G(r)ood Testing 12- knowledge of plane crashes can help to prevent IT disasters

Last month I talked about Risk prevention. In my previous column G(r)ood Testing 11- Explosive software – when risks do count, I explained the Failure Mode Effect Analysis (FMEA) in the Petro-Chemical industry. I state that in chemical plants risks do count and that their impact can be quite disastrous. No one likes chemical plants to explode. This month I would like to discuses some other disasters, which need to be avoided also.

I know many testers are interested in airplane disasters. Lee Copeland, for example can tell in an entertaining way about what happened in the hours before flight so and so ended prematurely. I also meet colleagues who can dish up details from the NASA report that was written on the occasion of the crashed space shuttles Challenger and Columbia. Another favorite topic of conversation in testers circles is the ‘Ariane rocket’. For example, read “A bug and a Crash” by James Gleick or watch the video.

Quality ambassadors like to look at failure cases
It is not surprising that we, as quality ambassadors, like to look at those failure cases. It helps us to define our goal and defines what we fight for on a daily basis: preventing IT disasters from happening. Perhaps it also offers consolation. When a defect occurs in production, we can soothe ourselves with the thought that even the big guys like NASA make mistakes.

But I think, most testers like to read about flight and space disasters because it is instructive. Unlike in the average software projects, when a plane crashes, extensive research is done to understand the causes. The lesson, Lee Copeland told me once, that can be drawn from these researches is: almost all disasters occur through an accumulation of several small mistakes. Only a few accidents have a single cause. Disasters are often caused by an unfortunate combination of factors.

From airline disaster to network testing
Last year I had an experience that made me aware that this knowledge is also valuable within our average test projects. With a group of colleagues I was asked to do some Wi-Fi network testing. We did this pretty ad-hoc: some testers walked through and around the building to gauge the network strength, others had fired up a load generator and were measuring the network capacity, whereas a third group thought it would be fun to spoof the network. Spoofing is a technique in which the connection is taken over by another network which presents itself as the trusted network, e.g. in order to secretly retrieve user data.

We all did find some anomalies and recorded them as bug.  During the debriefing, we reported the results of our tests to the business stakeholder. All findings were legitimate and accepted by the receiving party, who simply nodded and agreed.

But then something wonderful happened
But then something wonderful happened: in the dialogue that arose, we began to combine the reported errors.  “When there is signal strength outside the building”, we reasoned, “undesired guests can logon to the network unnoticed and retrieve data”. The load test proved that would affect the speed. Since the organization works with many cloud applications load problem would lead to a situation where employees are not able to perform their daily work. Help desk can e.g. not handle customers calls, with all implications for your customer satisfaction.

If details about the network are accessible, and test proved this to be, the network can be spoofed. If unwanted people take over the network, they can view data unnoticed, for example, strategic customer information or plans about the newest products. Because the network monitoring was at low ebb, we could tell our client that he probably would not even notice this immediately. With this last point we had the full attention of the business. “Darn,” you could see him thinking, “when combined, these issues sure have serious implications that need urgent attention”

We need to look beyond the individual incidents
The parallel between the described disasters and the sketched scene is a clear one. Small mistakes are often overlooked or pushed aside as unimportant. The problems with the O-ring in the booster rocket of the Challenger were well known before it was launched for the last time. Nevertheless, NASA decided to continue the mission. Apparently having underestimated the risk or being unable to estimate the consequences of this little defect.

To prevent disasters, we need to look beyond the individual incidents themselves. It is also important to know the other incidents. We must use our knowledge of the organization and the business to combine errors. Rather than aiming to reduce the issue list as fast as possible, do some fixing and retesting and get the whole thing solved, we need to think about scenarios in which different factors can reinforce each other. This requires a different mind-set, because some people might see it as seeking problems instead of killing them. But, if we do it well, we’ll have the full attention and commitment of the business. Testing will transfer from a checking activity to quality ambassadorship, which fits much better in modern (read Agile) IT development. With that we can join forces and collaborate in order to prevent those IT-disasters.

About the Author

Derk-Jan de Grood

Derk-Jan de Grood works for Valori as senior test manager and product manager. His drive is to improve the visibility of testing in both agile and traditional organizations, by sharing his knowledge and experience by means of training, presentations, workshops and publications. He is a regular speaker at conference like BTD, the STAR conferences in Europe and America. He wrote several successful books on software testing and articles for the major testing magazines.
Find out more about @derkjandegrood