Epic Software Failures then and now – what is still to learn?

We all know software errors can cause problems but what happens when software causes epic failures. This post looks as some of the most recent epic software failures.

 1990 – 4ESS Switches

This software failure was that thousands of people placed long distance calls but were unable to connect. The frustrations were caused by the long distance telephone company’s 114 4ESS switches that kept rebooting in sequence. It was meant to work like this: If one switch has a call then it sends a “do not disturb” message to the next switch, which will then take the call. The second switch resets itself to avoid any disturbance of the first switch. Switch 2 checks on switch 1- if any activity is detected that it switches again to show that switch 1 is back online.

However, on January 15th, 1990, 60,000 customers were left unable to make calls. A month before the system went down the company tweaked the code in an effort to speed up the process. The problem was that this change made the system too fast. The first switch to unload sent two messages, one of these messages hit the second switch when it was in the middle of a reset. The second switch believed there to be a fault in the logic and reset itself. This switch then put up a “do not disturb sign” and passed the problem over to the next switch.

Eventually the problem went through the whole system. All 114 switches kept resetting until it finally broke.

As a result of this software bug, the company lost around $60 million in long distance calls that did not go through. To make up to their customers they also gave a special discount, this added on more cost to their software failure.

If this happened today, the problem could have been prevented by automating test case creation. By using a tool that detects the smallest number of test cases with the most coverage it would have ensured that no critical tests were missed. In this case a line of code was written to speed the process up, however the failure indicates that the system was not tested to see if it could handle a faster service.

1998 – Nasa Mars Orbiter

The failure of the NASA Orbiter was caused by the project teams using two different units of measurement. This is something that should have been clearly defined in the requirements. If the requirements had been displayed in a way which was easy to understand for everyone in the software development lifecycle this error may not have occurred.

2003 – North America Blackout 

A software defect caused a blackout in North America that caused around 50 million to be without power. This failure was down to a software bug known as ‘race condition’ which existed in the energy management system. The bug stalled the energy control room alarm system for more than an hour. The system operators were not made aware of the problem as the bug stopped any alerts for any changes in the system.

After the alarm failure, all the events which remained unprocessed queued and caused the primary server to fail within 30 minutes. All applications were then transferred automatically to the backup server, which also failed.

2014 – Emergency Call Failure

Thousands of people in America were unable to make calls to the emergency services due to a defect in software. The automated system usually would assign a code to each incoming call in order to keep track of them. However in April 2014 the software stopped counting calls after 40 million. This lead to no new calls being counted causing bottlenecks and failures further down the line.

 

So, how can we prevent these fatal errors in the future?

Clearly defined requirements can help to avoid defects in software. When requirements are written in active flowcharts they provide a way that everyone in the team can understand, you can avoid miscommunication and bugs  such as the  NASA software failure. Teams will work with the same vision, helping to avoid rework, delays and the wrong features being developed and tested.

This improves collaboration between the user, business and IT, reducing the ambiguity that leads to cost, delays and defects. A flowchart breaks down the ‘wall of words’ into smaller, understandable, visual processes.  It allows users to do some of the hard thinking earlier in the development lifecycle.

Once imported, the otherwise static Business Process Flows can be optimized, creating a single active flowchart which contains all the qualitative information about a system needed for testing and development. Users can automatically generate use cases, complexity metrics, test cases, test data, virtual data, automation scripts, expected results and backlogs. This overcomes the constraints of having to work from disparate information sources such as Jira, Word, and Visio and provides clarity of vision, that is dynamic and can cope with change.

A more effective and efficient testing process could prevent a number of software failures, including the ones I have listed in this blog. That is not by employing more testers to do the job, it is by using a tool and enhancing the work of the existing testers. It is all about quality over quantity and automated test case creation tools provide just that.

 

About the Author

 

Huw Price
Huw has been the lead technical architect for several US and European software companies over the last 30 years.  Huw is Managing Director of Grid-Tools, a leading test data management vendor and Chief Technical Architect at Agile Designer.  Huw has been guest speaker at many conferences including Oracle, HP, Star East and the IIBA’s UK Chapter. He was awarded “IT Director of the Year 2010″ by QA Guild and is an adviser to Kings College KCL.  Huw also has experience offering strategic advice to large corporations like Deutsche Bank and Capgemini.
About the Author

Rebecca

I am a technical writer with an interest in the world of testing and improving the test process.
Find out more about @rebeccagt