Is it possible to prevent software glitch?
This post is also available in : Spanish
Software glitch, an unforeseen incident, is that possible to prevent at all?
“Software glitch” was first defined in 1965 by astronaut John Glenn and his flight team as a sudden drop in voltage that caused the computers to restart. This is exactly what happened to Altéa Amadeus’ software, which manages the passenger reservation and verification system for 125 airlines in seven countries, including such well-known airlines as British Airways, Qantas, Southwest and Lufthansa.
We still remember that incident involving “British Airways” in May 2017, and now history is repeating itself globally but temporarily. “British Airways also suffered a similar mishap in August 2017, which required manual boarding of passengers; at that time the company offered a public apology but did not specify the cause of the problem, so it may have been an incident unrelated to a “software glitch”.
The result of this failure was that many people expressed their opinions sarcastically on social networks. Remember that in this case we are not talking about a power failure, but an elegant way of classifying a transitory error as “software glitch”, which occurs in certain circumstances and although it is automatically restarted, it can cause many problems. Can you imagine spending 4 hours waiting to get on a plane? that’s what happened to the travelers. It was a huge amount of money for the airlines to keep their planes on the ground while they waited for their passengers!
What is “software glitch”?
The term “software glitch” comes from the 1960s, the dawn of the space age. Computers were still very primitive, but they did calculate trajectories faster than any human being – and still do – so they started to become very important as a source of processed data. It turned out that these machines were sensitive to electrical voltage fluctuations, which caused them to restart automatically and without proper intervention they recovered as if nothing had happened, causing only a slight delay.
From there it became clear that any unforeseen event, from which we could recover in a relatively short period of human time (for computers more than 200 milliseconds is an eternity), can be called “software glitch”. This is how this name was extended in electronics to computer science (which is very common in the field of video games), radio and television broadcasting, and even in human behaviour (we consider that in a football match the unwritten rule on “the law of advantage” is a “human glitch”).
But in the 21st century, being in a globalized world, “software glitch” in an interconnected system undoubtedly has an impact on the performance of the entire network or system. So let’s see if it is possible to prevent it or at least to avoid its repetition.
Official recognition of software glitch
The official statement about the “Altéa Amadeus” incident can be read in this link, even though it does not explicitly mention the “software glitch” that happened.
“Amadeus can confirm that our systems have recovered and are now functioning normally. Over the course of the morning, we had a network problem that caused disruptions in some of our systems. As a result of the incident, clients experienced interruptions in certain services. Our technical team took immediate steps to identify the cause of the problem and mitigate the impact on customers. Amadeus apologizes for any inconvenience caused to our customers.”
The fifteen-minute glitch
That was the time it took for the system to recover from the software glitch, but just like dominoes, systems worldwide had to synchronize with each other, causing intermittent failures which finally caused four hours of inconvenience.
“Airlines For Europe” is an association of fifteen airlines that has more than half a billion passengers in Europe and in its article of 27 July 2017 it ” announced ” delays of up to four hours due to the new immigration controls (the image at the top of this post is precisely the one they used in their warning campaign), but they never imagined that it would be ” software glitch ” which would ultimately trigger and make their prediction a reality.
As this article announcing a cooperation agreement between “MIAT Mongolian Airlines” and “Altéa Amadeus” reveals, we quote:
“On this basis, everything will be fully automated, enabling a high level of technological service without human intervention.”
It is precisely this lack of human intervention that implies the intervention of automated monitoring 24 hours a day, every day of the year, tirelessly, to search for possible failures or failures of the entire network at a global level. A software like Pandora FMS will always be ready to supervise thousands of nodes, and as it is written about free software it can be executed in GNU/Linux environment, operating system which can be used to create a cluster of computers to create redundancy in the monitoring and to have a tool of the same size and measure of what is needed to be monitored. For this cluster of servers, we can also set up “event correlation alerts” which we can receive by text message to our mobile phone or via social networks such as Twitter or Telegram.
Pandora FMS also has add-ons to monitor large amounts of data, as a specific example Apache Cassandra -which is a web server to directly serve the public- and always under the best practices -since we must follow an action plan if we want to implement any serious and reliable monitoring system-.
The nature of the problem raised
According to Bill Curtis, SVP and chief scientist at the software analysis company CAST, determining the exact causes of failure will take time: “Airline computers juggle various systems that must interact to control the door, reservations, ticket sales and frequent flyers. Each of these pieces may have been written separately by different companies.”
“Even if an airline has backup systems, the software running those probably has the same coding flaw. Tracking a software failure can be very difficult. It’s like investigating crime; there’s a lot of data you have to go through and then try to figure out what really happened.
In this environment of scattered data and apparent chaos we propose the Pandora FMS work method: we save all the information, collected directly or through agents, second by second, and it is stored in powerful MySQL databases (which we can also use in clusters and replicas to backup) and all this is available for future analysis: we are like the “black box” of airplanes before a “software glitch”.
Concluding remarks on “software glitch”
Although the computer applications will always have failures having a monitoring software that collects the chronology and conditions at the time of “software glitch” to take all this data in an excellent report well presented to the developers in charge will greatly shorten the search and correction in the future, avoiding “tripping over the same stone twice”.
“Altéa Suite” is a registered trademark of the company “Amadeus”, founded in 1987.