This past weekend, the Wall St. Journal published a front-page article detailing the investigation into the recent, deadly crashes of Boeing 737 MAX aircraft. It is a pretty extensive combination of information that I had seen before, new insights, and interviews with insiders. Cutting to the chase, they placed a large chunk of the blame on a regulatory structure that puts too much weight shoulders of the pilots. They showed, with a timeline, how many conflicting alarms the pilots received within a four-second period. If the pilots could have figured out the problem in those four seconds and taken the proscribed action, they could have saved the plane. The fact that the pilots had a procedure that they should have followed means the system fits within the safety guidelines for aircraft systems design.
Reading the article, I couldn’t help but think of another article that I read a few months back. I was directed to the older article by a friend, a software professional, on social media. His link was to an IEEE article that is now locked behind their members-only portal. The IEEE article, however, was a version of a blog post by the author and that original post remains available on Medium.
This detailed analysis is even longer than the newspaper version, but also very informative. Like the Wall St. Journal, the blog post traces the history behind the design of the hardware and software systems that went into the MAX’s upgrade. Informed speculation describes how the systems of those aircraft caused the crash and, furthermore, how those systems came to be in the first place. As long as it is, I found it well worth the time to read in its entirety.
On my friend’s social media share, there was a comment to the effect that software developers should understand the underlying systems for which they are writing software. My immediate reaction was a “no,” and its that reaction I want to talk about here. I’ll also point out that Mr. Travis, the blog-post author and a programmer, is not blaming programmers or even programming methodology per se. His criticism is at the highest level; for the corporate culture and processes and for the regulatory environment which governs these corporations. In this I generally agree with him, although I could probably nitpick some of this points. But first, the question of the software developer and what they should, can, and sometimes don’t understand.
There was a time, in my own career and (I would assume) in the career of the author, that statements about the requisite knowledge of programmers made sense. It was probably even industry practice to ensure that developers of control system software understood the controls engineering aspects of what they were supposed to be doing. Avionics software was probably an exception, rather than the rule, in that the industry was an early adopter of formal processes. For much of the software-elements-of-engineering-systems industry, programmers came from a wide mix of backgrounds and a key component of that background was what programmers might call “domain experience.” Fortran could be taught in a classroom but ten years worth of industry experience had to come the hard way.
Since we’ve been reminiscing about the artificial intelligence industry of the 80s and 90s, I’ll go back there again. I’ve discussed the neural network state-of-the-art, such as it was, of that time. Neutral networks were intended to allow the machines to extract information about the system which the programmers didn’t have. Another solution to the same category of problems, again one that seemed to hold promise, was a category called expert systems, which was to directly make use of those experts who did have the knowledge that the programmers lacked. Typically, expert systems were programs built around a data set of “rules.” The rules contained descriptions of the action of the software relative to the physical system in a way that would be intuitive to a non-programmer. The goal was to be a division of labor. Software developers, experts in the programming, would develop a system to collect, synthesize, and execute the rules in an optimized way. Engineers or scientists, experts in the system being controlled, would create those rules without having to worry about the software engineering.
Was this a good idea? In retrospect, maybe not. While neural networks have found a new niche in today’s software world, expert systems remain an odd artifact found on the fringes of software design. So if it isn’t a good idea, why not? One question I remember being asked way-back-when got to the why. Is there anything you can do with a rule-based system that you couldn’t also implement with standard software techniques? To put it another way, is my implemented expert system engine capable of doing anything that I couldn’t have my C++ team code up? The answer was, and I think obviously, “no.” Follow that with, maybe, a justification about improved efficiencies in terms of development that might come from the expert systems approach.
Why take this particular trip down memory lane? Boeing’s system is not what we’d classify as AI. However, I want to focus on a particular software flaw implicated as a proximate cause of the crashes; the one that uses the pitch (angle-of-attack) sensors to avert stalls. Aboard the Boeing MAX, this is the”Maneuvering Characteristics Augmentation System” (MCAS). It is intended to enhance the pilot’s operation of the plane by automatically correcting for, and thereby eliminating, rare and non-intuitive flight conditions. Explaining the purpose of the system with more pedestrian terminology, Mr. Travis’ blog calls it the “cheap way to prevent a stall when the pilots punch it” system. It was made a part of the Boeing MAX as a way to keep the airplane’s operation the same as how it always has been, using feedback about the angle-of-attack to avoid a condition that could occur only on a Boeing MAX.
On a large aircraft, the pitch sensors are redundant. There is one on each side of the plane and both the pilot and the co-pilot have indicators for their side’s sensor. Thus, if the pilot’s sensor fails and he sees a faulty reading, his co-pilot will still be seeing a good reading and can offer a different explanation for what he is seeing. As implemented, the software is part of the pilot’s control loop. MCAS is quickly, silently, and automatically doing what the pilot would be doing, where he to have noticed that the nose of the plane was rising toward a stall condition at high altitude. What it misses is the human interaction between the pilot and his co-pilot that might occur if the nose-up condition were falsely indicated by a faulty sensor. The pilot might say, “I see the nose rising too high. I’m pushing the nose down, but it doesn’t seem to be responding correctly.” At this point, the co-pilot might respond, “I don’t see that. My angle-of-attack reports normal.” This should lead them to deciding the pilot should not, in fact, be responding from the warning produced by his own sensor.
Now, according the the Wall St. Journal article, Boeing wasn’t so blind as to simply ignore the possibility of a sensor failure. This wasn’t explained in the Medium article, but there are (and were) other systems that should have alerted a stricken flight crew to an incompatible difference in values between the two angle-of-attack sensors. Further, there was a procedure, called the “runaway stabilizer checklist” that was to be enacted under that condition. Proper following of that checklist (within the 4 second window, mind you) would have resulted in the deactivation of the MCAS system in reaction to the sensor failure. But why not, instead, design the MCAS system to either a) take all available, relevant sensors as input before assuming corrective action is necessary or b) take as input the warning about conflicting sensor reading? I won’t pretend to understand what all goes into this system. There are probably any number of reasons; some good, some not-so-good, and some entirely compelling; that drove Boeing to this, particular solution. It is for that reason I led off using my expert system as an analogy; since I’m making the analogies I can claim I understand, entirely, the problem that I’m defining.
Back then, a fellow engineer and enthusiast for technologies like expert systems and fuzzy logic (a promising technique to use rules for non-binary control) explained it to me with a textbook example. Imagine, as we’ve done before, you have a self-driving car. In this case, its self-driving intelligence uses a rule-based expert system for high-level decision making. While out and about, the car comes to an unexpected fork in the road. In computing how to react, one rule says to swerve left and one says to swerve right. In a fuzzy controller, the solution to conflicting conclusions might be to weight and average the two rule outputs. As a result, our intelligent car would elect to drive on, straight ahead, crashing into the tree that had just appeared in the middle of the road. The example is oversimplified to the point of absurdity, but it does point out a particular, albeit potential, flaw with rule-based systems. I also think it helps explain, by analogy, the danger lurking in the control of complex systems when your analysis is focused on discrete functions.
With the logic for your system being made up of independent components, the overall system behavior becomes “emergent” – a combination of the rule base and the environment in which it operates. In the above case, each piece of the component logic dictated swerving away from the obstacle. It was only when the higher-level system did its stuff that the non-intuitive “don’t swerve” emerged. Contrasting rule based development with more traditional code design, the number of possible states may be indeterminate by design. Your expert input might be intended to be partial, completed only when synthesized with the operational environment. Or look at it by way of the quality assuredness problem it creates. While you may be creating the control system logic without understanding the entire environment within which it will operate, wouldn’t you still be required to understand, exhaustively, that entire environment when testing? Otherwise, how could you guarantee what the addition of one more expert rule would or wouldn’t do to your operation?
Modern software engineering processes have been built, to a large extent, based on an understanding that the earlier you find a software issue, the cheaper it is to solve. A problem identified in the preliminary, architectural stage may be trivial. Finding and fixing something during implementation is more expensive, but not as expensive as creating a piece of buggy software that has to be fixed either during the full QA testing or, worse yet, after release. Good design methodologies also eliminate, as much as possible, the influence that lone coders and their variable styles and personalities might have upon the generation of large code bases. We now feel that integrated teams are superior to a few eccentric coding geniuses. This goes many times over when it comes to critical control systems upon which people’s lives may depend. Even back when, say, an accounting system might have been cobbled together by a brilliant hacker-turned-straight, avionics software development followed rigid processes meant to tightly control the quality of the final product. This all seems to be for the best, right?
Yes, but part of what I see here is a systematization that eliminates not just the bad influences of the individual, but their creative and corrective influence as well. If one person had complete creative control over the Boeing MAX software, that person likely would never have shipped something like the MCAS reliance on only one of a pair of sensors. The way we write code today, however, there may be no individual in charge. In this case, the decision to make the MCAS a link between the pilot’s control stick and the tail rudder rather than an automated response at a higher level isn’t a software decision; its a cockpit design decision. As such it’s not only outside of the purview of software design, but perhaps outside of the control of Boeing itself if it evolved as a reaction to a part of the regulatory structure. In a more general sense, though, will the modern emphasis on team-based, structured coding methodology have the effect of siloing the coders? A small programming team who has been assigned a discrete piece of the puzzle not only doesn’t have responsibility for seeing the big picture issues, those issues won’t even be visible to them.
In other words (cycling back to that comment on my friend’s posting many months ago), shouldn’t the software developers should understand the underlying systems for which they are writing software? Likely, the design/implementation structure for this part of the system would mean that it wouldn’t be possible for a programmer to see that either a) the sensor they are using as input is one of a redundant pair of sensors and/or b) there is separate indication that might tell them whether the sensor they are using as input is reliable. Likewise the large team-based development methodologies probably don’t attract to the avionics software team the programmer who is also a controls engineer who also has experience piloting aircraft – that ideal combination of programmer and domain expert that we talked about in the expert system days. I really don’t know whether this is an inevitable direction for software development or if this is something that is done for better or for worse as we look at different companies. If the latter, the solutions may simply be with culture and management within software development companies.
So far, I’ve mostly been explaining why we shouldn’t point the figure at the programmers, but neither of the articles do. In both cases, blame seems to be reserved for the highest levels of aircraft development; at the business level and the regulatory level. The Medium article criticizes the use of engineered solutions to allow awkward physics as solutions to business problems (increasing the capacity of an existing plane rather than, expensively, creating a new one). The Wall St. Journal focuses on the philosophy that pilots will respond unerringly to warning indicators, often under very tight time constraints and under ambiguous and conflicting conditions. Both articles would tend to fault under-regulation by the FAA, but heavy-handed regulation may be just as much to blame as light oversight. Particularly, I’m thinking of the extent to which Boeing hesitated to pass information to the customers for fear of triggering expensive regulatory requirements. When regulations encourage a reduction in safety, is the problem under-regulation or over-regulation?
Another point that jumped out at me in the Journal article is that at least one of the redesigns that went into the Boeing MAX was driven by FAA high-level design requirements for today’s human-machine interfaces for aircraft control. From the WSJ:
[Boeing and FAA test pilots] suggested MCAS be expanded to work at lower speeds so the MAX could meet FAA regulations, which require a plane’s controls to operate smoothly, with steadily increasing amounts of pressure as pilots pull back on the yoke.
To adjust MCAS for lower speeds, engineers quadrupled the amount the system could repeatedly move the stabilizer, to increments of 2.5 degrees. The changes ended up playing a major role in the Lion Air and Ethiopian crashes.
To put this in context, the MCAS system was created to prevent an instability at some high altitude conditions, conditions which came about as a results of larger engines that had been moved to a suboptimal position. Boeing decided that this instability could be corrected with software. But if I’m reading the above correctly, there are FAA regulations focus on making sure a fly-by-wire system still feels like the mechanically-linked controls of yore, and MCAS seemed perfectly suited to help satisfy that requirement as well. Pushing this little corner of the philosophy too far may have been a proximate cause of the Boeing crashes. Doesn’t this also, however, point to a larger issue? Is there a fundamental flaw with requiring that control systems artificially inject physical feedback as a way to communicate with the pilots?
In some ways, its a similar concern to what I talked about with the automated systems in cars. In addition to the question whether over-automation is removing the connection between the driver/pilot and the operational environment, there is, for aircraft, an additional layer. An aircraft yoke’s design came about because it directly linked to the control surfaces. In a modern plane, the controls do not. Today’s passenger jet could just as well use a steering wheel or a touch screen interface or voice-recognition commands. The designs are how they are to maintain a continuity between the old and the new, not necessarily to provide the easiest or most intuitive control of the aircraft as it exists today. In addition, and by regulatory fiat apparently, controls are required to mimic that non-existent physical feedback. That continuity and feedback may also be obscuring logical linkages between different control surfaces that could never have existed when the interface was mechanically linked to the controlled components.
I foresee two areas where danger could creep in. First, the pilot responds to the artificially-induced control under the assumption that it is telling him something about the physical forces on the aircraft. But what if there is a difference? Could the pilot be getting the wrong information? It sure seems like a possibility when feedback is being generated internally by the control system software. Second, the control component (in this case, the MCAS system) is doing two things at once; stabilizing the aircraft AND providing realistic feedback to the pilots by feel through the control yoke. Like the car that can’t decide whether to swerve right or left, such a system risks, in trying do both, getting neither right.
I’ll sum up by saying I’m not questioning the design of modern fly-by-wire controls and cockpit layouts; I’m not qualified to do so. My questions are about the extent to which both regulatory requirements and software design orthodoxy box in the range of solutions available to aircraft-control designers in a way that limits the possibilities of creating safer, more efficient, and more effective aircraft for the future.