I guess you can call me a pragmatist. I prefer to think of myself as a realist. But mostly, I like to think of myself as what I really am – a Flight Operations engineer. Operations Engineers are “different” from Design Engineers (we’re right, and they’re wrong….’nuff said? No….just kidding.) “Different" doesn’t signify better or worse – it just means – different.
I got to thinking about this today after reading a good and thoughtful post over on the Yahoo GRT (EFIS) group that I help to moderate. The post was to the effect that he had found a bug in the software, and that he hoped that GRT would fix it soon because he didn’t feel like he could depend on the EFIS until it was fixed. Now he didn’t say what the bug was, so it is hard to evaluate whether or not it was “Safety of Flight”, or merely an ancillary function that wasn’t working, or not working right. (And I am not denigrating the poster – it is important that each of us feels comfortable to our own level.) But the concept of expecting software to be bug free was, I realized, a significant difference between designers and operators.
You see, I firmly believe that “bug free” software is a myth – just like Unicorns, Bigfoot, and the Lost Continent of Atlantis. It’s not that we don’t strive for and demand quality and perfection – it’s just that I have never met software – including extremely high reliability Aerospace software – that didn’t have the capability to surprise me. I have been flying man-rated spacecraft for over 25 years, and I can tell you that I have experienced my share of bugs that the designers would have sworn were impossible. I’ve had computers stop completely; I’ve had systems go into looping dilemmas, ad I have seen the ever-popular “stack overflow” a number of times. Every single time,, the cause is different, but almost without exception, we had just gotten into a corner of the software world that no one had ever thought about. Bugs happen, and no, they can’t be predicted.
So what do we do about it? Do we refuse to fly until they are all found and fixed? Nope – that is a logical impossibility, just like expecting to find a perpetual motion machine. You can’t find something that might or might not be there, and you don’t know what it will look like. Enter the operational Engineer’s perspective. “OK”, we say, “let’s just assume that we are going to have bugs. The question is not how do we prevent them (for us – we aren’t designers), the question is ‘what are we going to do when they hit us?’” You plan for the worst case (this isn’t necessarily when the computer locks up….it might be more subtle – like the computer keeps running, but SLOWLY gives you bad answers….), and have a backup plan.
It all comes back to risk management. I simply don’t expect my EFIS – in ANY air or spacecraft, certified or experimental – to be perfect. But I do expect the overall system design to have a backup, and acceptable reliability. AHRS goes belly up? Have a spare AHRS. Display Unit freezes? Have a spare display, or backup instruments. Autopilot goes on strike? Be ready to hand fly to safety. If we refuse to fly a system until it is proven to be bug free, then we will never fly the system – you can’t prove that something “isn’t there”, at least not in the real world.
I’ll go out on a limb and state that ALL of the popular EFIS’s have, or will exhibit, bugs of some kind. The major reliability related ones have probably been fixed in testing. Minor, subtle bugs are still there for us to find. The answer is to design your overall avionics package to provide you with unrelated redundancy. In the Shuttle, we have similar function software coded by different people. In my airplane, I have an autopilot which can run independent of the EFIS, and comes from a different group of people. In essence, I am planning on seeing failures – planning on finding bugs. But good risk management practices keep those bugs from being fatal. I appreciate the hard work of avionics and software designers that strive for perfection. I guess I just believe that perfection is unattainable by human beings, no matter how hard we try – at some point, we just have to go fly!
Paul
I got to thinking about this today after reading a good and thoughtful post over on the Yahoo GRT (EFIS) group that I help to moderate. The post was to the effect that he had found a bug in the software, and that he hoped that GRT would fix it soon because he didn’t feel like he could depend on the EFIS until it was fixed. Now he didn’t say what the bug was, so it is hard to evaluate whether or not it was “Safety of Flight”, or merely an ancillary function that wasn’t working, or not working right. (And I am not denigrating the poster – it is important that each of us feels comfortable to our own level.) But the concept of expecting software to be bug free was, I realized, a significant difference between designers and operators.
You see, I firmly believe that “bug free” software is a myth – just like Unicorns, Bigfoot, and the Lost Continent of Atlantis. It’s not that we don’t strive for and demand quality and perfection – it’s just that I have never met software – including extremely high reliability Aerospace software – that didn’t have the capability to surprise me. I have been flying man-rated spacecraft for over 25 years, and I can tell you that I have experienced my share of bugs that the designers would have sworn were impossible. I’ve had computers stop completely; I’ve had systems go into looping dilemmas, ad I have seen the ever-popular “stack overflow” a number of times. Every single time,, the cause is different, but almost without exception, we had just gotten into a corner of the software world that no one had ever thought about. Bugs happen, and no, they can’t be predicted.
So what do we do about it? Do we refuse to fly until they are all found and fixed? Nope – that is a logical impossibility, just like expecting to find a perpetual motion machine. You can’t find something that might or might not be there, and you don’t know what it will look like. Enter the operational Engineer’s perspective. “OK”, we say, “let’s just assume that we are going to have bugs. The question is not how do we prevent them (for us – we aren’t designers), the question is ‘what are we going to do when they hit us?’” You plan for the worst case (this isn’t necessarily when the computer locks up….it might be more subtle – like the computer keeps running, but SLOWLY gives you bad answers….), and have a backup plan.
It all comes back to risk management. I simply don’t expect my EFIS – in ANY air or spacecraft, certified or experimental – to be perfect. But I do expect the overall system design to have a backup, and acceptable reliability. AHRS goes belly up? Have a spare AHRS. Display Unit freezes? Have a spare display, or backup instruments. Autopilot goes on strike? Be ready to hand fly to safety. If we refuse to fly a system until it is proven to be bug free, then we will never fly the system – you can’t prove that something “isn’t there”, at least not in the real world.
I’ll go out on a limb and state that ALL of the popular EFIS’s have, or will exhibit, bugs of some kind. The major reliability related ones have probably been fixed in testing. Minor, subtle bugs are still there for us to find. The answer is to design your overall avionics package to provide you with unrelated redundancy. In the Shuttle, we have similar function software coded by different people. In my airplane, I have an autopilot which can run independent of the EFIS, and comes from a different group of people. In essence, I am planning on seeing failures – planning on finding bugs. But good risk management practices keep those bugs from being fatal. I appreciate the hard work of avionics and software designers that strive for perfection. I guess I just believe that perfection is unattainable by human beings, no matter how hard we try – at some point, we just have to go fly!
Paul
Last edited: