Ironflight · Apr 29, 2008

I guess you can call me a pragmatist. I prefer to think of myself as a realist. But mostly, I like to think of myself as what I really am – a Flight Operations engineer. Operations Engineers are “different” from Design Engineers (we’re right, and they’re wrong….’nuff said? No….just kidding.) “Different" doesn’t signify better or worse – it just means – different.

I got to thinking about this today after reading a good and thoughtful post over on the Yahoo GRT (EFIS) group that I help to moderate. The post was to the effect that he had found a bug in the software, and that he hoped that GRT would fix it soon because he didn’t feel like he could depend on the EFIS until it was fixed. Now he didn’t say what the bug was, so it is hard to evaluate whether or not it was “Safety of Flight”, or merely an ancillary function that wasn’t working, or not working right. (And I am not denigrating the poster – it is important that each of us feels comfortable to our own level.) But the concept of expecting software to be bug free was, I realized, a significant difference between designers and operators.

You see, I firmly believe that “bug free” software is a myth – just like Unicorns, Bigfoot, and the Lost Continent of Atlantis. It’s not that we don’t strive for and demand quality and perfection – it’s just that I have never met software – including extremely high reliability Aerospace software – that didn’t have the capability to surprise me. I have been flying man-rated spacecraft for over 25 years, and I can tell you that I have experienced my share of bugs that the designers would have sworn were impossible. I’ve had computers stop completely; I’ve had systems go into looping dilemmas, ad I have seen the ever-popular “stack overflow” a number of times. Every single time,, the cause is different, but almost without exception, we had just gotten into a corner of the software world that no one had ever thought about. Bugs happen, and no, they can’t be predicted.

So what do we do about it? Do we refuse to fly until they are all found and fixed? Nope – that is a logical impossibility, just like expecting to find a perpetual motion machine. You can’t find something that might or might not be there, and you don’t know what it will look like. Enter the operational Engineer’s perspective. “OK”, we say, “let’s just assume that we are going to have bugs. The question is not how do we prevent them (for us – we aren’t designers), the question is ‘what are we going to do when they hit us?’” You plan for the worst case (this isn’t necessarily when the computer locks up….it might be more subtle – like the computer keeps running, but SLOWLY gives you bad answers….), and have a backup plan.

It all comes back to risk management. I simply don’t expect my EFIS – in ANY air or spacecraft, certified or experimental – to be perfect. But I do expect the overall system design to have a backup, and acceptable reliability. AHRS goes belly up? Have a spare AHRS. Display Unit freezes? Have a spare display, or backup instruments. Autopilot goes on strike? Be ready to hand fly to safety. If we refuse to fly a system until it is proven to be bug free, then we will never fly the system – you can’t prove that something “isn’t there”, at least not in the real world.

I’ll go out on a limb and state that ALL of the popular EFIS’s have, or will exhibit, bugs of some kind. The major reliability related ones have probably been fixed in testing. Minor, subtle bugs are still there for us to find. The answer is to design your overall avionics package to provide you with unrelated redundancy. In the Shuttle, we have similar function software coded by different people. In my airplane, I have an autopilot which can run independent of the EFIS, and comes from a different group of people. In essence, I am planning on seeing failures – planning on finding bugs. But good risk management practices keep those bugs from being fatal. I appreciate the hard work of avionics and software designers that strive for perfection. I guess I just believe that perfection is unattainable by human beings, no matter how hard we try – at some point, we just have to go fly!

Paul

groucho · Apr 29, 2008

Bug free software is indeed a myth...especially in avionics (no matter how much money you throw at it).

Have fun!

jdeas · Apr 29, 2008

True words

This is why I have never accepted two devices from the same vendor as an appropriate backup. No matter how good the programmer there will be exceptions in their code or the libraries they use.
Until Vendor A and Vendor B can share data, I'll go with a steam gauge backup. The fewer connections between your primary and secondary systems the better.

Jim P · Apr 29, 2008

Up here in the great Northwest, home of both Microsoft and Bigfoot, I'll bet on bigfoot over bug free application.

az_gila · Apr 30, 2008

Bells and Whistles

jdeas said:
This is why I have never accepted two devices from the same vendor as an appropriate backup. No matter how good the programmer there will be exceptions in their code or the libraries they use.
Until Vendor A and Vendor B can share data, I'll go with a steam gauge backup. The fewer connections between your primary and secondary systems the better.

I would speculate that the number of "bugs" in a system is proportional to the square (or "name your own function"...

...) of the number of extra features it has - with each feature needing more code and having probable links to other chunks of code.

This would then say the electronic unit (EFIS types for this discussion) that you have designated for a back-up should be configured with the simplest configuration and the minimum amount of interconnects (i.e., I/O) to other units.

As Paul says "Every single time,, the cause is different, but almost without exception, we had just gotten into a corner of the software world that no one had ever thought about. Bugs happen, and now, they can’t be predicted."

The corners tend to exist when other units provide I/O, and unexpected combinations occur. More "corners" exist when more features exist. A "stand alone" EFIS unit, with it's bells and whistles not activated should be reliable, and fully tested (within human reason) by the vendor.

Even if you use two different vendors units, if they are interconnected and have multiple features and links to other units, then the reliability of the combination can be compromised... and even possible (hopefully very unlikely) for a bad input to bring down both units.

So keep your back-up simple, but unlike the previous post, I don't think a steam gauge is a good back-up for a simply configured, stand-alone, electronic EFIS... based on the number of moving parts alone...

gil A

...always remembering (a long time ago on a test range far away) when my laser rangefinder output brought down the entire TOW Cobra helicopter weapons computer due to a spurious data bit...in this case, a minor hardware design error at my end, and a major software defect at the computer end....

Mike S · Apr 30, 2008

YEP!!!!!

And that is why there are these big round things under my EFISs.

http://allyoucanupload.webshots.com/v/2005920552533100502

Rainier Lamers · Apr 30, 2008

groucho said:
Bug free software is indeed a myth...especially in avionics (no matter how much money you throw at it).

Have fun!

This is a most interesting thread and one that is very close to my heart.
You see, I write ALL the software that runs on our EFIS systems. Every last little byte. We use no third party libraries, drivers or operating systems.
We don't even use third party development tools such as compilers - we write our own from the ground up. So my claim that I write every last byte realy is true. I know exactly what everything does and how it does it.

Does this result in bug free software ?

No it does not. But it does result in one important thing - if it is found that something does not work right - it is VERY fast to locate the source of the evil and erradicate it.

Bugs come in flavours:

a) The "don't care bug".
This is the most frequent bug and it can be argued that it is not a bug. This is when something is used in a different way than planned and this results in some function not operating the way the user wants it but other than that it does not cause the system to go belly up.
Sometimes a bug like this will be "fixed" to allow or prevent the alternate way of operation.

b) The "ouch bug".
This is more serious. Something goes wrong - for example say you have a map drawing engine that works quite well and some day it gets fed with map source data that is corrupted. Usualy your software takes great care in identifying when it is being fed with garbage so it does not do anything stupid - but garbage comes in many ways and it is VERY easy to miss a particular flavor. The "Ouch bug" may cause one or more functions to fail but the system does not crash and is still stable and usable.

c) The "double ouch bug".
This is the most serious. Something goes wrong, regardless of cause that causes the system to "lock up" or malfunction in an unacceptable way. This can be related to bad data or a software coding issue that only reveals its nasty side once in a while or when a particular set of circumstances comes together (and this may happen in just one installation).
This is the worst kind of bug and one that requires immediate intervention by the company that makes the EFIS. This means usualy that all available information that leads to the issue needs to be known and perhaps the particular set of data or configuration used needs to be made available to the EFIS maker. He must be able to emulate the problem on another unit in order to trace the cause of the issue.

The worst kind of bug for anybody to find are intermittend issues. Lets say a random crash that happens without any identifiable reason. The cause could be just about anything - including hardware -perhaps a memory chip has one pin badly soldered. It happens.
Soft errors are also much like this. These are errors related to operating at high altitude (10.000 ft is quite high enough). There is actually nothing wrong - but the maker of the system probably used PC components and high density DRAM memories - one little neutron goes through that and BANG !

This brings us to the other side of the development office. The hardware side. Who said that bugs are always software ? They can be related to hardware just the same. It may even be interaction between software and hardware where both are 100% good but not correctly matched. This is one of the reasons I design all the electronics on our EFIS systems as well - right down to the PCB artworks, every detail. This is one way to ensure that the "software departent" knows exactly what the "hardware department" thinks and vice versa. Join the two. Simple.

This brings us to one of the most important issues around bugs: What happens if a bug rears its ugly head ??
Communication - how can the user that notices the problem communicate to the person that can fix it without having to go through a huge network of company officials that will actively shield the very person that can help from having to deal with customers ?
How long does it take to fix the problem ?
Once the problem is fixed, how can units in the field be updated, can the user do this himself ? How easy or difficult is this to do ?
How open is the manufacturer about his development and bug fix history - can you get that information or is this kept as company secret ?

Well, this is a subject one could easily write a thesis on (any budding Phd students out there ?) and every manufacturer will have a different way of dealing with this.

Bottom line: There is no such thing as a bug free system, only a system where the bugs have not been found. My duty as EFIS designer is to make our bugs difficult to spot ! If you can't find them - well, then it can't be so bad, can it ?

Rainier
CEO MGL Avionics

Bob Axsom · Apr 30, 2008

Yes there will be problems with software

Yes there will be problems with software and there will be problems within the hardware that will be difficult to isolate. If you have an individual bit in memory that occasionally changes state, some may consider it a software problem when in fact it is a known hardware failure mode. I believe the risk of failure in advanced complex systems is the price paid for operational convenience and development flexibility to expand the capability of the experimental avionics. I personally limit my reliance on software driven avionics to things that I consider essential in my RV - GPS, radios, autopilot. Everything else is standalone instrumentation and if you want more, the risk is greater. The economic balancing act of profit motivated manufacturers in a competitive environment will tend to discourage extensive independent stress testing of experimental avionic software in the short term. I develop operational workarounds for deficiencies/anomalies in my relatively simple system to the extent that I sometimes crutch a failing system - like the Trutrak Pictorial Pilot that I cherish highly. I flew with it several times before I had to admit that it was going in to a slow steady state turn to the left occasionally in cruise flight.

Yes you are right Paul but be careful anyway - I know you are.

Bob Axsom

Kahuna · Apr 30, 2008

Amen Paul and JD.
I had a critical defect in my GRT that caused a lockup IMC turning to intercept the approach. My event was well documented and I am convinced that without a different Vendor backup (I have BMA EFIS), id of been dead and it would have looked like spacial dis., spin, dead.
I **** near didnt catch the lockup and it was a very unnerving experience to say the least.

GRT release an emergency update the next day and all was fine with the world. I have made my living as a software test engineer and tester of poducts for many years. Paul you are spot on.

Even with all the bugs, they(electronic gismos) are awesome and I love them.

Tomasz · Apr 30, 2008

Rainier Lamers said:
This is a most interesting thread and one that is very close to my heart.
You see, I write ALL the software that runs on our EFIS systems. Every last little byte. We use no third party libraries, drivers or operating systems.
We don't even use third party development tools such as compilers - we write our own from the ground up. So my claim that I write every last byte realy is true. I know exactly what everything does and how it does it.

Don't get me wrong, but did you just said that you are the single point of failure in MGL?
No backup programmer?
No one that could handle things over in the event something happens to you?

the_other_dougreeves · Apr 30, 2008

Modern EFIS are more reliable than vacuum or electric gyros. However, what we don't always know is how they fail, i.e., the failure mode. It doesn't matter whether it's an EFIS or the control system for a refinery, you need to understand (1) when it's failed and (2) what has failed. One way to recognize failures is to cross-check a reading, particularly against a separate and/or dissimilar system.

We are taught early on as instrument students to cross-check the instruments. With EFIS, we need to cross-check against some type of dissimilar instruments; these can be steam gauges or a different EFIS. It's great that some EFIS have redundant sensors and can cross-check against them, but we should always do our own cross-checks.

TODR

Nomex Maximus · Apr 30, 2008

Architecture, architecture, architecture...

I tried starting two different reposnses to this thread and quit them both.

Like I have written in the other threads, it is all about architecture. A smart architecture will give you orders of magnitude better quality software and systems than a poorly designed one. Few if any modern avionics systems have a "good" architecture. Some of the worst are some of our most advanced military systems. The good news is our military threats are probably even worse than us at developing avionics.

There are ways to develop more fault tolerant, more maintainable, more flexible avionic systems. There are ways to dramatically limit the effect of software failures. There are ways to make it easy to develop large systems. It is all about how to structure a problem's solution - system/software architecture. Sadly however, the state of the industry does not allow most of these ideas to proceed past the after hours thoughts of the engineers. Such engineers take to posting ideas on obscure message boards on the internet and engage in excessive alcohol consumption...

Bug free? No. But drastically better than what you see now? Yes, it is possible. Use better architecture.

n5lp · Apr 30, 2008

Nomex Maximus said:
... Some of the worst are some of our most advanced military systems...

Certainly reminds one of the incident last year where every airplane in a flight of 12 F-22s lost all their navigation and communication capabilities when they crossed the international dateline.

Nomex Maximus · Apr 30, 2008

n5lp said:
Certainly reminds one of the incident last year where every airplane in a flight of 12 F-22s lost all their navigation and communication capabilities when they crossed the international dateline.

Uh huh. I have worked on the F-22 program. Horrendous. How a program gets the idea that whole system lockups are normal escapes me. Some years back there was an effort underway to improve the average time between whole system lockups to greater than 45 minutes. That it is even possible to EVER lockup an airplane's avionics totally baffles me. Now I need another drink... and it's before noon...

Mike S · Apr 30, 2008

Nomex Maximus said:
Now I need another drink... and it's before noon...

Well, it is 5:00 somewhere

Rainier Lamers · Apr 30, 2008

Tomasz said:
Don't get me wrong, but did you just said that you are the single point of failure in MGL?
No backup programmer?
No one that could handle things over in the event something happens to you?

Yes, I am certainly a "point of failure". Wow, that sounds bad

But I am no longer on my own, MGL has grown from the left side of a double garage 7 years ago to a quite sizeable (yet still manageable) operation.
My trusty side kick Nicol (who posts here on occasion) has taken over a few projects - he is now the guru behind our AHRS and a few other products while a nephew of mine, Franz has taken over all of our singles (he developed them further resulting in the very nice Infinity range).
These three people (including me) are the technical brains behind MGL and each can get into the systems of the other for redundancy.
We all work together and have a good common understanding of all our products. If one of us gets hit by a bus - that will be bad, make no mistake - but it will not be the end of MGL, it will just be an opportunety for somebody else to fill a pair of shoes.

It's a situation not unlike you will find with most of the other EFIS makers or for that matter with any small company anywhere...

Rainier
CEO MGL Avionics

flion · Apr 30, 2008

While this thread seems to be focusing on EFISs, I'd like to mention that bugs are not limited to software. When I stop to think about all the systems in my relatively simple aircraft, I think the least of my problems will be a fault in the EFIS. I'm really more worried about the fuel system due to it's critical nature.

N941WR · Apr 30, 2008

Nomex Maximus said:
How a program gets the idea that whole system lockups are normal escapes me. Some years back there was an effort underway to improve the average time between whole system lockups to greater than 45 minutes...

That's exactly why every IT department I've managed I've instituted "Post Crash Investigation Reports". Every time something (anything) goes down unexpectedly the first priority is to get it up and then the project lead has to fill out one of my forms. It covers such things as why, how (cause), impact (dollars and people), what was done to bring the system back up, and what needs to be done to keep it from happening again in the future.

You would be surprised at how often "reliable" programs crash and without someone tracking the outages, the root cause never gets fixed.

When I started my last job, daily announcements regarding system availability were common. Now you almost never hear them.

The idea is simple, give the problem visibility and scale (how much and how often) and it is easier to assign resources on the root cause fix rather than just fighting fires.

Oh, and the fact that IT types typically don’t like filling out forms is another type of incentive to fix things.

PS. Items requiring upgrades in my simple VFR -9, five, maybe six, depending on how you count.
1. EFIS
2. EMS
3. GPS
4 & 5. P-mags
(6) iPod

Jamie · Apr 30, 2008

Bug free software *IS* possible (sort of)

Generally speaking, software reliability decreases as complexity increases. Don't believe me? It's perfectly possible to write a guaranteed bug-free program. It's called "Hello World". I could guarantee you that this program would never ever crash and it would always do it's job as long as the hardware worked as designed.

The more complicated this program gets, the more likelihood there is for bugs. Many people assert that the likelihood of bugs increases exponentially with the number of lines of code written. I believe it.

One thing to reiterate here is that software is not just in EFIS units. Software is running virtually *every* component in your radio stack, including my GTX-320A transponder with the dial knobs on it. That SL-30 bringing you down a glideslope? Yep, software.

Reliability comes down to methodology, code review, QA, and most importantly hiring quality software guys.

jdeas · Apr 30, 2008

Jamie said:
It's perfectly possible to write a guaranteed bug-free program. It's called "Hello World". I could guarantee you that this program would never ever crash and it would always do it's job as long as the hardware worked as designed.

I would agree if there was no interrupt system, OS, or kernel under it. Fact is, even with C you depend on libraries outside your control. Even assembly language is no gurantee with the new microcoded processors. How many bugs are we up to with the Intel x86 chips?
Years ago I could write MS-DOS or CP/M programs that would literally run until the power supply fans died. Then this wonderful new OS came out (Windows). Now the same process takes four times the power MB's of libraries, and only runs for a few weeks before mem fragmentation or something else forces a restart.
It's as frustrating to the application programmer as it is to the end client. How can you write solid software when some wingnut just changed the default values on V1.1 of a library your using? I feel very lucky that I write embedded code and drivers. Working around the moving target that makes up the GUI would drive me to drink!

A somewhat famous incident (or possible urban legend) was that when DEC was helping Mslop develop NT, Mslop demanded the system run fast and boot quickly while the DEC guys wanted more stability. The Mslop rational was that the end user would not mind rebooting once in a while as long as it was fast.
Not a good model for moving objects

Jamie · Apr 30, 2008

JD: You won't get any arguments from me! Well said.

I'm an embedded guy as well and we write all of our own libraries (in fact, I'm the library guy), so I was speaking from that perspective.

Noel Simmons · Apr 30, 2008

Great thread!

Paul thank you,

You are a BIG asset to this forum. As an installer/user of EFIS and GPS systems I think a HUGE determining factor of choosing an EFIS is the responsiveness of the company. Your re-post mentioned GRT.

I was testing me one day, some simple stick and rudder skills, so I got into a 60 deg bank with the intention of seeing how long I could keep it together using just the EFIS. 12 minutes later I determined I could do this forever and decided to go work on something else. The EFIS worked flawlessly, that is untill I re-covered and flew strait and level. It really got confused!

A quick call to GRT and they said go duplicate it and this time record it.

I did another 12 minutes of 60 degree banks and the EFIS responded the same way it had before.

With in hours GRT knew the problem and had a solution.

In my opinon GRT is one of the finest!

Nomex Maximus · May 4, 2008

How bug free do you need it to be?

Getting back to Paul's original posting, the question I would pose is how bug free do you really think your avionics need to be?

I have always assumed that whatever electronics I put into my airplane I will always still have a set of non-electric basic instruments as a backup. Altimeter, airspeed, compass, AI, etc... were always planned to be on the bottom of the panel just in case the main system becomes unusable. If all electronics on the plane go bad, I should still be able to fly to a landing in VFR with little problem and to an IFR landing without much more difficulty (I am assuming that I always carry the ICOM handheld with a VOR function and that that radio will for all practical purposes always be available regardless of what the condition of the aircraft electrical systems).

So for me, the bugginess of the software is normally just a nuisance factor. Assuming that I used some Garmin equipment, it is really hard for me to believe that there are still any major bugs in their stuff that haven't been found by now. So what can really go wrong with a Garmin system that is going to make me do something stupid and crash the airplane? Incorrect database info? An obscure lockup? Incorrect display of engine temperature info? Hard to imagine, so I would have high confidence in a product like Garmin. I just don't see how anything that is commonly being put into an RV could really screw the pilot. Far more likely that the pilot's own poor judgement will kill him than anything in the firmware of what are really accessory systems...

...it's not like we are putting terrain following fly-by-wire systems into our RVs. What we do with RVs is sit a pilot in front of a fancy control panel and have him fly for his own amusement. Obviously, he is going to be paying constant attention to what is going on - but he is never really going to give over full control of any major flight function to a computer. In the fly-by-wire case you are asking the pilot to give complete control of a major flight function to a computer - and in doing so, the pilot is potentially in a position where if a bug occurs, then the pilot may not be able to do anything about it in time to avert a catastrophic failure. That is what we refer to as "level A" software in the DO-178B world. But I have difficulty seeing any need for "level A" software in an RV.

I sort of get the idea when I read a posting by some pilot about how the screens in a G1000 went blank in flight, that the poster is sort of saying something like, "See, you can't trust computers!" as a reason to then bluster about how you need a PILOT to get the plane down (sort of like the scenes from the Right Stuff movie). But if you really look at something like the situation where both the screens in a G1000 go blank, you have to believe that either, while annoying, the failure didn't really put the pilot in danger, or, if it did put the pilot in danger then the pilot was probably the one who created the danger by putting too much confidence in the avionics and not having a simple plan for what to do if that failure happened.

After all, nothing in the airplane is really bug free. The whole point of cockpit automation (as far as I am concerned) is to add options to the pilot's toolbox of flying capabilities, not to put him into a situation where he is beyond his abilities.

Kevin Horton · May 4, 2008

Never assume that any piece of complex hardware or software cannot fail, no matter who makes it, or what level software it has. I am aware of more than one failure in items with DO-178B Level A software. In all cases the software worked exactly as called for in the spec, but the spec did not envision all the possible things that can happen in the real world.

Ironflight · May 4, 2008

Kevin Horton said:
Never assume that any piece of complex hardware or software cannot fail, no matter who makes it, or what level software it has. I am aware of more than one failure in items with DO-178B Level A software. In all cases the software worked exactly as called for in the spec, but the spec did not envision all the possible things that can happen in the real world.

Kevin, you just summarized all those paragraphs I wrote very eloquently. Yup, I've got all the merit badges for having man-rated software up and quit in unique and untested ways. It's always a good way to get the blood pumping on the midnight shift when the computers start calling for help....

And John's post just above is right on the money as well - even a complex RV is still an RV (or should be....).

Paul

ccarter · Sep 5, 2008

One liner....

Great thread...

I've been in the software biz for over 20 years now. My time has been shared between systems architecture and operations viability, all in the consulting biz. As a designer I have a saying I repeat constantly. "The only bug free software application has one line of code and no users at all." Now that I am at FF for my 7 I found myself taking another hard look at my panel. The D-180 EFIS is all configured but in the process of doing that I found myself smiling at all those round "backup" gauges I installed. I'm comforted to know that if I want to (or need to) I can just look out the window and "fly the plane."

VAF Moderator / Line Boy

Well Known Member

Well Known Member

Well Known Member

Well Known Member

Senior Curmudgeon

Well Known Member

Well Known Member

Moderatoring

Active Member

Well Known Member

Well Known Member

fugio ergo sum

Well Known Member

Senior Curmudgeon

Well Known Member

Well Known Member

Legacy Member

Well Known Member

Well Known Member

Well Known Member

Well Known Member

Well Known Member

Well Known Member

VAF Moderator / Line Boy

Well Known Member