Sunday, May 12, 2013

The Fault -> Error -> Failure Chain

[Editor's Note: First posted on THE SHORTED TURN's WordPress.com site on 2010/02/16.]


Copyright © 2010 Gary R. Van Sickle

Hello once again, Gentle Reader,

Today let’s discuss software defect terminology.  We’ve all complained about how “buggy” some piece of software we use is, or read the headlines about a “glitch” in some large software system causing unspeakable mayhem.  Terms like “bug” and “glitch” perhaps have a certain charm which make software defects easier for the layperson to swallow or the PR department to explain.  “A spoonful of sugar” and all that.  But Gentle Reader, as software professionals, you and I don’t traffick in charm – the world in which you and I live is a cold, dark place where the only “charming” “bug” is a “dead” “bug”, and our only respite is the incessant grind of ensuring, of knowing, that our software is as defect-free and as safe as we could make it.

As professionals, we can and should be more precise with our terminology when discussing software defects.




To wit:

Faults

A Fault can be succinctly defined as “incorrect code”.  Consider the following code from some fictional spacecraft guidance system (why do I pick on spaceships so much?).  This code has at least one glaring Fault:
float CalculateAccelerationOnMassDueToForce(float mass_in_kgs, float force_in_N)
{
    return (force_in_N/mass_in_kgs) + 0.1;
}
Anyone who’s taken high-school physics will recognize that the “+ 0.1″ has no business being there.  How it got there is not germane to our discussion.  Maybe it was a cut-and-paste-o, maybe the developer didn’t do well in physics class: whatever the case, the code is inarguably incorrect.  This is termed a fault.
“Ah-hah! Now I got ya Shorted Turn, you old so-and-so!”, you might say.  “Maybe that code is taking into account the gravitational acceleration of some other planet or something!  Then it’s not wrong!  YOU JUST GOT SERVED!”
Yes, you might say that, but you’d be wrong, twice:

  • The function is named “CalculateAccelerationOnMassDueToForce()”, not “CalculateAccelerationOnMassDueToForceAndTheGravityOfSomePlanet()”.  Code that doesn’t do what it claims to do is faulty.
  • Neither is the function commented to that effect.  Uncommented code with non-blindingly-obvious functionality is faulty.

It’s important to note that faults exist statically in the code – i.e., the code doesn’t have to be running for the fault to exist.  As we’ll see in the next section, faults are errors waiting to happen.

Errors

Errors have a slightly slippery definition.  They’re not what you think they are: you’re thinking of ”failures”, and we’ll get to them in a moment.  In the spacecraft context introduced above, an error occurs whenever the function is called, subsequently returning the wrong answer.  The incorrect answer is the error.  In Software Fault Tolerance legalese, “an error is the part of the system state that is liable to lead to a failure” [1].

Notice two important characteristics of an error:
  1. It is a run-time entity.  No error exists until the code is run.
  2. It is not necessarily a problem.  Yeah, you read that right.  Maybe the code above is calculating the acceleration caused by the five massive first-stage F-1 engines on our 1-kg Mars Super Orbiter Plus (Gold Edition) only once at the very beginning of launch for some reason, and the 0.1 m/s^2 factor is completely swamped by the main engines’ force.  Then again, maybe that code is used for calculating the effect of our poor spacecraft’s tiny ullage motors on its overall trajectory, in which case we may very well have a problem.
When an error does become a problem, then Houston, we have a failure.

Failure

Now for the easy part, failure.  The failure is when the spacecraft crashes into Mars when it was intended to go into orbit due to the above error, which was in turn caused by the fault in the spacecraft’s control system at the start of this screed.  Failure is when the system’s behavior deviates from the intended behavior.

“But Shorted Turn, Why Should I Care?”

Gentle Reader, for shame!  You know The Shorted Turn only presents information of the highest usability!  But, you’re right, perhaps you shouldn’t care.  Perhaps the Gentle Reader’s boss likes it when you crash his spacecraft into other planets.  Oh, he doesn’t?  Oh, OK, so how about these reasons:
  1.  Developers tend to report faults, while Testers and Users almost exclusively report failures.
    Take a look at the reports in your organization’s defect tracking system sometime.  Armed with the above, I’d wager a nominal sum that you can tell if the reporter of any particular issue was a developer or a tester.  It’s important to realize this difference to facilitate communication between these two groups, who tend to look at the same issue from different angles.
  2. Systems can often be engineered to tolerate and even correct errors.
    Don’t read over that, Gentle Reader.  It is a universally accepted fact that fault-free software development is quite beyond the current state of the art, and once a failure happens, it’s too late.  But errors, since they occur at runtime, do allow us an opportunity to intercept and deal with them before they become failures.
  3. Knowing how the Fault->Error->Failure chain works can keep meetings from going off the rails.
    How many meetings have you attended that went on for hours on end, where all everyone did was throw out all kinds of ideas on how to “fix” a failure that hasn’t even had  the beginnings of a proper analysis?  Right: Too many.  Often a simple statement to the effect of “Guys, we don’t even fully understand the failure report yet, let’s leave the solution brainstorming for later” can save literally hours of non-productivity.
Those, Gentle Reader, are but a few reasons for knowing about the Fault->Error->Failure chain terminology.  Now get out there and write some code - but hey: Let’s be careful out there!

Until next time, Gentle Reader, I remain,
Gary R. Van Sickle

[1] Pullum, Laura L., Software Fault Tolerance Techniques and Implementation, Massachusettes: Artech House, Inc., 2001, pp. 3.

No comments:

Post a Comment