Quantcast
Channel: Coding the Architecture
Viewing all articles
Browse latest Browse all 230

Fail Safe

$
0
0

One of the most misunderstood engineering terms is 'fail safe'. Most people from a non-engineering background (including many software developers) believe it means something won't fail. Last week even the Economist used it incorrectly.

A 'fail safe' device/system is expected to eventually fail but when it does it will be in a safe way. Classic examples include the brakes on trains that engage when they fail and ratchet mechanisms in lifts/elevators so they can't drop if the cable breaks. Well engineered physical devices will state their Mean Time Between Failure (MTBF) and define how they can fail and what happens when they do. A well maintained physical device may never fail over its lifetime but you know what will happen if it does.

A fail safe physical device may also define what occurs when a user error causes it to behave in an undesired manner. For example the “dead man handles” in lawn-movers or electric drills. I own an angle-grinder and in order to turn it on I have to flick a switch and then pull a trigger. Importantly, if I let the trigger go the cutting blade is stopped. This means that if I drop it I'm much less likely to lose a foot. When the trigger is released the switch is also reset, making it impossible for the trigger to be pressed by bouncing off an object.

As there is no physical wear-and-tear on a software system the concept of MTBF is arguably not applicable. However software systems can and do fail all the time, so perhaps it's surprising that many software systems I've experienced don't cope with failure very well or have defined actions when they fail. For example the following may happen:

  • Underlying hardware failure. Networks and external disks are the ones I encounter most.
  • External system failure. Obviously your system is perfect but external systems you rely on start to feed you garbage.
  • User error. If you create an idiot proof system then I guarantee they will employ a better idiot.

It's tempting to try to correct a failure situation and keep on running but this can lead to a system getting into an unknown state and creating more issues. For example:

  • The network is not responding but you keep on processing inputs and queuing outputs hoping it comes back. Your caches and disks fill up affecting other systems. Eventually it does come back on line and your system stops responding as it processes hours worth of stale data.
  • An external data provider starts sending blanks in a numeric field. A developer had previously decided to 'interpret' empty as a zero (whereas it was missing data) and this fed through a banks pricing systems, was forwarded onto other system which then tried to execute buys (these as they were obviously a bargain at zero!)
  • In finance we worry about 'fat fingers' where a trader hits the wrong keys and buys a 12 million rather than 1 million...

All of the above are real examples I have come across. How would I have changed the failure handling? I prefer to put the system into a known, safe state if possible.

  • Put limits on anything you do for recovery situations e.g. retry only three times, put a time limit on caches etc. Don't continually do something that isn't working.
  • Don't make generic assumptions about correcting data across a system. If it's not a good input then fail that input as you have no idea what it really means and you are hiding the error. Note that I'm not suggesting the entire system should be suspended but the transactions that are in error should be suspended and reported upon.
  • User inputs are often sanity checked but “are you sure” dialogs are automatically clicked (without reading them) or the “never show this again” checkbox is selected. Ultimately, there is only so much you can do to save the user from themselves but you might want to save an audit of the user's decisions...

It's important to not just put the system (or transaction) into a safe state but to also inform those that can resolve the situation. As developers we often write

LOG.warn(“Transaction X has failed”)

and think nothing more about it. It's amazing to use a reporting tool like Splunk on a mature system and extract all the worrying messages. Would it be more appropriate to send an email, pager message, text message or change a dashboard status etc?

We need to design the error reporting and monitoring services up front and define how the operators should be kept informed. We also need to allow the operators to resolve issues speedily and safely.

To conclude:

  • How can a system fail?
  • What safe state can be entered?
  • How can the failure be reported?
  • How can the issue be resolved?

Viewing all articles
Browse latest Browse all 230

Trending Articles