Fail Safe

One of the most misunderstood engineering terms is 'fail safe'. Most people from a non-engineering background (including many software developers) believe it means something won't fail. Last week even the Economist used it incorrectly.

A 'fail safe' device/system is expected to eventually fail but when it does it will be in a safe way. Classic examples include the brakes on trains that engage when they fail and ratchet mechanisms in lifts/elevators so they can't drop if the cable breaks. Well engineered physical devices will state their Mean Time Between Failure (MTBF) and define how they can fail and what happens when they do. A well maintained physical device may never fail over its lifetime but you know what will happen if it does.

A fail safe physical device may also define what occurs when a user error causes it to behave in an undesired manner. For example the “dead man handles” in lawn-movers or electric drills. I own an angle-grinder and in order to turn it on I have to flick a switch and then pull a trigger. Importantly, if I let the trigger go the cutting blade is stopped. This means that if I drop it I'm much less likely to lose a foot. When the trigger is released the switch is also reset, making it impossible for the trigger to be pressed by bouncing off an object.

As there is no physical wear-and-tear on a software system the concept of MTBF is arguably not applicable. However software systems can and do fail all the time, so perhaps it's surprising that many software systems I've experienced don't cope with failure very well or have defined actions when they fail. For example the following may happen:

Underlying hardware failure. Networks and external disks are the ones I encounter most.
External system failure. Obviously your system is perfect but external systems you rely on start to feed you garbage.
User error. If you create an idiot proof system then I guarantee they will employ a better idiot.

It's tempting to try to correct a failure situation and keep on running but this can lead to a system getting into an unknown state and creating more issues. For example:

The network is not responding but you keep on processing inputs and queuing outputs hoping it comes back. Your caches and disks fill up affecting other systems. Eventually it does come back on line and your system stops responding as it processes hours worth of stale data.
An external data provider starts sending blanks in a numeric field. A developer had previously decided to 'interpret' empty as a zero (whereas it was missing data) and this fed through a banks pricing systems, was forwarded onto other system which then tried to execute buys (these as they were obviously a bargain at zero!)
In finance we worry about 'fat fingers' where a trader hits the wrong keys and buys a 12 million rather than 1 million...

All of the above are real examples I have come across. How would I have changed the failure handling? I prefer to put the system into a known, safe state if possible.

Put limits on anything you do for recovery situations e.g. retry only three times, put a time limit on caches etc. Don't continually do something that isn't working.
Don't make generic assumptions about correcting data across a system. If it's not a good input then fail that input as you have no idea what it really means and you are hiding the error. Note that I'm not suggesting the entire system should be suspended but the transactions that are in error should be suspended and reported upon.
User inputs are often sanity checked but “are you sure” dialogs are automatically clicked (without reading them) or the “never show this again” checkbox is selected. Ultimately, there is only so much you can do to save the user from themselves but you might want to save an audit of the user's decisions...

It's important to not just put the system (or transaction) into a safe state but to also inform those that can resolve the situation. As developers we often write

LOG.warn(“Transaction X has failed”)

and think nothing more about it. It's amazing to use a reporting tool like Splunk on a mature system and extract all the worrying messages. Would it be more appropriate to send an email, pager message, text message or change a dashboard status etc?

We need to design the error reporting and monitoring services up front and define how the operators should be kept informed. We also need to allow the operators to resolve issues speedily and safely.

To conclude:

How can a system fail?
What safe state can be entered?
How can the failure be reported?
How can the issue be resolved?

Fail Safe

Trending Articles

SANIDAPA LIVE IN HALDADUWANA 2005-06-26

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Teen Shot In Miami Drive-By Dies From Injuries

BO RUSSELL BENDER Arrested by Clackamas County Sheriff's Office on Mar 11, 2020

A Bottle of Dew Class 6 Worksheet English Poorvi Chapter 1

[ROM][UNOFFICIAL][x1s][SM-G980F/DS][10] Resurrection Remix v8.6.6 for Samsung...

Our most epic blog yet, 4 stunning, gorgeous Curvy Kate Star In A Bra...

LC4245W - TOSHIBA LCD TV - POWER SUPPLY SCHEMATIC [Circuit Diagram]

JAVARIS FOSTER Arrested by Miami-Dade County Corrections on Feb 01, 2017

Banks reluctant to lend on 400 Manx homes built in 1970s

Stalker hid in bushes leaving his ex 'terrified'

Black Angus Grilled Artichokes

Hizia picha za utupu za meneja wa benki imekaaje?

Giorgio Moroder - Music From Battlestar Galactica and Other Original...

'Exceptionally dangerous' rapist Bradley Trengove from Camborne...

Chaoro Lyrics Translation | Mary Kom - Priyanka Chopra

Creating Database from Backup of a Terminated DB System

Tinny — Dzormo (Prod by Hammer)

The 10 Tennessee Cities With The Largest Black Population For 2021

Grimsby school staff resign in sex photo shame