Friday 3 January 2014

The most common cause of unavailability

Hi, Happy new year.

I've done a lot of work on high-availability systems. There is a lot of writing on high-availability systems - how to implement failover, hot-spare systems, load-balancers etc.

However, most of these seem to make an assumption: humans are infallible.

In practice, this is not always the case.

In fact, I'd say that probably about 75% of downtime is caused by human errors, cock-ups, mistakes. I'm not an expert, but I suspect that it's about the same proportion as air crashes caused by pilot (or someone else's) error.

So, it's the human, stupid. PBKAC (problem between keyboard and chair).

Here are some possible fixes:

Give human less work to do

We can avoid SOME human errors by having systems automatically configure themselves, setup, or perform sanity checks before accepting settings.

"Blindly accepting" instructions given by an idiot human, is what computers are generally very good at, and often results in chaos.
  • Automatic configuration is less likely to make a mistake (when was the last time you saw a DHCP server give out a duplicate IP address? Never?)
  • Testing a configuration provided by a human, BEFORE applying it, might prevent a mistake.
HOWEVER
  • Automatic configuration is a "clever" piece of software which can be difficult to test. Sometimes autoconfiguration tests are mind-blowingly difficult to set up. This means it's not likely to be very well tested.
  • If autoconfig gets it wrong, it's likely to do it on a large scale.
  • Automatic configuration probably means that the humans have no idea how it works, which may be bad.
  • Autoconfig might not correctly handle "very unusual" circumstances.

Give human more work to do

Perversely, we could give the humans more work. Something that someone does every day, they're unlikely to get wrong. The procedures which are done infrequently are more likely to be a problem.

  • Less code to test. Manual configuration.
  • Humans (ops engineer, support engineer) become used to getting manual configuration correct
  • Humans might be able to set up unusual configurations required for unusual circumstances. (Corollary: such unusual configurations probably won't have been tested!)

 My view

This is a completely opinionated rant...

  • Humans cannot be trusted to do anything
  • Make everything either automatic, or hard-coded
  • Give users as few options as possible. Perhaps let them make some cosmetic changes (colours, window layout?). Don't let them change the TCP port number your application uses.
  • There is usually far less value in allowing a parameter to be changed than forcing it to a reasonable hard-coded value.
  • Sales engineers want to see screens full of knobs and dials? Give them placebo ones which do little or nothing (i.e. don't break it).
  • If you put a big red button in the reactor control room, marked "Do not press.", someone will press it sooner or later.
On testing...

  • If your program fails to correctly find the server when running on Chinese-language Windows Vista connected to two VPNs at once, it's probably not really important.
  • If your software can't find the server for anyone who typoes the address, it's more important.
Of course users sometimes want to configure "policy". This is understandable, but try not to enable them to create stupid policies (or at least, make sure it's strongly discouraged!). A policy of "delete all emails instantly and irretrievably" sounds more like a bug to me.