Monday 23 May 2011

Expecting the unexpected

Few developers consider, when trying to build robust platforms, all the possible modes of failure. Indeed, it is difficult to consider them all, let alone plan for them, or design tests which exercise particular symptoms.

In this post, I discuss some of the types of failure we can see in real systems.

Complete server failure



Most developers DO consider this. In a "Complete server failure", what generally happens is:

* The server stops processing new requests, completely.
* The server's OS no longer responds to any network request at all (e.g. "ping")
* Processing does not continue within the server
* The contents of memory are immediately and irretrievably lost.

Typically, the server recovers, and when it does so, it is rebooted and restored to full health. All writes which were acknowledged before its failure have been persisted.

This is very easy to simulate (just hit the "power off" button in your VM hypervisor) and fairly easy to plan for; most robust systems consider this kind of scenario.

Network failure



There are many different kinds of network failure, but consider the simplest, most severe network failure:

* One or more machines in the infrastructure lose network connectivity
* None of them can talk to anything at all, including each other
* Local processing on these servers continues as normal
* No machines need to be rebooted to fix the fault, when it is repaired everything is back to normal.

This is a symptom of, perhaps a switch failure, where a "complete" failure occurs.

I won't discuss network failures at all, but there are many different kinds. My experience suggests that the most common is partial or complete loss of internet connectivity from one location (datacentre).

IO subsystem failures


* One or more discs / volumes suddenly become unavailable
* The OS does not reboot; processes do not stop

These are the kinds of failures which developers typically don't consider and are a lot more difficult to simulate. What might happen is, the power fails for a disc enclosure unit, but not its host server, in this case the OS and its boot discs remain available, but data discs are not. In these cases, failover might not be triggered or might behave incorrectly.

Heavy load or unexpected poor performance



* A single server unexpectedly starts performing very badly
* In the extreme, this means without sufficient capacity to do useful work
* But it's not failed; no subsystem is individually totally unavailable
* Sometimes the effect is severe enough to prevent operations engineers logging in to diagnose / fix the fault

These kinds of faults usually cause a larger problem, because failover systems aren't triggered, or cannot take over in a timely fashion. Common causes can be

* Rogue process consuming lots of resources
* Denial-of-service attack
* Bad application design causing legitimate requests to suddenly spike in resource usage.
* Operational error (well, anything can be caused by operational error :) )

"Zombie" systems or, back from the dead



* A system fails in a catastrophic way and can't be remotely recovered
* Operations engineers assume that it's going to be completely dead until physically replaced (They are some distance away and don't raise a "remote hands" request, or are unable to recover it by doing so)
* Another system is provisioned in its place, and takes over its IP address, role etc
* Then one day... the "Zombie" system unexpectedly comes back from the dead to haunt its successsor ... Brraaainss....

Of course this could be months later, after many software updates (possibly security updates). The "zombie" system is running an old build and will not carry out correct processing if it is given work to do.

Conclusion



These are just a few of the annoying types of failures which happen to real systems in production. Expect the unexpected (as if that's not a contradiction!).

Happy hacking!

Sunday 20 February 2011

HTML 2d Canvas upscaling - really inefficient

I started writing some test programs with the HTML canvas element. This is great, as you can actually write games in Javascript - efficiently - in principle.

My previous attempts have all used the DOM API which is not very convenient, and not very efficient.

I had assumed the canvas 2d drawing context was basically a software-renderer - it's not extremely efficient, but provided the canvas doesn't have too many pixels, you can still do a lot of work per frame on a modern machine.

Which is fine.

Suppose you have a canvas which is 640x320 pixels, you can then get it upscaled into whatever resolution is in the browser window, making the game appear the same size for everyone. Great.

Except, the upscaling in web browsers sucks performance. I tried Firefox 3.6 and Chrome 9. Both of them use loads of CPU scaling the canvas on to the screen.

If we use a canvas element without any scaling (no CSS width etc) then all is fine.

Scale it up to a large window, Boom! now it's slow as a pregnant snail. Bummer.

See Example here