Mark's stream of verbiage

Monday, 23 May 2011

Expecting the unexpected

Few developers consider, when trying to build robust platforms, all the possible modes of failure. Indeed, it is difficult to consider them all, let alone plan for them, or design tests which exercise particular symptoms.

In this post, I discuss some of the types of failure we can see in real systems.

Complete server failure

Most developers DO consider this. In a "Complete server failure", what generally happens is:

* The server stops processing new requests, completely.
* The server's OS no longer responds to any network request at all (e.g. "ping")
* Processing does not continue within the server
* The contents of memory are immediately and irretrievably lost.

Typically, the server recovers, and when it does so, it is rebooted and restored to full health. All writes which were acknowledged before its failure have been persisted.

This is very easy to simulate (just hit the "power off" button in your VM hypervisor) and fairly easy to plan for; most robust systems consider this kind of scenario.

Network failure

There are many different kinds of network failure, but consider the simplest, most severe network failure:

* One or more machines in the infrastructure lose network connectivity
* None of them can talk to anything at all, including each other
* Local processing on these servers continues as normal
* No machines need to be rebooted to fix the fault, when it is repaired everything is back to normal.

This is a symptom of, perhaps a switch failure, where a "complete" failure occurs.

I won't discuss network failures at all, but there are many different kinds. My experience suggests that the most common is partial or complete loss of internet connectivity from one location (datacentre).

IO subsystem failures

* One or more discs / volumes suddenly become unavailable
* The OS does not reboot; processes do not stop

These are the kinds of failures which developers typically don't consider and are a lot more difficult to simulate. What might happen is, the power fails for a disc enclosure unit, but not its host server, in this case the OS and its boot discs remain available, but data discs are not. In these cases, failover might not be triggered or might behave incorrectly.

Heavy load or unexpected poor performance

* A single server unexpectedly starts performing very badly
* In the extreme, this means without sufficient capacity to do useful work
* But it's not failed; no subsystem is individually totally unavailable
* Sometimes the effect is severe enough to prevent operations engineers logging in to diagnose / fix the fault

These kinds of faults usually cause a larger problem, because failover systems aren't triggered, or cannot take over in a timely fashion. Common causes can be

* Rogue process consuming lots of resources
* Denial-of-service attack
* Bad application design causing legitimate requests to suddenly spike in resource usage.
* Operational error (well, anything can be caused by operational error :) )

"Zombie" systems or, back from the dead

* A system fails in a catastrophic way and can't be remotely recovered
* Operations engineers assume that it's going to be completely dead until physically replaced (They are some distance away and don't raise a "remote hands" request, or are unable to recover it by doing so)
* Another system is provisioned in its place, and takes over its IP address, role etc
* Then one day... the "Zombie" system unexpectedly comes back from the dead to haunt its successsor ... Brraaainss....

Of course this could be months later, after many software updates (possibly security updates). The "zombie" system is running an old build and will not carry out correct processing if it is given work to do.

Conclusion

These are just a few of the annoying types of failures which happen to real systems in production. Expect the unexpected (as if that's not a contradiction!).

Happy hacking!

Sunday, 20 February 2011

HTML 2d Canvas upscaling - really inefficient

I started writing some test programs with the HTML canvas element. This is great, as you can actually write games in Javascript - efficiently - in principle.

My previous attempts have all used the DOM API which is not very convenient, and not very efficient.

I had assumed the canvas 2d drawing context was basically a software-renderer - it's not extremely efficient, but provided the canvas doesn't have too many pixels, you can still do a lot of work per frame on a modern machine.

Which is fine.

Suppose you have a canvas which is 640x320 pixels, you can then get it upscaled into whatever resolution is in the browser window, making the game appear the same size for everyone. Great.

Except, the upscaling in web browsers sucks performance. I tried Firefox 3.6 and Chrome 9. Both of them use loads of CPU scaling the canvas on to the screen.

If we use a canvas element without any scaling (no CSS width etc) then all is fine.

Scale it up to a large window, Boom! now it's slow as a pregnant snail. Bummer.

See Example here

Wednesday, 26 May 2010

MySQL, what are you smoking?

There are a lot of weird things which MySQL does to handle its mix of transactional and non-transactional behaviour, but this one was new to me :)

create table t1 (ID INT NOT NULL PRIMARY KEY, V INT NOT NULL);

Query OK, 0 rows affected (0.01 sec)

mysql> insert into t1 (ID,V) VALUES (2,NULL);
ERROR 1048 (23000): Column 'V' cannot be null

mysql> insert into t1 (ID,V) VALUES (3,1),(4,1);
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

insert into t1 (ID,V) VALUES (5,1),(6,NULL);
Query OK, 2 rows affected, 1 warning (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 1

mysql> show warnings;
+---------+------+---------------------------+
| Level   | Code | Message                   |
+---------+------+---------------------------+
| Warning | 1048 | Column 'V' cannot be null |
+---------+------+---------------------------+
1 row in set (0.00 sec)

select * from t1;

+----+---+
| ID | V |
+----+---+
|  3 | 1 |
|  4 | 1 |
|  5 | 1 |
|  6 | 0 |
+----+---+
4 rows in set (0.00 sec)

What's going on?
MySQL does not consider (some) errors to be errors, if they happen on the second or subsequent row of a multi-row insert.

On the other hand, if it happens on the first row, it's an error.

Why?

Because non-transactional engines can't rollback to a savepoint. This means that if it's inserted one or more rows already, to generate an error would leave some stuff in the database.

No, really why?

I don't know. This is not consistent with, for example, a unique index violation, which makes it stop half way through a multi-row insert anyway, and non-transactional engines can't rollback.

So if you insert a duplicate, THAT still generates an error on the second a subsequent row. It's not even consistent!

mysql> insert into t1 (ID,V) VALUES (10,1),(10,2);
ERROR 1062 (23000): Duplicate entry '10' for key 'PRIMARY'
mysql> select * from t1;
+----+---+
| ID | V |
+----+---+
|  3 | 1 |
|  4 | 1 |
|  5 | 1 |
|  6 | 0 |
| 10 | 1 |
+----+---+
5 rows in set (0.00 sec)

Of course if we use a transactional engine, it looks better:


mysql> ALTER TABLE t1 ENGINE=InnoDB;
Query OK, 5 rows affected (0.00 sec)
Records: 5  Duplicates: 0  Warnings: 0

mysql> insert into t1 (ID,V) VALUES (20,1),(20,2);
ERROR 1062 (23000): Duplicate entry '20' for key 'PRIMARY'
mysql> select * from t1 WHERE ID=20;
Empty set (0.00 sec)

But that won't change the behaviour of a multi-row insert with NULLs in invalid places.

This kind of stuff is nonsense and we need it to GO AWAY NOW.

How?

SET SQL_MODE='STRICT_ALL_TABLES'

And now every error is really an error. Yay! Why can't this be the default? (I know the answer; this is a rhetorical question)

Sunday, 25 April 2010

Streaming data from MySQL

How do we stream data from MySQL? Traditionally, developers have thought that they can simply use their API's cursor-equivalent object (e.g. resultset) and move through the results.

Unfortunately, this does not do streaming in most cases. Normally most client libraries (which call mysql_store_result) will read the entire result into memory, and you're just going through an already-in-memory data set. This will fail if it doesn't fit in memory.

Enter mysql_use_result - if your library can use this instead of mysql_store_result, you can then skip through the records without needing to keep them all in ram at once.

However, there is some bad news - while going through a result set in mysql_use_result, you can't do any OTHER queries on the connection. An attempt to do so generally violates MySQL's protocol and fails. So if you need to do other queries while you're processing the large data set, you should open another connection.

We've found that in Perl DBI, mysql_use_result performs significantly worse on small data sets (i.e. ones which fit in ram, say 100M - 1G) than mysql_store_result (which is used by default). It seems that a lot more requests need to be done into the libraries from Perl.

Enter HANDLER. This is basically an SQL-level object which behaves a bit like a server-side cursor would in some other databases. You can have multiple handlers open on the same connection, and can do other operations while a handler is open. The disadvantage: it cannot be used for arbitrary queries. Only table and index scans of a single table can be done - no joins or sorting. And the client needs to know a bit about the table to choose the best method - the optimiser doesn't get involved.

However, in the cases where a lot of data are likely to be returned, we can either not require joins, or do joins on the client-side - if there are small "dimension tables" that can be held in-memory on the client.

None of this is great of course, it would be better if mysql actually supported server-side cursors. Unfortunately it doesn't (except in stored procedures).

Sunday, 29 November 2009

What is Cassandra good for

The Cassandra database has been getting quite a lot of publicity recently. I think this is a good thing in general, but it seems that some people are considering using it for unsuitable purposes.

Cassandra is a cluster database which uses multiple nodes to provide

Read-scaling
Write-scaling
High availability

Unless you need at least TWO of those things, you should probably not bother.

Good reasons to use Cassandra:
High availability

Cassandra tolerates the failure of some nodes and will continue to read data and take writes despite some nodes being offline or unreachable - the exact behaviour depends on its settings and what consistency level of read/write is requested.

Write scaling

Cassandra allows you to scale writes by just adding more nodes; writes are split between nodes, hence you can generally get better and better write performance by JUST adding more nodes (NB: it doesn't necessarily do load balancing, so you might not in all cases, but this is what it aspires to)

Less good reasons to use Cassandra
Read scaling
Cassandra gives you read-scaling in the same way as write-scaling. This is a good thing, but can also be achieved relatively easily* with a conventional database by adding more and more read-only slaves / replicas, or using a cache (if you tend to get a lot of similar requests). Many big MySQL users do both.

Also Cassandra does NOT create more than the configured number of replicas of any given piece of data, regardless of the amount of traffic on that part, so you could end up having a small number of servers hammered and the rest idle.

Bad reasons to use Cassandra

Schema flexibility

aka "I cannot figure out how to use ALTER TABLE", or at least make a flexible conventional schema ...

Some people have cited schema flexibility as a good reason to use Cassandra (same argument applies for Voldemort, Couchdb etc).

However, in practice this is NOT a benefit, because it comes at the cost of EVERYTHING ELSE YOU HAVE IN A TRADITIONAL DATABASE.

Let's see what Cassandra does NOT do:

Secondary indexes - I'd be really surprised if your app doesn't need any of those!
Sorting, grouping or other advanced queries
Filtering (mostly)
Synchronous behaviour of updates
Bulk updates (UPDATE 10,000 rows in one operation)
Efficient table creation / drop

That's quite a big list (and very incomplete) so you'd better have a better reason for using it than "I cannot figure out how to use ALTER TABLE"

Because X or Y uses it

Just because Digg, Facebook et al use Cassandra, doesn't mean you have to. Your data are probably more important than theirs. Your workload is probably different from theirs. In particular, your write/read scale requirements are probably less than theirs.

I have a lot of respect for Facebook, Digg developers etc, but I also have a lot of envy:

They lose data, nobody cares
They lose data, nobody rings up and complains
They lose data, and NOBODY DEMANDS THEIR MONEY BACK

They could get a bit of bad press, their users might desert them in numbers, but they wouldn't lose money directly and immediately.

Most companies who have big data provide a service, which comes with an SLA. The SLA often says that if we lose their data, they get their money back.

* May or may not be easy, depending on the calibre of your developers, ops staff, change control requirements, data structure etc.

Tuesday, 27 October 2009

When commit appears to fail

So you're using explicit transactions. Everything appears to work (every query gives the expected result) until you get to COMMIT.

Then you get an exception thrown from COMMIT. What happened?

Usually this would be because the server has been shut down, or you've lost the connection.

The problem is, that you can't assume that the commit failed, but you also can't assume it succeeded.

A robust application must make NO ASSUMPTION about whether a failed commit did, indeed, commit the transaction or not. It can safely assume that either all or none of it was committed, but can't easily tell which.

So the only way to really know is to have your application somehow remember that the transaction MIGHT have failed, and check later.

Possible solutions:

Ignore it and deal with any inconsistencies manually, or decide that you don't care :)
Write your entire transaction such that if it is repeated having been committed previously, it is a no-op or has no harmful side effects (e.g. change INSERT to INSERT IGNORE to avoid unique index violations). This is rather difficult for complex transactions, but works for some.
Check, in your code, even if the commit apparently succeeded that it really did. If you discover that it didn't, then retry or report failure to the caller / user.
Perform another transaction to "undo" or "cancel" whatever the original transaction did if commit failed (Problem 1: this may itself fail to commit; problem 2: for a short time, an inconsistency exists)

None of these is ideal, and I'd like to think that this never happens. But if you do enough transactions, it's going to happen sooner or later (networks and servers do fail, unfortunately)

In testing, we've found that this happens so infrequently that even deliberate malice (kill -9 of mysql while loads of transactions are processing etc) fails to reproduce it reliably. In the end, a set of iptables rules which blocks the response from a commit was constructed.

Wednesday, 2 September 2009

On mysql's myisam_block_size setting

There is a little-known setting, myisam_block_size in MySQL. This affects the block size used in the indexes of MyISAM tables stored in .MYI files on disc and in the key buffer.

The default value is 1k, this is possibly too small for best performance on modern systems; in particular, many filesystems used a bigger block size, so writing a single index block requires a read followed by a write. Random reads are really slow on hard discs (writes are mostly fast as they go into your battery-backed raid controller which has lots of RAM).

I am currently in the process of experimenting with myisam_block_size, and so far have determined the following:

myisam_block_size is settable only at server start time, either in my.cnf or on the command-line
myisam_block_size only affects newly created tables or tables rebuilt using ALTER TABLE; existing MyISAM tables keep their old index block size and work no differently.

There is a second setting, key_buffer_block_size, which is different. This controls the size of blocks in the key buffer, which is independent.

Advantages of higher myisam_block_size

Fewer IO requests required to satisfy some requests (e.g. range scans) as fewer blocks are needed
Better index prefix compression, as the blocks are larger - particularly with PACK_KEYS enabled (index compression works on a block-by-block basis)
No read-then-write required by the OS to write to the underlying disc, if the block size is >= the filesystem block size

Disadvantages

More wasted space in partially filled blocks
Possibly less efficient use of key cache, as more of the "wrong" blocks are cached
Larger IOs are required to fulfil the same request often - as it requires the same number of bigger blocks

So it could work either way. It may depend on your workload.

Testing continues.