Sunday, 29 November 2009

What is Cassandra good for

The Cassandra database has been getting quite a lot of publicity recently. I think this is a good thing in general, but it seems that some people are considering using it for unsuitable purposes.

Cassandra is a cluster database which uses multiple nodes to provide
  • Read-scaling
  • Write-scaling
  • High availability
Unless you need at least TWO of those things, you should probably not bother.

Good reasons to use Cassandra:
High availability

Cassandra tolerates the failure of some nodes and will continue to read data and take writes despite some nodes being offline or unreachable - the exact behaviour depends on its settings and what consistency level of read/write is requested.

Write scaling

Cassandra allows you to scale writes by just adding more nodes; writes are split between nodes, hence you can generally get better and better write performance by JUST adding more nodes (NB: it doesn't necessarily do load balancing, so you might not in all cases, but this is what it aspires to)

Less good reasons to use Cassandra
Read scaling
Cassandra gives you read-scaling in the same way as write-scaling. This is a good thing, but can also be achieved relatively easily* with a conventional database by adding more and more read-only slaves / replicas, or using a cache (if you tend to get a lot of similar requests). Many big MySQL users do both.

Also Cassandra does NOT create more than the configured number of replicas of any given piece of data, regardless of the amount of traffic on that part, so you could end up having a small number of servers hammered and the rest idle.

Bad reasons to use Cassandra

Schema flexibility

aka "I cannot figure out how to use ALTER TABLE", or at least make a flexible conventional schema ...

Some people have cited schema flexibility as a good reason to use Cassandra (same argument applies for Voldemort, Couchdb etc).

However, in practice this is NOT a benefit, because it comes at the cost of EVERYTHING ELSE YOU HAVE IN A TRADITIONAL DATABASE.

Let's see what Cassandra does NOT do:
  • Secondary indexes - I'd be really surprised if your app doesn't need any of those!
  • Sorting, grouping or other advanced queries
  • Filtering (mostly)
  • Synchronous behaviour of updates
  • Bulk updates (UPDATE 10,000 rows in one operation)
  • Efficient table creation / drop
That's quite a big list (and very incomplete) so you'd better have a better reason for using it than "I cannot figure out how to use ALTER TABLE"

Because X or Y uses it

Just because Digg, Facebook et al use Cassandra, doesn't mean you have to. Your data are probably more important than theirs. Your workload is probably different from theirs. In particular, your write/read scale requirements are probably less than theirs.

I have a lot of respect for Facebook, Digg developers etc, but I also have a lot of envy:
  • They lose data, nobody cares
  • They lose data, nobody rings up and complains
They could get a bit of bad press, their users might desert them in numbers, but they wouldn't lose money directly and immediately.

Most companies who have big data provide a service, which comes with an SLA. The SLA often says that if we lose their data, they get their money back.

* May or may not be easy, depending on the calibre of your developers, ops staff, change control requirements, data structure etc.

Tuesday, 27 October 2009

When commit appears to fail

So you're using explicit transactions. Everything appears to work (every query gives the expected result) until you get to COMMIT.

Then you get an exception thrown from COMMIT. What happened?

Usually this would be because the server has been shut down, or you've lost the connection.

The problem is, that you can't assume that the commit failed, but you also can't assume it succeeded.

A robust application must make NO ASSUMPTION about whether a failed commit did, indeed, commit the transaction or not. It can safely assume that either all or none of it was committed, but can't easily tell which.

So the only way to really know is to have your application somehow remember that the transaction MIGHT have failed, and check later.

Possible solutions:
  • Ignore it and deal with any inconsistencies manually, or decide that you don't care :)
  • Write your entire transaction such that if it is repeated having been committed previously, it is a no-op or has no harmful side effects (e.g. change INSERT to INSERT IGNORE to avoid unique index violations). This is rather difficult for complex transactions, but works for some.
  • Check, in your code, even if the commit apparently succeeded that it really did. If you discover that it didn't, then retry or report failure to the caller / user.
  • Perform another transaction to "undo" or "cancel" whatever the original transaction did if commit failed (Problem 1: this may itself fail to commit; problem 2: for a short time, an inconsistency exists)
None of these is ideal, and I'd like to think that this never happens. But if you do enough transactions, it's going to happen sooner or later (networks and servers do fail, unfortunately)

In testing, we've found that this happens so infrequently that even deliberate malice (kill -9 of mysql while loads of transactions are processing etc) fails to reproduce it reliably. In the end, a set of iptables rules which blocks the response from a commit was constructed.

Wednesday, 2 September 2009

On mysql's myisam_block_size setting

There is a little-known setting, myisam_block_size in MySQL. This affects the block size used in the indexes of MyISAM tables stored in .MYI files on disc and in the key buffer.

The default value is 1k, this is possibly too small for best performance on modern systems; in particular, many filesystems used a bigger block size, so writing a single index block requires a read followed by a write. Random reads are really slow on hard discs (writes are mostly fast as they go into your battery-backed raid controller which has lots of RAM).

I am currently in the process of experimenting with myisam_block_size, and so far have determined the following:
  • myisam_block_size is settable only at server start time, either in my.cnf or on the command-line
  • myisam_block_size only affects newly created tables or tables rebuilt using ALTER TABLE; existing MyISAM tables keep their old index block size and work no differently.
There is a second setting, key_buffer_block_size, which is different. This controls the size of blocks in the key buffer, which is independent.

Advantages of higher myisam_block_size
  • Fewer IO requests required to satisfy some requests (e.g. range scans) as fewer blocks are needed
  • Better index prefix compression, as the blocks are larger - particularly with PACK_KEYS enabled (index compression works on a block-by-block basis)
  • No read-then-write required by the OS to write to the underlying disc, if the block size is >= the filesystem block size
  • More wasted space in partially filled blocks
  • Possibly less efficient use of key cache, as more of the "wrong" blocks are cached
  • Larger IOs are required to fulfil the same request often - as it requires the same number of bigger blocks
So it could work either way. It may depend on your workload.

Testing continues.

Thursday, 6 August 2009

Cassandra database and range scans

I've been doing a little more playing with Cassandra, an open source distributed database. It has several features which make it very compelling for storing large data which has a lot of writes:
  • Write-scaling - adding more nodes increases write capacity
  • No single point of failure
  • configurable redundancy
And the most important:

  • Key range scans

Key range scans are really important because they allow applications to do what users normally want to do:
  • What emails did I receive this week
  • Give me all the transactions for customer X in time range Y
Answering these questions without range scans is extremely difficult; with efficient range scans they become fairly easy (provided you pick your keys right).

Other distributed-hash-table database (e.g. Voldemort) don't do this. This makes it difficult to do such queries.

Conventional RDBMSs do range scans all the time, in fact many queries which return more than one row will be implemented as a range scan.

Cassandra is extremely promising, but still a little bit rough around the edges; I've only done a small amount of research so far, but already found several bugs.

I can't complain about the service though; the main developer(s) have always looked into any problems I've reported immediately.

I hope it continues and becomes something really good.

Sunday, 29 March 2009

Distributed data storage systems

I'm looking for a distributed data storage system. Ideally such a system would have some (or ideally all) of the following characteristics:
  • Software only, using shared-nothing (must)
  • Stores arbitrarily large (actually 2G would be enough) items of binary data accessed by a key (a short-ish string would do) specified at store-time. Items would be stored across multiple storage nodes.
  • No single point of failure (preferable, a single point of failure which does not immediately impact the service would be acceptable)
  • Keeps N copies of each item in different nodes, specifyable either in config or at store-time
  • Automatic repairer to re-duplicate items following a storage node's demise (or administrative removal)
  • Automatic expiry of old data after a time specified at store-time
  • Managability: all nodes would share the same config; nodes can be administratively added and removed without any explicit config changes to other nodes.
  • Storage management: nodes should be able to be configured to use a given amount of maximum space; nodes should be able to be put into "readonly" mode where new data are not accepted
  • Automatic balancing of load for storage of new items
  • Monitoring: some cluster-aware monitoring tools which could report on the number of available nodes, total space available, which nodes were almost full, how much data is broken and being repaired, etc.
This is a bit of a tall order. There are some systems which almost do it, or do many of the above.

Tahoe seems to be the closest so far.

Of course things like Amazon S3 must do at least most of the above internally, but they aren't open source, indeed you can't even buy it except as a service.

Tuesday, 17 March 2009

REPAIR TABLE does not always upgrade it

MySQL 5.0 uses a new MyISAM table format than 4.1. So when upgrading, the tables need to be upgraded.

This can be done using REPAIR TABLE. Or can it?

  • On 5.0.67 (I've been testing with), REPAIR TABLE does indeed upgrade it
  • On 5.0.22 (Earlier version, used on a backup slave), REPAIR TABLE does NOT upgrade it
Oh dear. That breaks my plan for how to upgrade the slave. Not a disaster, but annoying nonetheless

NB: this is a non-production system, and this is the kind of thing that we discover by testing it.