Thursday, 6 August 2009

Cassandra database and range scans

I've been doing a little more playing with Cassandra, an open source distributed database. It has several features which make it very compelling for storing large data which has a lot of writes:
  • Write-scaling - adding more nodes increases write capacity
  • No single point of failure
  • configurable redundancy
And the most important:

  • Key range scans

Key range scans are really important because they allow applications to do what users normally want to do:
  • What emails did I receive this week
  • Give me all the transactions for customer X in time range Y
Answering these questions without range scans is extremely difficult; with efficient range scans they become fairly easy (provided you pick your keys right).

Other distributed-hash-table database (e.g. Voldemort) don't do this. This makes it difficult to do such queries.

Conventional RDBMSs do range scans all the time, in fact many queries which return more than one row will be implemented as a range scan.

Cassandra is extremely promising, but still a little bit rough around the edges; I've only done a small amount of research so far, but already found several bugs.

I can't complain about the service though; the main developer(s) have always looked into any problems I've reported immediately.

I hope it continues and becomes something really good.