Mark's stream of verbiage: Cassandra database and range scans

Thursday, 6 August 2009

Cassandra database and range scans

I've been doing a little more playing with Cassandra, an open source distributed database. It has several features which make it very compelling for storing large data which has a lot of writes:

Write-scaling - adding more nodes increases write capacity
No single point of failure
configurable redundancy

And the most important:

Key range scans

Key range scans are really important because they allow applications to do what users normally want to do:

What emails did I receive this week
Give me all the transactions for customer X in time range Y

Answering these questions without range scans is extremely difficult; with efficient range scans they become fairly easy (provided you pick your keys right).

Other distributed-hash-table database (e.g. Voldemort) don't do this. This makes it difficult to do such queries.

Conventional RDBMSs do range scans all the time, in fact many queries which return more than one row will be implemented as a range scan.

Cassandra is extremely promising, but still a little bit rough around the edges; I've only done a small amount of research so far, but already found several bugs.

I can't complain about the service though; the main developer(s) have always looked into any problems I've reported immediately.

I hope it continues and becomes something really good.

1 comment:

Jonathan Ellis said...: I'm optimistic that 0.4 will be a really solid release. Until then trunk will be a little rough to play in. :); 6 August 2009 at 20:31

Mark's stream of verbiage

Thursday, 6 August 2009

Cassandra database and range scans

1 comment:

Blog Archive

About Me

Links of interest