Mark's stream of verbiage

Thursday, 6 August 2009

Cassandra database and range scans

I've been doing a little more playing with Cassandra, an open source distributed database. It has several features which make it very compelling for storing large data which has a lot of writes:

Write-scaling - adding more nodes increases write capacity
No single point of failure
configurable redundancy

And the most important:

Key range scans

Key range scans are really important because they allow applications to do what users normally want to do:

What emails did I receive this week
Give me all the transactions for customer X in time range Y

Answering these questions without range scans is extremely difficult; with efficient range scans they become fairly easy (provided you pick your keys right).

Other distributed-hash-table database (e.g. Voldemort) don't do this. This makes it difficult to do such queries.

Conventional RDBMSs do range scans all the time, in fact many queries which return more than one row will be implemented as a range scan.

Cassandra is extremely promising, but still a little bit rough around the edges; I've only done a small amount of research so far, but already found several bugs.

I can't complain about the service though; the main developer(s) have always looked into any problems I've reported immediately.

I hope it continues and becomes something really good.

Sunday, 29 March 2009

Distributed data storage systems

I'm looking for a distributed data storage system. Ideally such a system would have some (or ideally all) of the following characteristics:

Software only, using shared-nothing (must)
Stores arbitrarily large (actually 2G would be enough) items of binary data accessed by a key (a short-ish string would do) specified at store-time. Items would be stored across multiple storage nodes.
No single point of failure (preferable, a single point of failure which does not immediately impact the service would be acceptable)
Keeps N copies of each item in different nodes, specifyable either in config or at store-time
Automatic repairer to re-duplicate items following a storage node's demise (or administrative removal)
Automatic expiry of old data after a time specified at store-time
Managability: all nodes would share the same config; nodes can be administratively added and removed without any explicit config changes to other nodes.
Storage management: nodes should be able to be configured to use a given amount of maximum space; nodes should be able to be put into "readonly" mode where new data are not accepted
Automatic balancing of load for storage of new items
Monitoring: some cluster-aware monitoring tools which could report on the number of available nodes, total space available, which nodes were almost full, how much data is broken and being repaired, etc.

This is a bit of a tall order. There are some systems which almost do it, or do many of the above.

Tahoe seems to be the closest so far.

Of course things like Amazon S3 must do at least most of the above internally, but they aren't open source, indeed you can't even buy it except as a service.

Tuesday, 17 March 2009

REPAIR TABLE does not always upgrade it

MySQL 5.0 uses a new MyISAM table format than 4.1. So when upgrading, the tables need to be upgraded.

This can be done using REPAIR TABLE. Or can it?

On 5.0.67 (I've been testing with), REPAIR TABLE does indeed upgrade it
On 5.0.22 (Earlier version, used on a backup slave), REPAIR TABLE does NOT upgrade it

Oh dear. That breaks my plan for how to upgrade the slave. Not a disaster, but annoying nonetheless

NB: this is a non-production system, and this is the kind of thing that we discover by testing it.

Sunday, 21 December 2008

MySQL running out of disc space

Running out of disc space is not a good situation. However, if it does happen, it would be nice to have some control over what happens.

We use MyISAM. When you run out of disc space, MyISAM just sits there and waits. And waits, and waits, apparently forever, for some space to become available.

This is not good, because an auditing/logging application (which ours is) may have lots of available servers which it could send its data to - getting an error from one would simply mean that the data could be audited elsewhere.

But if the server just hangs, and waits, the application isn't (currently) smart enough to give up and try another server, so it hangs the audit process too. Which means that audit data starts to back up, and customers wonder why they can't see recent data in their reports etc.

There has to be a better way. I propose

A background thread monitors the disc space level every few seconds
When it falls below a critical level (still more than can reasonably be filled up in a few seconds), force the server to become read-only
When in this mode, modifications to the data fail, quickly, with an error code which tells the process (or develope) exactly what the problem is (Out of disc space)
When the disc space falls below some threshold, the read-only mode is turned back off.

That way, clients get what they expect - either quick service for inserts, or a fast error telling them what's wrong (Go away, I'm full, audit your data somewhere else)

Drizzle etc, should do this.

Or perhaps, it's a job for the storage engine?

Happy Christmas.

Tuesday, 16 September 2008

Migrating from MySQL 4.1 to 5.x

Suppose I have around 30 servers with MySQL 4.1 each with up to 1Tb of MyISAM tables. We need to upgrade their OS to support new hardware. We want to upgrade to 64-bit architecture to use the hardware efficiently.

Upgrading to 5.x seems like a good idea, as the additional validation work will be the same as would have to be done anyway - so we could effectively upgrade our MySQL version "for free", i.e. no additional QA work would be required than would be needed anyway.

Upgrading at any other time would involve a repetition of the validation exercise, which is not trivial as we must test at least:

All features of the complex application which use these databases still work without errors
The behaviour of the new version is consistent with the old one (e.g. results returned)
Performance needs to be measured to ensure that performance regressions haven't happened (or are acceptable)
Soak / stress testing needs to be done to ensure that the new version can perform under sustained high load.

Of course some changes to our software are inevitable, but these are likely to be minor. Testing is the main work.

Migration and rollback

My initial plan was to leave the MyISAM table files exactly as they were - after all, 5.0.x can read them fine (apparently) - so that if we need to rollback, we simply move back to the old version and everything works as before (including data created while we were on the new version).

However, this looks tricky. Load testing confirms that 5.0.x has frequent crashes when dealing with "old" tables (i.e. those which have not been upgraded). The MySQL manual insists that mysql_upgrade is run to upgrade all existing MyISAM tables.

This is fine if your database is small and noncritical, but ours are large and important. So what to do?

It is possible to upgrade tables using CHECK TABLE FOR UPGRADE and REPAIR TABLE. However, NO ROLLBACK METHOD exists. This is a problem.

I've estimated (using performance data from a production-spec test system) that in production, migration can be performed in 2-3 hours. This is an acceptable level of planned downtime. The only way of rolling BACK the table versions is to mysqldump the database and reload it. I estimate that this could take 16 hours, which is not acceptable.

It is not acceptable to apply a change without a working (i.e. tested) rollback procedure. Neither is it particularly acceptable to have 16 hours of downtime to perform one.

This makes the upgrade tricky. We're working on it.

Sunday, 13 July 2008

Versioning your schema

How do you version your database schemas?

There are quite a lot of different options. On some very small / trivial applications developed by just one person, it might be acceptable to use ad-hoc queries (maybe with a GUI) to manage the schema - but this runs the risk of development and production schemas becoming out of sync (or rather, out of sync with their respective versions of code).

Requirements are typically

Schema changes are held with the rest of the software in a SCM system
Schema changes can be made in one development environment and will be correctly propogated with code to other development and production environments as required
Schema changes can be rolled back (if for example, a deployment contains a serious bug)

There doesn't seem to be an easy answer for this one.

Our teams do something like:

Have a base create.sql which generates the original schema
In each revision where a schema change is required, write an apply.sql and remove.sql script to apply / roll back the change
These scripts need to be hand written as there are often special cases - sometimes several ALTER statements are required to make a single change, sometimes in combination with UPDATEs etc. Rollback scripts are even more tricky and sometimes need to restore from a backup table.
Have a system of scripts which works out which apply.sql / remove.sql is/are required (when deploying a new version of the code) and runs them, in order to update the schema

The problem with this system is that it relies on developers correctly writing apply.sql and remove.sql - it is fairly error prone. Although our validation process is fairly rigorous, it still lets errors slip through occasionally.

The problem is compounded by having multiple teams working on different code branches, not knowing necessarily which order their releases will go out in (one team would ultimately be responsible for merging the other's changes into their branch prior to release).

Ideas on a postcard please :)

Mark

Saturday, 10 May 2008

Linux IP load balancing without a load balancer

I've been investigating load balancing without a load balancer. I'm building my own implementation of a high availability IP load balancer / failover system.

This will essentially work like the netfilter CLUSTERIP target, except that it will also be self-configuring and self-monitoring / repairing - thus not requiring other tools (such as the complicated LinuxHA tools) to work. Some other efforts to do this have been:

Saru http://www.ultramonkey.org/papers/active_active/ - seems abandoned
Microsoft's "network load balancing" does something similar

An author known as "flavio" wrote an article about load balancer-less clusters here but it seems to have disappeared although it's still on the wayback machine

How IP load balancing works without a dedicated load balancer host is:

ARP requests for the cluster IP address are responded to by a multicast ethernet address
All the hosts join the ethernet multicast group
Hosts selectively accept / ignore traffic based on whether they want to handle it or not, by some hashing algorithm.

I've started work on the implementation on google code. Most parts of it can be done in user-space (A kernel implementation might be necessary for performance later):

I use arptables to block the kernel's own ARP responses on the load balanced IP, otherwise it would give out its own unicast link address.
A small userspace daemon responds to ARP requests, giving out a multicast address.
The IP address is configured normally with "ip addr add ..."
Iptables is used to filter out the traffic we don't want and accept traffic we do want. It uses connection tracking to ensure that established connections are always kept, invalid ones ignored, and new connections passed to a userspace daemon using NFQUEUE
A userspace daemon reads the packets from NFQUEUE and uses a hashing algorithm to determine whther to accept them or not. Each host in the cluster receives the same packets and does the same hash - so reaches the same conclusion about who should receive the packet - thus EXACTLY ONE host will accept each new connection.

Load balancing can be done fairly (all nodes equal weight) or unfairly (different weights). Also, when administratively shutting down a node, we can set its weight to zero and existing connections will be allowed to finish (new ones will then be given to other nodes).

I've created a very sketchy design, it's all basically completely do-able. The userspace daemon uses UDP multicast packets to talk to the other nodes, will organise a "leader" which will then tell the other nodes which hash values to accept/reject, ensuring that there is no overlap and no gaps.

There are a lot of possibilities for race conditions during a reconfiguration due to a node weight change / failure / recovery. I haven't thought about these yet.

This principle works well for TCP-based services such as web and email, but may not be good for some UDP-based services because conntrack cannot ensure that the packets continue going to the same node for the lifetime of the connection (as it does for TCP).

---
Problems / disadvantages:

Apparently, an ARP reply indicating a link-layer multicast address is forbidden by RFC1812
The Linux kernel ignores TCP packets which have a link-layer multicast destination. I've worked around this with a really small kernel module (the same as what CLUSTERIP does)
Interoperability with other network OSs might not be good as this isn't a very official technique. Apparently some routers ignore these ARP packets.