Sunday, 25 April 2010

Streaming data from MySQL

How do we stream data from MySQL? Traditionally, developers have thought that they can simply use their API's cursor-equivalent object (e.g. resultset) and move through the results.

Unfortunately, this does not do streaming in most cases. Normally most client libraries (which call mysql_store_result) will read the entire result into memory, and you're just going through an already-in-memory data set. This will fail if it doesn't fit in memory.

Enter mysql_use_result - if your library can use this instead of mysql_store_result, you can then skip through the records without needing to keep them all in ram at once.

However, there is some bad news - while going through a result set in mysql_use_result, you can't do any OTHER queries on the connection. An attempt to do so generally violates MySQL's protocol and fails. So if you need to do other queries while you're processing the large data set, you should open another connection.

We've found that in Perl DBI, mysql_use_result performs significantly worse on small data sets (i.e. ones which fit in ram, say 100M - 1G) than mysql_store_result (which is used by default). It seems that a lot more requests need to be done into the libraries from Perl.

Enter HANDLER. This is basically an SQL-level object which behaves a bit like a server-side cursor would in some other databases. You can have multiple handlers open on the same connection, and can do other operations while a handler is open. The disadvantage: it cannot be used for arbitrary queries. Only table and index scans of a single table can be done - no joins or sorting. And the client needs to know a bit about the table to choose the best method - the optimiser doesn't get involved.

However, in the cases where a lot of data are likely to be returned, we can either not require joins, or do joins on the client-side - if there are small "dimension tables" that can be held in-memory on the client.

None of this is great of course, it would be better if mysql actually supported server-side cursors. Unfortunately it doesn't (except in stored procedures).