Sunday, 21 December 2008

MySQL running out of disc space

Running out of disc space is not a good situation. However, if it does happen, it would be nice to have some control over what happens.

We use MyISAM. When you run out of disc space, MyISAM just sits there and waits. And waits, and waits, apparently forever, for some space to become available.

This is not good, because an auditing/logging application (which ours is) may have lots of available servers which it could send its data to - getting an error from one would simply mean that the data could be audited elsewhere.

But if the server just hangs, and waits, the application isn't (currently) smart enough to give up and try another server, so it hangs the audit process too. Which means that audit data starts to back up, and customers wonder why they can't see recent data in their reports etc.

There has to be a better way. I propose
  • A background thread monitors the disc space level every few seconds
  • When it falls below a critical level (still more than can reasonably be filled up in a few seconds), force the server to become read-only
  • When in this mode, modifications to the data fail, quickly, with an error code which tells the process (or develope) exactly what the problem is (Out of disc space)
  • When the disc space falls below some threshold, the read-only mode is turned back off.
That way, clients get what they expect - either quick service for inserts, or a fast error telling them what's wrong (Go away, I'm full, audit your data somewhere else)

Drizzle etc, should do this.

Or perhaps, it's a job for the storage engine?

Happy Christmas.

Tuesday, 16 September 2008

Migrating from MySQL 4.1 to 5.x

Suppose I have around 30 servers with MySQL 4.1 each with up to 1Tb of MyISAM tables. We need to upgrade their OS to support new hardware. We want to upgrade to 64-bit architecture to use the hardware efficiently.

Upgrading to 5.x seems like a good idea, as the additional validation work will be the same as would have to be done anyway - so we could effectively upgrade our MySQL version "for free", i.e. no additional QA work would be required than would be needed anyway.

Upgrading at any other time would involve a repetition of the validation exercise, which is not trivial as we must test at least:
  • All features of the complex application which use these databases still work without errors
  • The behaviour of the new version is consistent with the old one (e.g. results returned)
  • Performance needs to be measured to ensure that performance regressions haven't happened (or are acceptable)
  • Soak / stress testing needs to be done to ensure that the new version can perform under sustained high load.
Of course some changes to our software are inevitable, but these are likely to be minor. Testing is the main work.

Migration and rollback

My initial plan was to leave the MyISAM table files exactly as they were - after all, 5.0.x can read them fine (apparently) - so that if we need to rollback, we simply move back to the old version and everything works as before (including data created while we were on the new version).

However, this looks tricky. Load testing confirms that 5.0.x has frequent crashes when dealing with "old" tables (i.e. those which have not been upgraded). The MySQL manual insists that mysql_upgrade is run to upgrade all existing MyISAM tables.

This is fine if your database is small and noncritical, but ours are large and important. So what to do?

It is possible to upgrade tables using CHECK TABLE FOR UPGRADE and REPAIR TABLE. However, NO ROLLBACK METHOD exists. This is a problem.

I've estimated (using performance data from a production-spec test system) that in production, migration can be performed in 2-3 hours. This is an acceptable level of planned downtime. The only way of rolling BACK the table versions is to mysqldump the database and reload it. I estimate that this could take 16 hours, which is not acceptable.

It is not acceptable to apply a change without a working (i.e. tested) rollback procedure. Neither is it particularly acceptable to have 16 hours of downtime to perform one.

This makes the upgrade tricky. We're working on it.

Sunday, 13 July 2008

Versioning your schema

How do you version your database schemas?

There are quite a lot of different options. On some very small / trivial applications developed by just one person, it might be acceptable to use ad-hoc queries (maybe with a GUI) to manage the schema - but this runs the risk of development and production schemas becoming out of sync (or rather, out of sync with their respective versions of code).

Requirements are typically
  • Schema changes are held with the rest of the software in a SCM system
  • Schema changes can be made in one development environment and will be correctly propogated with code to other development and production environments as required
  • Schema changes can be rolled back (if for example, a deployment contains a serious bug)

There doesn't seem to be an easy answer for this one.

Our teams do something like:
  • Have a base create.sql which generates the original schema
  • In each revision where a schema change is required, write an apply.sql and remove.sql script to apply / roll back the change
  • These scripts need to be hand written as there are often special cases - sometimes several ALTER statements are required to make a single change, sometimes in combination with UPDATEs etc. Rollback scripts are even more tricky and sometimes need to restore from a backup table.
  • Have a system of scripts which works out which apply.sql / remove.sql is/are required (when deploying a new version of the code) and runs them, in order to update the schema
The problem with this system is that it relies on developers correctly writing apply.sql and remove.sql - it is fairly error prone. Although our validation process is fairly rigorous, it still lets errors slip through occasionally.

The problem is compounded by having multiple teams working on different code branches, not knowing necessarily which order their releases will go out in (one team would ultimately be responsible for merging the other's changes into their branch prior to release).

Ideas on a postcard please :)

Mark

Saturday, 10 May 2008

Linux IP load balancing without a load balancer

I've been investigating load balancing without a load balancer. I'm building my own implementation of a high availability IP load balancer / failover system.

This will essentially work like the netfilter CLUSTERIP target, except that it will also be self-configuring and self-monitoring / repairing - thus not requiring other tools (such as the complicated LinuxHA tools) to work. Some other efforts to do this have been:
  • Saru http://www.ultramonkey.org/papers/active_active/ - seems abandoned
  • Microsoft's "network load balancing" does something similar
An author known as "flavio" wrote an article about load balancer-less clusters here but it seems to have disappeared although it's still on the wayback machine

How IP load balancing works without a dedicated load balancer host is:
  • ARP requests for the cluster IP address are responded to by a multicast ethernet address
  • All the hosts join the ethernet multicast group
  • Hosts selectively accept / ignore traffic based on whether they want to handle it or not, by some hashing algorithm.
I've started work on the implementation on google code. Most parts of it can be done in user-space (A kernel implementation might be necessary for performance later):
  • I use arptables to block the kernel's own ARP responses on the load balanced IP, otherwise it would give out its own unicast link address.
  • A small userspace daemon responds to ARP requests, giving out a multicast address.
  • The IP address is configured normally with "ip addr add ..."
  • Iptables is used to filter out the traffic we don't want and accept traffic we do want. It uses connection tracking to ensure that established connections are always kept, invalid ones ignored, and new connections passed to a userspace daemon using NFQUEUE
  • A userspace daemon reads the packets from NFQUEUE and uses a hashing algorithm to determine whther to accept them or not. Each host in the cluster receives the same packets and does the same hash - so reaches the same conclusion about who should receive the packet - thus EXACTLY ONE host will accept each new connection.
Load balancing can be done fairly (all nodes equal weight) or unfairly (different weights). Also, when administratively shutting down a node, we can set its weight to zero and existing connections will be allowed to finish (new ones will then be given to other nodes).

I've created a very sketchy design, it's all basically completely do-able. The userspace daemon uses UDP multicast packets to talk to the other nodes, will organise a "leader" which will then tell the other nodes which hash values to accept/reject, ensuring that there is no overlap and no gaps.

There are a lot of possibilities for race conditions during a reconfiguration due to a node weight change / failure / recovery. I haven't thought about these yet.

This principle works well for TCP-based services such as web and email, but may not be good for some UDP-based services because conntrack cannot ensure that the packets continue going to the same node for the lifetime of the connection (as it does for TCP).

---
Problems / disadvantages:
  • Apparently, an ARP reply indicating a link-layer multicast address is forbidden by RFC1812
  • The Linux kernel ignores TCP packets which have a link-layer multicast destination. I've worked around this with a really small kernel module (the same as what CLUSTERIP does)
  • Interoperability with other network OSs might not be good as this isn't a very official technique. Apparently some routers ignore these ARP packets.

Sunday, 20 April 2008

MySQL engines space usage comparison - continued


Since Oracle announced the InnoDB plugin which provides compressed row formats, I thought I'd do this comparison again.

Here are some more data that I've cobbled together.

I've created a 2M row table with a schema that we use for audit data. It has 47 columns and 17 indexes.

I copied the table into new empty tables using INSERT ... SELECT with each of these engines:
  • InnoDB row_format=compressed (from the InnoDB plugin)
  • InnoDB row_format=Compact
  • InnoDB row_format=Redundant
  • MyISAM pack_keys=1
  • MyISAM packed with myisampack (NOTE: these tables are readonly)
  • PBXT
  • Falcon
Here are the data:









EngineSpace (MB)
InnoDB compressed636
InnoDB compact1946
InnoDB redundant2253
MyISAM pack_keys=1738
MyISAM packed476
PBXT3379
Falcon999

I am very impressed with how good Falcon is. I am very impressed with InnoDB's compressed row format's space usage.

I don't think it's very reasonable to compare the loading times for the different engines as they all work in different ways, but some were a LOT faster than others. Take this with a pinch of salt:

Fast: MyISAM, Falcon, InnoDB except compressed
Slow: InnoDB compressed, PBXT

InnoDB compressed and PBXT both took well over an hour to complete what the others did in a few minutes. I can't explain this - maybe it's a tuning issue.

In each case, the copy was done in an INSERT ... SELECT which uses a single transaction. I tried to tune the buffers as best as I could (the server has enough ram) but I'm no expert in PBXT and Falcon.

Sunday, 13 April 2008

Improve reliability and maintainability by using stateless connections

MySQL application developers have some problems with connection state. There is a lot of state which can be used associated with a single connection
  • Transactions (autocommit, uncommitted work, when a transaction was started, isolation level)
  • Temporary tables
  • Session variables which affect behaviour
  • Other session variables
If you can avoid all of these, then you can benefit from transparent auto-reconnection and transparent connection reuse / pooling etc.

If you can't, then you will have to deal with reconnection, retry at a higher level. This is complicated and difficult to test.

I'll discuss ways of avoiding these things one at a time:

Transactions

I'm not advocating shunning transactions completely, just avoiding keeping any transaction state in the connection:
  • Keep autocommit on. Execute everything you want in a transaction in a single statement (in MySQL it is possible to delete from several tables in one statement, insert several rows etc).
  • Transactional engines will still use transactions if autocommit is on - a single statement becomes a single transaction - you will see a consistent view within that statement - no other thread will see parts of a batch insert, update or delete.
  • Non-transactional engines behave the same anyway.
  • Alternative: do everything that needs to be in a transaction inside a stored procedure - if these are retried, they will be retried as a whole.
Temporary tables

These are best avoided. They also cause problems for statement-level replication.
  • Use short-lived permanent tables with unique names instead (perhaps in a separate database which is automatically cleaned up when tables get old)
  • Wrap things which need temporary tables inside stored procedures
  • Use derived tables instead - in some cases this is sufficient.
Session variables which affect behaviour

Sometimes you can't get away from having to set these variables. Common examples are:
  • SET NAMES utf8
  • SET SQL_MODE = 'TRADITIONAL'
If an autoreconnect happens while these are in effect, you could accidentally get rubbish data inserted into the database if a connection reverts to latin1 unexpectedly or SQL_MODE reverts to its default "anything goes".

As these are unavoidable, they always have to be set - but there is a solution - the MYSQL_INIT_COMMAND option. As these commands won't actually have any effect on their own, we can have them automatically re-done when the library reconnects to the database.

The only problem with this approach is that not all client library interfaces allow the setting of this option - but in some cases it can be set in my.cnf (for client libraries which read this)

Other session variables

Quite simply - don't use them. There is no reason I know of for having another session variable.
  • Don't select @@LAST_INSERT_ID or last_insert_id() - use your API method to get it instead. An auto-reconnect happening between the insert and the select @@LAST_INSERT_ID will cause it to return zero, which is probably not what you wanted or expected.
  • Don't define your own variables - keep them on the client instead (or if you absolutely must store state server-side, put it in a table!)
Conclusions
  • It is possible to enable automatic reconnect for greater resilience at the MySQL API level
  • To do so safely you must avoid keeping session state in the connection
  • Variables which need to be reset on connect can be reset with mysql_init_command

Saturday, 22 March 2008

MySQL engines and space usage

A lot of people seem to spend a lot of effort comparing storage engine in MySQL - chiefly focusing on the difference between MyISAM and InnoDB.

People normally compare:

Feature sets:InnoDB: better durability, transactions, MVCC, foreign key constraints, row-level locking. MyISAM: fulltext, spatial indexes, table-level locking
Run-time performance: (see your favourite benchmark)

But few compare actual storage space usage. As this is very important to our application, I decided to run some tests.

I'm testing here with a realistic-sized table for our application (we partition data into daily partitions and spread them across many servers anyway, so this is just a small piece). We currently use MyISAM, and this is a typical table with approximately 4 million rows.

I can't dump the schema or content of this table here for confidentiality reasons, but it has:
  • 47 columns, many of which are VARCHARs
  • 17 indexes
  • Typical row size of 300 bytes (as reported by "SHOW TABLE STATUS")
Method

First I generated 4 Million rows of dummy data using a test tool (this is extremely useful as it avoids having to take real, potentially sensitive and very large data from production). Then I measured the space usage using SHOW TABLE STATUS, and successively ALTERed the table to different engines.

Note that ALTERing a table causes its indexes to be rebuilt - they will be a good deal smaller than if the rows had been inserted normally. Therefore, just an ALTER which doesn't change anything could make a big space saving.

Results

EngineData sizeIndex sizeNotes
MyISAM, PACK_KEYS=1 (initial)1152854640682459136Initial data load
MyISAM, PACK_KEYS=011528546402396920832ALTER
MyISAM, PACK_KEYS=DEFAULT1152854640677806080ALTER
MyISAM, PACK_KEYS=1 (alter)1152854640433162240ALTER
InnoDB19734200322442166272ALTER
sizes are in bytes.

In case that's not clear from the numbers above, here is a summary of the results
  • In MyISAM, the data are always the same size regardless of options
  • The initial (PACK_KEYS=1) index file was 650M
  • PACK_KEYS=0 makes the index 251% bigger
  • PACK_KEYS=DEFAULT makes the index marginally smaller (but it's been ALTERed which improves things)
  • Altering the table with PACK_KEYS=1 makes the index around 36% smaller than it originally was
  • InnoDB leaves the data 71% bigger and the index 257% bigger
Moving from MyISAM to InnoDB in this case would see an overall increase in space usage of 140% - that's making the disc requirement nearly 2.5 x more.

Conclusions
  • Before making a switch from MyISAM to InnoDB, plan your space usage carefully.
  • Measure your actual data, not mine. Yours may be different. We have a lot of indexes compared to some people.
Future work

I hope to be able to repeat this test with Falcon, Maria and PBXT some time soon.

Wednesday, 2 January 2008

mysql address space

Address space can be a significant problem on 32-bit mysql installations. After several IRC discussions attempting to explain this to some people, I decided to write this post about it.

What address space is

Each process running on a multitasking virtual memory OS has its own private address space. This is a range of addresses of a fixed size. The exact size depends on the CPU architecture, but on most 32-bit processors it is 32-bits. Intel/AMD chips address memory in bytes individually, so the maximum capacity of the address space is 2^32 bytes.

Address space has nothing whatsoever to do with the physical amount of RAM installed on a machine; you may have more address space than ram, but critically, you may also have less.

It is not possible to modify or upgrade the address space of a machine or software program, except by recompiling it for a different architecture.

How much space is available

Unfortunately, not all of the approximately 4Gb of address space is usable by mysql for buffers; some is used by:
  • The operating system kernel reserving address space for its own use
  • Code from the mysql executable file and its libraries etc
  • Fixed data allocated by mysql and its libraries
  • Thread stacks
  • etc
The amount that these use up is extremely system dependent, but it is normally safe to assume that on a 32-bit Linux OS, roughly 1.5Gb is available for mysql's buffers. On some systems there may be a bit more than this, but it's not likely to exceed 3Gb on any platform.

What impact running out has on mysql

MySQL is a multithreaded server process - each instance of mysql running on a machine (there is typically only one; several applications or databases can use the same server). This means that all its threads share the same, limited address space.

The database server can be tuned to use a lot of memory. There are two kinds of buffer - per thread and global buffers. The MyISAM key buffer is an example of a global buffer, the packet read buffer is an example of a per thread buffer. This is explained in more detail in the manual here: How MYSQL uses memory

All this memory comes from the same, limited address space. Running out of address space is very bad. When you run out of address space, the following happens:
  • A memory allocation attempt will fail. How gracefully this happens depends on where the allocation attempt was made. It could be: in the mysql core, in a storage engine, in an underlying library (e.g. the C or C++ runtime libraries) or in some other extension.
  • MySQL will probably crash. If a crash happens:
  • The server will restart, kicking all clients off.
  • InnoDB will start its recovery process. This may take a long time during which the server will be unusable.
  • Eventually all transactions will be rolled back and normal service will resume.
  • MyISAM tables which were in use at the time will be marked as crashed and needing a repair - this can take a very long time on large tables, during which time those tables are unusable.
This is very inconvenient for everyone concerned (the app developer, the sysadmin/DBA and especially the end user).

Mitigation: How to prevent running out

The most obvious and convenient solution is to run a 64-bit mysql server on a 64-bit OS. These have a lot more address space and are very unlikely to run out.

If, for some reason, you cannot upgrade to 64-bit, it is a very good idea to ensure that you do not run out of address space. This is mostly done by tuning the server parameters.
  • Do not set max_connections higher than necessary.
  • Keep per-thread buffers to a minimum
  • Tune other buffers reasonably
Monitoring the address space in use by mysql is a good idea. This is normally possible under Linux with "top" or "ps", as the "virt" or "VSZ" columns show the amount of address space used.

Final notes
  • Run a 64-bit OS if possible and for all new development. Run a 64-bit mysql if you run a 64-bit OS.
  • Running out of address space is not the same as running out of memory- it will probably happen a lot sooner (your server does have more than 2Gb of RAM, right?)
  • You can still run out of address space if you have less than 2Gb of RAM, because virtual memory is included
  • None of this is Linux-specific, it will happen on other OSs too (but the amount of space reserved by the kernel may be less)
  • Avoiding address space exhaustion is important in production systems