Mark's stream of verbiage

Saturday, 10 May 2008

Linux IP load balancing without a load balancer

I've been investigating load balancing without a load balancer. I'm building my own implementation of a high availability IP load balancer / failover system.

This will essentially work like the netfilter CLUSTERIP target, except that it will also be self-configuring and self-monitoring / repairing - thus not requiring other tools (such as the complicated LinuxHA tools) to work. Some other efforts to do this have been:

Saru http://www.ultramonkey.org/papers/active_active/ - seems abandoned
Microsoft's "network load balancing" does something similar

An author known as "flavio" wrote an article about load balancer-less clusters here but it seems to have disappeared although it's still on the wayback machine

How IP load balancing works without a dedicated load balancer host is:

ARP requests for the cluster IP address are responded to by a multicast ethernet address
All the hosts join the ethernet multicast group
Hosts selectively accept / ignore traffic based on whether they want to handle it or not, by some hashing algorithm.

I've started work on the implementation on google code. Most parts of it can be done in user-space (A kernel implementation might be necessary for performance later):

I use arptables to block the kernel's own ARP responses on the load balanced IP, otherwise it would give out its own unicast link address.
A small userspace daemon responds to ARP requests, giving out a multicast address.
The IP address is configured normally with "ip addr add ..."
Iptables is used to filter out the traffic we don't want and accept traffic we do want. It uses connection tracking to ensure that established connections are always kept, invalid ones ignored, and new connections passed to a userspace daemon using NFQUEUE
A userspace daemon reads the packets from NFQUEUE and uses a hashing algorithm to determine whther to accept them or not. Each host in the cluster receives the same packets and does the same hash - so reaches the same conclusion about who should receive the packet - thus EXACTLY ONE host will accept each new connection.

Load balancing can be done fairly (all nodes equal weight) or unfairly (different weights). Also, when administratively shutting down a node, we can set its weight to zero and existing connections will be allowed to finish (new ones will then be given to other nodes).

I've created a very sketchy design, it's all basically completely do-able. The userspace daemon uses UDP multicast packets to talk to the other nodes, will organise a "leader" which will then tell the other nodes which hash values to accept/reject, ensuring that there is no overlap and no gaps.

There are a lot of possibilities for race conditions during a reconfiguration due to a node weight change / failure / recovery. I haven't thought about these yet.

This principle works well for TCP-based services such as web and email, but may not be good for some UDP-based services because conntrack cannot ensure that the packets continue going to the same node for the lifetime of the connection (as it does for TCP).

---
Problems / disadvantages:

Apparently, an ARP reply indicating a link-layer multicast address is forbidden by RFC1812
The Linux kernel ignores TCP packets which have a link-layer multicast destination. I've worked around this with a really small kernel module (the same as what CLUSTERIP does)
Interoperability with other network OSs might not be good as this isn't a very official technique. Apparently some routers ignore these ARP packets.

Sunday, 20 April 2008

MySQL engines space usage comparison - continued

Since Oracle announced the InnoDB plugin which provides compressed row formats, I thought I'd do this comparison again.

Here are some more data that I've cobbled together.

I've created a 2M row table with a schema that we use for audit data. It has 47 columns and 17 indexes.

I copied the table into new empty tables using INSERT ... SELECT with each of these engines:

InnoDB row_format=compressed (from the InnoDB plugin)
InnoDB row_format=Compact
InnoDB row_format=Redundant
MyISAM pack_keys=1
MyISAM packed with myisampack (NOTE: these tables are readonly)
PBXT
Falcon

Here are the data:

Engine	Space (MB)
InnoDB compressed	636
InnoDB compact	1946
InnoDB redundant	2253
MyISAM pack_keys=1	738
MyISAM packed	476
PBXT	3379
Falcon	999

I am very impressed with how good Falcon is. I am very impressed with InnoDB's compressed row format's space usage.

I don't think it's very reasonable to compare the loading times for the different engines as they all work in different ways, but some were a LOT faster than others. Take this with a pinch of salt:

Fast: MyISAM, Falcon, InnoDB except compressed
Slow: InnoDB compressed, PBXT

InnoDB compressed and PBXT both took well over an hour to complete what the others did in a few minutes. I can't explain this - maybe it's a tuning issue.

In each case, the copy was done in an INSERT ... SELECT which uses a single transaction. I tried to tune the buffers as best as I could (the server has enough ram) but I'm no expert in PBXT and Falcon.

Sunday, 13 April 2008

Improve reliability and maintainability by using stateless connections

MySQL application developers have some problems with connection state. There is a lot of state which can be used associated with a single connection

Transactions (autocommit, uncommitted work, when a transaction was started, isolation level)
Temporary tables
Session variables which affect behaviour
Other session variables

If you can avoid all of these, then you can benefit from transparent auto-reconnection and transparent connection reuse / pooling etc.

If you can't, then you will have to deal with reconnection, retry at a higher level. This is complicated and difficult to test.

I'll discuss ways of avoiding these things one at a time:

Transactions

I'm not advocating shunning transactions completely, just avoiding keeping any transaction state in the connection:

Keep autocommit on. Execute everything you want in a transaction in a single statement (in MySQL it is possible to delete from several tables in one statement, insert several rows etc).
Transactional engines will still use transactions if autocommit is on - a single statement becomes a single transaction - you will see a consistent view within that statement - no other thread will see parts of a batch insert, update or delete.
Non-transactional engines behave the same anyway.
Alternative: do everything that needs to be in a transaction inside a stored procedure - if these are retried, they will be retried as a whole.

Temporary tables

These are best avoided. They also cause problems for statement-level replication.

Use short-lived permanent tables with unique names instead (perhaps in a separate database which is automatically cleaned up when tables get old)
Wrap things which need temporary tables inside stored procedures
Use derived tables instead - in some cases this is sufficient.

Session variables which affect behaviour

Sometimes you can't get away from having to set these variables. Common examples are:

SET NAMES utf8
SET SQL_MODE = 'TRADITIONAL'

If an autoreconnect happens while these are in effect, you could accidentally get rubbish data inserted into the database if a connection reverts to latin1 unexpectedly or SQL_MODE reverts to its default "anything goes".

As these are unavoidable, they always have to be set - but there is a solution - the MYSQL_INIT_COMMAND option. As these commands won't actually have any effect on their own, we can have them automatically re-done when the library reconnects to the database.

The only problem with this approach is that not all client library interfaces allow the setting of this option - but in some cases it can be set in my.cnf (for client libraries which read this)

Other session variables

Quite simply - don't use them. There is no reason I know of for having another session variable.

Don't select @@LAST_INSERT_ID or last_insert_id() - use your API method to get it instead. An auto-reconnect happening between the insert and the select @@LAST_INSERT_ID will cause it to return zero, which is probably not what you wanted or expected.
Don't define your own variables - keep them on the client instead (or if you absolutely must store state server-side, put it in a table!)

Conclusions

It is possible to enable automatic reconnect for greater resilience at the MySQL API level
To do so safely you must avoid keeping session state in the connection
Variables which need to be reset on connect can be reset with mysql_init_command

Saturday, 22 March 2008

MySQL engines and space usage

A lot of people seem to spend a lot of effort comparing storage engine in MySQL - chiefly focusing on the difference between MyISAM and InnoDB.

People normally compare:

Feature sets:InnoDB: better durability, transactions, MVCC, foreign key constraints, row-level locking. MyISAM: fulltext, spatial indexes, table-level locking
Run-time performance: (see your favourite benchmark)

But few compare actual storage space usage. As this is very important to our application, I decided to run some tests.

I'm testing here with a realistic-sized table for our application (we partition data into daily partitions and spread them across many servers anyway, so this is just a small piece). We currently use MyISAM, and this is a typical table with approximately 4 million rows.

I can't dump the schema or content of this table here for confidentiality reasons, but it has:

47 columns, many of which are VARCHARs
17 indexes
Typical row size of 300 bytes (as reported by "SHOW TABLE STATUS")

Method

First I generated 4 Million rows of dummy data using a test tool (this is extremely useful as it avoids having to take real, potentially sensitive and very large data from production). Then I measured the space usage using SHOW TABLE STATUS, and successively ALTERed the table to different engines.

Note that ALTERing a table causes its indexes to be rebuilt - they will be a good deal smaller than if the rows had been inserted normally. Therefore, just an ALTER which doesn't change anything could make a big space saving.

Results

Engine	Data size	Index size	Notes
MyISAM, PACK_KEYS=1 (initial)	1152854640	682459136	Initial data load
MyISAM, PACK_KEYS=0	1152854640	2396920832	ALTER
MyISAM, PACK_KEYS=DEFAULT	1152854640	677806080	ALTER
MyISAM, PACK_KEYS=1 (alter)	1152854640	433162240	ALTER
InnoDB	1973420032	2442166272	ALTER

sizes are in bytes.

In case that's not clear from the numbers above, here is a summary of the results

In MyISAM, the data are always the same size regardless of options
The initial (PACK_KEYS=1) index file was 650M
PACK_KEYS=0 makes the index 251% bigger
PACK_KEYS=DEFAULT makes the index marginally smaller (but it's been ALTERed which improves things)
Altering the table with PACK_KEYS=1 makes the index around 36% smaller than it originally was
InnoDB leaves the data 71% bigger and the index 257% bigger

Moving from MyISAM to InnoDB in this case would see an overall increase in space usage of 140% - that's making the disc requirement nearly 2.5 x more.

Conclusions

Before making a switch from MyISAM to InnoDB, plan your space usage carefully.
Measure your actual data, not mine. Yours may be different. We have a lot of indexes compared to some people.

Future work

I hope to be able to repeat this test with Falcon, Maria and PBXT some time soon.

Wednesday, 2 January 2008

mysql address space

Address space can be a significant problem on 32-bit mysql installations. After several IRC discussions attempting to explain this to some people, I decided to write this post about it.

What address space is

Each process running on a multitasking virtual memory OS has its own private address space. This is a range of addresses of a fixed size. The exact size depends on the CPU architecture, but on most 32-bit processors it is 32-bits. Intel/AMD chips address memory in bytes individually, so the maximum capacity of the address space is 2^32 bytes.

Address space has nothing whatsoever to do with the physical amount of RAM installed on a machine; you may have more address space than ram, but critically, you may also have less.

It is not possible to modify or upgrade the address space of a machine or software program, except by recompiling it for a different architecture.

How much space is available

Unfortunately, not all of the approximately 4Gb of address space is usable by mysql for buffers; some is used by:

The operating system kernel reserving address space for its own use
Code from the mysql executable file and its libraries etc
Fixed data allocated by mysql and its libraries
Thread stacks
etc

The amount that these use up is extremely system dependent, but it is normally safe to assume that on a 32-bit Linux OS, roughly 1.5Gb is available for mysql's buffers. On some systems there may be a bit more than this, but it's not likely to exceed 3Gb on any platform.

What impact running out has on mysql

MySQL is a multithreaded server process - each instance of mysql running on a machine (there is typically only one; several applications or databases can use the same server). This means that all its threads share the same, limited address space.

The database server can be tuned to use a lot of memory. There are two kinds of buffer - per thread and global buffers. The MyISAM key buffer is an example of a global buffer, the packet read buffer is an example of a per thread buffer. This is explained in more detail in the manual here: How MYSQL uses memory

All this memory comes from the same, limited address space. Running out of address space is very bad. When you run out of address space, the following happens:

A memory allocation attempt will fail. How gracefully this happens depends on where the allocation attempt was made. It could be: in the mysql core, in a storage engine, in an underlying library (e.g. the C or C++ runtime libraries) or in some other extension.
MySQL will probably crash. If a crash happens:
The server will restart, kicking all clients off.
InnoDB will start its recovery process. This may take a long time during which the server will be unusable.
Eventually all transactions will be rolled back and normal service will resume.
MyISAM tables which were in use at the time will be marked as crashed and needing a repair - this can take a very long time on large tables, during which time those tables are unusable.

This is very inconvenient for everyone concerned (the app developer, the sysadmin/DBA and especially the end user).

Mitigation: How to prevent running out

The most obvious and convenient solution is to run a 64-bit mysql server on a 64-bit OS. These have a lot more address space and are very unlikely to run out.

If, for some reason, you cannot upgrade to 64-bit, it is a very good idea to ensure that you do not run out of address space. This is mostly done by tuning the server parameters.

Do not set max_connections higher than necessary.
Keep per-thread buffers to a minimum
Tune other buffers reasonably

Monitoring the address space in use by mysql is a good idea. This is normally possible under Linux with "top" or "ps", as the "virt" or "VSZ" columns show the amount of address space used.

Final notes

Run a 64-bit OS if possible and for all new development. Run a 64-bit mysql if you run a 64-bit OS.
Running out of address space is not the same as running out of memory- it will probably happen a lot sooner (your server does have more than 2Gb of RAM, right?)
You can still run out of address space if you have less than 2Gb of RAM, because virtual memory is included
None of this is Linux-specific, it will happen on other OSs too (but the amount of space reserved by the kernel may be less)
Avoiding address space exhaustion is important in production systems

Saturday, 29 December 2007

On web application configuration

There are two common approaches for configuring web applications:

Configuration file(s) in the web application's directory
Storing configuration data in the database

What sort of data am I talking about:

Email setup (where to send response form, error emails etc, whether to send mail to customers etc)
Setup of things which need to vary between development and production - payment service provider configuration, setup of other third party integration things
Error handler configuration (whether to display errors, where to log them, email them etc)
Enabling / disabling features (for example, debug things)
Visual stuff or branding (labels etc)

Clearly the database connection string needs to be stored somewhere too, but that can't be in the database for obvious reasons.

Database configuration

pros: does not need to be configured on each web server of a farm; not accidentally overwritten by new deployments
cons: bad for hierarchical or structured data; performance overhead; difficult to maintain configuration data stored in a database in SCM. Modifications require apply/ rollback scripts

Configuration files

pros: Easy to manage in SCM (like all your other files, even if not in the same place); less runtime performance overhead; better for hierarchical data; database connection string can be held with other items.
cons: Needs to be synchronised across a web farm; care should be taken that the right versions are always deployed everywhere.

Ideas

For a larger application with a web farm, using a database for some configuration is clearly beneficial. If you have many servers and/or servers of other types (non-web servers), they can store all their configuration in a central database. If performance is a problem, copies can be cached locally.

For a small simple application, configuration files are better- they're more manageable generally. I prefer to put most configuration in a file if possible.

Monday, 3 December 2007

mysql dumps are not text files

Today's pearl of wisdom:

MySQL dump files, as produced by mysqldump, are not text files.

A common mistake is to think that mysqldump files are text files. While they superficially look like text files, they in fact are not. This leads to confusion and problems if people try to edit them in a text editor application, which will in many cases cause corruption.

MySQL dump files contain the SQL commands required to recreate the contents of a database (or subset thereof). However, these are SQL commands in arbitrary (binary) encoding and can consist of a mixture of different encodings (e.g. if you have BLOBs). This mixture is not safe to edit with a text editor, which expects a text file to contain character data in exactly one encoding (and will typically guess, ask the user or just use its default to determine which).

Suppose you use utf8 data in your database - then your mysqldump file will probably contains all of its strings in utf8. This is fine, until you get to a BLOB. This is a binary lump of data and will be full of sequences of bytes which are NOT LEGAL in utf8, therefore any editor attempting to make sense of them in utf8 won't succeed. It will therefore be forced to change them into something else (i.e. different from the original BLOB data) in order to load them into its buffers.

Saving this modified (or even unmodified) file back out will result in loss of data.

Summary

MySQL dump files aren't text files even though they look like them

Editing one in a text editor will probably lose data

They are often also too big for a text editor

And have lines too long