Saturday, 10 May 2008

Linux IP load balancing without a load balancer

I've been investigating load balancing without a load balancer. I'm building my own implementation of a high availability IP load balancer / failover system.

This will essentially work like the netfilter CLUSTERIP target, except that it will also be self-configuring and self-monitoring / repairing - thus not requiring other tools (such as the complicated LinuxHA tools) to work. Some other efforts to do this have been:
  • Saru http://www.ultramonkey.org/papers/active_active/ - seems abandoned
  • Microsoft's "network load balancing" does something similar
An author known as "flavio" wrote an article about load balancer-less clusters here but it seems to have disappeared although it's still on the wayback machine

How IP load balancing works without a dedicated load balancer host is:
  • ARP requests for the cluster IP address are responded to by a multicast ethernet address
  • All the hosts join the ethernet multicast group
  • Hosts selectively accept / ignore traffic based on whether they want to handle it or not, by some hashing algorithm.
I've started work on the implementation on google code. Most parts of it can be done in user-space (A kernel implementation might be necessary for performance later):
  • I use arptables to block the kernel's own ARP responses on the load balanced IP, otherwise it would give out its own unicast link address.
  • A small userspace daemon responds to ARP requests, giving out a multicast address.
  • The IP address is configured normally with "ip addr add ..."
  • Iptables is used to filter out the traffic we don't want and accept traffic we do want. It uses connection tracking to ensure that established connections are always kept, invalid ones ignored, and new connections passed to a userspace daemon using NFQUEUE
  • A userspace daemon reads the packets from NFQUEUE and uses a hashing algorithm to determine whther to accept them or not. Each host in the cluster receives the same packets and does the same hash - so reaches the same conclusion about who should receive the packet - thus EXACTLY ONE host will accept each new connection.
Load balancing can be done fairly (all nodes equal weight) or unfairly (different weights). Also, when administratively shutting down a node, we can set its weight to zero and existing connections will be allowed to finish (new ones will then be given to other nodes).

I've created a very sketchy design, it's all basically completely do-able. The userspace daemon uses UDP multicast packets to talk to the other nodes, will organise a "leader" which will then tell the other nodes which hash values to accept/reject, ensuring that there is no overlap and no gaps.

There are a lot of possibilities for race conditions during a reconfiguration due to a node weight change / failure / recovery. I haven't thought about these yet.

This principle works well for TCP-based services such as web and email, but may not be good for some UDP-based services because conntrack cannot ensure that the packets continue going to the same node for the lifetime of the connection (as it does for TCP).

---
Problems / disadvantages:
  • Apparently, an ARP reply indicating a link-layer multicast address is forbidden by RFC1812
  • The Linux kernel ignores TCP packets which have a link-layer multicast destination. I've worked around this with a really small kernel module (the same as what CLUSTERIP does)
  • Interoperability with other network OSs might not be good as this isn't a very official technique. Apparently some routers ignore these ARP packets.