This will essentially work like the netfilter CLUSTERIP target, except that it will also be self-configuring and self-monitoring / repairing - thus not requiring other tools (such as the complicated LinuxHA tools) to work. Some other efforts to do this have been:
- Saru http://www.ultramonkey.org/papers/active_active/ - seems abandoned
- Microsoft's "network load balancing" does something similar
How IP load balancing works without a dedicated load balancer host is:
- ARP requests for the cluster IP address are responded to by a multicast ethernet address
- All the hosts join the ethernet multicast group
- Hosts selectively accept / ignore traffic based on whether they want to handle it or not, by some hashing algorithm.
- I use arptables to block the kernel's own ARP responses on the load balanced IP, otherwise it would give out its own unicast link address.
- A small userspace daemon responds to ARP requests, giving out a multicast address.
- The IP address is configured normally with "ip addr add ..."
- Iptables is used to filter out the traffic we don't want and accept traffic we do want. It uses connection tracking to ensure that established connections are always kept, invalid ones ignored, and new connections passed to a userspace daemon using NFQUEUE
- A userspace daemon reads the packets from NFQUEUE and uses a hashing algorithm to determine whther to accept them or not. Each host in the cluster receives the same packets and does the same hash - so reaches the same conclusion about who should receive the packet - thus EXACTLY ONE host will accept each new connection.
I've created a very sketchy design, it's all basically completely do-able. The userspace daemon uses UDP multicast packets to talk to the other nodes, will organise a "leader" which will then tell the other nodes which hash values to accept/reject, ensuring that there is no overlap and no gaps.
There are a lot of possibilities for race conditions during a reconfiguration due to a node weight change / failure / recovery. I haven't thought about these yet.
This principle works well for TCP-based services such as web and email, but may not be good for some UDP-based services because conntrack cannot ensure that the packets continue going to the same node for the lifetime of the connection (as it does for TCP).
Problems / disadvantages:
- Apparently, an ARP reply indicating a link-layer multicast address is forbidden by RFC1812
- The Linux kernel ignores TCP packets which have a link-layer multicast destination. I've worked around this with a really small kernel module (the same as what CLUSTERIP does)
- Interoperability with other network OSs might not be good as this isn't a very official technique. Apparently some routers ignore these ARP packets.