Halfbakery: Resilient Wi-Fi Mesh Network

This is an idea for a type of Wi-Fi mesh network that is resilient to devices joining and leaving, and allows networks to break into fragments when they get out of range and merge again when they come back in range. I just recently realized that this would also work for a wired network, in case anybody wants to make an overengineered ad-hoc wired network for some reason.

I invented this a year and a half ago, inspired by a mesh network project someone else was doing that used a different architecture. I posted it as a comment on a project log post of that project ([link]), but didn't think to post it here. One of the people behind that project replied to my comment and said he thought it could be good. That [link]ed project log post outlines their idea of how to accomplish this sort of thing, so read that if you want context for my idea.

My idea uses DHCP as the mechanism for managing the network. This might require the addition to DHCP of another message type beyond DORA, or maybe (preferably) the heartbeat messages I describe could be worked into existing lease renewal messages, with either very short leases or renewal much more frequently than the lease duration would require.

I wrote it with IPv4 in mind; requiring IPv6 could help with address conflicts, especially on large networks. I'm not yet sure how NAT could be implemented (though it probably wouldn't be too difficult, and I will probably add it at some point), so currently, each participating device has to get a public IP address if the network is to be connected to the Internet. (This means that there needs to be some kind of flag so that all participating devices know not to bridge between the mesh network and the Internet if any of the IP addresses used internally are not usable public addresses. In other words, if a network is founded without a list of public IP addresses to assign, it can never connect to the Internet, until I either add NAT or invent some protocol to get a list of allocable public IP addresses and reassign them to all participating devices. Such a flag could even be attached to the SSID so that Internet-connectable networks and non-Internet-connectable networks with the same name don't try to merge. But better to just figure out how to add NAT.)

I have not yet considered how to add security to this kind of network, but I will. Suggestions are welcome.

Original writeup follows, with a couple of added notes in [brackets].

------

The network must always have a certain number greater than one of active DHCP servers. Rather than allocating addresses directly out of the pool of available addresses, they each maintain a short list (~5?) of reserved addresses to allocate from, which the other DHCP servers will not allocate from, because their short lists will not contain those addresses. This guarantees that no address will be allocated to multiple devices by different DHCP servers. Every DHCP server knows the addresses of the others and exchanges heartbeats with them, synchronizing the lease file and refilling their short lists from the general pool of available addresses. They also need to sync their short lists to keep them mutually exclusive (or just sync the pool, but conflict resolution might be harder that way, and it’s more data to sync if sunc in full). There will have to be a conflict resolution method for refilling the short lists, but I think it would be a lot easier to do that than try to resolve conflicts between addresses after they've been allocated. It could be as simple as both DHCP servers that claimed an address from the pool waiting a random amount of time, and whichever one reacts first gets the contested address to add to its short list, and the other one claims a different address. If there are no addresses left in the pool to add to the short lists, the DHCP servers keep allocating addresses from their short lists until they run out. When they run out, they will no longer respond with address offers to joining devices’ discovery messages. In this way, the joining device is guaranteed to get an address even if the nearest DHCP server to it doesn’t have one to allocate.

If there are too few DHCP servers on the network, the existing ones will ask another device to become a DHCP server. They should probably pick a device as far away from them on the mesh as possible so that the distribution of DHCP servers on the mesh is more uniform. If there are too many, then they pick one to deactivate, at which point it becomes just another device on the network. In this case, they should probably pick the one nearest to the rest, for the same reason. If using a routing algorithm that doesn’t allow full knowledge of the network, such as BATMAN, they can each pick a device to activate/deactivate as a DHCP server and decide by a vote. (BATMAN hardly allows any knowledge of the layout of the network, so latency should probably be used as the metric in this case.)

A device joining the network will, as usual, send out a discovery message looking for DHCP servers. At least one will respond, and the joining device will then have to choose which one to use—the easiest thing to do would be to simply choose the first one to respond, because it’s probably closest. The others can be ignored, because if they don’t receive address requests back from the joining device, they won’t allocate the addresses they offered. The device that just joined will then remember which DHCP server gave it its address and exchange heartbeats with that DHCP server. These heartbeats include a list of the other DHCP servers' addresses. If a device notices a lack of heartbeat from its associated DHCP server, it will assume that DHCP server has gone offline suddenly and reassociate with a different one (because it knows all of their addresses), keeping the same address it already had. If a DHCP server notices a lack of heartbeat from any of its associated devices, it will first ask the other DHCP servers if that device is now associated with any of them, and, if not, add that address back to its short list (if the short list isn’t full) or to the general pool. If any device cannot contact any DHCP server, it assumes there are none left on the network (or that it’s founding a network) and becomes one. This, according to what has been discussed so far, would result in every remaining device becoming a DHCP server. [I no longer know why I thought this would happen.] Then they can then decide which ones to deactivate. With network IDs and the resulting merging process, described below, each device would still become a DHCP server, but would create its own network fragment. These fragments would then re- merge. This results in an extremely resilient network.

If a DHCP server goes down from the perspective of the other DHCP servers, they should first attempt to contact the devices it allocated addresses to, to see if any of them can still contact it. [How do they know which devices were associated with it?] If so, then messages can just be routed via them. (If the mesh routing works well enough, this would happen without them even noticing.) If it can't be contacted, the other DHCP servers ask all of the devices that previously belonged to the downed one to reassociate with another DHCP server, again, using their existing IP addresses. (If some of them are not contactable, but continue to be in contact with their DHCP server, this just becomes a separate network.) If necessary, the remaining DHCP servers then ask another device on the network to become a DHCP server, as described above, to replace the lost one. Also, for optimization, it might be good to have devices automatically reassociate with different DHCP servers if the network topology changes to bring them closer together. There should be some hysteresis, though, to keep them from switching back and forth quickly.

What about merging networks? Well, what triggers a merger? Obviously, two distinct networks coming into range of each other. There are two cases for this: case 1, where some IP addresses are allocated to a device on each network, and case 2, where the networks don’t have any allocated IP addresses that overlap. In case 2, the most obvious way to detect proximity of another network is to detect that the destination IP address on a received packet isn’t allocated on the current network. However, this won’t necessarily work for case 1, and it isn’t reliable even in case 2. An analogous detection method for case 1 is receiving a packet from one’s own IP address, or receiving replies from two other devices in response to only one message being sent. This, however, is also unreliable. Therefore, we can discount this detection method. In case 2, it might be possible for the networks to merge passively, using their other synchronization and conflict resolution methods to bring all of the DHCP servers into agreement. However, the situation could easily transition to case 1 during the merger, causing more difficult conflicts.

Therefore, I think it’s necessary for each network to have an ID of some sort, agreed upon and distributed to the other devices by the DHCP servers. This could be a UUID or a traditional SSID. This ID would facilitate reliable detection of two networks coming into range of each other.

In the case of using an SSID, any device receiving a packet with an SSID on it that doesn’t match that of the current network would notify its DHCP server, which would initiate a merger. However, if the SSID is the same, as would be the case if the network broke in two and then came back together, no such notification would be sent. In this case, there could be IP address conflicts, so passive merging might not be reliable. Therefore, an SSID alone is not sufficient to avoid conflicts. However, an SSID has the advantage of being human-readable.

The advantage of using a UUID is that there could not be a case where the two network[ fragment]s think they’re one, so there could be no passive mergers, and so no unnoticed IP address conflicts. With UUIDs, a breakaway network[ fragment]’s DHCP server(s) would create and propagate a new UUID for the new network fragment as soon as it notices the break. (Only a fragment composed of a majority of the previous network could keep its current UUID, because only it would be able to know that there was no larger fragment resulting from the break.) Therefore, whenever a network loses a DHCP server (without that DHCP server telling the others it’s about to go offline and telling its associated devices to reassociate), a new UUID might need to be created for the remainder of the network. This leads to the idea of using not a UUID but instead some kind of hash of the active DHCP servers’ MAC addresses. (Such a hash might also be useful to enable joining devices to authenticate the DHCP server that offers them an address. I haven’t really thought about that.)

Therefore, I think it’s necessary to use both types of ID. The SSID would be the human-readable network name, enabling users to choose a network to join, and enabling multiple distinct networks to operate in the same geographical area without attempting to merge. The UUID or DHCP server MAC address hash would identify every network fragment uniquely to facilitate reliable detection of the necessity to merge. With both types of ID being used together, networks would merge when their SSIDs are the same but their UUIDs/hashes are different.

Now we come to the actual process of merging. Once the two fragments discover that they share an SSID but not a UUID/hash, they will know that they need to merge. Their DHCP servers will set up provisional communications via whatever edge nodes happened to come into range of the other network. Once this is in place, their DHCP servers can negotiate which DHCP servers to deactivate and which to leave active, and then create and propagate a new UUID/hash for the new combined [network], and synchronize the DHCP leases and short lists. To resolve address conflicts, they will choose one device to keep its address and one to switch to a different address based on some factor such as relative traffic volumes. [Presumably I wrote that to minimize ARP traffic, though either way, all devices on the same fragment as whichever one gets a new address will have to learn their peer's new address and will be sending mis-addressed messages until that happens. This might require a 'push' version of ARP so the DHCP servers can propagate the new address association; this could probably be included in the heartbeat messages.] DHCP servers should probably get priority over other devices to keep their addresses. If there are too many devices to merge the fragments without address conflicts, they will not merge, or some kind of subnet thing will happen. [I now realize that this would require something like NAT, internal to the network. Of course, it's not a problem if using a sufficiently large pool of IP addresses.] If the UUID or hash of either merging fragment happens to change during the merger process (e.g. by one of the fragments breaking or merging with a third fragment), then the merger should be aborted and retried. In this way, all fragments in the area with the same SSID will eventually merge.