This is an idea for a type of Wi-Fi mesh network that is resilient to devices joining and leaving, and allows networks to break into
fragments when they get out of range and merge again when they come back in range. I just recently realized that this would also
work for a wired network, in case anybody
wants to make an overengineered ad-hoc wired network for some reason.
I invented this a year and a half ago, inspired by a mesh network project someone else was doing that used a different architecture. I
posted it as a comment on a project log post of that project ([link]), but didn't think to post it here. One of the people behind that
project replied to my comment and said he thought it could be good. That [link]ed project log post outlines their idea of how to
accomplish this sort of thing, so read that if you want context for my idea.
My idea uses DHCP as the mechanism for managing the network. This might require the addition to DHCP of another message type
beyond DORA, or maybe (preferably) the heartbeat messages I describe could be worked into existing lease renewal messages, with
either very short leases or renewal much more frequently than the lease duration would require.
I wrote it with IPv4 in mind; requiring IPv6 could help with address conflicts, especially on large networks. I'm not yet sure how NAT
could be implemented (though it probably wouldn't be too difficult, and I will probably add it at some point), so currently, each
participating device has to get a public IP address if the network is to be connected to the Internet. (This means that there needs to
be some kind of flag so that all participating devices know not to bridge between the mesh network and the Internet if any of the IP
addresses used internally are not usable public addresses. In other words, if a network is founded without a list of public IP addresses
to assign, it can never connect to the Internet, until I either add NAT or invent some protocol to get a list of allocable public IP
addresses and reassign them to all participating devices. Such a flag could even be attached to the SSID so that Internet-connectable
networks and non-Internet-connectable networks with the same name don't try to merge. But better to just figure out how to add
NAT.)
I have not yet considered how to add security to this kind of network, but I will. Suggestions are welcome.
Original writeup follows, with a couple of added notes in [brackets].
------
The network must always have a certain number greater than one of active DHCP servers. Rather than allocating addresses directly
out of the pool of available addresses, they each maintain a short list (~5?) of reserved addresses to allocate from, which the other
DHCP servers will not allocate from, because their short lists will not contain those addresses. This guarantees that no address will be
allocated to multiple devices by different DHCP servers. Every DHCP server knows the addresses of the others and exchanges
heartbeats with them, synchronizing the lease file and refilling their short lists from the general pool of available addresses. They
also need to sync their short lists to keep them mutually exclusive (or just sync the pool, but conflict resolution might be harder that
way, and its more data to sync if sunc in full). There will have to be a conflict resolution method for refilling the short lists, but I
think it would be a lot easier to do that than try to resolve conflicts between addresses after they've been allocated. It could be as
simple as both DHCP servers that claimed an address from the pool waiting a random amount of time, and whichever one reacts first
gets the contested address to add to its short list, and the other one claims a different address. If there are no addresses left in the
pool to add to the short lists, the DHCP servers keep allocating addresses from their short lists until they run out. When they run out,
they will no longer respond with address offers to joining devices discovery messages. In this way, the joining device is guaranteed to
get an address even if the nearest DHCP server to it doesnt have one to allocate.
If there are too few DHCP servers on the network, the existing ones will ask another device to become a DHCP server. They should
probably pick a device as far away from them on the mesh as possible so that the distribution of DHCP servers on the mesh is more
uniform. If there are too many, then they pick one to deactivate, at which point it becomes just another device on the network. In
this case, they should probably pick the one nearest to the rest, for the same reason. If using a routing algorithm that doesnt allow
full knowledge of the network, such as BATMAN, they can each pick a device to activate/deactivate as a DHCP server and decide by a
vote. (BATMAN hardly allows any knowledge of the layout of the network, so latency should probably be used as the metric in this
case.)
A device joining the network will, as usual, send out a discovery message looking for DHCP servers. At least one will respond, and the
joining device will then have to choose which one to usethe easiest thing to do would be to simply choose the first one to respond,
because its probably closest. The others can be ignored, because if they dont receive address requests back from the joining device,
they wont allocate the addresses they offered. The device that just joined will then remember which DHCP server gave it its address
and exchange heartbeats with that DHCP server. These heartbeats include a list of the other DHCP servers' addresses. If a device
notices a lack of heartbeat from its associated DHCP server, it will assume that DHCP server has gone offline suddenly and reassociate
with a different one (because it knows all of their addresses), keeping the same address it already had. If a DHCP server notices a lack
of heartbeat from any of its associated devices, it will first ask the other DHCP servers if that device is now associated with any of
them, and, if not, add that address back to its short list (if the short list isnt full) or to the general pool. If any device cannot contact
any DHCP server, it assumes there are none left on the network (or that its founding a network) and becomes one. This, according to
what has been discussed so far, would result in every remaining device becoming a DHCP server. [I no longer know why I thought this
would happen.] Then they can then decide which ones to deactivate. With network IDs and the resulting merging process, described
below, each device would still become a DHCP server, but would create its own network fragment. These fragments would then re-
merge. This results in an extremely resilient network.
If a DHCP server goes down from the perspective of the other DHCP servers, they should first attempt to contact the devices it
allocated addresses to, to see if any of them can still contact it. [How do they know which devices were associated with it?] If so,
then messages can just be routed via them. (If the mesh routing works well enough, this would happen without them even noticing.)
If it can't be contacted, the other DHCP servers ask all of the devices that previously belonged to the downed one to reassociate with
another DHCP server, again, using their existing IP addresses. (If some of them are not contactable, but continue to be in contact with
their DHCP server, this just becomes a separate network.) If necessary, the remaining DHCP servers then ask another device on the
network to become a DHCP server, as described above, to replace the lost one. Also, for optimization, it might be good to have
devices automatically reassociate with different DHCP servers if the network topology changes to bring them closer together. There
should be some hysteresis, though, to keep them from switching back and forth quickly.
What about merging networks? Well, what triggers a merger? Obviously, two distinct networks coming into range of each other.
There are two cases for this: case 1, where some IP addresses are allocated to a device on each network, and case 2, where the
networks dont have any allocated IP addresses that overlap. In case 2, the most obvious way to detect proximity of another network
is to detect that the destination IP address on a received packet isnt allocated on the current network. However, this wont
necessarily work for case 1, and it isnt reliable even in case 2. An analogous detection method for case 1 is receiving a packet from
ones own IP address, or receiving replies from two other devices in response to only one message being sent. This, however, is also
unreliable. Therefore, we can discount this detection method. In case 2, it might be possible for the networks to merge passively,
using their other synchronization and conflict resolution methods to bring all of the DHCP servers into agreement. However, the
situation could easily transition to case 1 during the merger, causing more difficult conflicts.
Therefore, I think its necessary for each network to have an ID of some sort, agreed upon and distributed to the other devices by the
DHCP servers. This could be a UUID or a traditional SSID. This ID would facilitate reliable detection of two networks coming into range
of each other.
In the case of using an SSID, any device receiving a packet with an SSID on it that doesnt match that of the current network would
notify its DHCP server, which would initiate a merger. However, if the SSID is the same, as would be the case if the network broke in
two and then came back together, no such notification would be sent. In this case, there could be IP address conflicts, so passive
merging might not be reliable. Therefore, an SSID alone is not sufficient to avoid conflicts. However, an SSID has the advantage of
being human-readable.
The advantage of using a UUID is that there could not be a case where the two network[ fragment]s think theyre one, so there could
be no passive mergers, and so no unnoticed IP address conflicts. With UUIDs, a breakaway network[ fragment]s DHCP server(s) would
create and propagate a new UUID for the new network fragment as soon as it notices the break. (Only a fragment composed of a
majority of the previous network could keep its current UUID, because only it would be able to know that there was no larger
fragment resulting from the break.) Therefore, whenever a network loses a DHCP server (without that DHCP server telling the others
its about to go offline and telling its associated devices to reassociate), a new UUID might need to be created for the remainder of
the network. This leads to the idea of using not a UUID but instead some kind of hash of the active DHCP servers MAC addresses.
(Such a hash might also be useful to enable joining devices to authenticate the DHCP server that offers them an address. I havent
really thought about that.)
Therefore, I think its necessary to use both types of ID. The SSID would be the human-readable network name, enabling users to
choose a network to join, and enabling multiple distinct networks to operate in the same geographical area without attempting to
merge. The UUID or DHCP server MAC address hash would identify every network fragment uniquely to facilitate reliable detection of
the necessity to merge. With both types of ID being used together, networks would merge when their SSIDs are the same but their
UUIDs/hashes are different.
Now we come to the actual process of merging. Once the two fragments discover that they share an SSID but not a UUID/hash, they
will know that they need to merge. Their DHCP servers will set up provisional communications via whatever edge nodes happened to
come into range of the other network. Once this is in place, their DHCP servers can negotiate which DHCP servers to deactivate and
which to leave active, and then create and propagate a new UUID/hash for the new combined [network], and synchronize the DHCP
leases and short lists. To resolve address conflicts, they will choose one device to keep its address and one to switch to a different
address based on some factor such as relative traffic volumes. [Presumably I wrote that to minimize ARP traffic, though either way,
all devices on the same fragment as whichever one gets a new address will have to learn their peer's new address and will be sending
mis-addressed messages until that happens. This might require a 'push' version of ARP so the DHCP servers can propagate the new
address association; this could probably be included in the heartbeat messages.] DHCP servers should probably get priority over other
devices to keep their addresses. If there are too many devices to merge the fragments without address conflicts, they will not merge,
or some kind of subnet thing will happen. [I now realize that this would require something like NAT, internal to the network. Of
course, it's not a problem if using a sufficiently large pool of IP addresses.] If the UUID or hash of either merging fragment happens to
change during the merger process (e.g. by one of the fragments breaking or merging with a third fragment), then the merger should
be aborted and retried. In this way, all fragments in the area with the same SSID will eventually merge.