Replicated data (by alaric)
So you want a tradeoff. Ideally, you'd tell the system that you want every record copied to at least N nodes for reliability (where N is 2 or 3 a lot of the time), but otherwise, split evenly across the cluster. That's quite easy to do.
But the fun part is how to keep that the case while new nodes are joining the system, and dead nodes dropping off of it... especially when you consider that establishing consensus about which set of nodes is up is hard in a distributed system.
So I need to think about it in more detail, but before I forget it, here's an outline of the approach I think I should take.
My existing plans for TUNGSTEN involved every entity (the ARGON analogue for a 'file' or 'object') being assigned to a 'Volume', and each node having a list of which volumes it carries. Provisionally, and probably for the first release, every node will carry a copy of every entity from every volume it carries. In other words, full replication.
One important fact is that there's a system volume, which EVERY node carries. It contains cluster-wide configuration information - such as the list of nodes. It's fully replicated, since every node needs that information in it when booting, to actually contact the rest of the cluster.
SO I propose that we use a hash function to map the ID of an entity within each volume to a 32-bit number, which we will call the 'page number'. And we maintain, in the system volume, for every volume, a list of nodes that carry that volume. That list of nodes is divided into groups I call fascicles. Each fascicle is assigned one or more ranges of page numbers.
Therefore, for a given entity, the system can work out the page number, then look in the volume definition to find which fascicle is responsible for that page, and the list of nodes carrying that fascicle.
The fascicle also lists a state for each node in it - DEAD, UPDATING or LIVE. Writes to an entity must be multicast to all non-DEAD nodes in the fascicle, but reads should only be serviced from LIVE nodes.
Now, when a new node is assigned to the volume, the system assigns it to the fascicle with the least nodes in, in state UPDATING. The node is forewarned, then once it has acknowledged it's ready, this is written to the system volume, so all nodes hear about it and start sending it updates. At the same time, it starts fetching copies of entities from the LIVE nodes in the fascicle.
If a node dies, then as soon as the cluster notices, the node is labelled as DEAD, so reads and writes stop going to it. When it comes back, it compares the timestamp is last saw to the timestamps currently active in the cluster, goes to the UPDATING state, and requests changes to entities that have occurred since it went down from the LIVE nodes in the fascicle.
If a node is administratively deleted, then it's just removed from its fascicles.
Either way, whenever a fascicle shrinks, the system has to assess the situation to see if the fascicle now has too few nodes in it (where the minimum number of nodes per fascicle is manually defined per volume; it can be any integer from 1 to infinity. The system volume chooses 'infinity', so the system copies everything to every node). If so, it must merge the fascicle with another.
There are two ways this could be done; either pick a fascicle whose page range adjoins this one, and merge it to create a new, larger, fascicle with a single page range. This is desirable in that it means that ever fascicle has just one page range, which simplifies some stuff. But it gives the system little freedom in deciding which fascicles to merge.
Or it could choose another fascicle which is rather on the small side, and merge them, creating a new fascicle with two page ranges. Which results in longer-term complexity, but offers the option of more optimal mergings. I suspect experiments will have to be performed.
Anyway, how do we merge two fascicles, while the system is running? My hunch is that we should record that the merge is in progress. Mark both fascicles as merging with the other fascicle. Roll out the change to the system volume. Then the rest of the system, when it wants to write a change, looks up the appropriate fascicle, and notices that the fascicle is merging with another - so sends the write to both. But reads still just come from the appropriate fascicle. That way, all the nodes of the new fascicle will be hearing about all the writes while they are busy updating themselves by copying the contents of the other fascicle to themselves, so no updates get lost. When all the nodes have finished updating themselves, they find this out using a distributed termination algorithm, and update the system volume to reflect themselves as all being members of a single merged fascicle.
But what if, due to new nodes being added or fascicle merges, the fascicles get too big? If a fascicle gets larger than a threshold value, it can be split. Deciding how to split is easy if the fascicles are always contiguous page ranges - maybe split the page range exactly in half, or maybe look at the distribution of actual entities within the page range and choose a split point that splits the entity count in half. Or look at the distribution of numbers of bytes used by entities, or amount of CPU load generated by entities, or number of requests per second, or a function of all of them, and choose a split point that splits the load fairly. If the fascicles may have arbitrary numbers of noncontiguous page ranges, it gets a little tricker. But approaches spring to mind.
A split is easier to arrange. Just make the split in the system volume. The write load to each node will roughly half since they are now in a fascicle with less page numbers. And likewise with the read node. They will be carrying copies of entities they're now no longer responsible for, and will no longer be being sent updates for so may well be outdated, but they won't be being sent any read requests for them so that's no danger; they just need deleting to free the space up for other uses.
But the fun part is dealing with failures in interim states. I've spoken of what happens when a LIVE node dies above. But what happens if a node dies while it's UPDATING? Or while it's undergoing a fascicle merge? Or what happens if a node dies, and when it comes back, finds it's assigned different fascicles to when it died?