Jean woke us up at 6am today, wanting to be in Mummy and Daddy's Bed, but I couldn't get back to sleep after this so lay there thinking about things.
I was pondering the cache-deletion issue in Ugarit that I mentioned before.
The cache in Ugarit serves an important function: in order to know if it needs to upload a block of data or not, Ugarit checks first to see if a block with the same Tiger hash already exists. This operation is performed on everything in your entire filesystem every time you do a snapshot (although I'm scheming up some optimsiations to avoid this based on caching the hashes of files by their modtime and all that). If you have a slow remote archive, such as in S3, then each check for existence of a block might easily take a tenth of a second or more - and since the largest filesystem I need backing up will contain at least a million blocks, that'd be over a day just spent on checking what blocks already exist. So the cache contains a list of known-existing blocks, stored in a B-Tree on the local filesystem. Whenever a block is uploaded, it's added to the B-Tree, as we know it exists in the archive. When a check for the existance of a block comes in, we check the cache; if we see it, then we know instantly that the block exists. Otherwise we check the archive, and if the block was found, we mark this fact in the cache so that we can use it in future. And, of course, if we are asked to unlink a block, we pass that request to the archive, and if it reports that it finally deleted the block rather than just decrementing its reference count, we remove the block's hash from the cache.
However, if you have multiple computers sharing a single archive - which is a perfectly sensible thing to do, as the shared archive will mean that all the different filesystems being backed up to it will share copies (and thus not waste upload bandwidth) of files that appear in both (such as most of everything outside of /var
, /home
, and */etc
on a NetBSD system) - then deletion with caches poses an issue: if you delete a file, and update your local cache, but somebody else is also using a cache on the same archive, then their cache will not know about the deletion. This is dangerous since it means that cache will then falsely report existence of a block that's not actually in the archive, which means the contents of that block won't be uploaded - and since it was deleted from the archive, it means that block won't be backed up anywhere. Danger, danger!
But as I lay there thinking, a solution came to me.
I should make the cache backend maintain an intent log of deletions it proposes to make. When this log is about to become too big to fit in a block itself, it should:
- Upload it to the storage backend
- Request the key of the block pointed at by a special
**DELETION CHAIN HEAD**
tag
- Upload a small block containing that key, and the key of the block full of deleted block keys.
- Update the
**DELETION CHAIN HEAD**
tag to point to the new block
- Process the actual deletions
That way, it keeps a log of deleted references in the archive, and makes sure it's initialised before doing any actual deletes.
Then the cache just needs to store a local copy of the **DELETION CHAIN HEAD**
tag. When it starts up (or wants to do a periodic check) it can fetch the one from the archive; if they differ, then it should follow the chain of deletion blocks, removing them from the cache, until it reaches the deletion block with the key it has stored, or the end of the list, at which point it can stop and update its local copy of the tag.
There's still potential races - another deletion operation running in parallel might race over the **DELETION CHAIN HEAD**
tag, although I've tried to keep only a very small block upload within the window between getting and setting the tag - so I've added tag-lock!
and tag-unlock!
primitives to the storage engine API, to avoid that entirely.
More excitingly, if a deletion is running in parallel with a snapshot, then the cache being used by the snapshot might not realise a block has been deleted and return success right away.
Perhaps I need to extend tag-lock!
and tag-unlock!
to be a full-fledged semaphore system, so I can maintain archive invariants, such as that there may be N snapshots running or 1 deletion, like a read/write lock. But I don't like locks - doing it without locks would be much better!
Currently, the archive storage engine API won't quite allow the intent log, anyway, since it just has an unlink!
function that returns false if the block still has references, and returns the contents of the block if it does not (so that the caller can then recursively unlink!
everything mentioned from within that block). So there's no easy way of asking the storage engine we're wrapping the cache around whether an unlink!
would delete a block without actually deleting it. But, we can make do without, at the cost of a little less safety; we can instead make the cache store an after-the-fact log of deleted block keys, and just upload them when it gets full.
So, I'm still not sure if we need the complexity of safe deletions. Are frequent deletions actually a good idea anyway? The neat thing about a content-addressed store is that it does work well as a continual archive, as it essentially stores differences. I decided to implement deletion since I know there will be situations where the thought of freeing up a hundred gigabytes will be more inviting than having a year's snapshots from a decade ago lying around; if I don't implement deletion, then users will forever be pestering me about it. So perhaps I should just leave deletion in as is, along with a warning (automatically output when a cache spots its first deletion request of a session) to the user that doing a deletion will invalidate any other caches around the same archive on different machines.
Still, one cheery thought struck me: if you're running a snapshot, and your computer crashes, then you can just start the snapshot again. We'll only update the snapshot tag when a snapshot is completed, so you won't get a partial snapshot; but when you start again, all the blocks you'd already uploaded will not need to be uploaded again, so it'll just pick up about where it left off last time. Excellent!