Log structured file systems revisited (by )

I've been thinking about log structured transactional filesystems again, and I think I've come up with something potentially useful for making POSIX-style file systems as well as for the TUNGSTEN object store. The job of a filesystem can basically be boiled down to identifying disk blocks with some kind of numeric identifier. On a UNIX filesystem, every data block can be identified by the inode number of the file containing it, and the index of the block within the file. Something like the inode table can be handled within this model by making it a file itself; but you don't need to reference the inode table to locate a file, just for metadata. That way, you don't have any bootstrapping problems about finding the inode file itself.

This could be handled by a single giant B-Tree, with three kinds of block. The superblock contains the depth of the B-Tree and the block number of the root node. And, presumably, the inode numbers of the inode table file and the root directory, unless they are hardcoded. The nodes of the B-Tree contain pointers either to further levels of nodes or to data blocks - since the tree is balanced to have uniform height on all branches, the depth integer stored in the superblock can be used to count off levels as you walk down the tree, so you can tell immediately when the transition to data blocks occurs, rather than needing flags in the B-Tree itself.

The B-Tree index maps a pair - (inode number, logical block number) - to the physical block number where that logical block can be found.

We can handle updates using familiar techniques from database systems; rather than updating a block, just write the new value to a previously unused location, and then update the parent in the B-tree to reflect the new index, all the way up to the modified superblock. Writes can be buffered and executed in arbitrary order, except that the final modification to the superblock because changes in the B-Tree have rippled up to the root must happen LAST to ensure consistency. Superblock updates can be handled specially; the update daemon can periodically execute a flush-write-cache-and-then-write-new-superblock request, and a call to fsync() can be handled either with such a global flush or by a partial flush that only writes out modifications made to that file (careful, since there may be write-cached B-Tree blocks touched by changes to more than one file!), then the superblock.

We can also handle transactions. Let each transaction keep its own write buffer pool for data blocks, that can be flushed out to disk by allocating new blocks and putting the data in where necessary to free up memory, but keep track (for each transaction) of which blocks it has modified by storing the new disk index of the block, or a pointer to the buffer if it's not been written, along with the index of the block - inode number and block number.

Then to commit the transaction, flush that transaction's entire write buffer, then atomically update the B-Tree from the modified-block list of the transaction. Note that I avoid locking B-Tree blocks to transactions, since it harms concurrency, instead just holding the B-Tree locks while the transaction is comitting.

So what does this give us? We now have a (potentially transactional!) Unix-style file system. Any block of any file can be found in just a single B-Tree lookup; with 512 byte blocks, 32-bit inode numbers, and 64 bit block (physical and logical) block numbers, we can fit 25 links inside each B-Tree block. So with a single index block, we can manage 25*512 = 12.5KB of storage; with two levels, we can handle 25 times that amount, or 312.5KB; with three levels, 7.5MB; with four levels, 190MB; with five levels, 4.5GB; with six levels, 116GB, and so on.

With six levels of B-Tree, this means that any block of any file can be found with one lookup to get the superblock, six lookups to walk down the tree, then a lookup for the data - eight lookups to find any block out of 116GB, given the inode number and logical block number. In practice, the top few levels of the B-Tree will end up entirely cached, bringing the actual lookup count down. And if a file is opened, then the B-Tree index blocks for the entire file will end up cached; as above, for a 12.5KB file, only one or two index blocks will need caching; for a 321.5KB file, about 26 blocks of index in the cache will enable us to find which physical block contains any logical block of the file without disk access. Etc.

Oh, and since it provides atomic updates by rippling changes up the tree, it will never need to 'recover' from being unmounted uncleanly. A single superblock update switches the filesystem from one consistent state to the next consistent state. To avoid problems with a power failure during a superblock write, one can use the usual trick of having two or more superblocks which are updated in sequence; if, when the filesystem is mounted, the superblock is corrupted, then the next one is tried instead until a valid one is found. At which point, we have a consistent filesystem tree hanging off of it.

And did you notice that it's using a 512-byte block size? That means low wastage of space due to block granularity!

However, you will need some form of garbage collector to reclaim unused storage. As data and index blocks are 'updated' by new copies being written elsewhere, as soon as the new superblock is written referencing the new filesystem state, blocks containing obsolete versions of data will then become reclaimable. Some system process needs to identify them and somehow mark them as free for new allocations.

Here, I had another interesting idea...

Pages: 1 2 3 4

6 Comments

  • By alaric, Sun 5th Sep 2004 @ 7:02 pm

    A few improvements have come to mind since writing the above and mulling ot over lunch today.

    1. It shouldn't be too hard to come up with a scheme where the leaf links in the bottom level of the B-Tree point to extents of data blocks; eg, rather than saying "Logical block 14 of file 7 is over there", it would say "Logical blocks 14-500 of file 7 start over there". This way, large contiguous regions written to files (which are quite common!) can be written contiguously at the allocation pointer and the index updated to suit. If a block in the middle of an existing extent is modified, then the index entry for the extent will need to be split into two extents, skipping the changed block - and inbetween them, a single-block pointer to the new version of the block. This would reduce the size of the index considerably if most files are written contiguously.
    2. For some noncritical data, all the effort of copying blocks on update to avoid corrupting them if a disk failure occurs during writing, and to ensure consistent snapshots for reads during the update, is unnecessary. For example, caches. As such, it would seem logical to provide an interface for the application to overwrite a block of a file 'unsafely', causing an in-place update of the data blocks in question, but one that might leave the file corrupted if power fails during the write. Luckily, this would be really easy to do; just walk the index to find the current data block, and overwrite it. Ok, it means we have to seek around to do the write, but it can be elevator-scheduled like a read.
    3. Note that if we have uncomitted transactions who have been forced to write some of their dirty blocks out to disk to free up buffer space, then the cleaner must know that those dirty blocks are in use. As such, the cleaner should scan the transaction dirty buffer pools as well as a tree traversal to find live blocks to relocate.
    4. It occured to me that it would be really easy to support snapshots of the filesystem state, for short-term "oops I overwrote my stuff" rescues, by storing (say) a list of references to past index roots in the superblock. Or having a linked list of snapshot blocks coming from the superblock, each containing a snapshot name and date, and the pointer to the index root for that snapshot. This then means that the cleaner has to take them into account when doing its tree walk to find live blocks, obviously.
    5. The snapshots of the filesystem would conventionally be read only, but there's actually nothing stopping you making them writable. After all, any updated blocks would just copy-on-write, either leaving the original snapshot unmolested and creating a new root, or updating the snapshot's root pointer. With the former option, you could branch a filesystem, CVS style. Probably potentially useful, somehow...
  • By alaric, Tue 7th Sep 2004 @ 11:37 am

    Another idea I've had is a better allocation strategy for handling filesystems split over multiple disks.

    Instead of having the allocation and clean pointers extend over a giant continuous space consisting of all the disks concatenated together, with big jumps in to allow disks to be swapped in and out without messing with the numbering, one could instead of each partition have its own alloc and clean pointers. When wishing to write out some data, the filesystem can then use any of a number of techniques to decide which disk(s) to put the data onto. For a start, a write could be sent to whichever disk has the most free space, or it could be split evenly across all disks currently accepting new writes (eg, not ones currently being cleaned out because they're about to be removed) to make more efficient use of disk bandwidth. Or all index blocks could go to a special disk dedicated to them, so the cleaner's tree walk only uses up read bandwidth on one disk. Or whichever disk it would appear will be the fastest to write to - based on the estimate current head position compared to the disk's allocation pointer, and the disk's physical write speed.

    The cleaner would be more interesting. When running, it could clear up space on all connected disks in a single walk of the index; each disk would, as well as having a clean pointer, also have a target clean pointer. Blocks between the clean pointer and target pointer of the disk containing the blocks would then be relocated.

  • By alaric, Mon 20th Sep 2004 @ 12:07 am

    Looks like Sun Microsystems read my blog:

    http://www.sun.com/2004-0914/feature/

    🙂

    What does ZFS have that I've not thought of? Their striping technigue is probably similar in effect to my revised multiple-disk allocation system. I was anticipating leaving mirroring to an underlying RAID layer, I don't think there's any benefit to putting it in the volume manager itself - except perhaps for slightly easier system administration.

    128 bit block numbers, eh? I was wondering about using variable-length integers, so you can have huge block numbers when dealing with huge disk pools, but only need 40 bits for filesystems under 512GB.

    Compression? I'd considered that (since writes are done in contiguous segments, one could compress each segment then use a "block number of start of compressed segment, index into compressed segment" address scheme), but decided it not worth while; it makes the system more complex.

    I think their "variable sized blocks" is equivelant to my notion of tracking extents of data blocks in the index, rather than just individual blocks.

    But it looks annoyingly like ZFS is pretty damned similar to my idea - grrr! That's very demoralising. Why do I bother having neat ideas barely days before large companies announce the same idea, so everyone thinks I'm just a copycat? sigh 🙁

  • By John Everitt, Tue 21st Sep 2004 @ 11:10 am

    But it looks annoyingly like ZFS is pretty damned similar to my idea - grrr! That's very demoralising. Why do I bother having neat ideas barely days before large companies announce the same idea, so everyone thinks I'm just a copycat? *sigh* 🙁

    Or worse :(. Infringing Software Patents...

  • By alaric, Tue 21st Sep 2004 @ 9:32 pm

    Indeed! Luckily, my FS idea is really just a combination of existing ideas in databases, file systems, and garbage collecting memory managers... Sun say they have some patent-pending aspects of ZFS, so I'm pretty confident that if any of that overlaps my FS I can point at my sources and say "PRIOR ART!"...

  • By alaric, Thu 23rd Sep 2004 @ 11:15 pm

    Ok, so I've had a bit of a sidetrack thinking how this could be used as a filesystem with POSIX semantics, but of course, for ARGON, it needs to be something different.

    What would be a useful abstract of the filesystem, to expose to the TUNGSTEN storage class layer, is that each entity (identified by a serial number) is composed of objects (identified by serial numbers), and that each object is managed by a particular storage class. Each object contains a key:value map, with both keys and values being arbitrary byte strings.

    The storage class can request an entry from within the object by specifying the key, or write to the object by specifying a key and a value, or to delete an entry given a key, or to iterate through the keys. All of this happens within the context of a potentially nested transaction. To create a new object, just ask WOLFRAM for a new cluster-wide unique serial number, and write some mappings using that serial number as the object ID. To delete an object, just empty it out.

    This is, of course, implemented on disk as a B-Tree where the keys are a combined entity ID, object ID, either an inline mapping key (if it's small) or the precomputed hash of the real mapping key and a pointer to the key elsewhere on disk, and either an inline mapping value (if it's small) or a pointer to the mapping value elsewhere on disk. The precomputed hash is used to speed up equality tests.

    The system writes to disk sequentially, with an allocation pointer and a cleaning pointer onto each disk in the system and a cleaner process that advances the cleaning pointers on request, as mentioned above.

    The system keeps, for each transaction, a hash map of modified mappings, binding the entity ID / object ID / key to a value. The key and the value may be arbitrary byte vectors, but either can be flushed to disk to free up RAM for other functions; if the key is flushed, then it is represented in the hash map by a hash of its contents (so the hash of the EID/OID/key triplet can still be computed), and a pointer to the on-disk copy. If the value is flushed, then the pointer to its on-disk copy is stored.

    If so many mappings are modified in one transaction that the hash map gets too big, then pages of it can be sent to a swap area.

    Therefore, there are three kinds of disk write:

    1. Where we flush out some uncomitted data to free up RAM. In that case, we write a series of keys or values, in a single write. The lower level disk space manager must reveal the position of the write pointer before the write commences, so the system can work out the correct pointers to the starts of all the byte vectors written out in the run. For reasons to be explained later, it is not sufficient for the system to tell us (after the write) what went where; it must be possible to compute the pointers BEFORE the write.
    2. When we commit a transaction. In this case, any uncomitted keys or values are written out, followed immediately by the changed B-Tree index blocks. To do all of this in a single write, we need to know the new pointers to store in the new B-Tree index nodes; thus the requirement to know the allocation pointer BEFORE the write, to compute all the new locations in one go, to store in the index nodes. At the end of this write, the system then needs to write a new superblock to all the superblock mirror locations in the system, containing the new allocation pointer, and the new B-Tree index root pointer.
    3. When we relocate some data due to cleaning. This is very similar to the commit case; we need to write out the relocated keys and values and index nodes, then all the B-Tree index nodes that need updating to reflect the relocation. At the end of the write, the system then needs to write a new superblock to all the mirrors, containing the new allocation pointer, the new cleaning pointer, and the new B-Tree index root pointer.

    However, we need not allocate data in units of disk blocks. Although the disk can only atomically write in units of blocks, so we need to align writes to block boundaries, we can use byte offsets as actual pointers to on-disk data. The allocation and cleaning pointers will always be aligned to physical disk block boundaries. When a write is done, the system will pad the end of the data being written to a block boundary, to preserve this alignment.

    This has interesting consequences. For a start, we are not constrained to fixed size B-Tree nodes; we should still have a size limit on them, for ease of management and bounded processing times, but we need not worry about wasting space at the ends of disk blocks on every node. Also, when we write a series of index nodes, mapping keys, and mapping values, we concatenate them (maybe "machine word" rather than at the byte level) rather than rounding each up to individual blocks. This saves space!

    Providing the storage classes with an interface to the object as a dictionary mapping arbitrary keys to arbitrary values makes life a lot simpler for them than if they has POSIX-style files.

    For an array storage class, it can use the array indices as keys and store the elements in the corresponding values. For small elements, it might be worth storing a page of elements in each mapping; divide the index by some suitable constant to find the page number, which is used as the key, with a series of elements in the value. This instantly gets you a sparse array that can provide access and update in trivial time (logarithmic in the size of the entire system, but significant savings due to index nodes being cached in RAM).

    For a basic "single cell" storage class, it can just use a single mapping from a constant key (zero length byte array), with the contents of the cell as the value.

    Because our mapping keys are variable-length, we can trivially implement multiple independent "virtual dictionaries" within the object by using the ID of the virtual dictionary concatenated with the virtual key as the actual key, and the virtual mapping value stored directly into the actual value. This can be used to implement a storage class like an SQL table, with the rows themselves in one virtual dictionary, keyed on a row serial number, and then other virtual dictionaries containing indexes (mapping directly from keys to lists of row serial numbers with that key in the indexed field).

    A dictionary is such a natural programming abstraction for storage... it makes the storage classes so much simpler to implement compared to POSIX-style stream files, and we can very efficiently provide that abstraction by using a B-Tree!

    This should make it easy to provide a wide range of storage classes, suited to different applications, with relatively little effort. The SQL-style table mentioned above would clearly take relatively little code, for example.

    I can't wait to implement it! 🙂

Other Links to this Post

RSS feed for comments on this post. TrackBack URI

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales