Log structured file systems revisited (by alaric)
So what does this mean for ARGON? Well, it could provide a lovely local filestore for TUNGSTEN. Index blocks on a triple of the entity number, the object number, then the logical block number.
Store metadata about the entity itself in one or more special objects in the entity with well-known hardcoded obbject numbers. Store metadata about the whole system in the node entity; store the node entity number in the superblock so you can find it out when booting up.
And there you have it!
6 Comments
Other Links to this Post
RSS feed for comments on this post. TrackBack URI
By alaric, Sun 5th Sep 2004 @ 7:02 pm
A few improvements have come to mind since writing the above and mulling ot over lunch today.
By alaric, Tue 7th Sep 2004 @ 11:37 am
Another idea I've had is a better allocation strategy for handling filesystems split over multiple disks.
Instead of having the allocation and clean pointers extend over a giant continuous space consisting of all the disks concatenated together, with big jumps in to allow disks to be swapped in and out without messing with the numbering, one could instead of each partition have its own alloc and clean pointers. When wishing to write out some data, the filesystem can then use any of a number of techniques to decide which disk(s) to put the data onto. For a start, a write could be sent to whichever disk has the most free space, or it could be split evenly across all disks currently accepting new writes (eg, not ones currently being cleaned out because they're about to be removed) to make more efficient use of disk bandwidth. Or all index blocks could go to a special disk dedicated to them, so the cleaner's tree walk only uses up read bandwidth on one disk. Or whichever disk it would appear will be the fastest to write to - based on the estimate current head position compared to the disk's allocation pointer, and the disk's physical write speed.
The cleaner would be more interesting. When running, it could clear up space on all connected disks in a single walk of the index; each disk would, as well as having a clean pointer, also have a target clean pointer. Blocks between the clean pointer and target pointer of the disk containing the blocks would then be relocated.
By alaric, Mon 20th Sep 2004 @ 12:07 am
Looks like Sun Microsystems read my blog:
http://www.sun.com/2004-0914/feature/
🙂
What does ZFS have that I've not thought of? Their striping technigue is probably similar in effect to my revised multiple-disk allocation system. I was anticipating leaving mirroring to an underlying RAID layer, I don't think there's any benefit to putting it in the volume manager itself - except perhaps for slightly easier system administration.
128 bit block numbers, eh? I was wondering about using variable-length integers, so you can have huge block numbers when dealing with huge disk pools, but only need 40 bits for filesystems under 512GB.
Compression? I'd considered that (since writes are done in contiguous segments, one could compress each segment then use a "block number of start of compressed segment, index into compressed segment" address scheme), but decided it not worth while; it makes the system more complex.
I think their "variable sized blocks" is equivelant to my notion of tracking extents of data blocks in the index, rather than just individual blocks.
But it looks annoyingly like ZFS is pretty damned similar to my idea - grrr! That's very demoralising. Why do I bother having neat ideas barely days before large companies announce the same idea, so everyone thinks I'm just a copycat? sigh 🙁
By John Everitt, Tue 21st Sep 2004 @ 11:10 am
Or worse :(. Infringing Software Patents...
By alaric, Tue 21st Sep 2004 @ 9:32 pm
Indeed! Luckily, my FS idea is really just a combination of existing ideas in databases, file systems, and garbage collecting memory managers... Sun say they have some patent-pending aspects of ZFS, so I'm pretty confident that if any of that overlaps my FS I can point at my sources and say "PRIOR ART!"...
By alaric, Thu 23rd Sep 2004 @ 11:15 pm
Ok, so I've had a bit of a sidetrack thinking how this could be used as a filesystem with POSIX semantics, but of course, for ARGON, it needs to be something different.
What would be a useful abstract of the filesystem, to expose to the TUNGSTEN storage class layer, is that each entity (identified by a serial number) is composed of objects (identified by serial numbers), and that each object is managed by a particular storage class. Each object contains a key:value map, with both keys and values being arbitrary byte strings.
The storage class can request an entry from within the object by specifying the key, or write to the object by specifying a key and a value, or to delete an entry given a key, or to iterate through the keys. All of this happens within the context of a potentially nested transaction. To create a new object, just ask WOLFRAM for a new cluster-wide unique serial number, and write some mappings using that serial number as the object ID. To delete an object, just empty it out.
This is, of course, implemented on disk as a B-Tree where the keys are a combined entity ID, object ID, either an inline mapping key (if it's small) or the precomputed hash of the real mapping key and a pointer to the key elsewhere on disk, and either an inline mapping value (if it's small) or a pointer to the mapping value elsewhere on disk. The precomputed hash is used to speed up equality tests.
The system writes to disk sequentially, with an allocation pointer and a cleaning pointer onto each disk in the system and a cleaner process that advances the cleaning pointers on request, as mentioned above.
The system keeps, for each transaction, a hash map of modified mappings, binding the entity ID / object ID / key to a value. The key and the value may be arbitrary byte vectors, but either can be flushed to disk to free up RAM for other functions; if the key is flushed, then it is represented in the hash map by a hash of its contents (so the hash of the EID/OID/key triplet can still be computed), and a pointer to the on-disk copy. If the value is flushed, then the pointer to its on-disk copy is stored.
If so many mappings are modified in one transaction that the hash map gets too big, then pages of it can be sent to a swap area.
Therefore, there are three kinds of disk write:
However, we need not allocate data in units of disk blocks. Although the disk can only atomically write in units of blocks, so we need to align writes to block boundaries, we can use byte offsets as actual pointers to on-disk data. The allocation and cleaning pointers will always be aligned to physical disk block boundaries. When a write is done, the system will pad the end of the data being written to a block boundary, to preserve this alignment.
This has interesting consequences. For a start, we are not constrained to fixed size B-Tree nodes; we should still have a size limit on them, for ease of management and bounded processing times, but we need not worry about wasting space at the ends of disk blocks on every node. Also, when we write a series of index nodes, mapping keys, and mapping values, we concatenate them (maybe "machine word" rather than at the byte level) rather than rounding each up to individual blocks. This saves space!
Providing the storage classes with an interface to the object as a dictionary mapping arbitrary keys to arbitrary values makes life a lot simpler for them than if they has POSIX-style files.
For an array storage class, it can use the array indices as keys and store the elements in the corresponding values. For small elements, it might be worth storing a page of elements in each mapping; divide the index by some suitable constant to find the page number, which is used as the key, with a series of elements in the value. This instantly gets you a sparse array that can provide access and update in trivial time (logarithmic in the size of the entire system, but significant savings due to index nodes being cached in RAM).
For a basic "single cell" storage class, it can just use a single mapping from a constant key (zero length byte array), with the contents of the cell as the value.
Because our mapping keys are variable-length, we can trivially implement multiple independent "virtual dictionaries" within the object by using the ID of the virtual dictionary concatenated with the virtual key as the actual key, and the virtual mapping value stored directly into the actual value. This can be used to implement a storage class like an SQL table, with the rows themselves in one virtual dictionary, keyed on a row serial number, and then other virtual dictionaries containing indexes (mapping directly from keys to lists of row serial numbers with that key in the indexed field).
A dictionary is such a natural programming abstraction for storage... it makes the storage classes so much simpler to implement compared to POSIX-style stream files, and we can very efficiently provide that abstraction by using a B-Tree!
This should make it easy to provide a wide range of storage classes, suited to different applications, with relatively little effort. The SQL-style table mentioned above would clearly take relatively little code, for example.
I can't wait to implement it! 🙂