Generic indexed storage for TUNGSTEN (by alaric)
For TUNGSTEN, I plan to split the local storage manager into two layers; the physical manager and the logical managers. The physical manager takes a collection of block storage devices (such as hard disks), and a configuration giving advice on which disks to use for what, and provides an administrative interface for modifying said configuration and monitoring usage, and an interface to the logical managers. The latter interface provides access to a single unified associative data store; the local store is divided into regions for each volume mirrored on the local node (by a numeric volume ID), which are in turn divided into regions for each entity (by a numeric entity ID), which are in turn divided into regions for each object within the entity (by a numeric object ID), which may then be structured as the logical manager responsible for that object sees fit. The structure available to the logical manager within an object is a set of records, each with a key and a value. The physical manager provides the ability to create, abort, or commit transactions, and within a transaction, read, write, add, and remove records, including iterating through records by key range.
Logical managers use this facility to provide an application-level data structure. The simplest would just store a single IRON object that is loaded from TUNGSTEN or stored back as a whole, enclosed in a simple locking wrapper so at most one transaction may have a write lock, with copy-on-write so readers may read the current value even while it's write locked. A more complex logical manager would implement something like an SQL table, a set of IRON objects as 'rows' with a list of attributes that should be indexed. In this case, the logical manager might store the objects themselves with a numeric row ID as the key, then represent the indices by storing records with values of the indexed attributes as keys, and lists of row IDs as the value. Since IRON objects may be nested, the logical storage manager may in turn invoke other logical storage managers, and let them use the storage within the outer object ID's region of the file store by making them use a prefix on their keys (such as the row ID in the last example).
The local storage manager is, in turn, isolated from application code by the distributed storage manager, which uses the local storage manager on each node, and communications with other nodes via WOLFRAM, to present the applications with a choosable trade-off of global consensus versus speed and availability when the network is partitioned.
Now, my original plan for the physical local storage manager was to use a giant B-Tree, split across all the available disks, with copy-on-write propagation of changes up the tree to create a new root pointer to provide ACID properties. The B-Tree would have variable length entries, mapping a variable-length key to either an inline record value (if it's small) or a pointer to the record value on disk. Pointers to locations on disk, be they internal pointers to child nodes in the index tree or pointers to data, can be stored variable-length as well, making the system efficient on small disks while still handling large disks. Since the entries are variable-length, splitting of index nodes could be done on the byte size rather than on the number of entries.
The keys of the B-Tree would be composed by concatenating the volume ID, the entity ID, the object ID, and whatever string the logical storage manager wanted to use as its own key. Thus, the logical storage manager gets to use the B-Tree to handle the internal structure of the object; the whole system really is a single B-Tree. This makes access nice and efficient; the system can locate a small-grained record out of a huge disk array in very few disk accesses, especially since the upper levels of the B-Tree would tend to remain cached.
I chose to expose the key-value nature of the B-Tree to the logical storage managers, since it's quite a universal abstraction upon which to build data structures; certainly much easier to work with than the POSIX approach of providing a byte vector "file".
However, I have since began to worry about the one situation I know of in which a key-value mapping is insufficient to build data structures, one in which just having access to a raw array of fixed-size blocks would be preferred. And this situation is when you're wanting to use multi-dimensional keys.