Memory management (by alaric)
A lot of the core infrastructure in a computer system - such as memory management - is somewhat overlooked for opportunities to improve things, because a vicious circle of conservatism sets in: we tune memory manages for systems that work like POSIX, and then stick with systems that work like POSIX since the memory management is so well-tuned...
So I set myself the task of designing, from scratch, an approach to memory management that's appropriate to an event-driven system based around a purely functional language with mutation expressed via unique typing.
It's far from complete, but here's a brain dump:
HELIUM memory management
Assumptions about processes
- A process is a pure function. When started, it is given some inputs from its parent process, which can be summarised as a single pointer and all objects reachable from that pointer. When terminated, it returns some outputs to its parent, which again can be summarised as a single pointer and all objects reachable. There can be other communications between processes, but they'll be represented as apparently unconnected sets of 'send' and 'receive' operations organised by threading the state of linear IPC-system objects, with the communication happening under the hood and thus invisibly to this system, so it shall not be considered further.
- Objects are either pure - shareable, read only - or linear, with only one reference to it, and mutable.
- A process cannot terminate until all its children terminate. At best, it can go into a 'dying' state where its return value is already prepared, and any returns from its children will be discarded, and its value returned when its children have died.
- A process starts with access to its inputs from the parent, which basically boil down to a function application - one of the inputs is the code to run, and then any further inputs are arguments to that code (possibly other bits of code, of course). It can then allocate linear or shared data objects, which are initialised to contain data available to it at the time (eg, pointers to existing objects, or literal values). It can mutate linear objects, but no other process can read or write those (except for any encapsulated violations of this done by calling out to 'unmanaged code', such as inter-process communication systems as mentioned above). Shared objects created by a process can be read by that process or by its child processes, but can only be read by other processes if returned by that process to its parent, and then only after the process has died.
- By that, I'm assuming that only linear objects can ever be passed between processes along communications ports. And due to the linearity constraint, of course, to send an object to another process means losing it yourself.
- Most processes will be short lived, but some may turn out to live forever.
Assumptions about memory
- There can be any number of processors accessing a shared memory.
- Processors may own bits of memory on some arbitary granularity (eg, cache lines), and changing ownership may be expensive. Eg, try and keep data localised to the process operating on it. If we know we are on a NUMA system, then try to ship data into memory that's close to the processor that will operate upon it; we know what memory is owned by what process, so use this information in choosing a CPU to run the process on.
Therefore
- Therefore, have a global heap, optimised for highly concurrent access and large objects.
- Let each process then have a local heap, composed of large chunks allocated from the global heap. Each chunk has a header of H bytes. Each object has a header of W bytes. Each process also has a free-space index.
- Let a process have two tunable parameters, C (the chunk size) and L (the chunk allocation limit, less than or equal to C).
- For a new process, the local heap starts in fast mode. The local heap is represented as a linked list of chunks. H is large enough to accomodate a pointer to the next chunk in the list, plus a count of the number of AUs allocated in the chunk. Space within a chunk is allocated sequentially, by advancing the AU count in the chunk header. The process stores a pointer to the head of the list. When a process wants to allocate N bytes, if N > L-W, it requests a dedicated (H+W+N)-byte block from the global heap, and puts it in the local heap as the chunk below the head of the list (unless it's the first chunk allocated, in which case it becomes the head of the list), and allocates from that. Otherwise, if the space left in the chunk at the head of the list is sufficient for W+N AUs, it allocates the space from there. If not, it allocates a new chunk of (H+C) bytes from the global heap and makes that the new head of the list, and allocates from that.
- If a processes local heap grows beyond a certain size, S, the local heap switches to long-running mode. In simple mode, the object headers of W bytes were initialised with the object length, shared/linear flags, and process-allocation-order back pointers, to keep the heap starting off in the correct format for starting off in long-running mode. The only work to be done is to scan the linked list of local heap chunks and collect together the little spare regions at the end of each, and add them to the free-space index with them.
- When a process wants to allocate N bytes, if N > L-W, it requests a dedicated (H+W+N)-byte block from the global heap, adds it to its local heap, and allocates the N bytes from that. Otherwise, if it can find an existing hole in the local heap's free-space index, it uses that. Otherwise, it allocates (H+C) bytes from the global heap (W+N <= C since W+N <= L and L <= C), adds it to the local heap, and allocates from that.
- Linear objects that are deallocated by a process create holes in the local heap which are then placed on the free-space index, whether the heap is in fast or long-running mode.
- A heap in long-running mode may be garbage collected. Linear objects are deallocated by the code itself, but shared objects must be deallocated when no more references to them exist within that process or its child processes.
- Let every object header include a marked bit. We can mark all objects referenced within a process by marking its root set then walking its process-allocation-order chain of objects backwards, queuing unmarked objects for deletion, and for marked objects, marking each object referenced within that object. This will also mark objects inherited from its parent. But the objects now queued for deletion may include objects referenced by children, but that they have not yet marked. And having multiple collectors, one per process, with some objects shared, breaks the ability to clear all the marked bits in constant time by just flipping the sense of the bits. This can be done a few steps at a time every time the process tries to allocate, with more steps being used as memory pressure rises.
- When a process dies, we need to pass its outputs up to the parent somehow (copy them to new parent allocations? Or, for chunks that are mainly being returned, just repurpose them into the parent's local heap, with all but the returned stuff added to the parent's free-space index?), then we can return the entire remaining local heap to the global heap.
- When a linear object graph is passed between processes, it must likewise be copied-and-freed, unless it is the sole occupant of any chunks, in which case those can be transferred without copying.
- Linear objects are not included in the GC cycle, so we need to know which shared objects are referenced by linear objects - perhaps with a reference count.
But
How do we tell if an object that a process passes to its child is still referenced by one or more children? An object apparently unreachable in the parent may be used by children. Objects queued for deletion by the per-process GC may, if they have been passed to a child, require further examination before really being deleted.
Note in particular that process A may spawn process B, but process B may then spawn process C and pass down things inherited from A, so a process's objects may also be referenced by its descendants to arbitrary depth.
So
- Perhaps we need an owning-process pointer in every object, and when checking or setting the mark of an object, XOR the mark bit with the owning process's mark flag (atomically) so a process may clear the flags of all of its children by just flipping that flag. Be careful of race conditions between that and GCs in other processes.
- Perhaps we need to, when creating a child process, flag every object reachable from its inputs as 'inherited' so the parent handles them specially. But bear in mind that an object may be inherited by more than one process, and that inheritance may be carried down to a new generation of process, so we either have to have a reference count or something similar in the object header if we want to be able to unset the inherited flag when no inherited references remain, or something like that. Each process should definitely keep a weak pointer to its inputs in the process header, so they can be found again in future.
- Naive algorithm: Keep a 'first passed to a child' timestamp in every object header. When a child is created, examine all the objects reachable from the child's inputs and set that timestamp to now unless it is already set. Also keep that timestamp in the header of the child process. Then when a child process dies, any object whose timestamp is earlier than the start-time of the earliest existing child (grandchild, great-grandchild, ...) process can be considered considered no longer potentially used by children.
- Perhaps we can put use the linear reference count in the object header by also counting the number of child processes keeping references to the object. When a process is created, a bitmap of all the objects it inherits is made, and all set to one. There's also a second copy of the table, all set to one as well. When the GC runs in a child process, if it follows a reference to an object from a live object and finds that the object was inherited (from a parent, grandparent, etc), then as well as marking it live directly, it also finds (how? hash on address?) and sets the corresponding bit in the inherited-reference table. Once the GC cycle has completed, the tables are examined, and if an object's flag is zero in both tables, then the object is no longer referenced by this process so it can have its reference count decreased. The table is copied over the second table, and then cleared, and the GC cycle runs again. We need to make sure that references are correctly passed up the process parentage tree. As the GC goes through the objects, any object with a non-zero reference count is considered live and its subobjects marked.
Memory budgets
- Every process has a pointer to a memory budget object. A memory budget countains counters for objects and AUs allocated, and soft and hard limits on same.
- A process may save its memory budget and use a new one, and the old one is restored at the end of the scope of the new budget, so the budgets form a LIFO stack.
- When a process allocates an object, a pointer to the budget is stored in the object header, and the budget's AU and object counts increased. If that would put them over the hard limit, the allocation is denied. If that would put them over the soft limit, the allocation is allowed but a warning given somehow.
- When an object is freed, its budget is looked up via the pointer in the header, and the object's contribution to the AU and object counts undone.
Thoughts
- We have a lot of stuff in an object header. An allocation sequence chain pointer for the GC, a reference count for references from linear objects or child processes, a memory budget pointer, the length of the object, the type of the object (a pointer to a type object), and some flags? The flags and the refcount can go into one machine word, so that's 5 cells of object header. Hmmm.
- Perhaps we can remove the need for the memory budget pointer, assuming that there won't be too many memory budgets, by letting each process have a different local heap for every memory budget it actually uses.
- Most objects' sizes are fixed for their type, so we can remove the need for a size in everything other than vectors/arrays by putting a size in the type and indirecting.
- Perhaps we can remove the need for a size in arrays etc. by putting all objects of the same size in the same local-heap chunk, and having a size in the header of the chunk. There would need to be several linked lists in the fast heap case, each for different sizes. Large objects that get their own dedicated chunk of course get their own size listed in the chunk's object-size header field, and there's just one of them in the chunk. To reduce the number of object sizes required, we can round object sizes up to the nearest 16 cells or something like that. Perhaps have standard sizes of 4 cells, 8 cells, 16 cells, 24 cells, 32 cells, (going up in 8s) ... 256 cells, 512 cells, 768 cells, ... (going up in 256es), etc. up to C.
- Perhaps we can remove the need for a type field in some cases by having special local heaps for certain common types, with the type stored in the chunk header. If there are too many types around to make that worthwhile, or for less common types, we can store a NULL in the chunk-header type pointer, leading code to look in the (suitably larger) object header after all.
- Since a pointer can only point to the start of a >C sized object, if we make C a power of two, we can always find the header of the chunk holding an object by ANDing it with 0xffff....fff000...000. So finding the chunk-wide header is easy.
- That gets us down to an allocation sequence chain pointer, and a refcount and flags word. Much better.