Garbage collection (by alaric)
I read this paper shortly after it came out:
One Pass Real-Time Generational Mark-Sweep Garbage Collection
However, I've spent ages since trying to find it again. Mainly due to confusing it with The Treadmill: Real-Time Garbage Collection Without Motion Sickness, which describes a somewhat related algorithm that uses a read barrier to support mutation.
The algorithm described in the first paper is pretty amazing. It's simple, effective, fast, real-time, and requires no read barrier. The downside, however, is that it requires that pointers only ever point back in time to older objects. Which is fine if you have a purely functional language, since an object being created can only ever obtain references to objects that already exist at creation, and those references can never be updated to point to a newer object thereafter. However, you cannot then use any optimisations that use mutation "under the hood" such as in-place update of linear values.
I'd attempted to work out a way of handling mutation in this algorithm before, by moving things back to the head of the list when they were modified, but it started to get costly.
The thing is, I'm wondering if you can make use of the invariant that a non-linear value can never point to a linear value, and the fact that all transient state of a thread is linear, to get around this. Linear values themselves don't need garbage collection; in the absence of pointer sharing, the owner of a reference always knows when it abandons the object, and can manually deallocate it (or if that's costly, put it in a queue for deallocation).
In which case, each thread would have a root set of linear values, each containing references to other linear values or into the heap. And all mutation would be within the linear values, and heap objects could never refer to linear values, so the heap would still be time-ordered and we could still use the algorithm. Linear stuff would just never take part in it. There would need to be two phases of garbage collection; tracing the linear values (being careful not to get caught out since the mutating threads can be randomly swapping pointers around all over the place; this might require a write barrier to avoid race conditions), marking anything in the heap referenced from a linear value - in effect, working out the 'root set' of the heap - then using the collector on the heap. It's important to note that the traversal of the linear objects occurs purely to find out what is referenced in the heap; we're not actually garbage collecting the linear data, since it is cleared up automatically.
I think that'll work, and be a lot less complex and inefficient than the solutions I was thinking of previously...
And as I noted before, the fact that ARGON focusses on short-term event handlers rather than long-running processes, wherever possible, means we can effect some optimisations. If every thread has its own linked list of allocated objects (and, possibly, its own free list, to reduce contention over a shared free list; but that's another issue), and every operation that communicates an object "outside" of a thread either copies the object (or, if it's linear, just hands over ownership of it) or tags it (and everything it refers to) as now being "shared", then it's relatively easy to clear up everything owned by a thread (apart from shared things) when it exits. Only threads that seem to run for a long time will actually need garbage collection.
In general, threads will fork of child threads which can access objects of the parent thread as well as creating new objects of their own. I think it should be possible to modify the algorithm to support this, in effect allowing the history chain to "fork"; this, combined with a pointer to the owning thread (or, decoupling things slightly, the "containing memory pool") from every object, makes it easy to isolate the boundary from your own memory pool to your parent's one.
So what happens if an object is passed from one thread to another via some kind of communications channel?
Perhaps you could cause the closest parent of the two threads, in the thread family tree, to adopt the object and everything it refers to, so that it can be in the history of both source and destination threads (don't forget that unless we're talking about the limited (and trivial to handle!) case of passing linear values, the source needs to still be able to access it). But you can't just put it at the head of the ancestor's chain - since it will then no longer be visible in the chains of the two threads at all; they both forked off of at some point in the past, so their chains will join the ancestor's chain somewhere beneath where the object will be reparented to.
Obviously, before making something shareable between threads like that, you'll need to recursively perform the same operation on everything it references, too. But that means that we can move the object down the allocation chain, since anything it refers to will be pushed down below it anyway, and there's no danger of ending up with a pointer into the future.
So perhaps we could get away with inserting it into the allocation chain of the common ancestor, just below the point where the earliest fork of the source or destination thread occurred. It's legal to move an object down the chain, as long as it never moves below an object it refers to.
Obviously, some objects that are being prepared for sharing may already be pushed partway down the thread family tree from previous sharings, or the object having been obtained from an ancestor in the first place - so in some cases, the recursive move-down operation won't have to do anything. A thread passing objects to one of its own children is an easy no-op, for a start, if the object existed before the fork (so is below it in the chain).
Is it all worth it compared to a single allocation history chain shared by all threads? I dunno. More thread-local storage means less hassle with cache fighting on multi-processor systems (and they seem to be the future), and being able to quickly discard a given thread's garbage as soon as the thread dies is attractive in an environment with many short-lived lightweight threads. But the cost is in complex operations to maintain invariants when objects are passed between threads. And potential complex race conditions when two threads share the same 'tail' of their allocation chain list: who actually collects garbage in the tail? Really, the two threads' collectors ought to run down until the hit the tail, then block until BOTH have reached the same point, then the parent continues down the tail while the child starts again from the top, in order to ensure that all objects in the tail referenced by the child are correctly marked before the parent GC can risk considering anything free.
I guess I'll have to provide the detailed programming interface for creating memory pools and associating them with threads, then experiment with multiple implementations of the API: one that uses a global chain, and one that uses per-thread chains. With the latter having the option of allowing object sharing between threads (apart from a child thread sharing its parent's objects) at all, or just forcing copying when things are communicated.