Category: Scheme

Ugarit (by )

The core of Ugarit is ready. Now I just need to wrap some UI around it...

However, the code I have will now archive and restore directory trees!

Take a look. Here's the input:

    -bash-3.2$ ls -l test-data
    total 4
    prw-r--r--  1 alaric  users     0 Jan 15 20:52 FIFO
    lrwxr-xr-x  1 alaric  users    18 Jan 15 03:02 LICENCE.txt -> subdir/LICENCE.txt
    -rw-r--r--  1 alaric  users     0 Jan 15 02:44 README.txt
    brw-r--r--  1 alaric  users  0, 0 Jan 15 20:52 blockdev
    crw-r--r--  1 alaric  users  0, 0 Jan 15 20:53 chardev
    drwxr-xr-x  3 alaric  users   512 Jan 15 10:01 subdir
    -bash-3.2$ ls -l test-data/subdir/
    total 4
    -rw-r--r--  1 alaric  users  1527 Jan 15 02:44 LICENCE.txt

...and here's the output. Note the file modification times are preserved, the FIFO and the device special files and the symlink and the zero-length file and the subdirectory make it, too. The modtime of the symlink isn't preserved, but that's because there doesn't seem to be a POSIX function to do it - I use utime to set the times, although NetBSD seems to have a utimes/lutimes pair that can do symlinks properly, I don't know how widespread it is.

It does the mode, uid and gid, too; note that the files come out owned as alaric, even though I ran the extract as root.

    -bash-3.2$ ls -l tmp3
    total 4
    prw-r--r--  1 alaric  users     0 Jan 15 20:52 FIFO
    lrwxr-xr-x  1 alaric  users    18 Jan 17 01:03 LICENCE.txt -> subdir/LICENCE.txt
    -rw-r--r--  1 alaric  users     0 Jan 15 02:44 README.txt
    brw-r--r--  1 alaric  users  0, 0 Jan 15 20:52 blockdev
    crw-r--r--  1 alaric  users  0, 0 Jan 15 20:53 chardev
    drwxr-xr-x  3 alaric  users   512 Jan 15 10:01 subdir
    -bash-3.2$ ls -l tmp3/subdir/
    total 4
    -rw-r--r--  1 alaric  users  1527 Jan 15 02:44 LICENCE.txt

Now, give me a moment while I wrap a command-line shell around it and some higher-level management stuff, then run a more serious test, and it'll be ready for initial release...

Backup system progress update (by )

I forgot to mention before... I chose a name for it. Ugarit, after the site of the oldest human archive yet found.

Anyway, I had another few hours at it last night (I think my total's up to about a working day now; one more working day to go before I know if my two day estimate was correct!).

I've implemented two fundamental abstractions. In a limited-block-size reference-counted content-addressed store, a large file has to be stored as lots of chunks, with the keys of the chunks stored in one or more indirect chunks full of keys. If there's too many keys to fit in one indirect chunk, then you need lots of indirect chunks, and in turn, one or more doubly-indirect chunks to store the keys of THEM in, until you finally have a single chunk at the top of a big tree, whose key you can use to refer to the file.

Reference counting complicates this. As you're writing the file's chunks, some may already exist in the archive, and some might not. Those that already exist, you need to up their reference count, so we know they're being used. Those that don't, you upload the actual content to create them.

But if you reuse all of the blocks of a file - because it's not changed - then you'll find you can reuse the indirect blocks, too, since they won't have changed. In which case you mustn't increment the refcounts of the data blocks, since they're still only referenced from the one indirect block, even though that indirect block now has two references.

So, what I've done is to implement a "key stream writer" API. You create a key stream writer and feed keys into it, but along with every key, you feed a 'reused?' flag which is true if the key is a reused one rather than a new one.

At the end, you finish the key stream writer, and it returns you a key for the entire tree, plus a reused? flag.

And it does all the magic internally, incrementing the refcounts on reused keys passed to it if and only if it creates a brand new indirect block referring to that key, and correctly handling refcounts on any indirect blocks it uses. It does this with a lovely recursion; the key stream writer accepts keys into a buffer and flushes the buffer out to a key stream indirect block when it gets full, at which point it creates a parent key stream writer to manage the next "level" up in the tree, writing the key of the newly created indirect block to that writer. I'm quite taken aback by how simple and elegant the code is, for what it does 🙂

Then, on top of that, I implemented an sexpr-stream writer, which accepts sexprs (Lisp data records, in effect) and buffers them, writing them to the archive when there's too much for a single block, and using a key stream writer to track the blocks written. This is to be used to implement directories, which will be a list of records giving details of all the files in the directory. The sexpr stream writer needs to know about keys mentioned in the sexprs, so it can do the same refcount-tracking magic to handle sharing correctly. Luckily, in spite of those requirements, it also works beautifully and is also elegant code!

Then I just need to implement storing files (which should be quite easy; split the file into chunks, check to see if each chunk is already in the archive and upload if not, write keys to key stream), then I can implement storing directories (recursively store files and directories, use an sexpr stream writer to log each file created). Directory storage will need a few cases handled, since I plan to store funny filesystem objects like FIFOs, device specials, symlinks, and so on.

Then I'll need the main driver that does a snapshot of a directory tree, then creates a snapshot object with metadata in and links it into the snapshot chain for a chosen tag.

I reckon that lot will take me half of the next day - then I'll implement deletion (I've been implementing primitive deletion operators for key streams and sexpr streams as I go along, so this won't take long) and extraction (again, I've written fold operators for key and sexpr streams, so this won't take long), then I'll have the core done!

What will remain then? A command-line wrapper with decent option parsing and command line ops to do a snapshot to a tag, retrieve all or part of a snapshot, list snapshots on a tag, delete a snapshot, and delete a tag, which should just be a matter of plumbing. Then I'll be done!

I still think I'm on for two days... but mainly because I'm punting on writing an actual S3 backend for now. That'll take a while, dealing with fiddly Web Services authentication stuff!

Other things I ought to get around to doing at some point are:

  1. LZMA compression. For now, I'm using a deflate implementation for Chicken Scheme that I found, but I would like to wrap liblzma as a Chicken egg, since it's thought to have tighter compression.
  2. Encryption. Right now I have a stub in the code for encrypt/decrypt operations, applied to the compressed blocks, but for now it just returns the block unchanged. I need to either use Chicken's libcrypt egg (which looks complicated) or find a lightweight implementation of a good cipher that I can wrap as an egg.
  3. More storage backends.
    1. A replication backend that wraps a list of backends, doing writes and deletes to all of them, and servicing reads from the first backend that succeeds to provide a block. That way I could back up to a local disk and to S3, and restores would come from the local disk if possible or from S3 if not. Even if I lost part of the local disk in a filesystem corruption incident, anything I lost would still be available from S3.
    2. A generational backend, that wraps a list of backends. Writes are only done to the first, but reads (and deletes) are serviced from the first one that responds. That way, I can fill up one hard disk with archives, then start filling another, and so on, building up a chain of archive partitions.

Care must be taken when doing complex things with backends, though, in order to preserve the correctness of reference counts. A group of archives that are used together by replication or generations will have refcounts that are correct across the whole group, not necessarily individually within each component archive.

More forbodingly, my existing cache backend wrapper won't mix well with deletion of snapshots when the same backend is shared between multiple machines, as it basically caches the existence of blocks; but if a block is deleted by another system, the local cache won't know, and will incorrectly report that the block is already known. Which means it won't get uploaded to the archive, which means data could be lost.

For now I'll recommend that people who want to delete from shared repositories don't use the cache, or flush the caches after a deletion, but in the long run I'd like to think about how to improve the cache. Perhaps by logging deleted keys to a special area in the backend, and when a cache starts up, it can fetch the log and look for any deletions it doesn't already have. Or just compress the entire cache file and upload it when closing the connection, and download it when starting a connection, as long as no two cache-using machines are snapshotting at the same time...

I’m missing Scheme (by )

I've not done any Scheme programming for ages. In fact, the past few months have been quite a haze of relentless hard work; I'm liking what I'm actually doing for a living, except I've been doing rather a lot of faffing about recruitment rather than actually doing it lately.

I'm having to spend half of my week in London, and the other half working from home - but with Sarah away that half of the week doing her course, I'm working from home alone by day and looking after Jean in the evening, while having my working day bracketed by taking Jean to and from nursery, which is a half hour round trip each way. All the thrill of commuting without the fun of working somewhere different to where you sleep, or with people.

So, no programming-for-fun lately! But that can't last forever, since trying to stop my mind from going exploring for too many months in a row is always rather futile.

So I came across Ventonegro's post on and-let* and it set me thinking. The Lisp family of languages (which includes Scheme) are renowned for their macros, which are the key rationale for the minimalist syntax; without things like if holding a special place in the language, user-written macros are just as powerful as anything that comes built into the language. This lets you extend the language with features that you'd be mad to build into a language core, but which are nonetheless useful reusable constructs, such as and-let*.

As an aside, let me just explain and-let* - the name is a terse mnemonic that makes sense to Schemers and nobody else, but it's a way of compactly writing bits of code that attempt to compute something in steps, where the trail might end at any step and fall back to some default. The example Ventonegro gives is rather good:

  (define (get-session request)
    (and-let* ((cookies (request-cookies request))
               (p (assoc "session_id" cookies))
               (sid-str (cdr p))
               (sid (string->number sid-str))
               ((integer? sid))
               ((exact? sid))
               (sp (assq sid *sessions*)))
      (cdr sp)))

Which translates to:

  • If there are cookies available
  • And there is one called session_id
  • And parsing it as a session id succeeds
  • and the session id is a number
  • and that number is an integer
  • and that number is exact (eg, 3 rather than 3.0)
  • and that number is the ID of an existing session
  • ...return that session

A few languages happen to make that pattern easy to write natively by putting assignments inside an and, as Peter Bex points out, but with Lisp you don't need to rely on the that piece of luck; you can roll your own luck.

There's a whole library of useful macros and combinators (another handy higher-order programming tool) in most Scheme systems, and any your system lacks can be copied easily enough. But it occurs to me that there's very few educational resources on actually using them. I think a definite theme, if not a whole chapter, in a "practical Scheme" book would have to be the introduction and then applied use of such handy macros (and a damned good reference guide to them all), because reading the definition of and-let* failed to really fill me with inspiration for situations I'd use it in. While reading Ventonegro's example reminded me of some ugly code I'd written that could be tidied up by just using and-let*.

It's great to be able to assemble your own syntactic tools, but presenting them as one unorganised mass will just make your language seem as complex and messy as C++ and Perl combined; yet sticking only to the core base language and expecting programmers to spot their own patterns and abstract them out results in duplication of effort, and every piece of source code starting with a preamble full of generally rather general macros, neither of which are good. Rather than choosing a tradeoff between the two, as static programming language designers are forced to, we need to find a way of cataloguing such tools so that they can easily be split by category, and by priority; so the ones that are most widely useful can be learnt wholesale, and ones that are more useful in certain niches can be glanced at once then gone back to and studied if required.

Type systems (by )

There are a number of type systems out there in different programming languages. In fact, there's zillions, but they boil down to a few basic approaches.

Read more »

Insomnia (by )

yawn

5:30am and I haven't slept a wink yet! I really need to sort out my lifestyle so I get (a) exercise and (b) time to think every day. Time to think is important for me; if I don't get enough, then when I go to bed, I lie there and think. Lots.

Tonights thoughts have included:

  1. Some ideas about how whole-program transformations (eg, the macroexpander/partial evaluator and the OO system) in CHROME might be handled. The OO system needs to be a whole-program transformation rather than just some macros using the normal macroexpander since things like inheritance graphs and method lists for generic functions need to be accumulated together; most Lisps handle that with macros that destructively update data structures, but I'm trying to design a system without rampant mutation, so need a whole-program code walk to do this. Clearly, since we want to be able to write macros that produce generic functions, methods, and the like, we need to do this AFTER normal macro expansion, but before the compiler gets its hands on it.
  2. Some ideas about separating the generic function/method system - the dispatch part of Lispy OO - from the classes-inheriting thing. Subtype relationships that are used to dispatch GFs should be doable with plain predicates - pair? my-record-type? etc. Or more complex predicate expressions on groups of arguments, so we can support multivariate typeclasses in the Haskell sense, as a rich interface/implementation system as well as a traditional records-with-single-inheritance class system. To do this properly we also need declarations that one predicate implies another - (number? x) -> (integer? x) - so that a method on numbers will be used for integers, yet a more specific integer method can override it. I'm not sure how decidable the "most specific method wins" thing can be with complex multivariate type predicates, though. Must experiment and ask knowledgeable formal logic folks.
  3. Thoughts about future computer architectures. The drive is for more speed, but these days, most CPUs are idle waiting for memory to feed them code and data, or (much more often) for the disk, network, or user to feed them. The only places where the CPU maxes out tend to be highly parallelisable tasks - server stuff handling lots of requests at once, games and other audiovisual things, and scientific number crunching. This suggests to me that a future direction of growth would be one or more high-bang-per-buck MISC processors embedded directly into SRAM chips (sort of like a CPU with an onboard cache... but the other way around, since the CPU would be much smaller than the SRAM array) which are bonded to the same chip carrier module as a set of DRAMs. One or more of CPU-and-SRAM and some DRAM chips are then all designed together as a tightly-coupled integrated unit for maximum speed due to short traces and the lack of standardised modular interfaces between them (like DIMMs and CPU socket standards) meaning that the interface can evolve rapidly. The whole CPU+SRAM+DRAM unit is then pluggable into a standardised socket, which motherboards will have several of. The result? Lots of cores of low power consumption reasonably fast CPU with high speed access to high speed memory. And for those demanding games/media/scientific tasks? The higher-end modules will have FPGAs on as well...
  4. Forget nasty complex unpredictable memory caches: have a nonuniform physical address space (with regions of memory of varying speed) and let the OS use the virtual memory system to allocate pages to regions based upon application hints and/or access statistics. Not having cache tag and management facilities makes for more chip area to put actual memory in.
  5. We've been wondering about getting goats lately. Goats are useful creatures; they produce milk (which can be turned into cheese) and they produce decent wool (just not in the quantities sheep produce it). Their milk and cheese don't make Sarah ill the way cow-derived versions do. Plus, we need something to come and graze our paddock. We've been doing a little bit of research and apparently two goats would be about right for the space we have. We'd need to put an inner layer of fence around the paddock to keep them in while still allowing the public footpath, and we'd need a little shed for them to shelter in. But thinking about setting things up in the paddock, I'm now wondering if it would be a good idea to build a duck run in there too, down at the bottom by the stream, all securely fenced against foxes and mink and with a duck-house up on stilts in case of flooding, but with a little pond dug out for them (connected to the stream by a channel with a grille over it to prevent escapage). It would be a convenient place to have the ducks, and it would make a good home for them, I think.

It's now 6am. Do I try and go to sleep, or try and last the day out? Hmmm...

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales