Category: Computing

Configuring replication (by )

Storing all your data on one disk, or even inside one computer, is a risky thing to do. Anything stored in only one, small, physical location is all too easily destroyed by flood, fire, idiots, or deliberate action; and any one electronic device is prone to failure, as its continued functioning depends on the functioning of many tiny components that are not very easily replaced.

So it's sensible to store multiple copies, ideally in physically remote locations.

One way of doing this is by taking backups; this involves taking a copy of the data and putting it into a special storage system, such as compressed files on another disk, magnetic tape, a Ugarit vault, etc.

If the original data is lost, the backed-up data can't generally be used as-is, but has to be restored from the backup storage.

Another way is by replicating the data, which means storing multiple, equivalent, copies. Any of those copies can then be used to read the data, which is useful - there's no special restore process to get the data back, and if you have lots of requests to read the data, you can service those requests from your nearest copy of it (reducing delays and long-distance communication costs). Or you can spread the read workload across multiple copies in order to increase your total throughput.

Replication provides a better quality of service, but it has a downside; as all the copies are equally important, you can't use cheaper, slower, more compact storage methods for your extra copies, as you can with backups onto slower disks or tapes.

And then there's hybrid systems, perhaps were you have a primary copy and replicate onto slower disks as a "backup", while only using the primary copy for day-to-day use; if it fails then you switch to the slower "backup replica", and tolerate slower service until a new primary copy is made.

Traditionally, replicated storage systems such as HDFS require the administrator to specify a "replication factor", either system-wide or on a per-file basis. This is the number of replicas that must be made of the file. Two is the minimum to actually get any replication, but three is popular - if one replica is lost, then you still have two replicas to keep you going while you rebuild the missing replica, meaning you have to be unlucky and have two failures in quick succession before you're down to a single copy of anything.

However, this is a crude and nasty way of controlling replication. Needless to say, I've been considering how to configure replication of blocks within a Ugarit vault, and have designed a much fancier way.

For Ugarit replication, I want to cobble together all sorts of disks to make one large vault. I want to replicate data between disks to protect me against disk failures, and to make it possible to grow the vault by adding more disks, rather than having to transfer a single monolithic vault onto a larger disk when it gets full.

But as I'm a cheapskate, I'll be dealing with disks of varying reliability, capacity, and performance. So how do I control replication in such a complex, heterogeneous, environment?

What I've decided is to give each "shard" of the vault four configurable parameters.

The most interesting one is the "trust". This is a percentage. For a block to be considered sufficiently replicated, then copies of it must exist on enough shards that the sum of the trusts of the shards is more than or equal to 100%.

So a simple system with identical disks, where I want to replicate everything three times, can be had by giving each disk a trust of 34%; any three of them will sum to 102%, so every block will be copied three times.

But disks I trust less could be given a trust of 20%, requiring five copies if a block is stored only on such disks - or some combination of good and less-good disks.

That allows for simple homogeneous configurations, as well as complex heterogeneous ones, with a simple and intuitive configuration parameter. Nice!

The second is "write weighting". This is a dimensionless number, which defaults to 1 (it's not compulsory to specify it). Basically, when the system is given a block to store, it will pick shards at random until it has enough to meet the trust limit of 100%. But the write weighting is used as a weighting when making that random choice - a shard with a write weightinh of 2 will get twice as many blocks written to it as a normal block, on average.

So if I have two disks, one of which has 2TiB free and the other of which has 1TiB free, I can give a write weighting of 2 to the first one, and they'll fill so that they're both full at about the same time.

Of course, if I have disks that are now completely full in my vault, I can set their write weighting to 0 and they'll never be picked for writing new blocks to. They'll still be available for reading all the blocks they already have. If I left the write weighting untouched everything would still work, as the write requests failing would cause another shard to be picked for the write, but setting the weighting to 0 would speed things up by stopping the system from trying the write in the first place.

The third parameter is a read priority, which is also optional and defaults to 1. When a block must be read, the list of shards it's replicated on is looked up, and a shard picked in read priority order. If there are multiple shards with the same read priority, then one is picked at random. If the read fails, we repeat the process (excluding already-tried shards), so the read priority can be used to make sure we consult a fast, nearby, cheap-to-access local disk before trying to use a remote shard, for instance.

By default, all shards have the same read priority, so read requests will be randomly spread across them, sharing the load.

Finally, we have a read weighting, which defaults to 1. When we randomly pick a shard to read from, out of a set of alternatives with the same priority, we weight the random choice with this weighting. So if we have a disk that's twice as fast as another, we can give it twice the weighting, and on a busy system it'll get twice as many reads as the other, spreading the load fairly.

I like this approach, since it can be dumbed down to giving defaults for everything - 33% trust (for a three-way replication), and all the weightings and priorities at 1 (to spread everything evenly).

Or you can fine-tune it based on details of your available storage shards.

Or you can use extreme values for various special cases.

Got a "memcached backend" that offers fast storage, but will forget things? Give it a 0% trust and a high write weighting, so everything gets written there, but also gets properly replicated to stable storage; and give it a high read priority, so it gets checked first. Et voila, it's working as a cache.

Got 100% reliable storage shards, and just want to "stripe" them together to create a single, larger, one? Give them 100% trust, so every block is only written to one, but use read/write weightings to distribute load between them.

Got a read-only shard, perhaps due to its disk being full, or because you've explicitly copied it onto some protected read-only media (eg, optical) for security reasons? Just set the write weighting to 0, and it'll be there for reading.

Got some crazy combination of the above? Go for it!

Also, systems such as HDFS let you specify the replication factor on a per-file basis, requiring more replication for more important files (increasing the number of shard failures required to totally lose them) and to make them more widely avilable in the cluster (increasing the total read throughput available on that file, useful for small-but-widely-required files such as configuration or reference data). We can do that to! By default, every block written needs to be replicated enough to attain 100% trust - but this could be overriden on a per-block basis. Indeed, you could store a block on every shard by setting a trust target of "infinity"; normally, when given a trust target it can't meet (even with every shard), the system would do its best and emit a warning that the system is in danger, but a trust target of "infinity" should probably suppress that warning as it can be taken to mean "every shard".

The trust target of a block should be stored along with it, because the system needs to be able to check that blocks are still sufficiently replicated when shards are removed (or lost), and replicate them to new shards until every block has met its trust target again.

Tell me what you think. I designed this for Ugarit's replicated storage backend and WOLFRAM replicated storage in ARGON, but I think it could be a useful replication control framework in other projects, too.

The only extension I'm considering is having a write priority as well as a write weighting, just as we do with reads - because that would be a better way of enforcing all writes go to a "fast local cache" backend than just giving it a weighting of 99999999 or something, but I'm not sure it's necessary and four numbers is already a lot. What do you think?

Ugarit archive mode progress (by )

Ugarit's archive mode is getting along nicely. I now have importing from a manifest file that specifies properties for the import as a whole, and a list of files to import with their own properties, and basic browsing of the audit trail of an archive in the virtual file system. That includes access to the properties of an import via the virtual "properties.sexpr" file. Note also that lots of import and file properties are automatically added, such as the hostname we import from, the input path for each file, a MIME type deduced from the extension, and so on.

Below the fold is a transcript of it in use, which probably won't mean much to many people...

Read more »

Recent Ugarit progress (by )

I had some time to work on Ugarit yesterday, which I made good use of.

I really should have worked on raw byte-stream-level performance issues - I did a large extract recently, and it took a whole week - but, having a restricted time window, I caved in and did something fun instead; I started work on archival mode. As a pre-requisite for this, I added the facility to give tags a "type" so we can distinguish archive tags from snapshot tags - thereby preventing embarrassing accidents that end up with a tag pointing to a mixture of snapshot and archive-import objects...

(Not that I didn't think about the performance issues. I have a plan in mind to rearrange the basic bulk-block-shovelling logic to avoid any allocation whatsoever by using a small number of reusable buffers, which should also avoid the copying required when talking to compression/encryption engines written in C.)

Read more »

Cuddly Science at The British Science Festival (by )

Cuddly Science Puppet show Photo thanks to Fiona Austen

The weekend saw me, Alaric and Jean at the British Science Festival in Birmingham. I was doing the most indepth version of Cuddly Science yet - everyone who knows me will no doubt now be sick of hearing about Cuddly Science but just incase here is the run down 🙂

I came up with an idea during my science communication course at UWE and have spent the last six months working on it, initially just as a piece of course work but I soon realised that this was the thing that would link together all my skill sets. It grew and adapted.

It is a set of puppets, larger than life versions of influential scientists, technologist, engineers, maths peeps and medical persons. Initially I focused on Ada - she was a natural choice as we have taken part in every single Ada Lovelace Day so far!

Ada went on a few trips out and about telling kids about programming computers and her own erratic childhood. But right from the beginning I knew this needed to be bigger, I have a list of puppets that need to be made.

I now have 5 puppets, I only actually had two proper shows prepared for the Science Festival as I'd planned to repeat one of them. But people decided that they were going to keep coming back to my next show so I improvised the last show which was more about the experiments and science games we'd sorted out.

As mostly Cuddly Science is just me, each puppet has their own show with an activity of some sort for the kids to take part in. So Darwin told of how he wasn't very good at school or sitting still and about his discoveries and this led onto DNA (which wasn't about in his day!). We then did a little DNA extraction experiment with the kids which they loved.

Alaric extracting DNA

Ada has a game that Alaric designed and I have done the graphics for, called Robo Bob's Jobs. We want to make a giant version of it as too our amazement there were way more than the 30 people we had designed our shows around and we need something seen from the back etc. The size of the crowed and the increase in business of the library during the day caused some issues with noise levels so I want to get a portable PA system as well. I need funding.

We also had some bits from Universe in a Box which the kids loved and was the stage for Brahmagupta, a 1500 yr old maths and astronomy dude. I generally entertained the kids between shows with the puppets and also during the activity sessions. We also had colouring sheets which I had drawn - manga scientists with room for the older kids to write down little factoids about the scientists etc...

I want to draw some more of these and maybe have a proper bundle for people to take away with them or down load from the web etc...

There were also science crayons for the colouring in - it was very popular and parents were desperate for their kids to have one of each of the pictures.

Science crayons

Those who could here the shows seemed to really enjoy them and I had so many people coming up to me to say how brilliant it was, how the children really responded to the puppets etc... I did get very nervous for the Ada show which was strange as I have done that one several times before. There were a lot of people there but not as many as for the last show which was improvised so should have been more nerve racking!

This is why I am off to do an improve comedy course at the end of the month - I am going to nail those nerves!

The appeal of the puppets was pretty universal and I got people who were just in the library and hoping for a story time - I equipped them with programmes for the rest of the festival and some of the kids would have played Al's game for hours and hours and had to be shoed away by Ada Puppet.

Ada was termed a princess by many and at least one parent turn round and said that they hadn't known girls could programme. I obviously thought about all of this when deciding what puppets to put in but was amazed to see impact straight away. Questions from adults and kids a like - mainly about Ada and Brahmagupta - it was the idea that people like "me" have done big science, tech, etc.... I really did not expect to see it so vividly.

I believe science is for everyone and this has been a big part of wanting to do science communication and the science art and it has made me more resolute and determined that Cuddly Science needs to get out there. It maybe one of my mad hat schemes, it may just be stupid puppets that me and my mum designed and games my husband made and a mish mash of my science education, experience running craft workshops, being in musical theatre, being an artist, poet and childrens instructor. It may have gotten it's inspirations from all over the place but Cuddly Science has the chance to make a difference, to help build a better world.

Cuddly Science awaiting at the Birmingham Library

The library and festival volunteers were amazing at looking after us and a chain of people I know from various things came to see me which was very encouraging 🙂 Jeany loved it, especially when I let her set up the Story Steps at the library!

Jean setting up the story steps Jean too tired to continue with the setting up of the story steps

The library itself was pretty epic! And I loved the fact it was connected to the Theatre with poetry on the doors 🙂

The library Birmingham

I even bumped into a fellow poet just outside 🙂

And got to go to dinner with friends and meet their little one and stuff.

More photos of Birmingham:

Jean drinking milk in the Rep Gold dudes Gold Dudes planning topary train Giant flowers on the library buildings with giant crosses on them Reflective buildings Brum in sillohette first proper view of Brum

Jean and Alaric found where they had been doing the custard walking 🙂

Jean and Alaric find where the custard walking had been

And so yeah - Cuddly Science is GO!

A user interface design for a scrolling log viewer with varying levels of importance (by )

Like many people involved with computer programming and systems administration, I spend a lot of time looking at rapidly scrolling logs.

These logs tend to have lines of varying importance in them. This can fall into two kinds, that I see - one is where the lines have a "severity" (ranging from fatal errors down to debugging information). Another is where there's an explicit structure, with headings and subheadings.

Both suffer from a shared problem: important events or top-level headings whoosh past amidst a stream of minutae, and can be missed. A fatal error message can be obscured by thousands of routine notifications.

What I think might help is a tool that can be shoved in a pipe when viewing such a log, that uses some means (regexps, etc) to classify log lines with a numerical "importance" as appropriate, and then relaying them to the output.

However, it will use terminal control sequences to:

  1. Colour the lines according to their importance
  2. Ensure that the most recent entry at each level of importance remains onscreen, unless superceded by a later entry with a higher importance.

The latter deserves some explanation.

To start with, if we just have two levels of importance - ERROR and WARNING, for instance - it means that in a stream of output, as an ERROR scrolls up the screen, when it gets to the top it will "stick" and not scroll off, even while WARNINGs scroll by beneath it.

If a new ERROR appears at the bottom of the screen, it supercedes the old one, which can now disappear - letting the new ERROR scroll up until it hits the top and sticks.

Likewise, if you have three levels - ERROR, WARNING and INFO - then the most recent ERROR and WARNING will be stuck at the top of the screen (the WARNING below the ERROR) while INFOs scroll by. If a new WARNING appears, then the old one will unstick and scroll away until the new WARNING hits the top. If a new ERROR appears, then the old ERROR and WARNING at the top will become unstuck and scroll away until the new ERROR reaches the top.

So the screen is divided into two areas; the stuck things at the top, and the scrolling area at the bottom. Messages always scroll up through the scrolling area as they come, but any message that scrolls off the top will stick in the stuck things area unless there's another message at the same or higher level further down the scrolling area. And the emergence of a message into the bottom of the scrolling area automatically unsticks any message at that, or a less important, level from the stuck area.

That way, you can quickly look at the screen and see a scrolling status display, as well as (for activity logs from servers) the most recent FATAL, ERROR, WARNING, etc. message; or for the kinds of logs generated by long-running batch jobs, which tend to have lots of headings and subheadings, you'll always instantly see the headings/subheadings in effect for the log items you're reading.

This is related somewhat to the idea of having ERRORs and WARNINGs be situations with a beginning and an end (rather than just logged when they arise), such as "being low on disk space"; such a "situation alert" (rather than an event alert, as a single log message is) should linger on-screen somewhere until it's cancelled by the software that raised it emitting a corresponding "situation is over" event. Also related is the idea that event alerts above a certain severity should cause some kind of beeping/flashing to happen, which persists until manually stopped by pushing a button to acknowledge all current alerts. Such facilities can be integrated into the system.

This is relevant for a HYDROGEN console UI and pertinent to my previous thoughts on user interfaces for streams of events and programming interfaces to logging systems.

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales