Category: Sci/Tech

You Can’t Be A Scientist (by )

Me: explains science

Kid: How do you know all this slits eyes

Me: Because I happen to be a trained scientist

Kid: No you're not

Me: Yes I am

Kid: Nooooo, he's the scientist points to Al

Al: Nope I'm an engineer, she's the scientist

Kid: What really?

Me: Yep, I've blasted moon rock with lasers

Kid: puzzled look

This was last week. I am sure I've told you all about the time the toddlers refused to believe I was a geologist and that my friend with his shorts and big beard was the geologist?

I'm sure you've all read the posts I've made about how I get parents thanking me for having Ada and a list of female scientists and engineers but this time I don't think the issue was one of gender exactly - I am a mum, I had occupied the space of "mum person" "stay at home parent" and in our society that means, person who doesn't do anything but house work, or maybe a side job in a shop or nursery.

He had no problem with there being female scientist puppets or the idea of girls in the group doing science but Mums? Mums don't do/know this kind of stuff, mums well they're kind of dumb and reserved and frightened things.

And I have gotten this so much.

I noticed that crutches aside people just assume I don't want to/can't do stuff anymore because I've had a baby or two and it is INFURIATING, more so as though Alaric suffers from this too it is a watered down version - probably as he is out at "proper work". Stay at home dad's I know tend to have issues with people thinking they are lazy and heaven forbid they try and expand their minds by reading or anything whilst at home - surely they should be fixing everything - what you mean they cook and clean? They need to go out and get a job.

And on and on and on.

But I am kind of feeling a bit stressed about bits of science that are not my bit of science because though at primary school age etc... I know all the stuff I feel like I am now The Proof and the only proof that slightly dumpy mummies can do science too.

The Dyslexic Author (by )

Sarah Snell-Pym Award Winning Author

This week is Dyslexia Awareness Week, it is also the begininng of an insane writing challenge called NaNoWriMo which stands for National Novel Writing Month. The idea is that you write a minimum of fifty thousand words in a month and I have been doing this challenge and a picture book sister challenge called PiBoIdMo (Picture Book Idea Month) since 2009, which is now scary long ago.

When I first started the challenge and using the forum I felt very edgy, being severely dyslexic made me hesitate to enter into online written discussions with grammarian monsters - the sort that correct friends' emails. How was I ever going to compare to such writing experts when sometimes I can't spell mine or my kids' names correctly?

Trying to belt out a novel is an amazing experience but it is also an emotionally fraught one, especially for those low on self confidence. Self confidence is a key to success - it is not the only key but it is one of the main three - Self Confidence, Endurance and Improvisation/Adaptability. Dyslexics, due to our education system and social attitudes, tend to be high on intelligence and low on that whole confidence thing. To keep going with the writing you kind of need to believe that your story is good enough, that your imagination is fantastic and that everyone is going to want to read it. Many authors go through a cycle of thinking their stuff is amazing and will win a nobel prize, to sinking into a pit of despair over how rubbish it is.

But dyslexics have an added edge of nerves, an extra question over their abilities. Not only is there the language structure issues but there is the widely held idea that if you cannot spell you cannot write. This is wrong.

And it turned out that the way NaNoWriMo works is fantastic for boosting dyslexic writers. It goes something like this - everyone is rushing to get down as many words as they can, you are encouraged to leave the typos as they are and just keep going, everyone has typos, inversions of letters, missed letter where they are just typing so fast. Normal people see these and correct them, the dyslexic brain may think that that is the correct spelling and at other times it will see it as wrong - but conversely it might see the correct spelling as wrong and correct it to something incorrect - DOH!

What this means though is that when you are sitting in a cafe or pub with a group of writers your red line squiggles are no longer an issue - everyone has them. Then there is the concept that you can edit a book with mistakes in, no matter how many mistakes there are, but if there is no book to begin with you cannot edit it into something. This frees you up to write.

One of the things I also found was that increasingly I was learning language intricacies and histories and that I could grab the grammar nazis by the proverbial and correct them if and when they started. Grammar is not a fixed thing - look at the history of writing and you find that Shakespeare couldn't spell his own name, that names themselves are pretty fluid, that grammar is just basically a mark up language to tell the reader when to breathe when talking out loud.

But can a dyslexic ever be a writer, be a published author, a journalist?

Yes, they can, and when they do they tend to be multi-genre writers, not brilliant for becoming a household name but good for writing how-to and last minute books, to be able to switch the brain from science to sports to craft, to be journalists (with patient editors!), to be non-specialist all round jacks of all trades. And, increasingly, this is becoming acceptable back in the realm of fiction, thanks to authors such as Neil Gaiman.

So where does that leave me? I have said repeatedly that I must be insane trying to be a writer whilst being very very badly dyslexic but, you know what, I wasn't - I find that being dyslexic helps with research for stories and articles, as I can't rely on words or even the grammar. I often have to use both plus the context, meaning that I can often pick up on the big or small picture, the hidden concepts and deeper meanings. It also stops me making stupid assumptions as I can't take the writing literally and if it doesn't seem right I am forced to ask, to check. For science writing this is extremely important.

Now before we go any further, dyslexia is not something I can really define; it is just a part of how my brain is wired so I will not say that my writing success is because of, nor in spite of, the dyslexia. It could have stopped me; it was a hurdle, and it has stopped many but mainly because they are told they can't do things because of it. Also, yes, I am contrary and stubborn so when people told me I could not, or that I would find stuff hard, I was determined to show them I could do it - especially when my intelligence itself was under attack.

But would my life achievements have been different without the dyslexia? I kind of think not, I just had to take a different path. And that path has been strange and winding and this last week I have found myself writing craft workshops, reading my kids poetry and stories to kids whilst dressed up in ridiculous outfits at various kid clubs, being asked to perform my page poetry at several events, asked to run writing days for adults and kids, getting sci-fi stories accepted, writing blog copy and presenting my project Cuddly Science which includes script writing and picture book writing and report writing and talk writing.

And that was just this week. This last month included articles on sci-fi/fantasy and science and crafts and gardening and grant applications, and this last year saw me become a member of the Poetry Society, British Science Fiction Association and the British Science Writers Association (and yes that does confuse me especially as there is also the British Hen Well-Fare Trust that we got the chickens from too!), I have been asked to present awards to school kids and I completed a Science Communication course - something I dismissed as a "can't" during my undergraduate degree, due to the dyslexic issues.

I now firmly place myself in the role of writer, of author and so do others. I am finally what I was told I could never be - a dyslexic author. It was not trial free and it is not yet over, it kind of will never be over and I'm ok with that.

Back to NaNoWriMo, I find myself actively encouraging dyslexics to write - to take part and I love wondering around the forums and Facebook pages and twitter seeing articles like this pop up and I love to be able to say to those who are worried, those who are struggling, don't give up, you can succeed at this. And that doesn't just go for writing, it goes for every aspect of career and life 😀

Folding history (by )

Ugarit is a content-addressed store; the vault is a series of blocks, identified by a hash, that cannot change once they are written.

But logically, they appear as a set of "tags", each of which either points to an archive (a set of files with associated metadata, which can be added to, or the metadata of existing files changed) or snapshots (a chain of snapshots of a filesystem at a point in time).

So in a store where objects cannot be modified, how do we create the illusion of mutable state in these "tags"? Read more »

Further progress on Ugarit archival mode (by )

Further to my last post on the matter, I've been working on the basic user interface to accessing archive metadata.

As before, let's do an import to an archive tag in a vault. I've made a manifest file with three MP3s in - all data that could be extract from ID3 tags, and I plan to write a tool to automate the generation of manifests by examining their contents in exactly that manner, but for now I had to hand-write one:

[alaric@ahusai ugarit]$ cat test.manifest
(object "/home/alaric/archive/sorted-music/UNKLE/Psyence Fiction/13 Be There.mp3"
        (title = "Be There")
        (track = 13)
        (artist = "UNKLE")
        (album = "Psyence Fiction"))

(object "/home/alaric/archive/sorted-music/UNKLE/Psyence Fiction/11 Rabbit in Your Headlights.mp3"
        (title = "Rabbit in Your Headlights")
        (track = 11)
        (artist = "UNKLE")
        (album = "Psyence Fiction"))

(object "/home/alaric/archive/sorted-music/Led Zeppelin/Remasters/1-09 Celebration Day.mp3"
        (title = "Celebration Day")
        (track = 9)
        (volume = 1)
        (artist = "Led Zeppelin")
        (album = "Remasters"))

As before, I import it, loading the files into the content-addressible storage of the vault, automatically deduplicating, and possibly storing the data on a cluster of remote servers (although in this case, I'm just using a local vault). This was done with Ugarit revision [80b324f3af]:

[alaric@ahusai ugarit]$ ugarit import test.conf music test.manifest
Loading manifest file test.manifest...
Importing from test.manifest to tag music...
Importing /home/alaric/archive/sorted-music/Led Zeppelin/Remasters/1-09 Celebration Day.mp3...
...imported with key 4d64e4650333741cb56c3e6a785b6de4d23324cb1055e529
Importing /home/alaric/archive/sorted-music/UNKLE/Psyence Fiction/11 Rabbit in Your Headlights.mp3...
...imported with key 370bee7debb458357a2b879014d4abbeb409215ed269c1c6
Importing /home/alaric/archive/sorted-music/UNKLE/Psyence Fiction/13 Be There.mp3...
...imported with key 39df8bafd530a66614ad60ab323033b1385cdd842528dbd2
Committing import...
Imported successfully to tag music with import key ac26354ccfb0530109932c1aaddd414b59d4394d44ec43cd
Written 16MiB to the vault in 24 blocks, and reused 0B in 1 blocks (before compression)

But now it's in, we can query the metadata. Firstly, let's see what properties are available - a combination of the ones we wrote in the manifest, and automatically-generated ones such as a MIME type and the original import path:

[alaric@ahusai ugarit]$ ugarit search-props test.conf music
album
artist
filename
import-path
mime-type
title
track
volume

Let's see what values there are for the "artist" property:

[alaric@ahusai ugarit]$ ugarit search-values test.conf music artist
UNKLE
Led Zeppelin

(they're sorted by popularity, and we have two UNKLE tracks, so that comes first)

Let's see what UNKLE albums we have, by filtering for objects with an artist property of "UNKLE" and asking what values of the "album" property are available:

[alaric@ahusai ugarit]$ ugarit search-values test.conf music '(= ($ artist) "UNKLE")' album
Psyence Fiction

Let's see what we know about music by UNKLE:

[alaric@ahusai ugarit]$ ugarit search test.conf music '(= ($ artist) "UNKLE")'
object 39df8bafd530a66614ad60ab323033b1385cdd842528dbd2
    (album = "Psyence Fiction")
    (artist = "UNKLE")
    (filename = "13 Be There.mp3")
    (import-path = "/home/alaric/archive/sorted-music/UNKLE/Psyence Fiction/13 Be There.mp3")
    (mime-type = "audio/mpeg")
    (title = "Be There")
    (track = 13)
object 370bee7debb458357a2b879014d4abbeb409215ed269c1c6
    (album = "Psyence Fiction")
    (artist = "UNKLE")
    (filename = "11 Rabbit in Your Headlights.mp3")
    (import-path = "/home/alaric/archive/sorted-music/UNKLE/Psyence Fiction/11 Rabbit in Your Headlights.mp3")
    (mime-type = "audio/mpeg")
    (title = "Rabbit in Your Headlights")
    (track = 11)

Ok, let's listen to all our music by UNKLE (the extra "keys" parameter to the search command says to just output the object keys, one per line, and the "archive-stream" command streams the contents of an archived file to standard output):

[alaric@ahusai ugarit]$ for i in `ugarit search test.conf music '(= ($ artist) "UNKLE")' keys`;
do ugarit archive-stream test.conf music $i | mpg123 -;
done

...music by UNKLE plays...

We're slowly moving towards having a usable and useful archival filesystem, backed on a modular content-addressible storage system! Isn't that neat? Of course, it's not amazingly useful as it stands - at first sight, it's like a very crude version of the browser found in any modern music collection management app these days; but this is the seed of something much more interesting. For a start, it can categorise files using any user-defined schema. The backend storage can be encrypted, and accessed remotely over a network (and, in future, replicated over a cluster, or mirrored between your laptop and a home fileserver, and automatically synchronised when they're connected). The same storage can be used to store backup snapshots as well as archives, and if files exist in any combination of archives and snapshots, then only one copy of it will be stored (or need uploading, even); most files in an archive will have started off in a backed-up directory tree, or will be extracted into one.

There are many interesting use cases for Ugarit, but my personal one is to have a fault-tolerant vault of all the data that matters to me, neatly organised so I can find things quickly, and so I can access things from different locations (even when offline). Rather than having files scattered over different disks on different machines, and having to move things around to make space, and remember where they are, I can add more disks to the vault when I need more capacity, and have Ugarit manage everything for me. With the amount of data I manage, that'll be a great weight off my mind!

Configuring replication (by )

Storing all your data on one disk, or even inside one computer, is a risky thing to do. Anything stored in only one, small, physical location is all too easily destroyed by flood, fire, idiots, or deliberate action; and any one electronic device is prone to failure, as its continued functioning depends on the functioning of many tiny components that are not very easily replaced.

So it's sensible to store multiple copies, ideally in physically remote locations.

One way of doing this is by taking backups; this involves taking a copy of the data and putting it into a special storage system, such as compressed files on another disk, magnetic tape, a Ugarit vault, etc.

If the original data is lost, the backed-up data can't generally be used as-is, but has to be restored from the backup storage.

Another way is by replicating the data, which means storing multiple, equivalent, copies. Any of those copies can then be used to read the data, which is useful - there's no special restore process to get the data back, and if you have lots of requests to read the data, you can service those requests from your nearest copy of it (reducing delays and long-distance communication costs). Or you can spread the read workload across multiple copies in order to increase your total throughput.

Replication provides a better quality of service, but it has a downside; as all the copies are equally important, you can't use cheaper, slower, more compact storage methods for your extra copies, as you can with backups onto slower disks or tapes.

And then there's hybrid systems, perhaps were you have a primary copy and replicate onto slower disks as a "backup", while only using the primary copy for day-to-day use; if it fails then you switch to the slower "backup replica", and tolerate slower service until a new primary copy is made.

Traditionally, replicated storage systems such as HDFS require the administrator to specify a "replication factor", either system-wide or on a per-file basis. This is the number of replicas that must be made of the file. Two is the minimum to actually get any replication, but three is popular - if one replica is lost, then you still have two replicas to keep you going while you rebuild the missing replica, meaning you have to be unlucky and have two failures in quick succession before you're down to a single copy of anything.

However, this is a crude and nasty way of controlling replication. Needless to say, I've been considering how to configure replication of blocks within a Ugarit vault, and have designed a much fancier way.

For Ugarit replication, I want to cobble together all sorts of disks to make one large vault. I want to replicate data between disks to protect me against disk failures, and to make it possible to grow the vault by adding more disks, rather than having to transfer a single monolithic vault onto a larger disk when it gets full.

But as I'm a cheapskate, I'll be dealing with disks of varying reliability, capacity, and performance. So how do I control replication in such a complex, heterogeneous, environment?

What I've decided is to give each "shard" of the vault four configurable parameters.

The most interesting one is the "trust". This is a percentage. For a block to be considered sufficiently replicated, then copies of it must exist on enough shards that the sum of the trusts of the shards is more than or equal to 100%.

So a simple system with identical disks, where I want to replicate everything three times, can be had by giving each disk a trust of 34%; any three of them will sum to 102%, so every block will be copied three times.

But disks I trust less could be given a trust of 20%, requiring five copies if a block is stored only on such disks - or some combination of good and less-good disks.

That allows for simple homogeneous configurations, as well as complex heterogeneous ones, with a simple and intuitive configuration parameter. Nice!

The second is "write weighting". This is a dimensionless number, which defaults to 1 (it's not compulsory to specify it). Basically, when the system is given a block to store, it will pick shards at random until it has enough to meet the trust limit of 100%. But the write weighting is used as a weighting when making that random choice - a shard with a write weightinh of 2 will get twice as many blocks written to it as a normal block, on average.

So if I have two disks, one of which has 2TiB free and the other of which has 1TiB free, I can give a write weighting of 2 to the first one, and they'll fill so that they're both full at about the same time.

Of course, if I have disks that are now completely full in my vault, I can set their write weighting to 0 and they'll never be picked for writing new blocks to. They'll still be available for reading all the blocks they already have. If I left the write weighting untouched everything would still work, as the write requests failing would cause another shard to be picked for the write, but setting the weighting to 0 would speed things up by stopping the system from trying the write in the first place.

The third parameter is a read priority, which is also optional and defaults to 1. When a block must be read, the list of shards it's replicated on is looked up, and a shard picked in read priority order. If there are multiple shards with the same read priority, then one is picked at random. If the read fails, we repeat the process (excluding already-tried shards), so the read priority can be used to make sure we consult a fast, nearby, cheap-to-access local disk before trying to use a remote shard, for instance.

By default, all shards have the same read priority, so read requests will be randomly spread across them, sharing the load.

Finally, we have a read weighting, which defaults to 1. When we randomly pick a shard to read from, out of a set of alternatives with the same priority, we weight the random choice with this weighting. So if we have a disk that's twice as fast as another, we can give it twice the weighting, and on a busy system it'll get twice as many reads as the other, spreading the load fairly.

I like this approach, since it can be dumbed down to giving defaults for everything - 33% trust (for a three-way replication), and all the weightings and priorities at 1 (to spread everything evenly).

Or you can fine-tune it based on details of your available storage shards.

Or you can use extreme values for various special cases.

Got a "memcached backend" that offers fast storage, but will forget things? Give it a 0% trust and a high write weighting, so everything gets written there, but also gets properly replicated to stable storage; and give it a high read priority, so it gets checked first. Et voila, it's working as a cache.

Got 100% reliable storage shards, and just want to "stripe" them together to create a single, larger, one? Give them 100% trust, so every block is only written to one, but use read/write weightings to distribute load between them.

Got a read-only shard, perhaps due to its disk being full, or because you've explicitly copied it onto some protected read-only media (eg, optical) for security reasons? Just set the write weighting to 0, and it'll be there for reading.

Got some crazy combination of the above? Go for it!

Also, systems such as HDFS let you specify the replication factor on a per-file basis, requiring more replication for more important files (increasing the number of shard failures required to totally lose them) and to make them more widely avilable in the cluster (increasing the total read throughput available on that file, useful for small-but-widely-required files such as configuration or reference data). We can do that to! By default, every block written needs to be replicated enough to attain 100% trust - but this could be overriden on a per-block basis. Indeed, you could store a block on every shard by setting a trust target of "infinity"; normally, when given a trust target it can't meet (even with every shard), the system would do its best and emit a warning that the system is in danger, but a trust target of "infinity" should probably suppress that warning as it can be taken to mean "every shard".

The trust target of a block should be stored along with it, because the system needs to be able to check that blocks are still sufficiently replicated when shards are removed (or lost), and replicate them to new shards until every block has met its trust target again.

Tell me what you think. I designed this for Ugarit's replicated storage backend and WOLFRAM replicated storage in ARGON, but I think it could be a useful replication control framework in other projects, too.

The only extension I'm considering is having a write priority as well as a write weighting, just as we do with reads - because that would be a better way of enforcing all writes go to a "fast local cache" backend than just giving it a weighting of 99999999 or something, but I'm not sure it's necessary and four numbers is already a lot. What do you think?

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales