HYDROGEN: Extensibility (by alaric)
Word size
The simplest thing to handle is the word size. Different CPUs have a different 'native' size of number they handle; these days, the choices are usually 8, 16, 32 or 64 bits (although 16 is getting pretty rare).
C, and many other languages that have grown up with C held high in the designer's mind, handle this issue rather naffly. In C, there is a type called int
which is meant to map to the platform's native word size for arithmetic operations (the native word size for memory addresses can be different, as described below). There's also support for smaller and larger integers, with short int
and long int
. Let's not get into char
...
But the problem is that these definitions are too vague. short int
may not actually be any smaller than int
, although on most 32-bit platforms it's a 16-bit integer; and long int
may not be any bigger then int
(on 32-bit platforms, long int
is usually 32 bits as well, as a legacy from 16-bit systems that had a 16 bit short int
and int
but a 32-bit long int
).
An int
has to be at least 16 bits, so you can never really be sure if you can put a number above 32767 into one safely. Now, this can be sort of OK - if you have a system that tracks some resource (perhaps books being checked in and out of a library), you can use a type like int
to identify instances of the resource, and then on a 32-bit system you'll be able to deal with more of them than on a 16-bit system, so it sort of organically scales as the system's capabilities improve - but, these days, interoperability is a concern; things like file formats and network protocols need specified widths of fields in them. So C has added types like uint32_t
(unsigned 32-bit integer), however, they're nasty names that are unpleasant to type, so most people still use int
and assume it'll be 32 bits.
So in the HYDROGEN virtual machine, we primarily define our integer types in terms of certain bit widths. a HYDROGEN system must support u8
, s8
, u16
, s16
, u32
and s32
for signed and unsigned versions of the the tree most common integer sizes; for situations where we want to talk about a value without assigning an interpretation such as "unsigned integer" to it, we can refer to it as a "cell", so we also have c8
, c16
, and c32
. When we do want to use machine words for things, we just drop the size prefix, and so have c
, s
, and u
, for a single machine word, one interpreted as a signed integer, and one interpreted as an unsigned integer, respectively.
We have a character type, ch
. We rather avoid the issue of what the character encoding is, and how large a character might be; I am desperately trying to avoid the mess that Unicode opened up, but that's a topic for another blog post (hint: there's no easy way to represent a Unicode character - Unicode is defined in terms of code points, several of which may make a character...). So defining the semantics of characters is an open topic, but for now, all we actually need for the core of HYDROGEN is for them to be fixed-width comparable quantities, so we can store and compare strings.
It's allowed for a value like a u8
to actually have more bits of precision than eight. u8
really means "At least eight bits, interpreted as an unsigned integer". Indeed, on a word-addressed implementation where accessing individual bytes of a word is expensive, a u8
might be uniformly implemented as a 32-bit word, both in memory and in registers. Rather than having a plethora of words to handle all these different types, it's expected that the arithmetic support for all sizes smaller than or equal to the machine word size be actually the same primitive operations; as we require a cell to be at least sixteen bits at the moment, this means that u8+
, s8+
, u16+
, s16+
, u+
, and s+
are pretty much guaranteed to all be words for the same primitive (indeed, as addition is the same for unsigned and twos-complement signed integers, u...+
and s...+
will always be the same anyway). The words that transfer values of types like u8
, s8
and c8
to and from memory, and specify how large they are in memory, may work with individual bytes or might round them up to entire words; the only time you're guaranteed to get only eight bits is when you use the u8normalise
word (which takes the u8
value at the top of the stack and makes sure it's in the range 0..255
), or when using special memory access operations for reading and writing "packed structures" for interoperability with other systems.