Syntactic sugar in s-expression languages (by alaric)
S-expression based languages (eg, Lisps, and by a broader definition of s-expression, things like Prolog) use a single regular syntax to represent the parse tree of code, rather than having parsing rules for each syntactic construct in the language. For example, some C code might look like:
if (x == 5) {
foo(y, 7);
}
The compiler parses that to produce a tree structure in memory, along the lines of:
if
condition: equality
lhs: x
rhs: 5
body: block
stmt1: function-call
func: foo
arg1: variable
y
arg2: int-constant
7
This can then be analyzed for meaning and thus compiled.
However, languages like Lisp don't have parsing rules for each syntactic construct in the language, above a very basic minimal set ('integer', 'word', etc). The above C code in Lisp would look like:
(if (= x 5)
(foo y 7))
This is parsed by the read
standard library function into a tree structure in memory - a list starting with the symbol if
followed by two sublists, one containing the symbols =
and x
then the number 5
, the second containing the symbols foo
and y
, then the number 7
.
The Lisp language defines conventions that assign meaning to this. A number evaluates to itself. A symbol evaluates to that symbol's bound value in the current lexical environment (eg, the value of the variable with that name). A list's evaluation depends on its first element - if it's something if
then it's handled as a conditional, if it's not recognised as special then it's considered the name of a function and treated as a function call.
The fun thing about this is that you can then define your own syntax. For a start, you can define your own functions, which then stand "alongside" compiler/interpreter-provided things like if
since they both use the same basic "name of thing then list of arguments" structure. Whereas in C, you can't define anything that looks like if
(bar a few limited hacks with the preprocessor). Then there are macros, which allow you to do things beyond what can be done with a function call (if
can't be implemented with a function call, for example, but it can be implemented with a macro). For a thought-provoking guide to what can be done with Common Lisp macros, see On Lisp by Paul Graham; and for an alternative macro system, see my final year project from my degree.
However, the uniform syntax of s-expressions has a downside: everything dissolves into a sea of parenthesis and symbols. Things like the C if
construct with its curly braces stand out in the source code, clearly distinct from the function calls within, while Lisp source code can be confusing. Also, it can sometimes be too verbose. Abbreviations are welcome.
To get around this, you can extend read
to give you abbreviations for awkward forms. Lisp code often features lambda constructs of one argument, which look like (lambda (x) (+ x 1))
. This common structure could be abbreviated to [+ _ 1]
, by having read
expand it upon reading to the underlying form. By defining such abbreviations in read
(and, ideally but not necessarily, putting pattern-matching in write
to automatically abbreviate printed output for the sake of human readers), you can keep the regularity of the underlying s-expressions while smoothing off a few rough edges.
Common Lisp's read
supports the addition of user abbreviations by defining "read macros" - bits of user code that get called by read
when it encounters certain sequences, so they can have customised parsing rules. Scheme's doesn't, but a few abbreviations are built into Scheme's read
by the language spec (namely, things like 'foo
for (quote foo)
and so on).
There's other tricks you could do, too. Perhaps you could define x(a b){c d}
as sugar for (x a b c d)
, and then use it for control flow constructs:
if ((= x 5)) {
(foo y 7)
}
Suddenly, you might have less trouble getting people into Lisp.
However, such a syntax can't be recreated "at the other end" by write
. Unless you give write
a table of control flow structures, giving their name and the number of arguments they have that should go in ()
before the rest go in {}
, which would introduce messiness (think about how user-defined control flow macros would be added to the table, etc), write
would not be able to know when to use the special syntax.
One solution that's occurred to me is to parse the above example into the following s-expression:
(SPECIAL{} 1 if (= x 5)
(foo y 7))
Then have a macro called SPECIAL{}
that just ignores its first argument (the number of subexpressions found in the ()
part) and returns the list of arguments after that - in other words, removing the SPECIAL{}
construct from the semantics of the language, making sure the code still works as (if ...)
.
However, write
would handle the SPECIAL{}
specially - by turning it back into the special syntax, and applying appropriate indentation.
It would be cleaner, in fact, to replace SPECIAL{}
with a more general construct for storing parse-time information that would otherwise be discarded. (parse-metadata data body)
could be a macro that just expands to the body
, with the data
part being a list of parse metadata. The name of the file the body s-expression was parsed from, and the line and column at which the body started and ended. The text of any comment that was elided. Details of any special syntax shortcuts used.
The removal of these parse metadata tags would need to occur early in macro expansion, otherwise they'd cause havoc with pattern matching on macro arguments. Or perhaps they should be preserved, but the macro pattern matching made 'blind' to them, so they survive through into the outputs of the macros when source code is transformed, allowing debuggers to map back from compiled code addresses to positions in source code.
However, that might get messy. Is it all worth the effort?