Extending Syntax from Within a Language
↩ ↪February 13, 2011
A big goal with my little language Magpie is to do as much as possible at the library level and keep the core language small. I’ve been toiling for the past few weeks and I finally moved a huge chunk of Magpie over to the library side: all infix operators are now defined and implemented in Magpie. As far as the core language is concerned, it doesn’t even have operators.
Not only can you now define your own infix operators, including specifying precedence and associativity, but other large chunks of Magpie syntax are now defined at the library level. Take this chunk of (meaningless) code:
def doStuff(a, b)
if a and b then
print("Both " ~ a ~ " and " ~ b ~ " are truthy")
else
print("Their sum is " ~ a + b)
end
end
The operators you see there, +
and ~
(for string concatenation) are both
implemented in Magpie. So is the and
keyword. (It’s particularly
interesting because it doesn’t become just a regular function call since
it needs to short-circuit.) In fact, not even if
is baked into the
language. Or def
for that matter. In fact, about the only bit up there
that is part of the core syntax is print()
.
If this is the kind of thing you’re interested in, I’ll try to walk you through how this is possible, and how it works.
Pratt parsers FTW!
The real magic bit that makes this work is Pratt parsers. Magpie’s core parser, which is written in Java, is an object-oriented version of a top-down operator precedence parser. That sounds intimidating, but it’s actually the simplest parser technique I’ve ever seen. I kind of wish everyone knew about them, but I’m kind of glad no one does because it makes me feel like I have a secret weapon.
Update 2021/10/23: I have a chapter in my book Crafting Interpreters about Pratt parsers too.
Here’s the Java code that parses expressions in Magpie (which means it parses pretty much everything since Magpie doesn’t have statements):
public Expr parseExpression(int stickiness) {
Token token = consume();
PrefixParser prefix = mGrammar.getPrefixParser(token);
Expect.notNull(prefix);
Expr left = prefix.parse(this, token);
while (stickiness < mGrammar.getStickiness(current())) {
token = consume();
InfixParser infix = mGrammar.getInfixParser(token);
left = infix.parse(this, left, token);
}
return left;
}
That’s it, for reals. Precedence, associativity, infix operators, they all get
handled by that little chunk of code. You may be thinking that mGrammars
variable hides all the dirty work. That’s actually pretty simple too. It’s
basically a container for two dictionaries. One maps tokens to prefix parsers,
and one maps them to infix parsers. So, you give it a +
token representing the
plus operator, and it returns a parser object that knows how to parse an
addition expression. Prefix and infix parsers are just objects that implement
one of these dead simple interfaces:
interface PrefixParser {
Expr parse(MagpieParser parser, Token token);
}
interface InfixParser {
Expr parse(MagpieParser parser, Expr left, Token token);
int getStickiness();
}
I won’t go into detail about how these work (I’m trying to throw together a more complete post about just Pratt parsers later), but the important bit is that the grammar being parsed is decoupled from the core parsing code.
Recursive descent parsers have a fixed set of functions representing each
grammar production. If you’re using yacc or bison, you’ll have an
entirely separate offline process that generates code for your grammar and bakes
it in. But with this, our grammar is just a collection of objects. If we toss a
new instance of PrefixParser
or InfixParser
into the grammar dictionary,
we’ve just extended the syntax of the language.
Exposing this to Magpie
This gets us partway there. Now we’ve got a simple architecture that lets us decouple the grammar into separate objects. This is a worthwhile exercise in itself because it makes the parsing code written in Java less monolithic, but you still have to hack Java code and rebuild the interpreter to extend the syntax. Lame.
To fix that, we’ll build a shim. What we need is a Java object that implements
PrefixParser
or InfixParser
but which actually runs Magpie code to do the
parsing. I’ll pick infix here just ‘cause. It looks like this:
private static class MagpieInfixParser extends InfixParser {
public MagpieInfixParser(Interpreter interpreter, Obj parser) {
mInterpreter = interpreter;
mParser = parser;
}
public Expr parse(MagpieParser parser, Expr left, Token token) {
// Wrap the Java parser in a Magpie one.
Obj parserObj = mInterpreter.instantiate(
mInterpreter.getMagpieParserClass(), parser);
Obj exprObj = JavaToMagpie.convert(mInterpreter, left);
Obj tokenObj = JavaToMagpie.convert(mInterpreter, token);
Obj arg = mInterpreter.createTuple(
parserObj, exprObj, tokenObj);
// Let the Magpie code do the parsing.
Obj expr = mInterpreter.invokeMethod(mParser, "parse", arg);
// Marshall it back to Java format.
return MagpieToJava.convertExpr(mInterpreter, expr);
}
public int getStickiness() {
return mInterpreter.getMember(Position.none(),
mParser, "stickiness").asInt();
}
private Interpreter mInterpreter;
private Obj mParser;
}
There are a couple of important bits here. The Obj
class is the core class in
the interpreter that represents a Magpie object. If Object
is any object in
Java land, Obj
is any object in Magpie land.
Interpreter
is exactly what it sounds like: the class that represents a live
Magpie interpreter. It keeps track of variables and provides helper functions
for executing code.
There are two static classes, MagpieToJava
and JavaToMagpie
. Their job is to
marshall objects between the two languages. They convert from Magpie Obj
instances to instances of appropriate “real” Java classes and back.
Given that, the shim’s job is dead simple. When you call parse
on the
MagpieInfixParser
object, it just translates the arguments to Magpie and
invokes a parse
method on its Magpie-side sister object. It gets the result
back (an expression) and marshalls that back to Java and returns it.
The other side of the glass
Over in Magpie, it looks like this:
class AndParser
def parse(parser MagpieParser, left Expression,
token Token -> Expression)
// Ignore a newline after "and".
parser matchToken(TokenType line)
var right = parser parseExpression(stickiness)
{ do
var temp__ = `left
match temp__ true?
case true then `right
else temp__
end
end }
end
get stickiness Int = 30
end
MagpieParser registerInfixParser("and", AndParser new())
The Magpie side of the parser is just a class with a parse
method. The first
two lines parse the rest of the expression. (The left-hand side of the operator
will have already been parsed and gets passed in as left
.) Then the last bit
in curly braces shows another neat feature of Magpie: quotations.
Like the Lisp languages, Magpie lets you treat code as data, and has classes to let you build objects that represent bits of code. Doing that manually is kind of lame though:
CallExpression new(MessageExpression(name: "+"),
TupleExpression new(List of(
IntExpression new(1), IntExpression new(2))))
Quotations let you write that out just like it would appear in code:
{ 1 + 2 }
If you surround an expression in curlies, you get an object representing the
expression back, instead of evaluating it. Inside a quotation, you can unquote
using a backquote character (`
), which is like string interpolation but at
the code level, not textual. That way you can build chunks of code declaratively
and fill in the blanks with dynamically-generated stuff.
So the parser class here returns that last quotation, which contains a match
expression. (Magpie has full destructuring pattern matching, and that
is one of the core primitive parts of the language.)
Finally, the call to registerInfixParser
just takes that object and passes it
over to Java. That’s where we wrap it in the shim class and toss it into the
grammar dictionary. Once this parser is all hooked up, when an and
is
encountered, the parser desugars it to a little pattern match that does the
right thing. In other words, if you type:
happy and know(it)
The parser expands it to:
do
var temp__ = happy
match temp__ true?
case true then know(it)
else temp__
end
end
That looks a little weird, but if you think about it, it does the right thing.
If happy
is truthy, it returns know(it)
. Otherwise, it short-circuits and
returns just happy
.
But that’s too hard!
Now we can extend the syntax from Magpie, which is cool. But I gotta admit,
that AndParser
class is a bit hairy, even with quotations in there. Let’s
say I want to make a ~*
operator that repeats a string a given number of
times. Do I have to write a whole class just to do that?
The answer is “no”, of course. If you just want an infix operator that desugars to a function call, look no further than:
definfix ~* 60 (left String, count Int -> String)
var result = ""
for i = 1 to(count) do result = result ~ left
result
end
Here 60
defines the precedence level so it knows how to parse our new
operator if you’re crazy enough to mix it in with others without parentheses.
Now when the parser encounters:
"Beetlejuice " ~* 3
It transform that to a call to:
~*("Beetlejuice ", 3)
You may be wondering where definfix
comes from. Why, it’s a custom parser
written in Magpie, of course! Turtles all the way down!
OH NOES!
I got all of this working and then ran into a real snag. Let’s say we have some module that defines a new operator:
// In repeat.mag:
definfix ~* 60 (left String, count Int -> String)
var result = ""
for i = 1 to(count) do result = result ~ left
result
end
Then we want to import it and use it in another one:
// In summon-ghost.mag:
import("repeat.mag")
print("Beetlejuice " ~* 3)
Do you see the problem? Let’s walk through how most interpreters (including Magpie) handle this:
The user runs
magpie summon-ghost.mag
.The interpreter starts up.
It reads
summon-ghost.mag
and parses it.It starts interpreting it…
It evaluates the
import
and parses and runsrepeat.mag
.It evaluates the
print("Beetlejuice " ~* 3)
line.
Crap. It parses all of summon-ghost.mag
before it has had a chance to
evaluate the import
and actually define the operator we need to parse. So the
parser is going to hit ~*
and not know what to do with it.
But, a clever solution appears! It turns out that when Magpie is parsing the top level of a file, it returns a list of expressions. In other words, a Magpie program isn’t a single giant block expression, it’s a flat list of distinct ones. (Each expression in that list may be arbitrarily big and nested, of course, but at the top level, it’s a list.)
This gives us an opening to fix this little problem. It can parse and evaluate each expression in a file incrementally. Getting this working was actually a tiny code change and means you can now extend the syntax and then use that extension immediately in the same file. This is perfectly valid:
definfix ~* 60 (left String, count Int -> String)
var result = ""
for i = 1 to(count) do result = result ~ left
result
end
print("Beetlejuice " ~* 3)
The only limitation is that you have to do this at the top-level. Given that most languages don’t let you do any of this, that doesn’t seem like too much to give up. In return, we can make the language dramatically more powerful at building internal DSLs, and the core language itself becomes simpler.