Fixing Ambiguities in Grammars

December 28, 2008 code f-sharp language parsing

For kicks, I’ve been writing a little toy programming language using F#, lex and yacc (fslex and fsyacc, specifically, although everything here should apply to any lex and yacc). If you’ve tried doing this, inevitably you’ve run into the issue of ambiguity in your grammar. This means there’s more than one equally valid way to parse an input string, so the parser can’t decide, gets very upset, and starts writing on its LiveJournal about how unfair life is.

The two areas where you’ll run into problems are associativity and precedence. This is so common that yacc has support built-in to address them, but unfortunately %left and %right don’t always work as well as you’d like. Instead, I’ll show you some techniques I figured out from the 1985 C ANSI reference grammar for how to resolve these using just basic grammar rules. First, definitions:

Associativity

Associativity means that, given a run of the same operator, should it go left to right, or right to left? For example:

1 + 2 + 3

One of those additions needs to be performed first. Should it be (1 + 2) + 3 or 1 + (2 + 3)? Left-associative operators pick the first, and right- associative ones take the other. (Most operators you see in math are left- associative, and moreso, most produce the same result in either order. Subtraction, like 3 - 2 - 1, is one example where associativity does matter.)

Precedence (AKA Order of Operations)

Precedence is the other bugaboo. The question here is, given a few different operations, which are performed first? For example:

1 + 2 * 3

Should that be (1 + 2) * 3, or 1 + (2 * 3)? yacc doesn’t know your Dear Aunt Sally, so you have to tell it. I’m using arithmetic expressions as examples here, but this applies to all facets of expressions and statements: Given result = foo.x + 2 should we do the assignment (=), field access (.), or addition (+)first?

A Toy Language

For our exercise, we’ll consider a little toy language. It supports integer literals, functions that take and return a single integer, and binary operators. Functions are any normal identifier name like foo or rockMeAmadeus. Operators can be any punctuation character like + or &. Statements end with a semicolon (;), and parentheses (( )) can be used for grouping.

Function application is higher precedence than operators, such that foo 1 + 2 should apply 1 to foo first and then add 2 to the result.

Binary operators are left-associative and all have the same precedence. In other words, unlike in arithmetic, + and * have the same order of operations. Here are some examples:

123;                // an int
foo bar 1;          // pass 1 to bar, pass the result to foo
6 + 2 * 3 / 4;      // binary operators (result = 6)
foo 1 + bar 2;      // mix operators and functions
foo (1 + bar 2);    // parentheses for grouping

Pretty simple, but already you can see the trouble spots. In the third example, how does the grammar know to make it left-associative? In the fourth, how does it know to do the function application before the addition?

Lexing and Parsing (briefly)

You and I are both smart enough to figure out how to lex the above. Let’s assume we’ve got a working lexer than has the emits following tokens: INT, FUNCTION, OPERATOR, LPAREN, RPAREN, and SEMI.

Likewise, I’m not going to cover setting up yacc or discuss the AST emitted by it. Assume the action steps here do about what you think, and know that the entry point for our grammar is start: and we shall proceed…

From the Bottom Up

We’re going to build the full grammar for the toy language above with the proper precedence and associativity from the bottom up. We’ll start with integers:

Integers

start:
    | Expression SEMI               { $1 }

Expression:
    | Primary                       { $1 }

Primary:
    | INT                           { Int $1 }

This is a little more complicated than necessary, but it’ll make sense later. Our entrypoint basically says we’re parsing a single expression terminated with a semicolon. An expression in turn, can only be one thing (right now): a “primary expression”. A primary expression is just an integer. Magical. We can now parse 123. Break open the champagne bottles.

Operators

Let’s throw operators in:

Expression:
    | Primary                        { $1 }
    | Expression OPERATOR Expression { Operator ($1, $2, $3) }

Try compiling that and watch a yacc barf on your screen. yacc can handle recursive rules pretty well, but that operator one is a doozy. Not only is it recursive, but it provides two paths to recurse and no guidance for which one to choose.

Congratulations, you just hit the associativity problem. The issue is, given the text 1 + 2 * 3 yacc looks at that rule and says: “Well ‘1 + 2’ is an expression and ’*’ is an operator, and '3’ is an expression, so I could parse that way. But wait, '1’ is also an expression, and ’+’ is an operator, and '2 * 3’ is an expression, so the rule works that way too. Oh noes!

Specifying Associativity

We fix it by simply specifically ruling out one of those cases. If one side of an operator can only be contain an integer, then half of that ambiguity disappears. Behold:

start:
    | Expression SEMI               { $1 }

Expression:
    | Primary                       { $1 }
    | Expression OPERATOR Primary   { Operator ($1, $2, $3) }

Primary:
    | INT                           { Int $1 }

Note that the Operator rule now has Primary on the right of the operator now. What we’ve done is required the Operator rule to only be recursive on one side and force it to cascade down a level on the other. By putting the same-level recursive Expression to the left of the OPERATOR, we’ve chosen to make operators left-associative. Switch it around to Primary OPERATOR Expression and you’ve got a right-associative rule. So that’s how to solve half our ambiguity problems.

Functions

Let’s throw functions in. Consider just this:

Expression:
    | Primary                       { $1 }
    | FUNCTION Expression           { Function ($1, $2) }

Now an expression can either be an integer, or a function applied to an expression. Making the rule recursive like that lets us handle not only foo 123 but chained function calls like foo bar 123.

Pretty good. Now mix it in with our operator rule:

start:
    | Expression SEMI               { $1 }

Expression:
    | Primary                       { $1 }
    | FUNCTION Expression           { Function ($1, $2) }
    | Expression OPERATOR Primary   { Operator ($1, $2, $3) }

Primary:
    | INT                           { Int $1 }

yacc vomit covers your screen. Given foo 1 + 2, yacc says, “Well 'foo’ is a function and '1 + 2’ is an expression so I can pick the function rule. But 'foo 1’ is an expression and '2’ is a primary, so I could pick the operator rule instead. I’m feeling queasy!” Actually, yacc, like a good party guest, will clean up his own vomit and let you proceed with this conflict by trying to guess at your preferred precedence, but it’s unwise to rely on that.

What we need is to tell yacc to always do function application first. foo 1 + 2 should always be read as (foo 1) + 2.

Specifying Precedence

We can fix this in a way similar to how we handled associativity. Remember how I described the rules as a cascading series from complex to simple? Another way to look at those is as explicit order of operations, from last to first. Let’s split out the rules a bit:

start:
    | Expression SEMI               { $1 }

Expression:
    | Primary                       { $1 }
    | Function                      { $1 }
    | Operator                      { $1 }

Operator:
    | Expression OPERATOR Function  { Operator ($1, $2, $3) }

Function:
    | FUNCTION Expression           { Function ($1, $2) }

Primary:
    | INT                           { Int $1 }

That doesn’t actually fix anything for us, yet. The problem is that the Expression, Operator, and Function rules all recursively point to each other. Our neat cascade is trying to flow uphill. The trick, and the fix for our issue, is to simply not allow any rule to reference a rule above it:

start:
    | Expression SEMI               { $1 }

Expression:
    | Primary                       { $1 }
    | Function                      { $1 }
    | Operator                      { $1 }

Operator:
    | Operator OPERATOR Function    { Operator ($1, $2, $3) }

Function:
    | FUNCTION Function             { Function ($1, $2) }

Primary:
    | INT                           { Int $1 }

This is better, but we just broke a bunch of stuff. Since nothing bounces back up to Expression, there isn’t a way to actually terminate anything with an integer anymore. We can no longer parse foo 1 or 1 + 2. The problem is that when, for example, the operator rule references Function, what it means is “you can insert any function expression here”, but we want it to mean “you can insert any function or higher precedence expression here”. For our purposes, that would mean a primary expression: an integer.

We can fix that by making each precendence level have a fall-through case to the next highest level. Thus, anywhere we can use a function, we can also use a primary, and anywhere we can use an operator, we can also use a function. Like so:

start:
    | Expression SEMI               { $1 }

Expression:
    | Operator                      { $1 }

Operator:
    | Function                      { $1 }
    | Operator OPERATOR Function    { Operator ($1, $2, $3) }

Function:
    | Primary                       { $1 }
    | FUNCTION Function             { Function ($1, $2) }

Primary:
    | INT                           { Int $1 }

Congratulations, we’ve fixed our precedence problem. Our parser will now correctly parse foo 1 / bar 2 + 3 as ((foo 1) / (bar 2)) + 3. As a nice side-effect, the order of operations (from last to first) in our language is easily visible simply by scanning through the grammar. This is trivial in this one, but a language like C has something like 15 different precedence levels.

(): “Hey You Forgot Us!”

One last thing: with any expression system, users will also want to be able to use parentheses to override the default order of operations. This is fixed by adding a simple rule at the very end:

start:
    | Expression SEMI               { $1 }

Expression:
    | Operator                      { $1 }

Operator:
    | Function                      { $1 }
    | Operator OPERATOR Function    { Operator ($1, $2, $3) }

Function:
    | Primary                       { $1 }
    | FUNCTION Function             { Function ($1, $2) }

Primary:
    | INT                           { Int $1 }
    | LPAREN Expression RPAREN      { $2 }

All the way at the highest level of precedence, you can now use () to start the cascade back over at the lowest predecence level (Expression). This means that from the outside, a pair of parentheses have the highest precedence, but are able to contain an expression of any precedence, thus letting the cascade loop back on itself predictably.

Interestingly, you’ll note that nothing specific needs to be done in the action step for parentheses. It’s effect is handled totally by the parser and simply affects the shape of the AST generated.

Conclusion

I’m still far from an expert when it comes to yacc. The reason I had to infer this from an old C reference is because I don’t have any books on the subject, so there may be entirely inane things going on here I’m not aware of, but I think this is a sound technique. For your own grammars, it’s still worth considering using the precedence and associativity built into yacc, but if that doesn’t work for you, this should be a viable alternative.