Building Expression Parsers
This page builds from the ground up on expression parsing. For a less discussion-based explanation see precedence, as well as chain combinators and heterogeneous chain combinators for more specific use-cases.
Expression parsing is a ubiquitous problem in parsing. It concerns the correct reading of operators and values, which are usually organised into precedence and fixities. For the purposes of this page a fixity will represent both the fixity and the associativity: infix-left, infix-right, prefix, and postfix. For example, this is a grammar for reading simple expressions consisting of numbers, brackets, addition, subtraction and multiplication with the standard precedences and left associativity.
<number> ::= <digit>+
<expr> ::= <expr> '+' <term> | <expr> '-' <term> | <term>
<term> ::= <term> '*' <atom> | <atom>
<atom> ::= '(' <expr> ')' | <number>
Here, the precedence is encoded by the fact that <expr> contains <term>, but <term>
only contains <atom>. The <expr> on the left of the '+' denotes that addition binds
more tightly to the left, which is what we expect.
1 The Problem with Left-Recursion
For a first atomic, let's directly translate this grammar into Parsley (for now, we'll
parse into an Int: behold, the magic of combinators!):
import parsley.Parsley, Parsley.atomic
import parsley.character.digit
import parsley.syntax.character.charLift
import parsley.syntax.zipped._
// Standard number parser
val number = digit.foldLeft1[Int](0)((n, d) => n * 10 + d.asDigit)
lazy val expr: Parsley[Int] =
atomic((expr <* '+', term).zipped(_ + _)) |
(expr <* '-', term).zipped(_ - _) |
term
lazy val term: Parsley[Int] = (term <* '*', atom).zipped(_ * _) | atom
lazy val atom: Parsley[Int] = '(' *> expr <* ')' | number
This parser has a few glaring issues: for a start, the atomic is causing excessive backtracking!
While there are ways to improve this, the real problem here is
the left-recursion. Imagine you are evaluating this parser, first you look at expr, and then
your first task is to evaluate expr! In fact, due to the strictness of Parsley's combinators, this example breaks before the parser runs: on Scala 2, it will StackOverflowError at runtime when constructing the parser, and on Scala 3, it
will report an infinitely recursive definition for expr and term. The solution is to turn to the chain combinators, but before we do that, let's
eliminate the atomics and refactor it a little to make the transition less jarring:
import parsley.Parsley, Parsley.atomic
import parsley.character.digit
import parsley.syntax.character.charLift
// Standard number parser
val number = digit.foldLeft1[Int](0)((n, d) => n * 10 + d.asDigit)
val add = (y: Int) => (x: Int) => x + y
val sub = (y: Int) => (x: Int) => x - y
val mul = (y: Int) => (x: Int) => x * y
lazy val expr: Parsley[Int] =
atomic(expr <**> ('+'.as(add) <*> term)) |
expr <**> ('-'.as(sub) <*> term) |
term
lazy val term: Parsley[Int] = term <**> ('*'.as(mul) <*> atom) | atom
lazy val atom: Parsley[Int] = '(' ~> expr <~ ')' | number
The first step is to perform the translation from the previous post, where we make the operator
result a function and apply that (flipped) to the right hand side (with <*>) and then the left
(with <**>). Now, in this form, hopefully you can notice we've exposed the leading expr so that
its on its own: now we can factor a bit more:
lazy val expr: Parsley[Int] =
expr <**> (('+'.as(add) <*> term) | ('-'.as(sub) <*> term)) |
term
Now we've eliminated the "backtracking" (if only we could make it that far!), but we can right factor
the | too to obtain the simplest form for the parser:
lazy val expr: Parsley[Int] =
expr <**> (('+'.as(add) | '-'.as(sub)) <*> term) |
term
Now, at this point, I could demonstrate how to left-factor this grammar and produce something that
is right recursive whilst preserving left-associativity. However, there isn't much point in doing
this, as now we are in a good position to use the chain.left1 combinator, which perfectly embodies
the translation.
2 Using expr.chain
The left-recursion problem is not a new one, the parser combinator community has known about it
for a long time. For parser combinator libraries it is necessary to left-factor the grammar.
Thankfully, the left-factoring algorithm can be itself encoded nicely as a combinator: this is
embodied by the chain-family. Here is the same example as before, but fixed using chain.left1:
import parsley.Parsley
import parsley.character.digit
import parsley.syntax.character.charLift
import parsley.expr.chain
// Standard number parser
val number = digit.foldLeft1[Int](0)((n, d) => n * 10 + d.asDigit)
val add = (x: Int, y: Int) => x + y
val sub = (x: Int, y: Int) => x - y
// chain.left1[A](p: Parsley[A])(op: Parsley[(A, A) => A]): Parsley[A]
lazy val expr: Parsley[Int] = chain.left1(term)('+'.as(add) | '-'.as(sub))
lazy val term = chain.left1(atom)('*' as (_ * _))
lazy val atom = '(' ~> expr <~ ')' | number
The structure of the parser is roughly the same, however now you'll notice that expr and term
are no longer self-recursive, and neither term nor atom need to be lazy (or have explicit types).
To make the relationship very clear between what we had before and what we have now, observe that
the transformation from recursive to chains follows these shape:
self <**> (op <*> next) | next == chain.left1(next)(op) // flipped op
self <**> op <*> next | next == chain.left1(next)(op) // normal op
next <**> (op <*> self </> identity) == chain.right1(next)(op) // no backtracking, flipped
atomic(next <**> op <*> self) | next == chain.right1(next)(op) // backtracking, normal op
In this parser, the nesting of the chains dictates the precedence order (again, terms are found inside
expressions and atoms inside terms). Since the addition and subtraction are on the same level, they
belong in the same chain. The left1 indicates that the operator/s are left-associative and that there
should be at least one of the next layer down. There are also chain.right1, chain.prefix, and
chain.postfix combinators. The building of these parsers, however, is fairly mechanical, and it is
tiresome to keep finding new names for new layers of the precedence table. For instances where there
is more than one chain interacting together then expr.precedence comes in handy (but note that
expr.precedence is complete overkill to replace a single chain!).
3 Using expr.precedence
The final form of this parser uses a expression parser builder, called precedence. Since Parsley parsers
are implemented in pure Scala, there is nothing to stop you from developing tools like this yourself: the
ability to work with parsers as values and develop combinators with them is the biggest advantage of
the approach. That being said, most combinator libraries provide this sort of functionality out of the box
and Parsley is no exception. Let's see the same parser one last time and see what's changed:
import parsley.Parsley
import parsley.character.digit
import parsley.syntax.character.charLift
import parsley.expr.{precedence, Ops, InfixL}
val number = digit.foldLeft1[Int](0)((n, d) => n * 10 + d.asDigit)
lazy val expr: Parsley[Int] = precedence[Int]('(' ~> expr <~ ')', number)(
Ops(InfixL)('*' as (_ * _)),
Ops(InfixL)('+' as (_ + _), '-' as (_ - _)))
This is a lot smaller! The way precedence works is that it is first provided with the
atoms of the expression, and then each precedence level in turn (as many as needed),
starting with the tightest binding operators. These levels are provided in the Ops, which
take a fixity, and then any number of parsers which return functions matching the fixity given.
Under the hood it will form the same nested chains that were used in the previous section.
In essence, there is no practical difference between the two implementations.
The precedence table can actually also be reversed so that it works the other way round:
lazy val expr: Parsley[Int] = precedence[Int](
Ops[Int](InfixL)('+' as (_ + _), '-' as (_ - _)),
Ops[Int](InfixL)('*' as (_ * _)))(
'(' ~> expr <~ ')', number)
But due to the ordering that type inference happens, this form is a bit more cumbersome.
As mentioned before, the fixity given to Ops influences what type the operators need to have.
This works by a Scala feature called path-dependent typing, which is extraordinarily useful.
If you want to know more about this, see the relevant sub-section: you don't need to know about it
or understand it to use precedence, however.
There is still a little more to this story though. So far we've been working with a homogenous
datatype: every level in the precedence table shares the same type Int. Now, in an abstract syntax
tree, which is the far more common result of parsing, you could represent all expressions
homogenously (which I call a monolithic AST). But sometimes, it's desirable to maintain stronger
guarantees about how the AST is structured, and for that we need a heterogenous precedence table.
3.1 Generalising precedence with GOps, SOps, and Levels
3.1.1 Subtyped ASTs with SOps
In some circumstances, it might be desirable to change the type of the parsers at each layer of the precedence table. This allows for a more strongly-typed AST, for example. Compared to Haskell, this can be easily achieved in Scala using subtyping.
For example, we can make an AST for our expressions like so:
sealed trait Expr
case class Add(x: Expr, y: Term) extends Expr
case class Sub(x: Expr, y: Term) extends Expr
sealed trait Term extends Expr
case class Mul(x: Term, y: Atom) extends Term
sealed trait Atom extends Term
case class Number(x: Int) extends Atom
case class Parens(x: Expr) extends Atom
The magic of subtyping means that Number(10) is a valid value of type Expr. That being said,
we have a guarantee that an Expr can only be found inside a Mul if it is wrapped in Parens:
since Expr is not a subtype of Term or Atom, Mul(Add(Number(6), Number(7)), Number(8)) does
not type-check!
Let's see what happens if we try and use our existing precedence knowledge with Ops:
val mul = (x: Expr, y: Expr) => Mul(x, y)
val add = (x: Expr, y: Expr) => Add(x, y)
val sub = (x: Expr, y: Expr) => Sub(x, y)
lazy val atom: Parsley[Atom] = number.map(Number) | '(' ~> expr.map(Parens) <~ ')'
lazy val expr = precedence[Expr](atom)(
Ops(InfixL)('*' as mul),
Ops(InfixL)('+' as add, '-' as sub))
// error: type mismatch;
// found : Expr
// required: Term
// val mul = (x: Expr, y: Expr) => Mul(x, y)
// ^
// error: type mismatch;
// found : Expr
// required: Atom
// val mul = (x: Expr, y: Expr) => Mul(x, y)
// ^
// error: type mismatch;
// found : Expr
// required: Term
// val add = (x: Expr, y: Expr) => Add(x, y)
// ^
// error: type mismatch;
// found : Expr
// required: Term
// val sub = (x: Expr, y: Expr) => Sub(x, y)
// ^
The problem is that, though all Terms are Exprs (and ditto
for Atom), we are forced to create operators of the shape (Expr, Expr) => Expr to fit into the
precedence table, but we can't guarantee that those Exprs we are passing into the function are
actually Terms (even though we know intuitively that they will be). In other words,
(Term, Atom) => Term is not a subtype of (Expr, Expr) => Expr!
So, how do we fix this? Well, we need to stop using Ops and use SOps: instead of requiring an
(A, A) => A operator for InfixL, SOps will demand those with shape (B, A) => B such that
A <: B. So, does our (Term, Atom) => Term match this type? Yes: A = Atom, B = Term and
Atom <: Term; all is good. Why do we require that A <: B exactly? Well, consider that we didn't
read any multiplication operators, then we are going to be handing just an Atom to the layer
above, but we are making the claim that we produce Terms. Of course, this is ok because Atoms
are themselves Terms.
Unfortunately, we can't just provide SOps as variadic arguments to the combinator, since they all
have different types to each other (that is the point, after all). Instead we use a heterogenous
list of precedence levels called, well, Levels.
trait Levels[+A]
case class Atoms[+A](atoms: Parsley[A]*) extends Levels[A]
// and
case class Level[A, B](
nextLevels: Levels[A],
ops: Ops[A, B])
extends Levels[B]
Basically, the type parameter to Levels is saying that we produce values of
type A from the outer-most level in the structure. There are two choices of constructor in the list: Atoms is the end of the
list, it produces As. The equivalent to ::,
Level[A, B] is a bit more complex: it says that, if you give it a precedence table that produces As, then it can use its own operators that work on A to produce values of type B. As a result, the larger table produces Bs.
Now, to make life nicer for us, the Levels list supports the common-place Scala collections
operators of +: and :+, which can be used in place of Level. Just like other Scala
collections, the rest of the table appears on the side of the :. As a result, we can build tables
like:
Atoms(atom1, atom2, .., atomN) :+ ops1 :+ ops2 :+ .. :+ opsN
// or
opsN +: .. +: ops2 +: ops1 +: Atoms(atom1, atom2, .., atomN)
The first form is the tightest first approach, and the second is the weakest first approach. So, what does our parser look like if we use Levels and SOps?
import parsley.Parsley
import parsley.character.digit
import parsley.syntax.character.charLift
import parsley.expr.{precedence, SOps, InfixL, Atoms}
val number = digit.foldLeft1[Int](0)((n, d) => n * 10 + d.asDigit)
sealed trait Expr
case class Add(x: Expr, y: Term) extends Expr
case class Sub(x: Expr, y: Term) extends Expr
sealed trait Term extends Expr
case class Mul(x: Term, y: Atom) extends Term
sealed trait Atom extends Term
case class Number(x: Int) extends Atom
case class Parens(x: Expr) extends Atom
lazy val expr: Parsley[Expr] = precedence {
Atoms(number.map(Number), '(' ~> expr.map(Parens) <~ ')') :+
SOps(InfixL)('*' as Mul) :+
SOps(InfixL)('+' as Add, '-' as Sub)
}
Not so bad! We've constructed the Levels list using :+, so this is strongest-first.
This time, if we turn the list around it isn't going to make us need to add type-annotations
like it did when we turned the Ops based table round earlier. Nice! An extra advantage of
using this approach now is that if we tried to use InfixR instead of InfixL, this happens:
type mismatch;
found : (Term, Atom) => Term
required: (Atom, Term) => Term
This means that, by using SOps, we get a guarantee that our parser correctly matches the intended
associativity advertised by our ASTs constructors!
3.1.2 Non-Subtyped Heterogenous ASTs with GOps
So far we've seen how to generalise our expression parsers to work with heterogenous trees that rely
on subtyping. However, there may be cases where the subtyping is undesirable, or otherwise not
possible (for example, if you want layers from Int to Expr) but we still want these strongly typed guarantees about the shape of the tree. In this case we would change the data-type as follows:
sealed trait Expr
case class Add(x: Expr, y: Term) extends Expr
case class Sub(x: Expr, y: Term) extends Expr
case class OfTerm(t: Term) extends Expr
sealed trait Term
case class Mul(x: Term, y: Atom) extends Term
case class OfAtom(x: Atom) extends Term
sealed trait Atom
case class Number(x: Int) extends Atom
case class Parens(x: Expr) extends Atom
Now the question is, how do we use the precedence parser now? The types of each of
these constructors no longer match (B, A) => B with A <: B! This is where GOps comes in. It's
very similar to SOps, except it doesn't come with the constraint that A is a subtype of B.
Instead, a GOps constructor requires you to provide a function of type A => B too! In our case,
these will correspond to the OfAtom and OfTerm functions from above. Note that, if there are
any implicit conversion available from A to B, GOps will happily use those (this includes the
implicit conversions called A =:= A and A <:< B for type equality and subtyping respectively:
GOps can implement the behaviour of Ops and SOps via these conversions). So, what does this
look like in practice?
import parsley.Parsley
import parsley.character.digit
import parsley.syntax.character.charLift
import parsley.expr.{precedence, GOps, InfixL, Atoms}
val number = digit.foldLeft1[Int](0)((n, d) => n * 10 + d.asDigit)
lazy val expr: Parsley[Expr] = precedence {
Atoms(number.map(Number), '(' ~> expr.map(Parens) <~ ')') :+
GOps[Atom, Term](InfixL)('*' as Mul)(OfAtom) :+
GOps[Term, Expr](InfixL)('+' as Add, '-' as Sub)(OfTerm)
}
Not so different from the original using SOps, but if you can allow subtyping in your AST, you
can use the much less brittle SOps form. What makes it brittle? Well, notice that this time we've
had to manually specify the types that each level deals with: this is because, without a subtyping
constraint, Scala is reluctant to make Mul be of type (Term, Atom) => Term. Instead it makes it
(Term, Atom) => Mul and complains that OfAtom hasn't got type Atom => Mul. Oops!
By the way, you can actually intermingle Ops, SOps, and GOps all in the same table, just as
long as you are using Levels. Each of them are just builders for values of type Ops[A, B].
3.2 Path-Dependent Typing and Ops/SOps/GOps
To support the advertised behaviour that the type of an operator depends on the fixity it has,
the Fixity trait has an abstract type called Op. Let's take the machinery behind the simpler Ops as an
example.
sealed trait Fixity {
type Op[A, B]
}
object Ops {
def apply[A](fixity: Fixity)(ops: Parsley[fixity.Op[A, A]]*): Ops[A, A] = ???
}
This is saying that the types of the parsers we pass to a call to Ops.apply should depend on the
type of the Op supported by the fixity. For instance, let's take InfixL and Prefix:
case object InfixL extends Fixity {
override type Op[-A, B] = (B, A) => B
}
case object Prefix extends Fixity {
override type Op[A, B] = B => B
}
Why Op works with As and Bs is explained in the very last subsection, so for now just always assume that A =:= B. Now observe the types of the partial applications of Ops.apply
to the different fixities:
def infixLefts[A](ops: Parsley[(A, A) => A]*): Ops[A, A] =
Ops(InfixL)(ops: _*)
def prefixes[A](ops: Parsley[A => A]*): Ops[A, A] =
Ops(Prefix)(ops: _*)
The path-dependent type of fixity.Op[A, A] allows the types of the parsers to change accordingly.
There is a similar story for the GOps and SOps objects, but they instead rely on Levels as
opposed to variadic arguments.
3.3 Afternote: Why (B, A) => B and B => B?
The types given to each fixity are as follows:
InfixLis(B, A) => BInfixRis(A, B) => BPrefixandPostfixare bothB => B
This might seem confusing at first: why, for instance, do the unary operators not mention A at all?
Well, let's first understand why (B, A) => B is appropriate for left-associative things but not
right ones.
sealed trait Expr
case class LOp(x: Expr, y: Int) extends Expr
case class ROp(x: Int, y: Expr) extends Expr
case class Number(x: Int) extends Expr
Notice that LOp(LOp(Number(6), 5), 4) is ok, because the right hand argument to LOp is always an
Int and the left-hand argument is always an expression. In the case of 6, Int is not an Expr,
so we wrap it up in the Number constructor. So, for LOp, if we take A = Int and B = Expr,
it has the shape (B, A) => B. On the other hand, ROp(ROp(Number(6), 5), 4) is not ok, because
ROp(...) is not an Int! This justifies the (A, B) => B type: like-expressions can appear on the
right, but not the left. The level for this would be GOp[Int, Expr](InfixL)('@'.as(LOp))(Number) or
GOp[Int, Expr](InfixR)('@'.as(ROp))(Number) (notice that switching them round wouldn't type-check!)
For Prefix and Postfix it's a similar story:
sealed trait BoolExpr
case class Not(x: BoolExpr) extends BoolExpr
case class Literal(b: Boolean) extends BoolExpr
We would like to be able to write Not(Not(Literal(False))), which means that Not needs to accept
the same type as itself. This explains B => B, and in this case, the booleans themselves need to be
wrapped up with Literal of shape A => B. This is the same role of Number before, which also has
shape A => B. The level for this would be GOp[Boolean, BoolExpr]("not" #> Not)(Literal).