Interlude 1: Building a Parser for Haskell
We've covered a lot of ground in the series so far! I think it's time to take a medium-length
break and implement a parser for a (somewhat) simplified Haskell from scratch. Haskell is a
deceptively simple looking language, but is actually a minefield of ambiguous parses and the
whitespace sensitivity in the language makes it even trickier. For now, I'm going to make a
couple of simplifications: firstly, I'm disallowing multi-line anything. We are going to make
a \n
a significant character and not include it in the whitespace parsing (at least until
part 3!). Furthermore, I'm banning where
clauses, more than one definition in a let
, guards
on a new line, and case
statements must use {}
(these are optional in Haskell, so long as
the code is well-indented). I'm also not going to deal with user defined operators, because the
introduction of these during the parser does create a context-sensitive grammar, which I'd
prefer to avoid for now. With that being said, here's the grammar. I wouldn't recommend spending
too much time looking at the grammar, and instead we'll get started with the fairly easy task of
lexing.
<program> ::= ((<data> | <declaration> | <clause>) NEWLINE)*
<data> ::= 'data' <con-id> <var-id>* '=' <constructors>
<constructors> ::= <constructor> ['|' <constructors>]
<constructor> ::= <con-id> <type-atom>*
<declaration> ::= <var-id> '::' <type>
<clause> ::= <var-id> <pat-naked>* [<guard>] '=' <expr>
<pat-naked> ::= <var-id> | <pat-con> | '()' | '[]' | <literal> | '_'
| '(' <pat> ')'
| '(' <pat> [',' <pat>]+ ')'
| '[' <pat> [',' <pat>]* ']'
<pat> ::= <pat-paren> [':' <pat>]
<pat-paren> ::= <pat-app> | <pat-naked>
<pat-app> ::= <pat-con> <pat>+
<pat-con> ::= '(' ','+ ')' | <con-id> | '(' ':' ')'
<guard> ::= '|' <expr>
<type> ::= <type-app> ['->' <type>]
<type-app> ::= <type-atom>+
<type-atom> ::= <type-con> | <var-id> | '()' | '[' <type> ']'
| '(' <type> (',' <type> ')')+ | '(' <type> ')'
<type-con> ::= <con-id> | '[]' | '(' '->' ')' | '(' ','+ ')'
<expr> ::= [<expr> '$'] <expr-1>
<expr-1> ::= <expr-2> ['||' <expr-1>]
<expr-2> ::= <expr-3> ['&&' <expr-2>]
<expr-3> ::= <expr-4> [('<' | '<=' | '>' | '>=' | '==' | '/=') <expr-4>]
<expr-4> ::= <expr-5> [(':' | '++') <expr-4>]
<expr-5> ::= [<expr-5> ('+' | '-')] <expr-6>
<expr-6> ::= '-' <expr-6> | <expr-7>
<expr-7> ::= [<expr-7> ('*' | '/')] <expr-8>
<expr-8> ::= <expr-9> ['^' <expr-8>]
<expr-9> ::= <expr-10> ['.' <expr-9>]
<expr-10> ::= '\' <pat-naked>+ '->' <expr>
| 'let' <clause> 'in' <expr>
| 'if' <expr> 'then' <expr> 'else' <expr>
| 'case' <expr> 'of' '{' <alt> [(';'|NEWLINE) <alt>]* '}'
| <func-app>
<alt> ::= <pat> '->' <expr>
<func-app> ::= <term>+
<term> ::= <var-id> | <con-id> | '()' | '(' ','+ ')'
| '(' <expr> ')'
| '(' <expr> (',' <expr>)+) ')'
| '[' [<expr> (',' <expr>)*] ']'
| <literal>
<literal> ::= FLOAT | INTEGER | STRING | CHAR
<var-id> ::= VAR-ID
<con-id> ::= CON-ID
1 Lexing
As it turns out, lexing Haskell in Parsley is particularly easy: the Lexer
class is compliant
with the Haskell specification!
import parsley.Parsley
object lexer {
import parsley.token.{Lexer, Unicode, Basic}
import parsley.token.descriptions.{LexicalDesc, NameDesc, SymbolDesc, SpaceDesc, NumericDesc, ExponentDesc, EscapeDesc, TextDesc}
import parsley.character.newline
private val haskellDesc = LexicalDesc(
NameDesc.plain.copy(
identifierStart = Unicode(c => Character.isLetter(c) || c == '_'),
identifierLetter =
Unicode(c => Character.isLetterOrDigit(c) || c == '_' || c == '\''),
),
SymbolDesc.plain.copy(
hardKeywords = Set("if", "then", "else", "data", "where",
"let", "in", "case", "of"),
hardOperators = Set("$", "||", "&&", "<", "<=", ">", ">=", "==", "/=", ":",
"++", "+", "-", "*", "/", "^", "."),
),
NumericDesc.plain.copy(
octalExponentDesc = ExponentDesc.NoExponents,
binaryExponentDesc = ExponentDesc.NoExponents,
),
TextDesc.plain.copy(
escapeSequences = EscapeDesc.haskell,
),
SpaceDesc.plain.copy(
lineCommentStart = "--",
multiLineCommentStart = "{-",
multiLineCommentEnd = "-}",
multiLineNestedComments = true,
space = Basic(c => c == ' ' || c == '\t'),
)
)
private val lexer = new Lexer(haskellDesc)
val CON_ID = lexer.lexeme.names.identifier(Basic(_.isUpper))
val VAR_ID = lexer.lexeme.names.identifier(Basic(_.isLower))
val INTEGER = lexer.lexeme.natural.number
val FLOAT = lexer.lexeme.floating.number
val INT_OR_FLOAT = lexer.lexeme.unsignedCombined.number
// Strictly speaking, Haskell files are probably encoded as UTF-8, but this
// is not supported by Parsley _yet_
val STRING = lexer.lexeme.string.fullUtf16
val CHAR = lexer.lexeme.character.fullUtf16
val NEWLINE = lexer.lexeme(newline).void
def fully[A](p: Parsley[A]) = lexer.fully(p)
val implicits = lexer.lexeme.symbol.implicits
}
The only tricky bit here is identifiers. Ideally, we can make a distinction between so-called
constructor ids and variable ids: this is done by using the identifier
combinator, which
refines what the first letter of the identifier is allowed to be. I've exposed INT_OR_FLOAT
to our interface here, since it prevents any backtracking required by INTEGER | FLOAT
. Also seen here, is the NEWLINE
token, which we will use in the parser to deliberate delimit newlines, but still
ensure it consumes whitespace! The reason I have picked the shouty-case names is to mimic how tokens sometimes look in grammars. This is purely stylistic, but will help us distinguish between parts of our parser and the primitives of our lexer.
2 The AST + Parser Bridge Pattern
Now it's time to build the AST for our Haskell Parser to return. Since we'll be using the Parser Bridge pattern anyway, I get a choice about whether or not I want position tracking for each node in the tree. Just to keep the AST looking simple, I'll not track anything. Of course, if I did change my mind, I could do it here by changing which generic bridge trait is used. More interesting will be what bridge constructor shapes I pick for each of the AST nodes. Let's start by just outlining the datatypes themselves and why they are how they are:
object ast {
import parsley.generic._
case class HaskellProgram(lines: List[ProgramUnit])
sealed trait ProgramUnit
case class Data(id: ConId, tys: List[VarId], cons: List[Con]) extends ProgramUnit
case class Con(id: ConId, tys: List[TyAtom])
case class Decl(id: VarId, ty: Type) extends ProgramUnit
case class Clause(id: VarId, pats: List[PatNaked], guard: Option[Expr], rhs: Expr)
extends ProgramUnit
sealed trait Pat
case class PatCons(x: PatParen, xs: Pat) extends Pat
sealed trait PatParen extends Pat
case class PatApp(con: PatCon, args: List[PatNaked]) extends PatParen
sealed trait PatNaked extends PatParen
case object NilCon extends PatNaked with ParserBridge0[PatNaked]
case object Wild extends PatNaked with ParserBridge0[PatNaked]
case class NestedPat(pat: Pat) extends PatNaked
case class PatTuple(xs: List[Pat]) extends PatNaked
case class PatList(xs: List[Pat]) extends PatNaked
sealed trait PatCon extends PatNaked
case object ConsCon extends PatCon with ParserBridge0[PatCon]
sealed trait Type
case class FunTy(argTy: Type_, resTy: Type) extends Type
sealed trait Type_ extends Type
case class TyApp(tyF: Type_, tyX: TyAtom) extends Type_
sealed trait TyAtom extends Type_
case object UnitTy extends TyAtom with ParserBridge0[TyAtom]
case class ListTy(ty: Type) extends TyAtom
case class TupleTy(tys: List[Type]) extends TyAtom
// This is needed if we want to maximise the well-typedness of the parser
// For a parser as big as this one, it's definitely desirable: we can always
// weaken the types later if we want to!
case class ParenTy(ty: Type) extends TyAtom
case object ListConTy extends TyAtom with ParserBridge0[TyAtom]
case object FunConTy extends TyAtom with ParserBridge0[TyAtom]
case class TupleConTy(arity: Int) extends TyAtom
// We'll model this layer by layer, to maximise the flexiblity whilst maintaining
// The type safety: by using subtyping, we can avoid useless wrapper constructors
sealed trait Expr
case class WeakApp(f: Expr, arg: Expr1) extends Expr
sealed trait Expr1 extends Expr
case class Or(x: Expr2, y: Expr1) extends Expr1
sealed trait Expr2 extends Expr1
case class And(x: Expr3, y: Expr2) extends Expr2
sealed trait Expr3 extends Expr2
// We could certainly compress this by factoring out the op!
// Notice that these operators have Expr4 on both sides: this implies they are
// not left _or_ right associative! x < y < z is not legal in Haskell
case class Less(x: Expr4, y: Expr4) extends Expr3
case class LessEqual(x: Expr4, y: Expr4) extends Expr3
case class Greater(x: Expr4, y: Expr4) extends Expr3
case class GreaterEqual(x: Expr4, y: Expr4) extends Expr3
case class Equal(x: Expr4, y: Expr4) extends Expr3
case class NotEqual(x: Expr4, y: Expr4) extends Expr3
sealed trait Expr4 extends Expr3
case class Cons(x: Expr5, xs: Expr4) extends Expr4
case class Append(xs: Expr5, ys: Expr4) extends Expr4
sealed trait Expr5 extends Expr4
case class Add(x: Expr5, y: Expr6) extends Expr5
case class Sub(x: Expr5, y: Expr6) extends Expr5
sealed trait Expr6 extends Expr5
case class Negate(x: Expr6) extends Expr6
sealed trait Expr7 extends Expr6
case class Mul(x: Expr7, y: Expr8) extends Expr7
case class Div(x: Expr7, y: Expr8) extends Expr7
sealed trait Expr8 extends Expr7
case class Exp(x: Expr9, y: Expr8) extends Expr8
sealed trait Expr9 extends Expr8
case class Comp(f: Expr10, g: Expr9) extends Expr9
sealed trait Expr10 extends Expr9
case class Lam(args: List[Pat], body: Expr) extends Expr10
case class Let(binding: Clause, in: Expr) extends Expr10
case class If(cond: Expr, thenExpr: Expr, elseExpr: Expr) extends Expr10
case class Case(scrutinee: Expr, cases: List[Alt]) extends Expr10
case class Alt(pat: Pat, body: Expr)
sealed trait Expr10_ extends Expr10
case class StrongApp(f: Expr10_, arg: Term) extends Expr10_
sealed trait Term extends Expr10_
case class ConId(v: String) extends Term with PatCon with TyAtom
case class VarId(v: String) extends Term with PatNaked with TyAtom
case object UnitCon extends Term with PatNaked with ParserBridge0[Term with PatNaked]
case class TupleCon(arity: Int) extends Term with PatCon
case class ParensVal(x: Expr) extends Term
case class TupleLit(xs: List[Expr]) extends Term
case class ListLit(xs: List[Expr]) extends Term
trait Literal extends Term with PatNaked
case class HsInt(x: BigInt) extends Literal
case class HsString(s: String) extends Literal
case class HsChar(c: Int) extends Literal
case class HsDouble(x: BigDecimal) extends Literal
object WeakApp extends ParserBridge2[Expr, Expr1, Expr]
object Or extends ParserBridge2[Expr2, Expr1, Expr1]
object And extends ParserBridge2[Expr3, Expr2, Expr2]
object Less extends ParserBridge2[Expr4, Expr4, Expr3]
object LessEqual extends ParserBridge2[Expr4, Expr4, Expr3]
object Greater extends ParserBridge2[Expr4, Expr4, Expr3]
object GreaterEqual extends ParserBridge2[Expr4, Expr4, Expr3]
object Equal extends ParserBridge2[Expr4, Expr4, Expr3]
object NotEqual extends ParserBridge2[Expr4, Expr4, Expr3]
object Cons extends ParserBridge2[Expr5, Expr4, Expr4]
object Append extends ParserBridge2[Expr5, Expr4, Expr4]
object Add extends ParserBridge2[Expr5, Expr6, Expr5]
object Sub extends ParserBridge2[Expr5, Expr6, Expr5]
object Negate extends ParserBridge1[Expr6, Expr6]
object Mul extends ParserBridge2[Expr7, Expr8, Expr7]
object Div extends ParserBridge2[Expr7, Expr8, Expr7]
object Exp extends ParserBridge2[Expr9, Expr8, Expr8]
object Comp extends ParserBridge2[Expr10, Expr9, Expr9]
object FunTy extends ParserBridge2[Type_, Type, Type]
object Lam extends ParserBridge2[List[Pat], Expr, Lam]
object Let extends ParserBridge2[Clause, Expr, Let]
object If extends ParserBridge3[Expr, Expr, Expr, If]
object Case extends ParserBridge2[Expr, List[Alt], Case]
object Alt extends ParserBridge2[Pat, Expr, Alt]
object ConId extends ParserBridge1[String, ConId]
object VarId extends ParserBridge1[String, VarId]
object TupleCon extends ParserBridge1[Int, TupleCon]
object ParensVal extends ParserBridge1[Expr, ParensVal]
object TupleLit extends ParserBridge1[List[Expr], TupleLit]
object ListLit extends ParserBridge1[List[Expr], ListLit]
object HsInt extends ParserBridge1[BigInt, HsInt]
object HsString extends ParserBridge1[String, HsString]
object HsChar extends ParserBridge1[Int, HsChar]
object HsDouble extends ParserBridge1[BigDecimal, HsDouble]
object Data extends ParserBridge3[ConId, List[VarId], List[Con], Data]
object Con extends ParserBridge2[ConId, List[TyAtom], Con]
object Decl extends ParserBridge2[VarId, Type, Decl]
object Clause extends ParserBridge4[VarId, List[PatNaked], Option[Expr], Expr, Clause]
object PatCons extends ParserBridge2[PatParen, Pat, Pat]
object PatApp extends ParserBridge2[PatCon, List[PatNaked], PatApp]
object NestedPat extends ParserBridge1[Pat, NestedPat]
object PatTuple extends ParserBridge1[List[Pat], PatTuple]
object PatList extends ParserBridge1[List[Pat], PatList]
object TupleConTy extends ParserBridge1[Int, TupleConTy]
object ParenTy extends ParserBridge1[Type, ParenTy]
object TupleTy extends ParserBridge1[List[Type], TupleTy]
object ListTy extends ParserBridge1[Type, ListTy]
}
There is a lot of constructors here, but that's because the grammar is quite big. Notice that
the shape of the AST roughly follows the shape of the grammar (down to the naming). Subtyping
has been used where the same rule can appear in multiple places (ConId
or VarId
for
instance). This helps keep the parser simple whilst still providing a level of type safety. We
know, for instance, the associativity of the operators purely based on their types alone. By
generalising, we can see that left-associative types are shaped like (B, A) => B
, right
associative ones as (A, B) => B
and non-associative ones as (A, A) => B
. This is a helpful
guide for us, and in fact it will also ensure that we can't get the precedence table "wrong". A
consequence, as we'll see later, is we will be forced to use SOps
instead of Ops
in the
precedence tables.
The bridges have all also been defined above as well, including those marked
with ParserBridge0
, which is done on the object itself.
Notably though, there are two AST nodes I'm not going to give bridge constructors to: StrongApp
, TyApp
. If you look at
the grammar, you'll see that the two relevant rules are both just many
-like. Another option for
these datatypes would have been to have taken a List
of the sub-parses. But, morally, Haskell
function applications are done one at a time so I've not flattened the structure, and as we'll see,
we'll use a reduce
to handle these cases. In reality, providing a position to either of these two
nodes is quite difficult, because they are actually both delimited by ' '
, so there really isn't a
sensible position to latch on to.
3 Parsing
Now it's finally time to tackle the parser itself. Remember, our lexer handles all whitespace
(except newlines) for us, and our bridge constructors would handle position information for us. The
type of the AST is going to help us make sure that the parsers are constructed in the right way. One
concern we need to be aware of is notice areas of the grammar where ambiguity lies, and make sure
we resolve it properly with atomic
(and only when needed!). Let's start at the bottom and work
our way up: this means tackling, in order, atoms, expressions, patterns, clauses, types,
declarations, and then finally data.
3.1 Atoms of the Expression
Let's make a start with <term>
. There is nothing particular special about this, but we will need
to be careful to handle the difference between tupled expressions and parenthesised expressions.
There are a couple of ways we can try and tackle this: the first is to write both cases out and be
sure to atomic
one of them, so we can back out if required. The second is to parse them as one
case together and then disambiguate which of the constructors should be used in a bridge! I'm
going to take the first approach for now, and then we can revisit later. Now for the <term>
parser:
import parsley.Parsley.atomic
import parsley.combinator.{sepBy, sepBy1, countSome}
import lexer._
import implicits.implicitSymbol
import ast._
val `<literal>` = atomic(HsDouble(FLOAT)) | HsInt(INTEGER) | HsString(STRING) | HsChar(CHAR)
val `<var-id>` = VarId(VAR_ID)
val `<con-id>` = ConId(CON_ID)
val `<expr>`: Parsley[Expr] = /???/
val `<term>` = ( `<var-id>` | `<con-id>` | (UnitCon from "()")
| atomic(TupleCon("(" ~> countSome(",") <~ ")"))
| atomic(ParensVal("(" ~> `<expr>` <~ ")"))
| TupleLit("(" ~> sepBy1(`<expr>`, ",") <~ ")")
| ListLit("[" ~> sepBy(`<expr>`, ",") <~ "]")
| `<literal>`
)
Here, I've followed the structure of the grammar quite closely. I'm even making sure to follow the
same order that the grammar uses: this means that I use lazy val
for any parser which forward
references another grammar rule (this is why I didn't need the laziness in the bridges). I'm making
use of the sepBy
and sepBy1
combinators to handle the comma separated values and parse them into
a list. Notice that there are three instances of backtracking alone in this parser. Tuple
constructions (like (,,,)
), parenthesised values, and tuple literals all share the (
token. That
being said, so does ()
, but there it's been treated as a single atomic token by our
implicitSymbol
, so no backtracking required at the branching level. As I said, this could get
expensive, so we will re-visit it later. The same crops up with FLOAT
and INTEGER
, which may
also overlap with each other: again, we will re-visit this later and use the FLOAT_OR_INT
token
instead. To deal with the number of ,
representing the arity of a tuple constructor operator, I've
also drafted up a countSome
combinator, which will parse its given parser one or more times, counting
the number of successes. You'll also notice that, in Scala, anything goes between backticks! For
this parser, I'll go with this notation to make it match a little more closely with the grammar
itself (and for variety): it's up to you whether or not you like this notation. One thing that's
nice about it is that it clearly distinguishes between our combinators and the grammar rules
themselves.
3.2 Expressions
Next we'll tackle up to <expr>
. By now, we should know that the correct tool to reach for is
precedence
. While <expr-10>
can be considered to have operators in it, they do not fit with
any associativity in Haskell so I will split them out for digestibility. Without further ado, let's get going:
import parsley.expr.{precedence, SOps, InfixL, InfixR, InfixN, Prefix, Atoms}
lazy val `<expr>`: Parsley[Expr] = precedence {
SOps(InfixL)(WeakApp from "$") +:
SOps(InfixR)(Or from "||") +:
SOps(InfixR)(And from "&&") +:
SOps(InfixN)(Less from "<", LessEqual from "<=",
Greater from ">", GreaterEqual from ">=",
Equal from "==", NotEqual from "/=") +:
SOps(InfixR)(Cons from ":", Append from "++") +:
SOps(InfixL)(Add from "+", Sub from "-") +:
SOps(Prefix)(Negate from "-") +:
SOps(InfixL)(Mul from "*", Div from "/") +:
SOps(InfixR)(Exp from "^") +:
SOps(InfixR)(Comp from ".") +:
`<expr-10>`
}
lazy val `<expr-10>` = Atoms(/???/)
Here we can see a whole bunch of interesting things! Firstly, up to this point we've been used to
seeing Ops
in our precedence, where here we are using SOps
and the Levels
list. This is
important, because our AST is far more strongly typed. If we made each layer of the tree the same
(Expr, Expr) => Expr
shape, then we could use Ops
as we've been used to in other pages.
However, since I opted to make a more strongly typed tree using subtyping, we have to use the more
complex and general SOps
precedence architecture. This has some really nice consequences:
1) if, say, I removed SOps(InfixR)(Exp <# "^")
from the list, it would no longer compile
2) if, say, I accidentally said SOps(InfixL)(Exp <# "^")
, then it would no longer compile
3) if, say, I reordered the lines in the list, then it would no longer compile
4) by using subtyping, we don't need to provide any explicit wrapper constructors
Next up is the remaining three rules: <expr-10>
, <alt>
, and <func-app>
.
import parsley.Parsley.{some, many}
val `<clause>`: Parsley[Clause] = /???/
val `<pat-naked>`: Parsley[PatNaked] = /???/
val `<pat>`: Parsley[Pat] = /???/
val `<alt>` = Alt(`<pat>`, "->" ~> `<expr>`)
lazy val `<func-app>` = `<term>`.reduceLeft(StrongApp)
lazy val `<expr-10>` = Atoms(
Lam("\\" ~> some(`<pat-naked>`), "->" ~> `<expr>`),
Let("let" ~> `<clause>`, "in" ~> `<expr>`),
If("if" ~> `<expr>`, "then" ~> `<expr>`, "else" ~> `<expr>`),
Case("case" ~> `<expr>`,
"of" ~> "{" ~> sepBy1(`<alt>`, (";" | NEWLINE) <~ many(NEWLINE)) <~ "}"),
`<func-app>`)
This section of the parser is much more straightforward: we are using the regular shape of the
grammar in conjunction with our bridge constructors. Notice here we are explicitly making use of
NEWLINE
, so that we can make multi-line case
statements. This is ok because we have explicit
curly braces to delimit the start and end of the case
. Note here that the "\\"
is just parsing a
\
, but the backslash much be escaped to fit within the Scala string! The <func-app>
rule is
interesting, because it is the same as some(`<term>`).map(_.reduceLeft(StrongApp))
, but is
more efficient, not having to have to construct an intermediate list. It's always a good idea to
check out parsley
's API to see if you can find any hidden gems like this one!
Now we have tackled everything from <expr>
down, we are now in a position to deal with <clause>
and its sub grammars.
3.3 Clauses
These set of parsers are similar to what we saw with <term>
. There are going to be elements of
ambiguity in the grammar that need to be resolved with atomic
, specifically those involving
parentheses. But other than that there isn't anything really new here.
import parsley.Parsley.many
import parsley.combinator.option
import parsley.expr.infix
lazy val `<clause>` =
Clause(`<var-id>`, many(`<pat-naked>`), option(`<guard>`), "=" ~> `<expr>`)
lazy val `<pat-naked>`: Parsley[PatNaked] =
( `<var-id>` | atomic(`<pat-con>`)
| (UnitCon from "()") | (NilCon from "[]") | `<literal>` | (Wild from "_")
| atomic(NestedPat("(" ~> `<pat>` <~ ")"))
| PatTuple("(" ~> sepBy1(`<pat>`, ",") <~ ")")
| PatList("[" ~> sepBy(`<pat>`, ",") <~ "]")
)
lazy val `<pat>` = infix.right1(`<pat-paren>`)(PatCons from ":")
lazy val `<pat-paren>` = atomic(`<pat-app>`) | `<pat-naked>`
lazy val `<pat-app>` = PatApp(`<pat-con>`, some(`<pat-naked>`))
lazy val `<pat-con>` = ( atomic("(" ~> (ConsCon from ":") <~ ")")
| TupleCon("(" ~> countSome(",") <~ ")")
| `<con-id>`
)
lazy val `<guard>` = "|" ~> `<expr>`
Like I said, nothing too interesting here. Notice, however, that for <pat>
I have used a
infix.right1
: we've been used to using precedence
, but in cases like these the chain
s are just
so much more simple. Don't be afraid to make use of them! There are plenty of atomic
s here that we
will come back and eliminate later on: notably, the atomic
in <pat-paren>
is used to guard
against a <pat-con>
being parsed without any <pat-naked>
- this needs to be parsed again in
<pat-naked>
itself.
3.4 Types
Again, here we can just see more examples of the same concepts we've already been working with. There is a sense in which we've really reached the limit of stuff we need for our practical common cases: there isn't much more to say until we try and deal with much more complex grammatical features.
lazy val `<type>`: Parsley[Type] = infix.right1(`<type-app>`)(FunTy from "->")
lazy val `<type-app>` = `<type-atom>`.reduceLeft(TyApp)
lazy val `<type-atom>` = ( `<type-con>` | `<var-id>` | (UnitTy from "()")
| ListTy("[" ~> `<type>` <~ "]")
| atomic(ParenTy("(" ~> `<type>` <~ ")"))
| TupleTy("(" ~> sepBy1(`<type>`, ",") <~ ")")
)
lazy val `<type-con>` = ( `<con-id>`
| (ListConTy from "[]")
| atomic("(" ~> (FunConTy from "->") <~ ")")
| atomic(TupleConTy("(" ~> countSome(",") <~ ")"))
)
We can see another instance of infix.right1
out in the wild here, as well as the other reduceLeft
used for type application. Yet again, there is some ambiguity evidenced by the atomic
here, and
just like in the other instances, it's to do with parentheses. The second atomic
in <type-con>
is
interesting: it's not clear from the rule itself why it is there. In fact, it's there because we
need to be able to backtrack out of <type-con>
during <type-atom>
; however, since the ambiguity
only arises from the tuple constructor (the function constructor doesn't cause ambiguity in this case,
because of the existing atomic
), we don't need to enable backtracking on the entire rule.
Finally we can move onto the top level parsers and start tackling the performance gotchas: this is
the more challenging aspect of managing a parser of this size and complexity.
3.5 Declarations and Data
Here is a nice easy finish. These last rules are really just book-keeping. I'm also going to introduce a way of running the parser directly.
import parsley.Parsley.some
import parsley.combinator.sepEndBy
import parsley.errors.ErrorBuilder
def parse[Err: ErrorBuilder](input: String) = `<program>`.parse(input)
lazy val `<program>` =
fully(sepEndBy(`<data>` | atomic(`<declaration>`) | `<clause>`, some(NEWLINE)))
lazy val `<data>` = Data("data" ~> `<con-id>`, many(`<var-id>`), "=" ~> `<constructors>`)
lazy val `<constructors>` = sepBy1(`<constructor>`, "|")
lazy val `<constructor>` = Con(`<con-id>`, many(`<type-atom>`))
lazy val `<declaration>` = Decl(`<var-id>`, "::" ~> `<type>`)
We do have to be careful here, there is some overlap between <declaration>
and <clause>
. This,
unfortunately, is unavoidable without more involved work, but the overlap is just a single variable
name, so it's fairly minor. That being said, really, the scope on that atomic
is too big: once we've
seen the "::"
, we know for sure that everything that follows (good or bad) is a type. At the
moment, if something bad happens in the type, it can backtrack all the way out and try reading a
clause instead. This isn't too much of a problem for us here, since the clause will also fail and
the declaration's error message got further (and so takes precedence), but suppose the backtracking
allowed the parser to succeed somehow (perhaps it was many(`<declaration>`)
) then the error
will for sure be in the wrong place!
The parse
method is our hook to the outside world. I've written it "properly"
here so that it is parameterised by an instance of the ErrorBuilder
typeclass. We could omit this,
but then we'd always have to run it with the ErrorBuilder
in scope for whatever concrete Err
we
choose (in most cases, String
). It's useful to do this in case your unit tests want to test for
the correctness of error messages!
4 Optimising the Parser
4.1 Fixing Parentheses Backtracking
Before we finish up this page, and move onto learning about error messages, let's deal with the last
remaining warts in this version of our parser. There are several instances of backtracking: some are
very avoidable, and others are potentially expensive. We will progress through them, in some cases
making incremental changes. As I've said before, dealing with backtracking in an effective way is in
most cases the trickiest part of actually writing a parser. Make sure you understand how the parser
has been built up so far and convince yourself about why each instance of atomic
in the previous
parsers has been necessary before continuing.
4.1.1 <pat-con>
, <type-con>
, and <term>
These three grammar rules are very straightforward to factor out and remove the backtracking for, so we will start with them first. Let's remind ourselves of the three rules in question and identify the parts we can handle:
lazy val `<pat-con>` = ( atomic("(" ~> (ConsCon from ":") <~ ")")
| TupleCon("(" ~> countSome(",") <~ ")")
| `<con-id>`
)
val `<type-con>` = ( `<con-id>`
| (ListConTy from "[]")
| atomic("(" ~> (FunConTy from "->") <~ ")")
| atomic(TupleConTy("(" ~> countSome(",") <~ ")"))
)
val `<term>` = ( `<var-id>` | `<con-id>` | (UnitCon from "()")
| atomic(TupleCon("(" ~> countSome(",") <~ ")"))
| atomic(ParensVal("(" ~> `<expr>` <~ ")"))
| TupleLit("(" ~> sepBy1(`<expr>`, ",") <~ ")")
| ListLit("[" ~> sepBy(`<expr>`, ",") <~ "]")
| `<literal>`
)
Now, with <pat-con>
and <type-con>
, they both contain backtracking because there are two portions
of the parser which lie within parentheses. You can see the same thing in the <term>
parser,
however, as we'll see, <term>
will require a bit more extra work to fix the second instance of
backtracking. Thankfully, these are all relatively easy to fix: we just need to distribute the
parentheses through the rules that contain them on the inside. This is a nice warm-up exercise:
lazy val `<pat-con>` = ( atomic("(" ~> (ConsCon from ":") <~ ")")
| "(" ~> TupleCon(countSome(",")) <~ ")"
| `<con-id>`
)
val `<type-con>` = ( `<con-id>`
| (ListConTy from "[]")
| atomic("(" ~> (FunConTy from "->") <~ ")")
| atomic("(" ~> TupleConTy(countSome(",")) <~ ")")
)
val `<term>` = ( `<var-id>` | `<con-id>` | (UnitCon from "()")
| atomic("(" ~> TupleCon(countSome(",")) <~ ")")
| atomic("(" ~> ParensVal(`<expr>`) <~ ")")
| "(" ~> TupleLit(sepBy1(`<expr>`, ",")) <~ ")"
| ListLit("[" ~> sepBy(`<expr>`, ",") <~ "]")
| `<literal>`
)
With the parentheses distributed, we can see that they are easily factored out (on both the left- and the right-hand sides):
lazy val `<pat-con>` = ( "(" ~> ((ConsCon from ":") | TupleCon(countSome(","))) <~ ")"
| `<con-id>`
)
val `<type-con>` = ( `<con-id>`
| (ListConTy from "[]")
| atomic("(" ~> ((FunConTy from "->") | TupleConTy(countSome(","))) <~ ")")
)
val `<term>` = ( `<var-id>` | `<con-id>` | (UnitCon from "()")
| "(" ~> ( TupleCon(countSome(","))
| atomic(ParensVal(`<expr>`))
| TupleLit(sepBy1(`<expr>`, ","))
) <~ ")"
| ListLit("[" ~> sepBy(`<expr>`, ",") <~ "]")
| `<literal>`
)
This has immediately eliminated three of the atomic
s, but one persists inside <term>
: this is
because ParensVal
and TupleLit
both share an <expr>
. This is a bit trickier to eliminate, but
let's move on to tackling these. Note that the other atomic
in <type-con>
is due to a conflict
in the wider grammar, we can look at that in part 4.
4.1.2 <pat-naked>
, <type-atom>
, and <term>
(again)
The next three grammar rules contain similar patterns to the last three, but solving the backtracking is less obvious. Let's start by recapping what the three rules are:
lazy val `<pat-naked>`: Parsley[PatNaked] =
( `<var-id>` | atomic(`<pat-con>`)
| (UnitCon from "()") | (NilCon from "[]") | `<literal>` | (Wild from "_")
| atomic(NestedPat("(" ~> `<pat>` <~ ")"))
| PatTuple("(" ~> sepBy1(`<pat>`, ",") <~ ")")
| PatList("[" ~> sepBy(`<pat>`, ",") <~ "]")
)
lazy val `<type-atom>` = ( `<type-con>` | `<var-id>` | (UnitTy from "()")
| ListTy("[" ~> `<type>` <~ "]")
| atomic(ParenTy("(" ~> `<type>` <~ ")"))
| TupleTy("(" ~> sepBy1(`<type>`, ",") <~ ")")
)
val `<term>` = ( `<var-id>` | `<con-id>` | (UnitCon from "()")
| "(" ~> ( TupleCon(countSome(","))
| atomic(ParensVal(`<expr>`))
| TupleLit(sepBy1(`<expr>`, ","))
) <~ ")"
| ListLit("[" ~> sepBy(`<expr>`, ",") <~ "]")
| `<literal>`
)
First, let's do what we did to <term>
to <pat-naked>
and <type-atom>
:
lazy val `<pat-naked>`: Parsley[PatNaked] =
( `<var-id>` | atomic(`<pat-con>`)
| (UnitCon from "()") | (NilCon from "[]") | `<literal>` | (Wild from "_")
| "(" ~> ( atomic(NestedPat(`<pat>`))
| PatTuple(sepBy1(`<pat>`, ","))
) <~ ")"
| PatList("[" ~> sepBy(`<pat>`, ",") <~ "]")
)
lazy val `<type-atom>` = ( `<type-con>` | `<var-id>` | (UnitTy from "()")
| ListTy("[" ~> `<type>` <~ "]")
| "(" ~> ( atomic(ParenTy(`<type>`))
| TupleTy(sepBy1(`<type>`, ","))
) <~ ")"
)
val `<term>` = ( `<var-id>` | `<con-id>` | (UnitCon from "()")
| "(" ~> ( TupleCon(countSome(","))
| atomic(ParensVal(`<expr>`))
| TupleLit(sepBy1(`<expr>`, ","))
) <~ ")"
| ListLit("[" ~> sepBy(`<expr>`, ",") <~ "]")
| `<literal>`
)
Hopefully you can see that all three rules are similar to each other: they all have an atomic
inside the factored parentheses. The problem is that they use different bridge constructors and the
sepBy1
does not allow for easy factoring. That being said, we could deal with this by creating... a
new disambiguator bridge! Let's take a look at them:
object NestedPatOrPatTuple extends ParserBridge1[List[Pat], PatNaked] {
def apply(ps: List[Pat]): PatNaked = ps match {
case List(p) => NestedPat(p)
case ps => PatTuple(ps)
}
}
object ParenTyOrTupleTy extends ParserBridge1[List[Type], TyAtom] {
def apply(tys: List[Type]): TyAtom = tys match {
case List(ty) => ParenTy(ty)
case tys => TupleTy(tys)
}
}
object TupleLitOrParensVal extends ParserBridge1[List[Expr], Term] {
def apply(xs: List[Expr]): Term = xs match {
case List(x) => ParensVal(x)
case xs => TupleLit(xs)
}
}
Compared with bridge constructors, disambiguator bridges are used to arbitrate between several possible instantiations dependending on the data that is fed to them. These bridges each encapsulate both of the overlapping cases. If the list of results only has one element, then it indicates we should not be creating a tuple. With these new bridges replacing the old ones, we can adjust our parsers:
lazy val `<pat-naked>`: Parsley[PatNaked] =
( `<var-id>` | atomic(`<pat-con>`)
| (UnitCon from "()") | (NilCon from "[]") | `<literal>` | (Wild from "_")
| NestedPatOrPatTuple("(" ~> sepBy1(`<pat>`, ",") <~ ")")
| PatList("[" ~> sepBy(`<pat>`, ",") <~ "]")
)
lazy val `<type-atom>` = ( `<type-con>` | `<var-id>` | (UnitTy from "()")
| ListTy("[" ~> `<type>` <~ "]")
| ParenTyOrTupleTy("(" ~> sepBy1(`<type>`, ",") <~ ")")
)
val `<term>` = ( `<var-id>` | `<con-id>` | (UnitCon from "()")
| "(" ~> ( TupleCon(countSome(","))
| TupleLitOrParensVal(sepBy1(`<expr>`, ","))
) <~ ")"
| ListLit("[" ~> sepBy(`<expr>`, ",") <~ "]")
| `<literal>`
)
By making use of our special disambiguators bridge, we've eliminated the pesky atomic
s. I've put the
parentheses back inside the bridge call for the <pat-naked>
and <type-atom>
rules, because, in my opinion, it looks a bit cleaner.
4.1.3 Numbers in <literal>
With the tools we've already developed so far, this one is easy. Let's remind ourselves of the rule first:
/*
val INTEGER = lexer.lexeme.numeric.natural.number
val FLOAT = lexer.lexeme.numeric.floating.number
val INT_OR_FLOAT = lexer.lexeme.numeric.unsignedCombined.number
*/
val `<literal>` = atomic(HsDouble(FLOAT)) | HsInt(INTEGER) | HsString(STRING) | HsChar(CHAR)
The problem here is that floats and ints share a common leading prefix: the whole number part. When
we defined the lexer
, I mentioned that it supports a INT_OR_FLOAT
token. Now it's time to make
use of it to remove this atomic
. Our first thought might be to make a disambiguator bridge that can accommodate either of them, and that would be a fine idea:
object HsIntOrDouble extends ParserBridge1[Either[BigInt, BigDecimal], Literal] {
def apply(x: Either[BigInt, BigDecimal]): Literal = x.fold(HsInt(_), HsDouble(_))
}
val `<literal>` = HsIntOrDouble(INT_OR_FLOAT) | HsString(STRING) | HsChar(CHAR)
And this works fine!
4.1.4 Loose ends
At this point, there are four atomic
s left in the entire parser: one in <type-con>
, one in
<pat-paren>
, one in <pat-naked>
, and, finally, one in <program>
. At this point it gets much
harder to remove them without altering the grammar substantially. Let's see how much more we can do
to remove them and explore some of the more dramatic changes it entails to the parser. We'll start
with the atomic
in <type-con>
.
The reason for this atomic
is more obvious when we compare it with <type-atom>
:
val `<type-con>` = ( `<con-id>`
| (ListConTy from "[]")
| atomic("(" ~> ((FunConTy from "->") | TupleConTy(countSome(","))) <~ ")")
)
lazy val `<type-atom>` = ( `<type-con>` | `<var-id>` | (UnitTy from "()")
| ListTy("[" ~> `<type>` <~ "]")
| ParenTyOrTupleTy("(" ~> sepBy1(`<type>`, ",") <~ ")")
)
The atomic
in <type-con>
is used to backtrack out of the parentheses, since ParenTyOrTupleTy
will also consume them if we didn't see a ->
or a ,
. Thankfully, <type-con>
is used in one
place in the parser, and we are looking at it! To fix this atomic
we just need to be destructive
and stop following the grammar as rigidly as we have been. Let's inline <type-con>
into <type-atom>
to start with:
lazy val `<type-atom>` = ( `<con-id>`
| (ListConTy from "[]")
| atomic("(" ~> ( (FunConTy from "->")
| TupleConTy(countSome(","))
) <~ ")")
| `<var-id>` | (UnitTy from "()")
| ListTy("[" ~> `<type>` <~ "]")
| ParenTyOrTupleTy("(" ~> sepBy1(`<type>`, ",") <~ ")")
)
Right, now that they have been put together, we can see the problem more clearly. Let's now reorganise the parser so that the problematic parentheses appear next to each other: it's worth mentioning that, for parsers without backtracking, we can always reorder the branches without consequence; the restriction is that backtracking parsers cannot move ahead of their paired up branch.
lazy val `<type-atom>` = ( `<con-id>`
| (ListConTy from "[]")
| `<var-id>` | (UnitTy from "()")
| ListTy("[" ~> `<type>` <~ "]")
| atomic("(" ~> ( (FunConTy from "->")
| TupleConTy(countSome(","))
) <~ ")")
| ParenTyOrTupleTy("(" ~> sepBy1(`<type>`, ",") <~ ")")
)
This parser is a bit neater, and now we can apply our favourite tricks from part 1 to resolve this
atomic
:
lazy val `<type-atom>` = ( `<con-id>`
| (ListConTy from "[]")
| `<var-id>` | (UnitTy from "()")
| ListTy("[" ~> `<type>` <~ "]")
| "(" ~> ( (FunConTy from "->")
| TupleConTy(countSome(","))
| ParenTyOrTupleTy(sepBy1(`<type>`, ","))
) <~ ")"
)
Nice! One down, three to go. Let's have a look at the two involving patterns together:
lazy val `<pat-naked>`: Parsley[PatNaked] =
( `<var-id>` | atomic(`<pat-con>`)
| (UnitCon from "()") | (NilCon from "[]") | `<literal>` | (Wild from "_")
| NestedPatOrPatTuple("(" ~> sepBy1(`<pat>`, ",") <~ ")")
| PatList("[" ~> sepBy(`<pat>`, ",") <~ "]")
)
lazy val `<pat-paren>` = atomic(`<pat-app>`) | `<pat-naked>`
lazy val `<pat-app>` = PatApp(`<pat-con>`, some(`<pat-naked>`))
lazy val `<pat-con>` = ( "(" ~> ((ConsCon from ":") | TupleCon(countSome(","))) <~ ")"
| `<con-id>`
)
I've omitted the <pat>
rule here, since it's not relevant. Right, so the interaction of these rules
is quite intricate! The atomic
in <pat-paren>
is there because <pat-app>
reads a <pat-con>
and <pat-naked>
can also read one of those. Additionally, within <pat-naked>
, the atomic
is
guarding against yet more parentheses ambiguity caused by ConsCon
, TupleCon
, and
NestedPatOrPatTuple
. Now, our first thought might be that we can eliminate the atomic
within
<pat-naked>
with the same strategy we used for <type-con>
; we certainly could, but <pat-con>
is used in two places, so doing so will cause duplication in the parser. This is a trade-off:
inlining <pat-con>
will eliminate backtracking and make the parser more efficient, but that comes
at the cost of increased code size; in this case, in fact, the backtracking is limited to a single
character, which is relatively cheap, and the size of the rule is small, so inlining it will not
increase the code size significantly. It doesn't really matter either way what we do, so let's
reinforce our factoring skills and duplicate the code to eliminate the atomic
!
lazy val `<pat-naked>`: Parsley[PatNaked] =
( `<var-id>` | `<con-id>`
| atomic("(" ~> ((ConsCon from ":") | TupleCon(countSome(","))) <~ ")")
| NestedPatOrPatTuple("(" ~> sepBy1(`<pat>`, ",") <~ ")")
| (UnitCon from "()") | (NilCon from "[]") | `<literal>` | (Wild from "_")
| PatList("[" ~> sepBy(`<pat>`, ",") <~ "]")
)
lazy val `<pat-paren>` = atomic(`<pat-app>`) | `<pat-naked>`
lazy val `<pat-app>` = PatApp(`<pat-con>`, some(`<pat-naked>`))
lazy val `<pat-con>` = ( "(" ~> ((ConsCon from ":") | TupleCon(countSome(","))) <~ ")"
| `<con-id>`
)
In the above parser, I inlined the parser and reorganised it to bring the offending sub-rules together. We know the drill by this point, let's factor that out:
lazy val `<pat-naked>`: Parsley[PatNaked] =
( `<var-id>` | `<con-id>`
| "(" ~> ( (ConsCon from ":")
| TupleCon(countSome(","))
| NestedPatOrPatTuple(sepBy1(`<pat>`, ","))
) <~ ")"
| (UnitCon from "()") | (NilCon from "[]") | `<literal>` | (Wild from "_")
| PatList("[" ~> sepBy(`<pat>`, ",") <~ "]")
)
lazy val `<pat-paren>` = atomic(`<pat-app>`) | `<pat-naked>`
lazy val `<pat-app>` = PatApp(`<pat-con>`, some(`<pat-naked>`))
lazy val `<pat-con>` = ( "(" ~> ((ConsCon from ":") | TupleCon(countSome(","))) <~ ")"
| `<con-id>`
)
Nice, another atomic
down! Now, what about the atomic
in <pat-paren>
? Well, it turns out
that, by eliminating the first atomic
, we've stopped ourselves from being able to deal with this one!
The reason is that we've cannibalised the common <pat-con>
structure into our factored parentheses
in <pat-naked>
: oops! This atomic
is actually worse than the other one, since it can backtrack
out of an entire <pat-con>
as opposed to just a single (
. So, could we undo what we've just done
and fix this one instead? We certainly could; but this transformation is very violent since
<pat-naked>
appears in other places in the parser. That being said, lets do it!
The first step is to return to our old parser:
lazy val `<pat-naked>`: Parsley[PatNaked] =
( `<var-id>` | atomic(`<pat-con>`)
| (UnitCon from "()") | (NilCon from "[]") | `<literal>` | (Wild from "_")
| NestedPatOrPatTuple("(" ~> sepBy1(`<pat>`, ",") <~ ")")
| PatList("[" ~> sepBy(`<pat>`, ",") <~ "]")
)
lazy val `<pat-paren>` = atomic(`<pat-app>`) | `<pat-naked>`
lazy val `<pat-app>` = PatApp(`<pat-con>`, some(`<pat-naked>`))
lazy val `<pat-con>` = ( "(" ~> ((ConsCon from ":") | TupleCon(countSome(","))) <~ ")"
| `<con-id>`
)
Now, we know that the <pat-con>
is the problematic bit here, so let's break the <pat-naked>
into
two rules:
lazy val `<pat-naked>` = atomic(`<pat-con>`) | `<pat-naked'>`
lazy val `<pat-naked'>`: Parsley[PatNaked] =
( `<var-id>`
| (UnitCon from "()") | (NilCon from "[]") | `<literal>` | (Wild from "_")
| NestedPatOrPatTuple("(" ~> sepBy1(`<pat>`, ",") <~ ")")
| PatList("[" ~> sepBy(`<pat>`, ",") <~ "]")
)
lazy val `<pat-paren>` = atomic(`<pat-app>`) | atomic(`<pat-con>`) | `<pat-naked'>`
lazy val `<pat-app>` = PatApp(`<pat-con>`, some(`<pat-naked>`))
lazy val `<pat-con>` = ( "(" ~> ((ConsCon from ":") | TupleCon(countSome(","))) <~ ")"
| `<con-id>`
)
Now, notice that I've inlined <pat-naked>
into <pat-paren>
. The reason I did this is to make it
clear that the part we are trying to factor is the <pat-con>
. In fact, let's do a little bit of
shuffling and move it into <pat-app>
:
lazy val `<pat-paren>` = atomic(`<pat-app>`) | `<pat-naked'>`
lazy val `<pat-app>` = atomic(PatApp(`<pat-con>`, some(`<pat-naked>`))) | `<pat-con>`
Now, the aim here is to smash those <pat-con>
s together! We can introduce a new disambiguator bridge to
handle this, and switch some
for many
:
object PatAppIfNonEmpty extends ParserBridge2[PatCon, List[PatNaked], PatParen] {
def apply(con: PatCon, args: List[PatNaked]): PatParen = args match {
case Nil => con
case args => PatApp(con, args)
}
}
Now, compared to the original PatApp
bridge constructor, this one returns a PatParen
instead of a
PatApp
. This is because that is the common supertype of PatApp
and PatCon
. Let's see what the
parser looks like now:
lazy val `<pat-paren>` = atomic(`<pat-app>`) | `<pat-naked'>`
lazy val `<pat-app>` = PatAppIfNonEmpty(`<pat-con>`, many(`<pat-naked>`))
Now, since we've switched to a many
, we can actually push both of our atomic
s down into the
<pat-con>
and leave it at that:
lazy val `<pat-naked>` = `<pat-con>` | `<pat-naked'>`
lazy val `<pat-naked'>`: Parsley[PatNaked] =
( `<var-id>`
| (UnitCon from "()") | (NilCon from "[]") | `<literal>` | (Wild from "_")
| NestedPatOrPatTuple("(" ~> sepBy1(`<pat>`, ",") <~ ")")
| PatList("[" ~> sepBy(`<pat>`, ",") <~ "]")
)
lazy val `<pat-paren>` = `<pat-app>` | `<pat-naked'>`
lazy val `<pat-app>` = PatAppIfNonEmpty(`<pat-con>`, many(`<pat-naked>`))
lazy val `<pat-con>` = ( atomic("(" ~> ((ConsCon from ":") | TupleCon(countSome(","))) <~ ")")
| `<con-id>`
)
So, can we remove that last atomic
? No. At least not without a collosal amount of very destructive
refactoring of the grammar. What we can do, however, is make its scope ever so slightly smaller:
lazy val `<pat-con>` = ( atomic("(" ~> ((ConsCon from ":") | TupleCon(countSome(",")))) <~ ")"
| `<con-id>`
)
All I've done there is move it one parser to the left, that way, we commit to the branch as soon as
we've seen either a ,
or a :
and can't backtrack out of the closing bracket. That's as good as
we're going to get. A little unsatisfying, perhaps, but it's really such a minor point.
So, now what? Well, we have one final atomic
we can look at. And it's trickier than it looks.
val `<declaration>` = Decl(`<var-id>`, "::" ~> `<type>`)
val `<clause>` =
Clause(`<var-id>`, many(`<pat-naked>`), option(`<guard>`), "=" ~> `<expr>`)
val `<program>` =
fully(sepEndBy(`<data>` | atomic(`<declaration>`) | `<clause>`, some(NEWLINE)))
I've skipped out the irrelevant <data>
parser here. So, from the outset, this atomic
doesn't
look so bad: both <declaration>
and <clause>
share a <var-id>
. This, in theory, should be easy
to factor out. The problem is the bridges: when we factor out <var-id>
we no longer have the right
shape to use their bridges. You might think that the solution is to introduce an Either
, like we
did with the INT_OR_FLOAT
lexeme. This would work out ok, but is a little clunky. The new bridge
factory would be dealing with Either[Type, (List[PatNaked], Option[Expr], Expr)])]
. Let's start with
this approach first, and then see an alternative that keeps the two bridges separate and avoids the
tuple.
// These make use of `ast.Clause` because they are defined outside of `ast` (in this .md file)
type PartialClause = (List[PatNaked], Option[Expr], Expr)
object DeclOrClause extends ParserBridge2[VarId, Either[Type, PartialClause], ProgramUnit] {
def apply(id: VarId, declOrClause: Either[Type, PartialClause]): ProgramUnit =
declOrClause match {
case Left(ty) => Decl(id, ty)
case Right((args, guard, body)) => ast.Clause(id, args, guard, body)
}
}
object Clause extends ParserBridge2[VarId, PartialClause, Clause] {
def apply(id: VarId, partialClause: PartialClause): Clause = {
val (args, guard, body) = partialClause
ast.Clause(id, args, guard, body)
}
}
Now, to make this work nicely, I'm going to make use of the <+>
combinator: pronounced "sum", this parser works like |
except it returns its result into a co-product (Either
). Using
this, we can define the factored <program>
:
import parsley.syntax.zipped._
val `<declaration>` = "::" ~> `<type>`
val `<partial-clause>` = (many(`<pat-naked>`), option(`<guard>`), "=" ~> `<expr>`).zipped
val `<clause>` = Clause(`<var-id>`, `<partial-clause>`)
val `<decl-or-clause>` = DeclOrClause(`<var-id>`, `<declaration>` <+> `<partial-clause>`)
val `<program>` =
fully(sepEndBy(`<data>` | `<decl-or-clause>`, some(NEWLINE)))
Why have we got two <clause>
s? Well, we also need a <clause>
for the let
-expressions further
down the parser. Now, there isn't anything wrong with this parser, and its a perfectly reasonable
approach.
Let's also take a look at the other way we could have done this. This time, we'll change the two original bridge constructors, but won't introduce a third:
// These make use of `ast.Clause` because they are defined outside of `ast` (in this .md file)
object Decl extends ParserBridge1[Type, VarId => ast.Decl] {
def apply(ty: Type): VarId => ast.Decl = ast.Decl(_, ty)
}
object Clause extends ParserBridge3[List[PatNaked], Option[Expr], Expr, VarId => ast.Clause] {
def apply(pats: List[PatNaked], guard: Option[Expr], rhs: Expr): VarId => ast.Clause =
ast.Clause(_, pats, guard, rhs)
}
This time, the two bridge constructors return functions that take the factored VarId
. How does this
change the parser?
val `<declaration>` = Decl("::" ~> `<type>`)
val `<partial-clause>` = Clause(many(`<pat-naked>`), option(`<guard>`), "=" ~> `<expr>`)
val `<clause>` = `<var-id>` <**> `<partial-clause>`
val `<decl-or-clause>` = `<var-id>` <**> (`<declaration>` | `<partial-clause>`)
val `<program>` =
fully(sepEndBy(`<data>` | `<decl-or-clause>`, some(NEWLINE)))
This also works completely fine. This time, we use <**>
to apply the <var-id>
to the partially
built AST nodes. The advantage of this style is that it's a little more concise, and involves less
unnecessary object construction. However, this version has a subtle flaw: suppose we wanted to
add position information to the AST, then where would the pos
go? The position information would
be in the var-id>
, and we'd have to extract it out in our bridges, for example:
object Decl extends ParserBridge1[Type, VarId => Decl] {
def apply(ty: Type): VarId => Decl = v => Decl(v, ty)(v.pos)
}
object Clause extends ParserBridge3[List[PatNaked], Option[Expr], Expr, VarId => Clause] {
def apply(pats: List[PatNaked], guard: Option[Expr], rhs: Expr): VarId => Clause =
v => Clause(v, pats, guard, rhs)(v.pos)
}
This is really brittle: it is relying on the VarId
type having and exposing its position
information. In contrast, here's how our other bridge factory would be transformed:
object DeclOrClause
extends ParserBridgePos2[VarId, Either[Type, PartialClause], ProgramUnit] {
def apply(id: VarId, declOrClause: Either[Type, PartialClause])
(pos: (Int, Int)): ProgramUnit = declOrClause match {
case Left(ty) => Decl(id, ty)(pos)
case Right((args, guard, body)) => Clause(id, args, guard, body)(pos)
}
}
Much more straightforward, with no effect on any other bridge! This is worth considering if you do
find yourself in this sort of position (no pun intended). That being said, the <**>
-based version
is slightly cleaner and a tiny bit more efficient. Regardless of which method we pick, however,
that pesky atomic
is gone! That leaves us in a final state with a single atomic
which has a
maximum backtrack of one character. In other words, this parser runs in linear-time. Excellent!
5 Concluding Thoughts
In this (rather long) page, we've explored the implementation of an entire parser for a subset of Haskell from scratch. We've also seen a few techniques for factoring out common branches of the parser and applied everything we've learnt so far to a real example. We are going to come back to this parser later in the series: we've got to add better error messages, and deal with Haskell's indentation-sensitive off-side rule!