Effective Lexing
Parsley offers a user-friendly API for lexing, which is detailed at the end of this document. We recommend using this API for most purposes.
The initial sections of this document explain the fundamental principles behind lexing with parser combinators.
In the previous post, we saw the basic principles behind handling whitespace in a transparent manner. To remind ourselves of what we ended up lets pick up where we left off:
import parsley.Parsley, Parsley.{atomic, eof, many}
import parsley.character.{digit, whitespace, string, item, endOfLine}
import parsley.combinator.manyTill
import parsley.expr.{precedence, Ops, InfixL}
import parsley.errors.combinator.ErrorMethods
object lexer {
    private def symbol(str: String): Parsley[String] = atomic(string(str))
    private val lineComment = symbol("//") ~> manyTill(item, endOfLine).void
    private val multiComment = symbol("/*") ~> manyTill(item, symbol("*/")).void
    private val comment = lineComment | multiComment
    private val skipWhitespace = many(whitespace.void | comment).void.hide
    private def lexeme[A](p: Parsley[A]): Parsley[A] = p <~ skipWhitespace
    private def token[A](p: Parsley[A]): Parsley[A] = lexeme(atomic(p))
    def fully[A](p: Parsley[A]): Parsley[A] = skipWhitespace ~> p <~ eof
    val number = token(digit.foldLeft1[BigInt](0)((n, d) => n * 10 + d.asDigit))
    object implicits {
        implicit def implicitSymbol(s: String): Parsley[String] = lexeme(symbol(s))
    }
}
object expressions {
    import lexer.implicits.implicitSymbol
    import lexer.{number, fully}
    private lazy val atom: Parsley[BigInt] = "(" ~> expr <~ ")" | number
    private lazy val expr = precedence[BigInt](atom)(
        Ops(InfixL)("*" as (_ * _)),
        Ops(InfixL)("+" as (_ + _), "-" as (_ - _)))
    val parser = fully(expr)
}So far, we've broken the parser into two distinct chunks: the lexer and the main parser. Within
        the lexer we need to be careful and explicit about where we handle whitespace and where we
        don't; within the parser we can assume that all the whitespace has been correctly dealt with and
        can focus on the main content. To help motivate the changes we are going to make to the lexer
        object later on in the post, I want to first extend our "language" to add in variables into the
        language and a negate operator. In the process I'm going to swap the integer result for an
        abstract syntax tree.
import parsley.expr.{precedence, Ops, InfixL, Prefix}
object expressions {
    import lexer.implicits.implicitSymbol
    import lexer.{number, fully, identifier}
    // for now, assume that `identifier` is just 1 or more alphabetical characters
    sealed trait Expr
    case class Add(x: Expr, y: Expr) extends Expr
    case class Mul(x: Expr, y: Expr) extends Expr
    case class Sub(x: Expr, y: Expr) extends Expr
    case class Neg(x: Expr) extends Expr
    case class Num(x: BigInt) extends Expr
    case class Var(x: String) extends Expr
    private lazy val atom: Parsley[Expr] =
        "(" ~> expr <~ ")" | number.map(Num) | identifier.map(Var)
    private lazy val expr = precedence[Expr](atom)(
        Ops(Prefix)("negate" as Neg),
        Ops(InfixL)("*" as Mul),
        Ops(InfixL)("+" as Add, "-" as Sub))
    val parser = fully(expr)
}Now, we can assume that, since identifier comes from the lexer, this parser handles all the whitespace correctly. The question is, does it work?
expressions.parser.parse("x + y")
// res1: parsley.Result[String, expressions.Expr] = Success(Add(Var(x),Var(y)))
expressions.parser.parse("negate + z")
// res2: parsley.Result[String, expressions.Expr] = Failure((line 1, column 8):
//   unexpected "+"
//   expected "(", "negate", digit, or letter
//   >negate + z
//           ^)
expressions.parser.parse("negate x + z")
// res3: parsley.Result[String, expressions.Expr] = Success(Add(Neg(Var(x)),Var(z)))
expressions.parser.parse("negatex + z")
// res4: parsley.Result[String, expressions.Expr] = Success(Add(Neg(Var(x)),Var(z)))So, looking at these examples, the first one seems to work fine. The second one also works
        fine, but given that we've said that identifiers are just alpha-numeric characters, you might
        assume it was legal (indeed, it really shouldn't be legal in most languages that don't have
        "soft" keywords). The third example again works as intended, but the fourth is suspicious:
        negatex is clearly an identifier but was parsed as negate x! This now gives us a set up for
        refining our lexer for the rest of the page. We won't be touching the expressions object
        again, so take a long hard look at it.
1 Keywords, Identifiers and Operators
The crux of the problem we unearthed in the last section is that the implicit used to handle strings has no awareness of keywords (or indeed operators) and identifiers that work for any alpha-numeric sequence are an accident waiting to happen. Let's start by creating a couple of sets to represent the valid keywords and operators in the language:
val keywords = Set("negate")
val operators = Set("*", "+", "-")Now we can use these to define some more specialist combinators for dealing with these lexemes:
- We want to ensure that identifiers are not valid keywords.
- We want to ensure reading a keyword does not have a valid identifier "letter" after it.
- We want to ensure that a specific operator does not end up being the prefix of another, parsable, operator. This satisfies the "maximal-munch" style of parsing.
We'll start with identifier: to check that an identifier we've just read is not itself a
        valid keyword we can use the filter family of combinators. In particular, filterOut
        provides an error message that explains why the parser has failed. Here is our new and improved
        identifier:
import parsley.errors.combinator.ErrorMethods //for filterOut
import parsley.character.{letter, stringOfSome}
// `stringOfSome(letter)` is loosely equivalent to `some(letter).map(_.mkString)`
val identifier = token(stringOfSome(letter).filterOut {
    case v if keywords(v) => s"keyword $v may not be used as an identifier"
})The filterOut combinator takes a PartialFunction from the parser's result to String. If
        the partial function is defined for its input, that produces the error message that the parser
        fails with. Notice that I've been very careful to make sure the filter happens inside the
        scope of the token combinator. If we do read an identifier and then rule it out because its
        a keyword, we want the ability to backtrack. Filtering after the input has been irrevocably consumed will mean this parser fails more strongly than it should.
Next we'll tackle the keywords, using the handy notFollowedBy combinator that was briefly
        referenced in the very first post:
import parsley.Parsley.notFollowedBy
def keyword(k: String): Parsley[Unit] = token(string(k) ~> notFollowedBy(letter))Again, notice that I've been very careful to perform the negative-lookahead within the scope of
        the token so that we can backtrack if necessary (indeed, it's likely that a valid
        alternative was an identifier!). Additionally, we also need to ensure that whitespace isn't
        read before we try and check for the letterOrDigit, otherwise negate x would also fail, so that's
        another reason to keep it within token.
Finally, let's look at how operator is dealt with. It's a bit trickier, and in this case is meaningless, because all of our operators are single character and don't form valid prefixes of each other. But it will be useful to discuss anyway:
import parsley.character.strings
def operator(op: String): Parsley[Unit] = {
    val biggerOps = operators.collect {
        case biggerOp if biggerOp.startsWith(op)
                      && biggerOp > op => biggerOp.substring(op.length)
    }.toList
    biggerOps match {
        case Nil => lexeme(symbol(op)).void
        // strings requires one non-varargs argument
        case biggerOp :: biggerOps =>
            token(string(op) ~> notFollowedBy(strings(biggerOp, biggerOps: _*)))
    }
}Let's unpack what's going on here: first we read the op as normal, then we ensure that it's
        not followed by the rest of any operators for which it forms a valid prefix. This is using the
        regular collect method on Scala Set. As an example, if we have the operator set
        Set("*", "*+", "*-", "++") and we call operator("*"), we would be checking
        notFollowedBy(strings("+", "-")), since *+ and *- are both prefixed by *. This is,
        again, a great example of how powerful access to regular Scala code in our parsers is! This
        would be quite tricky to define in a parser generator!
So, the question is, what do we do with our new found combinators? We could just expose them to
        the rest of the parser as they are, but that leaves room for error if we forget, or miss out,
        any of the replacements. And, in addition, we lose the nice string literal syntax we've made
        good use of until this point. So, a better solution would be to change our definition of
        implicitSymbol:
object implicits {
    implicit def implicitSymbol(s: String): Parsley[Unit] = {
        if (keywords(s))       keyword(s)
        else if (operators(s)) operator(s)
        else                   lexeme(symbol(s)).void
    }
}Now, when we use a string literal in our original parser, it will first check to see if that is a valid keyword or an operator and, if so, it can use our specialised combinators: neat! With this done, let's see what the new lexer looks like and relook at the problematic example:
object lexer {
    private val keywords = Set("negate")
    private val operators = Set("*", "+", "-")
    private def symbol(str: String): Parsley[String] = atomic(string(str))
    private val lineComment = symbol("//") ~> manyTill(item, endOfLine).void
    private val multiComment = symbol("/*") ~> manyTill(item, symbol("*/")).void
    private val comment = lineComment | multiComment
    private val skipWhitespace = many(whitespace.void | comment).void.hide
    private def lexeme[A](p: Parsley[A]): Parsley[A] = p <~ skipWhitespace
    private def token[A](p: Parsley[A]): Parsley[A] = lexeme(atomic(p))
    def fully[A](p: Parsley[A]): Parsley[A] = skipWhitespace ~> p <~ eof
    val number = token(digit.foldLeft1[BigInt](0)((n, d) => n * 10 + d.asDigit))
    val identifier = token(stringOfSome(letter).filterOut {
        case v if keywords(v) => s"keyword $v may not be used as an identifier"
    })
    private def operator(op: String): Parsley[Unit] = {
        val biggerOps = operators.collect {
            case biggerOp if biggerOp.startsWith(op)
                        && biggerOp > op => biggerOp.substring(op.length)
        }.toList
        biggerOps match {
            case Nil => lexeme(symbol(op)).void
            // strings requires one non-varargs argument
            case biggerOp :: biggerOps =>
                token(string(op) ~> notFollowedBy(strings(biggerOp, biggerOps: _*)))
        }
    }
    private def keyword(k: String): Parsley[Unit] =
        token(string(k) ~> notFollowedBy(letter))
    object implicits {
        implicit def implicitSymbol(s: String): Parsley[Unit] = {
            if (keywords(s))       keyword(s)
            else if (operators(s)) operator(s)
            else                   lexeme(symbol(s)).void
        }
    }
}And the original failing example:
expressions.parser.parse("negatex + z")
// res8: parsley.Result[String, expressions.Expr] = Success(Add(Var(negatex),Var(z)))Exactly as desired!
2 Using token.descriptions.LexicalDesc and token.Lexer
        Whilst everything we have done above is nice and instructive, in practice all this work is
        already done for us with token.Lexer. By providing a suitable token.descriptions.LexicalDesc,
        we can get a whole bunch of combinators for dealing with tokens for free. There is a lot of
        functionality found inside the Lexer, and most of it is highly configurable with the LexicalDesc
        and its sub-components. Let's make use of this new found power and change up our lexer object one
        last time:
object lexer {
    import parsley.token.{Lexer, Basic}
    import parsley.token.descriptions.{LexicalDesc, NameDesc, SymbolDesc}
    private val desc = LexicalDesc.plain.copy(
        nameDesc = NameDesc.plain.copy(
            // Unicode is also possible instead of Basic
            identifierStart = Basic(_.isLetter),
            identifierLetter = Basic(_.isLetter),
        ),
        symbolDesc = SymbolDesc.plain.copy(
            hardKeywords = Set("negate"),
            hardOperators = Set("*", "+", "-"),
        ),
    )
    private val lexer = new Lexer(desc)
    val identifier = lexer.lexeme.names.identifier
    val number = lexer.lexeme.natural.decimal
    def fully[A](p: Parsley[A]) = lexer.fully(p)
    val implicits = lexer.lexeme.symbol.implicits
}The implicitSymbol function we developed before, along with operator and
        keyword are all implemented by lexer.lexeme.symbol. The names.identifier parser
        accounts for the keyword problem for us. The basic natural.decimal parser
        meets our needs without any additional configuration: it also returns BigInt, which
        is arbitrary precision. By using token.lexeme, this will already handle the
        whitespace and atomicity of the token for us. This is just the tip of the iceberg
        when it comes to the lexer functionality within Parsley. It is well worth having a
        play around with this functionality and getting used to it!
A more detailed description of this functionality can be found in the API Guide.