Token Extraction in ErrorBuilder
When vanilla error messages are generated internally to
parsley
, the unexpected component is usually derived from
the raw input, or a name explicitly given to an unexpected
combinator. However, that does not necessarily provide the
most informative or precise error messages.
Instead, the ErrorBuilder
typeclass has an unexpectedToken
method that can be used to determine how the token should
be formulated in the event that it would have otherwise
come raw from the input. Its signature is as follows:
def unexpectedToken(
cs: Iterable[Char],
amountOfInputParserWanted: Int,
lexicalError: Boolean
): Token
The first argument, cs
, is the input from the point that
the bad input was found; the second is the amount of input
the parser tried to read when it failed; and lexicalError
denotes whether or not the failure happened whilst trying
to parse a token from Lexer
, or not. The return value,
Token
, is one of the following classes:
case class Named(name: String, span: TokenSpan) extends Token
case class Raw(tok: String) extends Token
sealed trait TokenSpan
case class Spanning(line: Int, col: Int) extends TokenSpan
case class Width(w: Int) extends TokenSpan
A Raw
token indicates no further processing of the input could
occur to get a better token, and some is returned verbatim.
Otherwise, a Named
token can replace a raw token with something
derived from the input -- the span
here denotes how wide that
token had been determined to be.
The Spanning
class will be removed in parsley:5.0.0
: in future, the width will be in Named
directly.
The idea is that unexpectedToken
should examine the provided
arguments and determine if a more specific token can be extracted
from the residual input, or, if not, produce a final Raw
token
of the desired width.
In practice, while a user could implement the unexpectedToken
method by hand, parsley
provides a collection of
token extractors that can be mixed-in to an ErrorBuilder
to
implement it instead.
Basic Extractors
There are three basic extractors available in parsley
:
SingleChar
, MatchParserDemand
, and TillNextWhitespace
.
Each is discussed below. Each of them have special handling for
whitespace characters and ones that are unprintable, which are
given names.
SingleChar
This extractor simply takes the first codepoint of the input
stream cs
and returns it. A codepoint is a single unicode
character, which may consist of one or two bytes. As an example,
the default formatting may be instantiated with this extractor
by writing:
import parsley.errors.DefaultErrorBuilder
import parsley.errors.tokenextractors.SingleChar
val builder = new DefaultErrorBuilder with SingleChar
MatchParserDemand
This extractor, as its name suggests, takes more than a single
codepoint from the input, instead taking as many as the parser
has requested via the amountOfInputParserWanted
argument.
As an example, the default formatting may be instantiated with
this extractor by writing:
import parsley.errors.DefaultErrorBuilder
import parsley.errors.tokenextractors.MatchParserDemand
val builder = new DefaultErrorBuilder with MatchParserDemand
TillNextWhitespace
Unlike the other extractors, this one has additional
configuration. It generally aims to take as much input
as necessary to find the next the next whitespace character,
which can be changed by overriding the isWhitespace
method.
However, this can be capped as the minimum of the input the
parser demanded or until the next whitespace.
As an example, the default formatting may be instantiated with
this extractor by writing:
import parsley.errors.DefaultErrorBuilder
import parsley.errors.tokenextractors.TillNextWhitespace
val builder = new DefaultErrorBuilder with TillNextWhitespace {
def trimToParserDemand = true
}
This extractor, with trimToParserDemand = true
is the default
currented used by parsley
for all error messages. By default,
isWhitespace
matches any character c
for which c.isWhitespace
is true.
Lexer
-backed Extraction
The default strategies outlined above all ignore the
lexicalError
flag passed to unexpectedToken
. To provide a
more language-directed token extraction, however, the LexToken
extractor is also provided.
It has one compulsory configuration and two more that have defaults:
trait LexToken {
def tokens: Seq[Parsley[String]]
def extractItem(cs: Iterable[Char], amountOfInputParserWanted: Int): Token = {
SingleChar.unexpectedToken(cs)
}
def selectToken(matchedToks: List[(String, (Int, Int))]): (String, (Int, Int)) = {
matchedToks.maxBy(_._2)
}
}
Here, the tokens
are parsers for valid tokens within the
language being parsed: each returns the name of that token as it
would be displayed in the error message. The extractor will
try to parse all of these tokens, and should at least one
succeed the non-empty list of parsed tokens will be passed
to selectToken
for one to be picked to be used in the error:
by default, the one which is the widest is chosen. If no tokens
could be parsed, or the error occured during the parsing of
a token/within the markAsToken
combinator (as denoted by
lexicalError
normally), then extractItem
is used instead.
This usually should defer to another kind of token extractor,
which, for convience, all expose their functionality in their companion objects.
The intention of the tokens
sequence is that they should not
consume whitespace: were they to do so, this whitespace would
form part of the generated token! When using Lexer
to fill
this sequence, be sure to use lexer.nonlexeme
to source the
tokens.
As an example, a language which already has an available lexer
built with lexical description desc
can implement a LexToken
as follows:
import parsley.errors.DefaultErrorBuilder
import parsley.errors.tokenextractors.LexToken
val builder = new DefaultErrorBuilder with LexToken {
def tokens = Seq(
lexer.nonlexeme.integer.decimal.map(n => s"integer $n"),
lexer.nonlexeme.names.identifier.map(v => s"identifier $v")
) ++ desc.symbolDesc.hardKeywords.map { k =>
lexer.nonlexeme.symbol(k).as(s"keyword $k")
}
}
Obviously, this may not be an exhaustive list of tokens, but is illustrative of how to set things up.