Lexer (parsley.token.Lexer)

The Lexer class is the main-entry point to the combinator-based functionality of the parsley.token package. It is given configuration in the form of a LexicalDesc and an optional ErrorConfig. The internal structure is then a collection of objects that contain various forms of functionality: these are explored in more detail in this page.

It is worth noting the highest-level structure:

The Scaladoc for this page can be found at parsley.token.Lexer.

Distinguishing Between "Lexeme" and "Non-Lexeme"

Broadly, the Lexer duplicates the vast majority of its functionality between two different objects: lexeme and nonlexeme. Broadly speaking, everything within nonlexeme can be found inside lexeme, but not the other way around. The name "lexeme" is not an amazing one terminology wise, but there is a historical precedent set by parsec.

Non-lexeme things A non-lexeme thing does not care about whitespace: these are raw tokens. It is highly likely that you wouldn't want to use these in a regular parser, but they may be handy for custom error handling or building composite tokens.

Lexeme things These do account for whitespace that occurs after a token, consuming everything up until the next token. This means there are some extra pieces of functionality available that don't make much sense for non-lexeme handling. The lexeme object can also be used as a function via its apply method, allowing it to make any parser into one that handles whitespace: this should be done for any composite tokens made with nonlexeme.

Whitespace handling should ideally be handled uniformly by lexeme: it establishes a convention of only consuming trailing whitespace, which is important for avoiding ambiguity in a parser. If you cannot use lexeme.apply, you must adhere to this same convention.

For handling initial whitespace in the parser (before the very first token), you should use Lexer.fully.

Names and Symbols

These two categories of parser are closely linked, as described below.

Lexer.{lexeme, nonlexeme}.names

This object contains the definitions of several different parsers for dealing with values that represent names in a language, specifically identifiers and operators. These are configured directly by LexicalDesc.nameDesc, however valid names are affected by the keywords and reserved operators as given in LexicalDesc.symbolDesc. Both are defined by an initial letter, and then any subsequent letters.

Note that both the start and end letters must be defined for an identifier/userDefinedOperator to work properly. It is not the case that, say, if identifierStart is ommited, that identifierLetter is used in its place.

In some cases, languages may have special descriptions of identifiers or operators that work in specific scenarios or with specific properties. For instance: Haskell's distinction between constructors, which start uppercase, and variables, which start lowercase; and Scala's special treatment of operators that end in :. In these cases, the identifier and userDefinedOperator parsers provided by Names allow you to refine the start letter (and optionally the end letter for operators) to restrict them to a smaller subset. This allows for these special cases to be handled directly.

Lexer.{lexeme, nonlexeme}.symbol

Compared with names, which deals with user-defined identifiers and operators, symbol is responsible for hard-coded or reserved elements of a language. This includes keywords and built-in operators, as well as specific symbols like {, or ;. The description for symbols, found in LexicalDesc.symbolDesc, describes what the "hard" keywords and operators are for the language: these are always regarded as reserved, and identifiers and user-defined operators may not take these names. However, the symbol object also defines the softKeyword and softOperator combinators: these are for keywords that are contextually reserved. For example, in Scala 3, the soft keyword opaque is only considered a keyword if it appears before type; this means it is possible to define a variable val opaque = 4 without issue. In parsley, this could be performed by writing atomic(symbol.softKeyword("opaque") ~> symbol("type")). Keywords and reserved operators are only legal if they are not followed by something that would turn them into part of a wider identifier or user-defined operator: even if if is a keyword, iffy should not be parsed as if then fy!

Both soft and hard keywords cannot form part of a wider identifer. However, for this to work it is important that NameDesc.identifierLetter (and/or NameDesc.operatorLetter) is defined. If not, then parsley will not know what constitutes an illegal continutation of the symbol!

To make things easier, symbol.apply(String) can be used to take any literal symbol and handle it properly with respect to the configuration (except soft keywords need to go via softKeyword). If if is part of the hardKeywords set, then symbol("if") will properly parse it, disallowing iffy, and so on. If the provided string is not reserved in any way, it will be parsed literally, as if string had been used.

The symbol object also defines a bunch of pre-made helper parsers for some common symbols like ;, ,, and so on. They are just defined in terms of symbol.apply(String) or symbol.apply(Char).

Implicits symbol.implicits contains the function implicitSymbol, which does the same job as symbol.apply, but is defined as an implicit conversion. By importing this, string literals can themselves serve as parsers of type Parsley[Unit], and parse symbols correctly. With this, it instead of symbol("if") you can simply write "if".

Lexer.{lexeme, nonlexeme} Numeric Parsers

This object contains the definitions of several different parsers for handling numeric data: this includes both integers and floating point numbers. The configuration for all of these parsers is managed by LexicalDesc.numericDesc. The members are split into three kinds:

The configuration which specifies which of the numeric bases are legal for a number literal applies only to the number parsers within Integer, Real, and Combined. A parser for a specific base can always just be used directly, even when otherwise disabled in configuration.

Examples of Configuration and Valid Literals

The plain definition of NumericDesc provides a variety of different configurations for the numeric literals depending on the literal base, so it mostly suffices to look at the effects of these on the different bases to get a sense of what does what.

val lexer = new Lexer(LexicalDesc.plain)

The basic configuration allows number to work with hexadecimal and octal literals, as well as decimal. These have their standard prefixes of 0x and 0o, respectively (or uppercase variants). This means that unsigned.number will allow literals like 0, 0xff, 0o45 and 345. For signed, each of these may be preceded by a + sign, but this is not required; if positiveSign is set to PlusSignPresence.Compulsory, positive literals would always require a +; and if it is set to PlusSignPrecense.Illegal, the + prefix can never be used (but - is fine regardless). By default, 023 is legal, but this can be disabled by setting leadingZerosAllowed to false.

val num = lexer.lexeme.signed.number
// num: parsley.Parsley[BigInt] = parsley.Parsley@4dd0fe2d
num.parse("0")
// res0: parsley.Result[String, BigInt] = Success(x = 0)
num.parse("0xff")
// res1: parsley.Result[String, BigInt] = Success(x = 255)
num.parse("+0o45")
// res2: parsley.Result[String, BigInt] = Success(x = 37)
num.parse("-345")
// res3: parsley.Result[String, BigInt] = Success(x = -345)

In the basic configuration, break characters are not supported. However, by setting literalBreakChar to BreakCharDesc.Supported('_', allowedAfterNonDecimalPrefix = true), say, will allow for 1_000 or 0x_400. Setting the second parameter to false will forbid the latter example, as the break characters may then only appear between digits.

val lexerWithBreak = new Lexer(LexicalDesc.plain.copy(
        numericDesc = NumericDesc.plain.copy(
            literalBreakChar = BreakCharDesc.Supported('_', allowedAfterNonDecimalPrefix = true))
    ))
// lexerWithBreak: Lexer = parsley.token.Lexer@4731c284
val withBreak = lexerWithBreak.lexeme.signed.number
// withBreak: parsley.Parsley[BigInt] = parsley.Parsley@7248ea83
withBreak.parse("1_000")
// res4: parsley.Result[String, BigInt] = Success(x = 1000)
withBreak.parse("1_")
// res5: parsley.Result[String, BigInt] = Failure(
// ...
withBreak.parse("2__0") // no double break
// res6: parsley.Result[String, BigInt] = Failure(
// ...

Real numbers in the default configuration do not support literals like .0 or 1., this behaviour must be explicitly enabled with trailingDotAllowed and leadingDotAllowed: note that . is not a valid literal, even with both flags enabled! By default, all four different bases support exponents on their literals for floating-point numbers. This could be turned off for each by using ExponentDesc.NoExponents. However, with exponents enabled, it is configured that the non-decimal bases all require exponents for valid literals. Whilst 3.142 is valid decimal literal, 0x3.142 is not a legal hexadecimal literal: to make it work, the exponent must be added, i.e. 0x3.142p0, where p0 is performing * 2^0. For each of the non-decimal literals, the base of the exponent is configured to be 2, hence 2^0 in the previous example; for decimal it is set to the usual 10, so that 2e3 is 2*10^3, or 2000. Notice that literals do not require a point, so long as they do have an exponent.

val real = lexer.lexeme.real
// real: parsley.token.numeric.Real = parsley.token.numeric.LexemeReal@147e4580
real.hexadecimalDouble.parse("0x3.142")
// res7: parsley.Result[String, Double] = Failure(
// ...
real.hexadecimalDouble.parse("0x3.142p0")
// res8: parsley.Result[String, Double] = Success(x = 3.07861328125)
real.binary.parse("0b0.1011p0")
// res9: parsley.Result[String, BigDecimal] = Success(x = 0.6875)
real.decimal.parse("3.142")
// res10: parsley.Result[String, BigDecimal] = Success(x = 3.142)
real.decimal.parse("4")
// res11: parsley.Result[String, BigDecimal] = Failure(
// ...
real.decimal.parse("2e3")
// res12: parsley.Result[String, BigDecimal] = Success(x = 2000)

When a floating point literal is parsed in a non-decimal base, the meaning of each digit past the point is to be a fraction of that base. The example 0x3.142p0, for instance is not equal to the decimal 3.142. Instead, it is equal to (3 + 1/16 + 4/(16^2) + 2/(16^3)) * 2^0 = 3.07861328125. Handily, hexadecimal floats are still equal to the 4-bit bunching up of binary floats: 0x0.Bp0 is the same as 0b0.1011p0, both of which are 0.6875 in decimal.

Lexer.{lexeme, nonlexeme} Text Parsers

This object deals with the parsing of both string literals and character literals, configured broadby by LexicalDesc.textDesc:

Examples of Configuration and Valid Literals

The majority of configuration for strings and characters is focused around the escape sequences. Outside of that, it is mostly just what the valid start and end sequences are valid for different flavours of literal. However, the graphicCharacter predicate is used to denote what the valid characters are that can appear in a string verbatim. This can be restricted to a smaller set than might otherwise have been checked by ascii or latin1 parsers. In these instances, a different error message would be generated:

val aboveSpace = predicate.Unicode(_ >= 0x20)
// aboveSpace: predicate.Unicode = Unicode(<function1>)
def stringParsers(graphicChar: CharPredicate = aboveSpace,
                  escapeDesc: EscapeDesc = EscapeDesc.plain) =
    new Lexer(LexicalDesc.plain.copy(
        textDesc = TextDesc.plain.copy(
            escapeSequences = escapeDesc,
            graphicCharacter = graphicChar
        )
    )).nonlexeme.string

val fullUnicode = stringParsers(aboveSpace)
// fullUnicode: parsley.token.text.String = parsley.token.text.ConcreteString@50383c91
val latin1Limited = stringParsers(predicate.Basic(c => c >= 0x20 && c <= 0xcf))
// latin1Limited: parsley.token.text.String = parsley.token.text.ConcreteString@c2a7678

fullUnicode.latin1.parse("\"hello α\"")
// res13: parsley.Result[String, String] = Failure((line 1, column 2):
//   non-latin1 characters in string literal, this is not allowed
//   >"hello α"
//     ^^^^^^^)
latin1Limited.fullUtf16.parse("\"hello α\"")
// res14: parsley.Result[String, String] = Failure((line 1, column 8):
//   unexpected "α"
//   expected """ or string character
//   >"hello α"
//           ^)

When it comes to escape characters, the configuration distinguishes between four kinds of escape sequence, which are further sub-divided:

Denotative escapes These are a family of escape sequences that are names or symbols for the escape characters they represent. Parsley supports three different kinds of denotative escape characters:

Of course, all denotative escape sequences can be represented by the multiMap on its own, and all the above examples could be represented by Map("\"" -> '"', "\\" -> '\\', "n" -> 0xa, "NULL" -> 0x0). For literals in particular, the Set is more ergonomic than the Map.

Note that the literals set, along with the keys of singleMap and multiMap, must all be distinct from each other. Furthermore, no empty sequences may be placed in multiMap. Violating any of these requirements will result in an error.

Numeric escapes These are escapes that represent the numeric code of a specific character. There are four different bases for numeric escapes: binary, octal, hexadecimal, and decimal. Each of these can have their own unique prefix (or lack there of), maximum allowed value, and specific number of digits:

String gaps Supported for string literals only, string gaps allow for prunable whitespace within a string literal. These take the form of a backslash, followed by whitespace, terminated by another backslash (this can include newlines, even in otherwise single-line strings). As an example:

val withGaps = stringParsers(escapeDesc = EscapeDesc.plain.copy(gapsSupported = true))
// withGaps: parsley.token.text.String = parsley.token.text.ConcreteString@6474aa8c
withGaps.ascii.parse(""""Hello \

      \World!" """)
// res15: parsley.Result[String, String] = Success(Hello World!)

Empty escapes These are also only supported by string literals. These characters have no effect on the string literal, but otherwise allow for disambiguation with multi-character escape sequences. For example, if EscapeDesc.emptyEscape is set to Some('&'), then "\x20\&7" would be interpreted as the string " 7", however, without the \& character, it would try and render character 0x207.

Lexer.lexeme.{enclosing, separators}

These two objects just contain various shortcuts for doing things such as semi-colon separated things, or braces enclosed things, etc. There is nothing special about them: with lexer.lexeme.symbol.implicits.implicitSymbol imported, "(" ~> p <~ ")" is the same as lexer.lexeme.enclosing.parens(p). The choice of one style over the other is purely up to taste.

Whitespace-Sensitive Languages and Lexer.space

Normally, the whitespace definitions used by lexeme are fixed as described by the LexicalDesc.spaceDesc; accounting for the comments and spaces themselves. However, some languages, like Python and Haskell do not have constant definitions of whitespace: for instance, inside a pair of parentheses, newline characters are no longer considered for the current indentation. To support this, parsley allows for the space definition to be locally altered during parsing if LexicalDesc.spaceDesc.whitespaceIsContextDependent is set to true: this may impact the performance of the parser.

If the LexicalDesc.spaceDesc.whitespaceIsContextDependent flag is turned on it is crucial that either the Lexer.fully combinator is used, or Lexer.space.init is ran as the very first thing the top-level parser does. Without this, the context-dependent whitespace will not be set-up correctly!

In this mode, it is possible to use the lexer.space.alter combinator to temporarily change the definition of whitespace (but not comments) within the scope of a given parser. As an example:

val withNewline = predicate.Basic(_.isSpace)
val expr = ... | "(" ~> lexer.space.alter(withNewline)(expr) <~ ")"

For the duration of that nested expr call, newlines are considered regular whitespace. This, of course, is assuming that newlines were not considered whitespace under normal conditions.