gigaparsec-0.3.0.0: Refreshed parsec-style library for compatibility with Scala parsley
LicenseBSD-3-Clause
MaintainerJamie Willis, Gigaparsec Maintainers
Stabilityexperimental
Safe HaskellSafe
LanguageHaskell2010

Text.Gigaparsec.Token.Lexer

Description

This module provides a large selection of functionality concerned with lexing.

In traditional compilers, lexing and parsing are two largely separate processes; lexing turns raw input into a series of tokens, and parsing then processes these tokens. Parser combinators, on the other hand, are often implemented to deal directly with the input stream.

Nonetheless, a lexer abstraction may be achieved by defining a core set of lexing combinators that convert input to tokens, and then defining the parsing combinators in terms of these. The parsers defined using Lexer construct these lexing combinators, which creates a clear and logical separation from the rest of the parser.

It is possible that some of the implementations of parsers found within this class may have been hand-optimised for performance: care will have been taken to ensure these implementations precisely match the semantics of the originals.

Synopsis

Lexing

data Lexer Source #

A lexer describes how to transform the input string into a series of tokens.

mkLexer Source #

Arguments

:: LexicalDesc

The description of the lexical structure of the language.

-> Lexer

A lexer which can convert the input stream into a series of lexemes.

Create a Lexer with a given description for the lexical structure of the language.

mkLexerWithErrorConfig Source #

Arguments

:: LexicalDesc

The description of the lexical structure of the language.

-> ErrorConfig

The description of how to process errors during lexing.

-> Lexer

A lexer which can convert the input stream into a series of lexemes.

Create a Lexer with a given description for the lexical structure of the language, which reports errors according to the given error config.

Lexemes and Non-Lexemes

A key distinction in lexers is between lexemes and non-lexemes:

  • lexeme consumes whitespace. It should be used by a wider parser, to ensure whitespace is handled uniformly. The output of lexeme can be considered a token as provided by traditional lexers, and can be used by the parser.
  • nonlexeme does not consume whitespace. It should be used to define further composite tokens or in special circumstances where whitespace should not be consumed. One may consider the output of nonlexeme to still be in the lexing stage of parsing, and not necessarily a valid token.

Lexemes

Ideally, a wider parser should not be concerned with handling whitespace, as it is responsible for dealing with a stream of tokens. With parser combinators, however, it is usually not the case that there is a separate distinction between the parsing phase and the lexing phase. That said, it is good practice to establish a logical separation between the two worlds. As such, lexeme contains parsers that parse tokens, and these are whitespace-aware. This means that whitespace will be consumed after any of these parsers are parsed. It is not required that whitespace be present.

lexeme :: Lexer -> Lexeme Source #

This contains parsers for tokens treated as "words", such that whitespace will be consumed after each token has been parsed.

Non-Lexemes

Whilst the functionality in lexeme is strongly recommended for wider use in a parser, the functionality here may be useful for more specialised use-cases. In particular, these may for the building blocks for more complex tokens (where whitespace is not allowed between them, say), in which case these compound tokens can be turned into lexemes manually.

For example, the lexer does not have configuration for trailing specifiers on numeric literals (like, 1024L in Scala, say): the desired numeric literal parser could be extended with this functionality before whitespace is consumed by using the variant found in this object.

These tokens can also be used for lexical extraction, which can be performed by the ErrorBuilder typeclass: this can be used to try and extract tokens from the input stream when an error happens, to provide a more informative error. In this case, it is desirable to not consume whitespace after the token to keep the error tight and precise.

nonlexeme :: Lexer -> Lexeme Source #

This contains parsers for tokens that do not give any special treatment to whitespace.

Fully and Space

fully :: Lexer -> forall a. Parsec a -> Parsec a Source #

This combinator ensures a parser fully parses all available input, and consumes whitespace at the start.

space :: Lexer -> Space Source #

This contains parsers that directly treat whitespace.

Lexeme Fields

Despite their differences, lexemes and non-lexemes share a lot of common functionality. The type Lexeme describes both lexemes and non-lexemes, so that this common functionality may be exploited.

data Lexeme Source #

A Lexeme is a collection of parsers for handling various tokens (such as symbols and names), where either all or none of the parsers consume whitespace.

Lexemes and Non-Lexemes are described by these common fields.

apply :: Lexeme -> forall a. Parsec a -> Parsec a Source #

This turns a non-lexeme parser into a lexeme one by ensuring whitespace is consumed after the parser.

sym :: Lexeme -> String -> Parsec () Source #

Parse the given string.

symbol :: Lexeme -> Symbol Source #

This contains lexing functionality relevant to the parsing of atomic symbols.

names :: Lexeme -> Names Source #

This contains lexing functionality relevant to the parsing of names, which include operators or identifiers. The parsing of names is mostly concerned with finding the longest valid name that is not a reserved name, such as a hard keyword or a special operator.

Symbolic Tokens

The Symbol interface handles the parsing of symbolic tokens, such as keywords.

data Symbol Source #

This contains lexing functionality relevant to the parsing of atomic symbols.

Symbols are characterised by their "unitness", that is, every parser inside returns Unit. This is because they all parse a specific known entity, and, as such, the result of the parse is irrelevant. These can be things such as reserved names, or small symbols like parentheses.

This type also contains a means of creating new symbols as well as implicit conversions to allow for Haskell's string literals (with OverloadedStringLiterals enabled) to serve as symbols within a parser.

softKeyword :: Symbol -> String -> Parsec () Source #

This combinator parses a given soft keyword atomically: the keyword is only valid if it is not followed directly by a character which would make it a larger valid identifier.

Soft keywords are keywords that are only reserved within certain contexts. The apply combinator handles so-called hard keywords automatically, as the given string is checked to see what class of symbol it might belong to. However, soft keywords are not included in this set, as they are not always reserved in all situations. As such, when a soft keyword does need to be parsed, this combinator should be used to do it explicitly. Care should be taken to ensure that soft keywords take parsing priority over identifiers when they do occur.

softOperator :: Symbol -> String -> Parsec () Source #

This combinator parses a given soft operator atomically: the operator is only valid if it is not followed directly by a character which would make it a larger valid operator (reserved or otherwise).

Soft operators are operators that are only reserved within certain contexts. The apply combinator handles so-called hard operators automatically, as the given string is checked to see what class of symbol it might belong to. However, soft operators are not included in this set, as they are not always reserved in all situations. As such, when a soft operator does need to be parsed, this combinator should be used to do it explicitly.

Name Tokens

The Names interface handles the parsing of identifiers and operators.

data Names Source #

This class defines a uniform interface for defining parsers for user-defined names (identifiers and operators), independent of how whitespace should be handled after the name.

The parsing of names is mostly concerned with finding the longest valid name that is not a reserved name, such as a hard keyword or a special operator.

identifier :: Names -> Parsec String Source #

Parse an identifier based on the given NameDesc predicates identifierStart and identifierLetter. The NameDesc is provided by mkNames.

Capable of handling unicode characters if the configuration permits. If hard keywords are specified by the configuration, this parser is not permitted to parse them.

identifier' :: Names -> CharPredicate -> Parsec String Source #

Parse an identifier whose start satisfies the given predicate, and subseqeunt letters satisfy identifierLetter in the given NameDesc. The NameDesc is provided by mkNames.

Behaves as identifier, then ensures the first character matches the given predicate. Thus, identifier' can only refine the output of identifier; if identifier fails due to the first character, then so will identifier', even if this character passes the supplied predicate.

Capable of handling unicode characters if the configuration permits. If hard keywords are specified by the configuration, this parser is not permitted to parse them.

userDefinedOperator :: Names -> Parsec String Source #

Parse a user-defined operator based on the given SymbolDesc predicates operatorStart and operatorLetter. The SymbolDesc is provided by mkNames.

Capable of handling unicode characters if the configuration permits. If hard operators are specified by the configuration, this parser is not permitted to parse them.

userDefinedOperator' :: Names -> CharPredicate -> Parsec String Source #

Parse a user-defined operator whose first character satisfies the given predicate, and subsequent characters satisfying operatorLetter in the given SymbolDesc. The SymbolDesc is provided by mkNames.

Behaves as userDefinedOperator, then ensures the first character matches the given predicate. Thus, userDefinedOperator' can only refine the output of userDefinedOperator; if userDefinedOperator fails due to the first character, then so will userDefinedOperator', even if this character passes the supplied predicate.

Capable of handling unicode characters if the configuration permits. If hard operators are specified by the configuration, this parser is not permitted to parse them.

Numeric Tokens

These types and combinators parse numeric literals, such as integers and reals.

type CanHoldSigned = CanHoldSigned Source #

type CanHoldUnsigned = CanHoldUnsigned Source #

Integer Parsers

IntegerParsers handles integer parsing (signed and unsigned). This is mainly used by the combinators integer and natural.

data IntegerParsers (canHold :: Bits -> Type -> Constraint) Source #

A uniform interface for defining parsers for integer literals, independent of how whitespace should be handled after the literal or whether the literal should allow for negative numbers.

integer :: Lexeme -> IntegerParsers CanHoldSigned Source #

This is a collection of parsers concerned with handling signed integer literals.

Signed integer literals are an extension of unsigned integer literals which may be prefixed by a sign.

natural :: Lexeme -> IntegerParsers CanHoldUnsigned Source #

A collection of parsers concerned with handling unsigned (positive) integer literals.

Fixed-Base Parsers

decimal :: IntegerParsers canHold -> Parsec Integer Source #

Parse a single integer literal in decimal form (base 10).

hexadecimal :: IntegerParsers canHold -> Parsec Integer Source #

Parse a single integer literal in hexadecimal form (base 16).

octal :: IntegerParsers canHold -> Parsec Integer Source #

Parse a single integer literal in octal form (base 8).

binary :: IntegerParsers canHold -> Parsec Integer Source #

Parse a single integer literal in binary form (base 2).

Fixed-Width Numeric Tokens

These combinators tokenize numbers that must be within specific bit-widths. The possible bit-widths are provided by Bits.

Decimal Tokens

decimal8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as decimal except it ensures that the resulting value is a valid 8-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

decimal16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as decimal except it ensures that the resulting value is a valid 16-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

decimal32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as decimal except it ensures that the resulting value is a valid 32-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

decimal64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as decimal except it ensures that the resulting value is a valid 64-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

Hexadecimal Tokens

hexadecimal8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as hexadecimal except it ensures that the resulting value is a valid 8-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

hexadecimal16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as hexadecimal except it ensures that the resulting value is a valid 16-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

hexadecimal32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as hexadecimal except it ensures that the resulting value is a valid 32-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

hexadecimal64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as hexadecimal except it ensures that the resulting value is a valid 64-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

Octal Tokens

octal8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as octal except it ensures that the resulting value is a valid 8-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

octal16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as octal except it ensures that the resulting value is a valid 16-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

octal32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as octal except it ensures that the resulting value is a valid 32-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

octal64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as octal except it ensures that the resulting value is a valid 64-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

Binary Tokens

binary8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as binary except it ensures that the resulting value is a valid 8-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

binary16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as binary except it ensures that the resulting value is a valid 16-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

binary32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as binary except it ensures that the resulting value is a valid 32-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

binary64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as binary except it ensures that the resulting value is a valid 64-bit number.

The resulting number will be converted to the given type a, which must be able to losslessly store the parsed value; this is enforced by the canHold constraint on the type. This accounts for unsignedness when necessary.

Textual Tokens

The TextParsers interface handles the parsing of string and character literals.

data TextParsers t Source #

This type defines a uniform interface for defining parsers for textual literals, independent of how whitespace should be handled after the literal.

The type of these literals is determined by the parameter t.

ascii :: TextParsers t -> Parsec t Source #

Parses a single t-literal, which may contain any graphic ASCII character. These are characters with ordinals in range 0 to 127 inclusive.

It may also contain escape sequences, but only those which result in ASCII characters.

unicode :: TextParsers t -> Parsec t Source #

Parses a single t-literal, which may contain any unicode graphic character as defined by up to two UTF-16 codepoints.

It may also contain escape sequences.

latin1 :: TextParsers t -> Parsec t Source #

Parses a single t-literal, which may contain any graphic extended ASCII character. These are characters with ordinals in range 0 to 255 inclusive.

It may also contain escape sequences, but only those which result in extended ASCII characters.

String Parsers

A Lexer provides the following TextParsers for string literals.

stringLiteral :: Lexeme -> TextParsers String Source #

A collection of parsers concerned with handling single-line string literals.

String literals are generally described by the TextDesc fields:

rawStringLiteral :: Lexeme -> TextParsers String Source #

A collection of parsers concerned with handling single-line string literals, without handling any escape sequences: this includes literal-end characters and the escape prefix (often " and \ respectively).

String literals are generally described by the TextDesc fields:

multiStringLiteral :: Lexeme -> TextParsers String Source #

A collection of parsers concerned with handling multi-line string literals.

Multi-string literals are generally described by the TextDesc fields:

rawMultiStringLiteral :: Lexeme -> TextParsers String Source #

A collection of parsers concerned with handling multi-line string literals, without handling any escape sequences: this includes literal-end characters and the escape prefix (often " and \ respectively).

Multi-string literals are generally described by the TextDesc fields:

Character Parsers

A Lexer provides the following TextParsers for character literals.

charLiteral :: Lexeme -> TextParsers Char Source #

A collection of parsers concerned with handling character literals.

Charcter literals are generally described by the TextDesc fields:

Whitespace and Comments

Space and its fields are concerned with special treatment of whitespace itself.

Most of the time, the functionality herein will not be required, as lexeme and fully will consistently handle whitespace.

However, whitespace is significant in some languages, like Python and Haskell, in which case Space provides a way to control how whitespace is consumed.

data Space Source #

This type is concerned with special treatment of whitespace.

For the vast majority of cases, the functionality within this object shouldn't be needed, as whitespace is consistently handled by lexeme and fully. However, for grammars where whitespace is significant (like indentation-sensitive languages), this object provides some more fine-grained control over how whitespace is consumed by the parsers within lexeme.

skipComments :: Space -> Parsec () Source #

Skips zero or more comments.

The implementation of this combinator does not vary with whiteSpaceIsContextDependent. It will use the hide combinator as to not appear as a valid alternative in an error message: adding a comment is often legal, but not a useful solution for how to make the input syntactically valid.

whiteSpace :: Space -> Parsec () Source #

Skips zero or more (insignificant) whitespace characters as well as comments.

The implementation of this parser depends on whether whiteSpaceIsContextDependent is true: when it is, this parser may change based on the use of the alter combinator.

This parser will always use the hide combinator as to not appear as a valid alternative in an error message: it's likely always the case whitespace can be added at any given time, but that doesn't make it a useful suggestion unless it is significant.

alter :: Space -> forall a. CharPredicate -> Parsec a -> Parsec a Source #

This combinator changes how lexemes parse whitespace for the duration of a given parser.

So long as whiteSpaceIsContextDependent is true, this combinator will be able to locally change the definition of whitespace during the given parser.

Examples

Expand
  • In indentation sensitive languages, the indentation sensitivity is often ignored within parentheses or braces. In these cases, parens (alter withNewLine p) would allow unrestricted newlines within parentheses.

initSpace :: Space -> Parsec () Source #

This parser initialises the whitespace used by the lexer when whiteSpaceIsContextDependent is true.

The whitespace is set to the implementation given by the lexical description. This parser must be used, by fully or otherwise, as the first thing the global parser does or an UnfilledRegisterException will occur.

See alter for how to change whitespace during a parse.