License	BSD-3-Clause
Maintainer	Jamie Willis, Gigaparsec Maintainers
Stability	experimental
Safe Haskell	Safe
Language	Haskell2010

Text.Gigaparsec.Token.Lexer

Contents

Lexing
Lexemes and Non-Lexemes
Symbolic Tokens
Name Tokens
Numeric Tokens
- Integer Parsers
  - Fixed-Base Parsers
  - Fixed-Width Numeric Tokens
Textual Tokens
- String Parsers
- Character Parsers
Whitespace and Comments

Description

This module provides a large selection of functionality concerned with lexing.

In traditional compilers, lexing and parsing are two largely separate processes; lexing turns raw input into a series of tokens, and parsing then processes these tokens. Parser combinators, on the other hand, are often implemented to deal directly with the input stream.

Nonetheless, a lexer abstraction may be achieved by defining a core set of lexing combinators that convert input to tokens, and then defining the parsing combinators in terms of these. The parsers defined using Lexer construct these lexing combinators, which creates a clear and logical separation from the rest of the parser.

It is possible that some of the implementations of parsers found within this class may have been hand-optimised for performance: care will have been taken to ensure these implementations precisely match the semantics of the originals.

Synopsis

data Lexer
mkLexer :: LexicalDesc -> Lexer
mkLexerWithErrorConfig :: LexicalDesc -> ErrorConfig -> Lexer
lexeme :: Lexer -> Lexeme
nonlexeme :: Lexer -> Lexeme
fully :: Lexer -> forall a. Parsec a -> Parsec a
space :: Lexer -> Space
data Lexeme
apply :: Lexeme -> forall a. Parsec a -> Parsec a
sym :: Lexeme -> String -> Parsec ()
symbol :: Lexeme -> Symbol
names :: Lexeme -> Names
data Symbol
softKeyword :: Symbol -> String -> Parsec ()
softOperator :: Symbol -> String -> Parsec ()
data Names
identifier :: Names -> Parsec String
identifier' :: Names -> CharPredicate -> Parsec String
userDefinedOperator :: Names -> Parsec String
userDefinedOperator' :: Names -> CharPredicate -> Parsec String
type CanHoldSigned = CanHoldSigned
type CanHoldUnsigned = CanHoldUnsigned
data IntegerParsers (canHold :: Bits -> Type -> Constraint)
integer :: Lexeme -> IntegerParsers CanHoldSigned
natural :: Lexeme -> IntegerParsers CanHoldUnsigned
decimal :: IntegerParsers canHold -> Parsec Integer
hexadecimal :: IntegerParsers canHold -> Parsec Integer
octal :: IntegerParsers canHold -> Parsec Integer
binary :: IntegerParsers canHold -> Parsec Integer
decimal8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a
decimal16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a
decimal32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a
decimal64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a
hexadecimal8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a
hexadecimal16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a
hexadecimal32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a
hexadecimal64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a
octal8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a
octal16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a
octal32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a
octal64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a
binary8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a
binary16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a
binary32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a
binary64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a
data TextParsers t
ascii :: TextParsers t -> Parsec t
unicode :: TextParsers t -> Parsec t
latin1 :: TextParsers t -> Parsec t
stringLiteral :: Lexeme -> TextParsers String
rawStringLiteral :: Lexeme -> TextParsers String
multiStringLiteral :: Lexeme -> TextParsers String
rawMultiStringLiteral :: Lexeme -> TextParsers String
charLiteral :: Lexeme -> TextParsers Char
data Space
skipComments :: Space -> Parsec ()
whiteSpace :: Space -> Parsec ()
alter :: Space -> forall a. CharPredicate -> Parsec a -> Parsec a
initSpace :: Space -> Parsec ()

Lexing

data Lexer Source #

A lexer describes how to transform the input string into a series of tokens.

mkLexer Source #

Arguments

:: LexicalDesc	The description of the lexical structure of the language.
-> Lexer	A lexer which can convert the input stream into a series of lexemes.

Create a Lexer with a given description for the lexical structure of the language.

mkLexerWithErrorConfig Source #

Arguments

:: LexicalDesc	The description of the lexical structure of the language.
-> ErrorConfig	The description of how to process errors during lexing.
-> Lexer	A lexer which can convert the input stream into a series of lexemes.

Create a Lexer with a given description for the lexical structure of the language, which reports errors according to the given error config.

Lexemes and Non-Lexemes

A key distinction in lexers is between lexemes and non-lexemes:

lexeme consumes whitespace. It should be used by a wider parser, to ensure whitespace is handled uniformly. The output of lexeme can be considered a token as provided by traditional lexers, and can be used by the parser.
nonlexeme does not consume whitespace. It should be used to define further composite tokens or in special circumstances where whitespace should not be consumed. One may consider the output of nonlexeme to still be in the lexing stage of parsing, and not necessarily a valid token.

Lexemes

Ideally, a wider parser should not be concerned with handling whitespace, as it is responsible for dealing with a stream of tokens. With parser combinators, however, it is usually not the case that there is a separate distinction between the parsing phase and the lexing phase. That said, it is good practice to establish a logical separation between the two worlds. As such, lexeme contains parsers that parse tokens, and these are whitespace-aware. This means that whitespace will be consumed after any of these parsers are parsed. It is not required that whitespace be present.

lexeme :: Lexer -> Lexeme Source #

This contains parsers for tokens treated as "words", such that whitespace will be consumed after each token has been parsed.

Non-Lexemes

Whilst the functionality in lexeme is strongly recommended for wider use in a parser, the functionality here may be useful for more specialised use-cases. In particular, these may for the building blocks for more complex tokens (where whitespace is not allowed between them, say), in which case these compound tokens can be turned into lexemes manually.

For example, the lexer does not have configuration for trailing specifiers on numeric literals (like, 1024L in Scala, say): the desired numeric literal parser could be extended with this functionality before whitespace is consumed by using the variant found in this object.

These tokens can also be used for lexical extraction, which can be performed by the ErrorBuilder typeclass: this can be used to try and extract tokens from the input stream when an error happens, to provide a more informative error. In this case, it is desirable to not consume whitespace after the token to keep the error tight and precise.

nonlexeme :: Lexer -> Lexeme Source #

This contains parsers for tokens that do not give any special treatment to whitespace.

Fully and Space

fully :: Lexer -> forall a. Parsec a -> Parsec a Source #

This combinator ensures a parser fully parses all available input, and consumes whitespace at the start.

space :: Lexer -> Space Source #

This contains parsers that directly treat whitespace.

`Lexeme` Fields

Despite their differences, lexemes and non-lexemes share a lot of common functionality. The type Lexeme describes both lexemes and non-lexemes, so that this common functionality may be exploited.

data Lexeme Source #

A Lexeme is a collection of parsers for handling various tokens (such as symbols and names), where either all or none of the parsers consume whitespace.

Lexemes and Non-Lexemes are described by these common fields.

apply :: Lexeme -> forall a. Parsec a -> Parsec a Source #

This turns a non-lexeme parser into a lexeme one by ensuring whitespace is consumed after the parser.

sym :: Lexeme -> String -> Parsec () Source #

Parse the given string.

symbol :: Lexeme -> Symbol Source #

This contains lexing functionality relevant to the parsing of atomic symbols.

names :: Lexeme -> Names Source #

This contains lexing functionality relevant to the parsing of names, which include operators or identifiers. The parsing of names is mostly concerned with finding the longest valid name that is not a reserved name, such as a hard keyword or a special operator.

Symbolic Tokens

The Symbol interface handles the parsing of symbolic tokens, such as keywords.

data Symbol Source #

This contains lexing functionality relevant to the parsing of atomic symbols.

Symbols are characterised by their "unitness", that is, every parser inside returns Unit. This is because they all parse a specific known entity, and, as such, the result of the parse is irrelevant. These can be things such as reserved names, or small symbols like parentheses.

This type also contains a means of creating new symbols as well as implicit conversions to allow for Haskell's string literals (with OverloadedStringLiterals enabled) to serve as symbols within a parser.

softKeyword :: Symbol -> String -> Parsec () Source #

This combinator parses a given soft keyword atomically: the keyword is only valid if it is not followed directly by a character which would make it a larger valid identifier.

Soft keywords are keywords that are only reserved within certain contexts. The apply combinator handles so-called hard keywords automatically, as the given string is checked to see what class of symbol it might belong to. However, soft keywords are not included in this set, as they are not always reserved in all situations. As such, when a soft keyword does need to be parsed, this combinator should be used to do it explicitly. Care should be taken to ensure that soft keywords take parsing priority over identifiers when they do occur.

softOperator :: Symbol -> String -> Parsec () Source #

This combinator parses a given soft operator atomically: the operator is only valid if it is not followed directly by a character which would make it a larger valid operator (reserved or otherwise).

Soft operators are operators that are only reserved within certain contexts. The apply combinator handles so-called hard operators automatically, as the given string is checked to see what class of symbol it might belong to. However, soft operators are not included in this set, as they are not always reserved in all situations. As such, when a soft operator does need to be parsed, this combinator should be used to do it explicitly.

Name Tokens

The Names interface handles the parsing of identifiers and operators.

data Names Source #

This class defines a uniform interface for defining parsers for user-defined names (identifiers and operators), independent of how whitespace should be handled after the name.

The parsing of names is mostly concerned with finding the longest valid name that is not a reserved name, such as a hard keyword or a special operator.

identifier :: Names -> Parsec String Source #

Parse an identifier based on the given NameDesc predicates identifierStart and identifierLetter. The NameDesc is provided by mkNames.

Capable of handling unicode characters if the configuration permits. If hard keywords are specified by the configuration, this parser is not permitted to parse them.

identifier' :: Names -> CharPredicate -> Parsec String Source #

Parse an identifier whose start satisfies the given predicate, and subseqeunt letters satisfy identifierLetter in the given NameDesc. The NameDesc is provided by mkNames.

Behaves as identifier, then ensures the first character matches the given predicate. Thus, identifier' can only refine the output of identifier; if identifier fails due to the first character, then so will identifier', even if this character passes the supplied predicate.

Capable of handling unicode characters if the configuration permits. If hard keywords are specified by the configuration, this parser is not permitted to parse them.

userDefinedOperator :: Names -> Parsec String Source #

Parse a user-defined operator based on the given SymbolDesc predicates operatorStart and operatorLetter. The SymbolDesc is provided by mkNames.

Capable of handling unicode characters if the configuration permits. If hard operators are specified by the configuration, this parser is not permitted to parse them.

userDefinedOperator' :: Names -> CharPredicate -> Parsec String Source #

Parse a user-defined operator whose first character satisfies the given predicate, and subsequent characters satisfying operatorLetter in the given SymbolDesc. The SymbolDesc is provided by mkNames.

Behaves as userDefinedOperator, then ensures the first character matches the given predicate. Thus, userDefinedOperator' can only refine the output of userDefinedOperator; if userDefinedOperator fails due to the first character, then so will userDefinedOperator', even if this character passes the supplied predicate.

Capable of handling unicode characters if the configuration permits. If hard operators are specified by the configuration, this parser is not permitted to parse them.

Numeric Tokens

These types and combinators parse numeric literals, such as integers and reals.

type CanHoldSigned = CanHoldSigned Source #

type CanHoldUnsigned = CanHoldUnsigned Source #

Integer Parsers

IntegerParsers handles integer parsing (signed and unsigned). This is mainly used by the combinators integer and natural.

data IntegerParsers (canHold :: Bits -> Type -> Constraint) Source #

A uniform interface for defining parsers for integer literals, independent of how whitespace should be handled after the literal or whether the literal should allow for negative numbers.

integer :: Lexeme -> IntegerParsers CanHoldSigned Source #

This is a collection of parsers concerned with handling signed integer literals.

Signed integer literals are an extension of unsigned integer literals which may be prefixed by a sign.

natural :: Lexeme -> IntegerParsers CanHoldUnsigned Source #

A collection of parsers concerned with handling unsigned (positive) integer literals.

Fixed-Base Parsers

decimal :: IntegerParsers canHold -> Parsec Integer Source #

Parse a single integer literal in decimal form (base 10).

hexadecimal :: IntegerParsers canHold -> Parsec Integer Source #

Parse a single integer literal in hexadecimal form (base 16).

octal :: IntegerParsers canHold -> Parsec Integer Source #

Parse a single integer literal in octal form (base 8).

binary :: IntegerParsers canHold -> Parsec Integer Source #

Parse a single integer literal in binary form (base 2).

Fixed-Width Numeric Tokens

These combinators tokenize numbers that must be within specific bit-widths. The possible bit-widths are provided by Bits.

Decimal Tokens

decimal8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a Source #

This parser behaves the same as decimal except it ensures that the resulting value is a valid 8-bit number.