License | BSD-3-Clause |
---|---|
Maintainer | Jamie Willis, Gigaparsec Maintainers |
Stability | experimental |
Safe Haskell | Safe |
Language | Haskell2010 |
This module provides a large selection of functionality concerned with lexing.
In traditional compilers, lexing and parsing are two largely separate processes; lexing turns raw input into a series of tokens, and parsing then processes these tokens. Parser combinators, on the other hand, are often implemented to deal directly with the input stream.
Nonetheless, a lexer abstraction may be achieved by defining a core set of lexing combinators that convert input to tokens,
and then defining the parsing combinators in terms of these.
The parsers defined using Lexer
construct these lexing combinators, which creates a clear and logical separation from the rest of the parser.
It is possible that some of the implementations of parsers found within this class may have been hand-optimised for performance: care will have been taken to ensure these implementations precisely match the semantics of the originals.
Synopsis
- data Lexer
- mkLexer :: LexicalDesc -> Lexer
- mkLexerWithErrorConfig :: LexicalDesc -> ErrorConfig -> Lexer
- lexeme :: Lexer -> Lexeme
- nonlexeme :: Lexer -> Lexeme
- fully :: Lexer -> forall a. Parsec a -> Parsec a
- space :: Lexer -> Space
- data Lexeme
- apply :: Lexeme -> forall a. Parsec a -> Parsec a
- sym :: Lexeme -> String -> Parsec ()
- symbol :: Lexeme -> Symbol
- names :: Lexeme -> Names
- data Symbol
- softKeyword :: Symbol -> String -> Parsec ()
- softOperator :: Symbol -> String -> Parsec ()
- data Names
- identifier :: Names -> Parsec String
- identifier' :: Names -> CharPredicate -> Parsec String
- userDefinedOperator :: Names -> Parsec String
- userDefinedOperator' :: Names -> CharPredicate -> Parsec String
- type CanHoldSigned = CanHoldSigned
- type CanHoldUnsigned = CanHoldUnsigned
- data IntegerParsers (canHold :: Bits -> Type -> Constraint)
- integer :: Lexeme -> IntegerParsers CanHoldSigned
- natural :: Lexeme -> IntegerParsers CanHoldUnsigned
- decimal :: IntegerParsers canHold -> Parsec Integer
- hexadecimal :: IntegerParsers canHold -> Parsec Integer
- octal :: IntegerParsers canHold -> Parsec Integer
- binary :: IntegerParsers canHold -> Parsec Integer
- decimal8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a
- decimal16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a
- decimal32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a
- decimal64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a
- hexadecimal8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a
- hexadecimal16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a
- hexadecimal32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a
- hexadecimal64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a
- octal8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a
- octal16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a
- octal32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a
- octal64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a
- binary8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a
- binary16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a
- binary32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a
- binary64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a
- data TextParsers t
- ascii :: TextParsers t -> Parsec t
- unicode :: TextParsers t -> Parsec t
- latin1 :: TextParsers t -> Parsec t
- stringLiteral :: Lexeme -> TextParsers String
- rawStringLiteral :: Lexeme -> TextParsers String
- multiStringLiteral :: Lexeme -> TextParsers String
- rawMultiStringLiteral :: Lexeme -> TextParsers String
- charLiteral :: Lexeme -> TextParsers Char
- data Space
- skipComments :: Space -> Parsec ()
- whiteSpace :: Space -> Parsec ()
- alter :: Space -> forall a. CharPredicate -> Parsec a -> Parsec a
- initSpace :: Space -> Parsec ()
Lexing
:: LexicalDesc | The description of the lexical structure of the language. |
-> Lexer | A lexer which can convert the input stream into a series of lexemes. |
Create a Lexer
with a given description for the lexical structure of the language.
mkLexerWithErrorConfig Source #
:: LexicalDesc | The description of the lexical structure of the language. |
-> ErrorConfig | The description of how to process errors during lexing. |
-> Lexer | A lexer which can convert the input stream into a series of lexemes. |
Create a Lexer
with a given description for the lexical structure of the language,
which reports errors according to the given error config.
Lexemes and Non-Lexemes
A key distinction in lexers is between lexemes and non-lexemes:
lexeme
consumes whitespace. It should be used by a wider parser, to ensure whitespace is handled uniformly. The output oflexeme
can be considered a token as provided by traditional lexers, and can be used by the parser.nonlexeme
does not consume whitespace. It should be used to define further composite tokens or in special circumstances where whitespace should not be consumed. One may consider the output ofnonlexeme
to still be in the lexing stage of parsing, and not necessarily a valid token.
Lexemes
Ideally, a wider parser should not be concerned with handling whitespace,
as it is responsible for dealing with a stream of tokens.
With parser combinators, however, it is usually not the case that there is a separate distinction between the parsing phase and the lexing phase.
That said, it is good practice to establish a logical separation between the two worlds.
As such, lexeme
contains parsers that parse tokens, and these are whitespace-aware.
This means that whitespace will be consumed after any of these parsers are parsed.
It is not required that whitespace be present.
lexeme :: Lexer -> Lexeme Source #
This contains parsers for tokens treated as "words", such that whitespace will be consumed after each token has been parsed.
Non-Lexemes
Whilst the functionality in lexeme is strongly recommended for wider use in a parser, the functionality here may be useful for more specialised use-cases. In particular, these may for the building blocks for more complex tokens (where whitespace is not allowed between them, say), in which case these compound tokens can be turned into lexemes manually.
For example, the lexer does not have configuration for trailing specifiers on numeric literals (like, 1024L in Scala, say): the desired numeric literal parser could be extended with this functionality before whitespace is consumed by using the variant found in this object.
These tokens can also be used for lexical extraction, which can be performed by the ErrorBuilder typeclass: this can be used to try and extract tokens from the input stream when an error happens, to provide a more informative error. In this case, it is desirable to not consume whitespace after the token to keep the error tight and precise.
nonlexeme :: Lexer -> Lexeme Source #
This contains parsers for tokens that do not give any special treatment to whitespace.
Fully and Space
fully :: Lexer -> forall a. Parsec a -> Parsec a Source #
This combinator ensures a parser fully parses all available input, and consumes whitespace at the start.
Lexeme
Fields
Despite their differences, lexemes and non-lexemes share a lot of common functionality.
The type Lexeme
describes both lexemes and non-lexemes, so that this common functionality may be exploited.
A Lexeme
is a collection of parsers for handling various tokens (such as symbols and names), where either all or none of the parsers consume whitespace.
Lexemes and Non-Lexemes are described by these common fields.
apply :: Lexeme -> forall a. Parsec a -> Parsec a Source #
This turns a non-lexeme parser into a lexeme one by ensuring whitespace is consumed after the parser.
symbol :: Lexeme -> Symbol Source #
This contains lexing functionality relevant to the parsing of atomic symbols.
names :: Lexeme -> Names Source #
This contains lexing functionality relevant to the parsing of names, which include operators or identifiers. The parsing of names is mostly concerned with finding the longest valid name that is not a reserved name, such as a hard keyword or a special operator.
Symbolic Tokens
The Symbol
interface handles the parsing of symbolic tokens, such as keywords.
This contains lexing functionality relevant to the parsing of atomic symbols.
Symbols are characterised by their "unitness", that is, every parser inside returns Unit. This is because they all parse a specific known entity, and, as such, the result of the parse is irrelevant. These can be things such as reserved names, or small symbols like parentheses.
This type also contains a means of creating new symbols as well as implicit conversions
to allow for Haskell's string literals (with OverloadedStringLiterals
enabled) to serve as symbols within a parser.
softKeyword :: Symbol -> String -> Parsec () Source #
This combinator parses a given soft keyword atomically: the keyword is only valid if it is not followed directly by a character which would make it a larger valid identifier.
Soft keywords are keywords that are only reserved within certain contexts.
The apply
combinator handles so-called hard keywords automatically,
as the given string is checked to see what class of symbol it might belong to.
However, soft keywords are not included in this set,
as they are not always reserved in all situations.
As such, when a soft keyword does need to be parsed,
this combinator should be used to do it explicitly.
Care should be taken to ensure that soft keywords take
parsing priority over identifiers when they do occur.
softOperator :: Symbol -> String -> Parsec () Source #
This combinator parses a given soft operator atomically: the operator is only valid if it is not followed directly by a character which would make it a larger valid operator (reserved or otherwise).
Soft operators are operators that are only reserved within certain contexts. The apply combinator handles so-called hard operators automatically, as the given string is checked to see what class of symbol it might belong to. However, soft operators are not included in this set, as they are not always reserved in all situations. As such, when a soft operator does need to be parsed, this combinator should be used to do it explicitly.
Name Tokens
The Names
interface handles the parsing of identifiers and operators.
This class defines a uniform interface for defining parsers for user-defined names (identifiers and operators), independent of how whitespace should be handled after the name.
The parsing of names is mostly concerned with finding the longest valid name that is not a reserved name, such as a hard keyword or a special operator.
identifier :: Names -> Parsec String Source #
Parse an identifier based on the given NameDesc
predicates identifierStart
and identifierLetter
.
The NameDesc
is provided by mkNames
.
Capable of handling unicode characters if the configuration permits. If hard keywords are specified by the configuration, this parser is not permitted to parse them.
identifier' :: Names -> CharPredicate -> Parsec String Source #
Parse an identifier whose start satisfies the given predicate, and subseqeunt letters satisfy identifierLetter
in the given NameDesc
.
The NameDesc
is provided by mkNames
.
Behaves as identifier
, then ensures the first character matches the given predicate.
Thus, identifier'
can only refine the output of identifier
;
if identifier
fails due to the first character, then so will identifier'
,
even if this character passes the supplied predicate.
Capable of handling unicode characters if the configuration permits. If hard keywords are specified by the configuration, this parser is not permitted to parse them.
userDefinedOperator :: Names -> Parsec String Source #
Parse a user-defined operator based on the given SymbolDesc
predicates operatorStart
and operatorLetter
.
The SymbolDesc
is provided by mkNames
.
Capable of handling unicode characters if the configuration permits. If hard operators are specified by the configuration, this parser is not permitted to parse them.
userDefinedOperator' :: Names -> CharPredicate -> Parsec String Source #
Parse a user-defined operator whose first character satisfies the given predicate,
and subsequent characters satisfying operatorLetter
in the given SymbolDesc
.
The SymbolDesc
is provided by mkNames
.
Behaves as userDefinedOperator
, then ensures the first character matches the given predicate.
Thus, userDefinedOperator'
can only refine the output of userDefinedOperator
;
if userDefinedOperator
fails due to the first character, then so will userDefinedOperator'
,
even if this character passes the supplied predicate.
Capable of handling unicode characters if the configuration permits. If hard operators are specified by the configuration, this parser is not permitted to parse them.
Numeric Tokens
These types and combinators parse numeric literals, such as integers and reals.
type CanHoldSigned = CanHoldSigned Source #
type CanHoldUnsigned = CanHoldUnsigned Source #
Integer Parsers
IntegerParsers
handles integer parsing (signed and unsigned).
This is mainly used by the combinators integer
and natural
.
data IntegerParsers (canHold :: Bits -> Type -> Constraint) Source #
A uniform interface for defining parsers for integer literals, independent of how whitespace should be handled after the literal or whether the literal should allow for negative numbers.
integer :: Lexeme -> IntegerParsers CanHoldSigned Source #
This is a collection of parsers concerned with handling signed integer literals.
Signed integer literals are an extension of unsigned integer literals which may be prefixed by a sign.
natural :: Lexeme -> IntegerParsers CanHoldUnsigned Source #
A collection of parsers concerned with handling unsigned (positive) integer literals.
Fixed-Base Parsers
decimal :: IntegerParsers canHold -> Parsec Integer Source #
Parse a single integer literal in decimal form (base 10).
hexadecimal :: IntegerParsers canHold -> Parsec Integer Source #
Parse a single integer literal in hexadecimal form (base 16).
octal :: IntegerParsers canHold -> Parsec Integer Source #
Parse a single integer literal in octal form (base 8).
binary :: IntegerParsers canHold -> Parsec Integer Source #
Parse a single integer literal in binary form (base 2).
Fixed-Width Numeric Tokens
These combinators tokenize numbers that must be within specific bit-widths.
The possible bit-widths are provided by Bits
.
Decimal Tokens
decimal8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as decimal
except it ensures that the resulting value is a valid 8-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
decimal16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as decimal
except it ensures that the resulting value is a valid 16-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
decimal32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as decimal
except it ensures that the resulting value is a valid 32-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
decimal64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as decimal
except it ensures that the resulting value is a valid 64-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
Hexadecimal Tokens
hexadecimal8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as hexadecimal
except it ensures that the resulting value is a valid 8-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
hexadecimal16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as hexadecimal
except it ensures that the resulting value is a valid 16-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
hexadecimal32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as hexadecimal
except it ensures that the resulting value is a valid 32-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
hexadecimal64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as hexadecimal
except it ensures that the resulting value is a valid 64-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
Octal Tokens
octal8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as octal
except it ensures that the resulting value is a valid 8-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
octal16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as octal
except it ensures that the resulting value is a valid 16-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
octal32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as octal
except it ensures that the resulting value is a valid 32-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
octal64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as octal
except it ensures that the resulting value is a valid 64-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
Binary Tokens
binary8 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B8 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as binary
except it ensures that the resulting value is a valid 8-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
binary16 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B16 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as binary
except it ensures that the resulting value is a valid 16-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
binary32 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B32 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as binary
except it ensures that the resulting value is a valid 32-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
binary64 :: forall a (canHold :: Bits -> Type -> Constraint). canHold 'B64 a => IntegerParsers canHold -> Parsec a Source #
This parser behaves the same as binary
except it ensures that the resulting value is a valid 64-bit number.
The resulting number will be converted to the given type a
, which must be able to losslessly store the parsed value;
this is enforced by the canHold
constraint on the type.
This accounts for unsignedness when necessary.
Textual Tokens
The TextParsers
interface handles the parsing of string and character literals.
data TextParsers t Source #
This type defines a uniform interface for defining parsers for textual literals, independent of how whitespace should be handled after the literal.
The type of these literals is determined by the parameter t
.
ascii :: TextParsers t -> Parsec t Source #
Parses a single t
-literal, which may contain any graphic ASCII character.
These are characters with ordinals in range 0 to 127 inclusive.
It may also contain escape sequences, but only those which result in ASCII characters.
unicode :: TextParsers t -> Parsec t Source #
Parses a single t
-literal, which may contain any unicode graphic character
as defined by up to two UTF-16 codepoints.
It may also contain escape sequences.
latin1 :: TextParsers t -> Parsec t Source #
Parses a single t
-literal, which may contain any graphic extended ASCII character.
These are characters with ordinals in range 0 to 255 inclusive.
It may also contain escape sequences, but only those which result in extended ASCII characters.
String Parsers
A Lexer
provides the following TextParsers
for string literals.
stringLiteral :: Lexeme -> TextParsers String Source #
A collection of parsers concerned with handling single-line string literals.
String literals are generally described by the TextDesc
fields:
rawStringLiteral :: Lexeme -> TextParsers String Source #
A collection of parsers concerned with handling single-line string literals, without handling any escape sequences:
this includes literal-end characters and the escape prefix (often "
and \
respectively).
String literals are generally described by the TextDesc
fields:
multiStringLiteral :: Lexeme -> TextParsers String Source #
A collection of parsers concerned with handling multi-line string literals.
Multi-string literals are generally described by the TextDesc
fields:
rawMultiStringLiteral :: Lexeme -> TextParsers String Source #
A collection of parsers concerned with handling multi-line string literals, without handling any escape sequences:
this includes literal-end characters and the escape prefix (often "
and \
respectively).
Multi-string literals are generally described by the TextDesc
fields:
Character Parsers
A Lexer
provides the following TextParsers
for character literals.
charLiteral :: Lexeme -> TextParsers Char Source #
A collection of parsers concerned with handling character literals.
Charcter literals are generally described by the TextDesc
fields:
Whitespace and Comments
Space
and its fields are concerned with special treatment of whitespace itself.
Most of the time, the functionality herein will not be required,
as lexeme
and fully
will consistently handle whitespace.
However, whitespace is significant in some languages, like Python and Haskell,
in which case Space
provides a way to control how whitespace is consumed.
This type is concerned with special treatment of whitespace.
For the vast majority of cases, the functionality within this object shouldn't be needed, as whitespace is consistently handled by lexeme and fully. However, for grammars where whitespace is significant (like indentation-sensitive languages), this object provides some more fine-grained control over how whitespace is consumed by the parsers within lexeme.
skipComments :: Space -> Parsec () Source #
Skips zero or more comments.
The implementation of this combinator does not vary with whiteSpaceIsContextDependent
.
It will use the hide combinator as to not appear as a valid alternative in an error message:
adding a comment is often legal,
but not a useful solution for how to make the input syntactically valid.
whiteSpace :: Space -> Parsec () Source #
Skips zero or more (insignificant) whitespace characters as well as comments.
The implementation of this parser depends on whether whiteSpaceIsContextDependent
is true:
when it is, this parser may change based on the use of the alter combinator.
This parser will always use the hide combinator as to not appear as a valid alternative in an error message: it's likely always the case whitespace can be added at any given time, but that doesn't make it a useful suggestion unless it is significant.
alter :: Space -> forall a. CharPredicate -> Parsec a -> Parsec a Source #
This combinator changes how lexemes parse whitespace for the duration of a given parser.
So long as whiteSpaceIsContextDependent
is true,
this combinator will be able to locally change the definition of whitespace during the given parser.
Examples
- In indentation sensitive languages, the indentation sensitivity is often ignored within parentheses or braces.
In these cases,
parens (alter withNewLine p)
would allow unrestricted newlines within parentheses.
initSpace :: Space -> Parsec () Source #
This parser initialises the whitespace used by the lexer when whiteSpaceIsContextDependent
is true.
The whitespace is set to the implementation given by the lexical description. This parser must be used, by fully or otherwise, as the first thing the global parser does or an UnfilledRegisterException will occur.
See alter
for how to change whitespace during a parse.