License	BSD-3-Clause
Maintainer	Jamie Willis, Gigaparsec Maintainers
Stability	experimental
Safe Haskell	Safe
Language	Haskell2010

Text.Gigaparsec.Token.Descriptions

Contents

Lexical Descriptions

Description

This module contains the descriptions of various lexical structures to configure the lexer.

Many languages share common lexical tokens, such as numeric and string literals. Writing lexers turning these strings into tokens is effectively boilerplate. A Description encodes how to lex one of these common tokens. Feeding a LexicalDesc to a Lexer provides many combinators for dealing with these tokens.

Usage

Rather than use the internal constructors, such as NameDesc, one should extend the 'plain' definitions with record field updates. For example,

myLexicalDesc = plain
  { nameDesc = myNameDesc
  , textDesc = myTextDesc
  }

will produce a description that overrides the default name and text descriptions by those given. See plainName, plainSymbol, plainNumeric, plainText and plainSpace for further examples.

Since: 0.2.2.0

Synopsis

plain :: LexicalDesc
plainEscape :: EscapeDesc
plainName :: NameDesc
plainNumeric :: NumericDesc
plainSpace :: SpaceDesc
plainSymbol :: SymbolDesc
plainText :: TextDesc
data BreakCharDesc
- = NoBreakChar
- | BreakCharSupported {
  - breakChar :: !Char
  - allowedAfterNonDecimalPrefix :: !Bool
  }
type CharPredicate = Maybe (Char -> Bool)
data EscapeDesc = EscapeDesc {
- escBegin :: !Char
- literals :: !(Set Char)
- mapping :: !(Map String Char)
- decimalEscape :: !NumericEscape
- hexadecimalEscape :: !NumericEscape
- octalEscape :: !NumericEscape
- binaryEscape :: !NumericEscape
- emptyEscape :: !(Maybe Char)
- gapsSupported :: !Bool
}
data ExponentDesc
- = NoExponents
- | ExponentsSupported {
  - compulsory :: !Bool
  - chars :: !(Set Char)
  - base :: !Int
  - expSign :: !PlusSignPresence
  - expLeadingZerosAllowd :: !Bool
  }
data LexicalDesc = LexicalDesc {
- nameDesc :: !NameDesc
- symbolDesc :: !SymbolDesc
- numericDesc :: !NumericDesc
- textDesc :: !TextDesc
- spaceDesc :: !SpaceDesc
}
data NameDesc = NameDesc {
- identifierStart :: !CharPredicate
- identifierLetter :: !CharPredicate
- operatorStart :: !CharPredicate
- operatorLetter :: !CharPredicate
}
data NumberOfDigits
- = Unbounded
- | Exactly !(NonEmpty Word)
- | AtMost !Word
data NumericDesc = NumericDesc {
- literalBreakChar :: !BreakCharDesc
- leadingDotAllowed :: !Bool
- trailingDotAllowed :: !Bool
- leadingZerosAllowed :: !Bool
- positiveSign :: !PlusSignPresence
- integerNumbersCanBeHexadecimal :: !Bool
- integerNumbersCanBeOctal :: !Bool
- integerNumbersCanBeBinary :: !Bool
- realNumbersCanBeHexadecimal :: !Bool
- realNumbersCanBeOctal :: !Bool
- realNumbersCanBeBinary :: !Bool
- hexadecimalLeads :: !(Set Char)
- octalLeads :: !(Set Char)
- binaryLeads :: !(Set Char)
- decimalExponentDesc :: !ExponentDesc
- hexadecimalExponentDesc :: !ExponentDesc
- octalExponentDesc :: !ExponentDesc
- binaryExponentDesc :: !ExponentDesc
}
data NumericEscape
- = NumericIllegal
- | NumericSupported {
  - prefix :: !(Maybe Char)
  - numDigits :: !NumberOfDigits
  - maxValue :: !Char
  }
data PlusSignPresence
- = PlusRequired
- | PlusOptional
- | PlusIllegal
data SpaceDesc = SpaceDesc {
- lineCommentStart :: !String
- lineCommentAllowsEOF :: !Bool
- multiLineCommentStart :: !String
- multiLineCommentEnd :: !String
- multiLineNestedComments :: !Bool
- space :: !CharPredicate
- whitespaceIsContextDependent :: !Bool
}
data SymbolDesc = SymbolDesc {
- hardKeywords :: !(Set String)
- hardOperators :: !(Set String)
- caseSensitive :: !Bool
}
data TextDesc = TextDesc {
- escapeSequences :: !EscapeDesc
- characterLiteralEnd :: !Char
- stringEnds :: !(Set (String, String))
- multiStringEnds :: !(Set (String, String))
- graphicCharacter :: !CharPredicate
}

Lexical Descriptions

A lexer is configured by extending the default plain template, producing a LexicalDesc.

LexicalDesc
plain

Name Descriptions

A NameDesc configures the lexing of name-like tokens, such as variable and function names. To create a NameDesc, use plainName, and configure it to your liking with record updates.

NameDesc
- identifierStart
- identifierLetter
- operatorStart
- operatorLetter
plainName

Symbol Descriptions

A SymbolDesc configures the lexing of 'symbols' (textual literals), such as keywords and operators. To create a SymbolDesc, use plainSymbol and configure it to your liking with record updates.

SymbolDesc
- hardKeywords
- hardOperators
- caseSensitive
plainSymbol

Numeric Descriptions

A NumericDesc configures the lexing of numeric literals, such as integer and floating point literals. To create a NumericDesc, use plainNumeric and configure it to your liking with record updates. Also see ExponentDesc, BreakCharDesc, and PlusSignPresence, for further configuration options.

Exponent Descriptions

An ExponentDesc configures scientific exponent notation.

ExponentDesc
- NoExponents
- ExponentsSupported
  - compulsory
  - chars
  - base
  - expSign
  - expLeadingZerosAllowd

Break-Characters in Numeric Literals

Some languages allow a single numeric literal to be separated by a 'break' symbol.

BreakCharDesc
NoBreakChar
BreakCharSupported
- breakChar
- allowedAfterNonDecimalPrefix

Numeric Literal Prefix Configuration

PlusSignPresence
- PlusRequired
- PlusOptional
- PlusIllegal

Text Descriptions

A TextDesc configures the lexing of string and character literals, as well as escaped numeric literals. To create a TextDesc, use plainText and configure it to your liking with record updates. See EscapeDesc, NumericEscape and NumberOfDigits for further configuration of escape sequences and escaped numeric literals.

TextDesc
- escapeSequences
- characterLiteralEnd
- stringEnds
- multiStringEnds
- graphicCharacter
plainText

Escape Character Descriptions

Configuration of escape sequences, such as tabs t and newlines n, and escaped numbers, such as hexadecimals 0x... and binary 0b....

EscapeDesc
- escBegin
- literals
- mapping
- decimalEscape
- hexadecimalEscape
- octalEscape
- binaryEscape
- emptyEscape
- gapsSupported
plainEscape

Numeric Escape Sequences

Configuration of escaped numeric literals. For example, hexadecimals, 0x....

NumericEscape
- NumericIllegal
- NumericSupported
  - prefix
  - numDigits
  - maxValue
NumberOfDigits
- Unbounded
- Exactly
- AtMost

Whitespace and Comment Descriptions

A SpaceDesc configures the lexing whitespace and comments. To create a SpaceDesc, use plainSpace and configure it to your liking with record updates.

plain :: LexicalDesc Source #

This lexical description contains the template plain<...> descriptions defined in this module. See plainName, plainSymbol, plainNumeric, plainText and plainSpace for how this description configures the lexer.

plainEscape :: EscapeDesc Source #

This is a blank escape description template, which should be extended to form a custom escape description.

In its default state, plainEscape the only escape symbol is a backslash, "\\". To change this, one should use record field copies, for example:

{-# LANGUAGE OverloadedLists #-} -- This lets us write [a,b] to get a Set containing a and b,
                                 -- and [(a,b),(c,d)] for a Map which sends a ↦ b and c ↦ d
myPlainEscape:: EscapeDesc
myPlainEscape= plainEscape
  { literals = a
  , stringEnds = [(b, c)]
  , mapping = [("t",0x0009), ("r",0x000D)]
  , hexadecimalEscape = NumericSupported TODO
  }

myPlainText with then parse characters as a single character between a and a, and a string as characters between b and c.

plainName :: NameDesc Source #

This is a blank name description template, which should be extended to form a custom name description.

In its default state, plainName makes no characters able to be part of an identifier or operator. To change this, one should use record field copies, for example:

myNameDesc :: NameDesc
myNameDesc = plainName
  { identifierStart = myIdentifierStartPredicate
  , identifierLetter = myIdentifierLetterPredicate
  }

myNameDesc with then lex identifiers according to the given predicates.

plainNumeric :: NumericDesc Source #

This is a blank numeric description template, which should be extended to form a custom numeric description.

In its default state, plainNumeric allows for hex-, oct-, and bin-ary numeric literals, with the standard prefixes. To change this, one should use record field copies.

plainSpace :: SpaceDesc Source #

This is a blank whitespace description template, which should be extended to form the desired whitespace descriptions.

In its default state, plainName makes no comments possible, and the only whitespace characters are those defined by isSpace

plainSymbol :: SymbolDesc Source #

This is a blank symbol description template, which should be extended to form a custom symbol description.

In its default state, plainSymbol has no keywords or reserved/hard operators. To change this, one should use record field copies, for example:

{-# LANGUAGE OverloadedLists #-} -- This lets us write [a,b] to get a Set containing a and b
                                 -- If you don't want to use this, just use fromList [a,b]
mySymbolDesc :: SymbolDesc
mySymbolDesc = plainSymbol
  { hardKeywords = ["data", "where"]
  , hardOperators = ["->"]
  , caseSensitive = True
  }

mySymbolDesc with then treat data and where as keywords, and -> as a reserved operator.

plainText :: TextDesc Source #

This is a blank text description template, which should be extended to form a custom text description.

In its default state, plainText parses characters as symbols between ' and ', and strings between " and ". To change this, one should use record field copies, for example:

{-# LANGUAGE OverloadedLists #-} -- This lets us write [a,b] to get a Set containing a and b
                                 -- If you don't want to use this, just use fromList [a,b]
myPlainText:: TextDesc
myPlainText= plainText
  { characterLiteralEnd = a
  , stringEnds = [(b, c)]
  }

myPlainText with then parse characters as a single character between a and a, and a string as characters between b and c.

data BreakCharDesc Source #

Prescribes whether or not numeric literals can be broken up by a specific symbol.

For example, can one write 300.2_3?

Constructors

NoBreakChar	Literals cannot be broken.
BreakCharSupported	Literals can be broken.
Fields breakChar :: !Char the character allowed to break a literal (often _). allowedAfterNonDecimalPrefix :: !Bool can non-decimals be broken; e.g. can one write, 0x_300?

type CharPredicate = Maybe (Char -> Bool) Source #

An optional predicate on characters: if pred :: CharPredicate and pred x = Just True, then the lexer should accept the character x.

Examples

Expand

A predicate that only accepts alphabetical or numbers:

   isAlphaNumPred = Just . isAlphaNum

A predicate that only accepts capital letters:

   isCapital = Just . isAsciiUpper

data EscapeDesc Source #

Defines the escape characters, and their meaning.

This includes character escapes (e.g. tabs, carriage returns), and numeric escapes, such as binary (usually "0b") and hexadecimal, "0x".

Constructors

EscapeDesc

Fields

escBegin :: !Char
the character that begins an escape sequence: this is usually \.
literals :: !(Set Char)
the characters that can be directly escaped, but still represent themselves, for instance '"', or '\'.
mapping :: !(Map String Char)
the possible escape sequences that map to a character other than themselves and the (full UTF-16) character they map to, for instance "n" -> 0xa
decimalEscape :: !NumericEscape
if allowed, the description of how numeric escape sequences work for base 10.
hexadecimalEscape :: !NumericEscape
if allowed, the description of how numeric escape sequences work for base 16
octalEscape :: !NumericEscape
if allowed, the description of how numeric escape sequences work for base 8
binaryEscape :: !NumericEscape
if allowed, the description of how numeric escape sequences work for base 2
emptyEscape :: !(Maybe Char)
if one should exist, the character which has no effect on the string but can be used to disambiguate other escape sequences: in Haskell this would be &
gapsSupported :: !Bool
specifies whether or not string gaps are supported: this is where whitespace can be injected between two escBegin characters and this will all be ignored in the final string, such that "hello world" is "hello world"

data ExponentDesc Source #

Describe how scientific exponent notation can be used within real literals.

A common notation would be 1.6e3 for 1.6 × 10³, which the following ExponentDesc describes:

{-# LANGUAGE OverloadedLists #-} -- Lets us write [a] to generate a singleton Set containing a.
usualNotation :: ExponentDesc
usualNotation = ExponentsSupported
  { compulsory = False
  , chars = ['e']  -- The letter 'e' separates the significand from the exponent
  , base  = 10   -- The base of the exponent is 10, so that 2.3e5 means 2.3 × 10⁵
  , expSign = PlusOptional -- A positive exponent does not need a plus sign, but can have one.
  , expLeadingZerosAllowd = True -- We allow leading zeros on exponents; so 1.2e005 is valid.
  }

Constructors

NoExponents	The language does not allow exponent notation.
ExponentsSupported	The language does allow exponent notation, according to the following fields:
Fields compulsory :: !Bool Is exponent notation required for real literals? chars :: !(Set Char) The characters that separate the significand from the exponent base :: !Int The base of the exponent; this is usually base ten. expSign :: !PlusSignPresence Is a plus (`+`) sign required for positive exponents? expLeadingZerosAllowd :: !Bool Can the exponent contain leading zeros; for example is `3.2e005` valid?

data LexicalDesc Source #

This type describes the aggregation of a bunch of different sub-configurations for lexing a specific language.

See the plain smart constructor to define a LexicalDesc.

Constructors

LexicalDesc
Fields nameDesc :: !NameDesc the description of name-like lexemes symbolDesc :: !SymbolDesc the description of specific symbolic lexemes numericDesc :: !NumericDesc the description of numeric literals textDesc :: !TextDesc the description of text literals spaceDesc :: !SpaceDesc the description of whitespace

data NameDesc Source #

This type describes how name-like things are described lexically.

In particular, this defines which characters will constitute identifiers and operators.

See the plainName smart constructor for how to implement a custom name description.

Constructors

NameDesc
Fields identifierStart :: !CharPredicate the characters that start an identifier identifierLetter :: !CharPredicate the characters that continue an identifier operatorStart :: !CharPredicate the characters that start a user-defined operator operatorLetter :: !CharPredicate the characters that continue a user-defined operator

data NumberOfDigits Source #

Describes how many digits a numeric escape sequence is allowed.

Constructors

Unbounded	there is no limit on the number of digits that may appear in this sequence.
Exactly !(NonEmpty Word)	the number of digits in the literal must be one of the given values.
AtMost	there must be at most `n` digits in the numeric escape literal, up to and including the value given.
Fields !Word the maximum (inclusive) number of digits allowed in the literal..

data NumericDesc Source #

This type describes how numeric literals (integers, decimals, hexadecimals, etc...), should be lexically processed.

Constructors

NumericDesc

Fields

literalBreakChar :: !BreakCharDesc
can breaks be found within numeric literals? (see BreakCharDesc)
leadingDotAllowed :: !Bool
can a real number omit a leading 0 before the point?
trailingDotAllowed :: !Bool
can a real number omit a trailing 0 after the point?
leadingZerosAllowed :: !Bool
are extraneous zeros allowed at the start of decimal numbers?
positiveSign :: !PlusSignPresence
describes if positive (+) signs are allowed, compulsory, or illegal.
integerNumbersCanBeHexadecimal :: !Bool
can generic "integer numbers" to be hexadecimal?
integerNumbersCanBeOctal :: !Bool
can generic "integer numbers" to be octal?
integerNumbersCanBeBinary :: !Bool
can generic "integer numbers" to be binary?
realNumbersCanBeHexadecimal :: !Bool
can generic "real numbers" to be hexadecimal?
realNumbersCanBeOctal :: !Bool
can generic "real numbers" to be octal?
realNumbersCanBeBinary :: !Bool
can generic "real numbers" to be binary?
hexadecimalLeads :: !(Set Char)
the characters that begin a hexadecimal literal following a 0 (may be empty).
octalLeads :: !(Set Char)
the characters that begin an octal literal following a 0 (may be empty).
binaryLeads :: !(Set Char)
the characters that begin a binary literal following a 0 (may be empty).
decimalExponentDesc :: !ExponentDesc
describes how scientific exponent notation should work for decimal literals.
hexadecimalExponentDesc :: !ExponentDesc
describes how scientific exponent notation should work for hexadecimal literals.
octalExponentDesc :: !ExponentDesc
describes how scientific exponent notation should work for octal literals.
binaryExponentDesc :: !ExponentDesc
describes how scientific exponent notation should work for binary literals.

data NumericEscape Source #

Describes how numeric escape sequences should work for a given base.

Constructors

NumericIllegal	Numeric literals are disallowed for this specific base.
NumericSupported	Numeric literals are supported for this specific base.
Fields prefix :: !(Maybe Char) the character, if any, that is required to start the literal (like x for hexadecimal escapes in some languages). numDigits :: !NumberOfDigits the number of digits required for this literal: this may be unbounded, an exact number, or up to a specific number. maxValue :: !Char the largest character value that can be expressed by this numeric escape.

data PlusSignPresence Source #

Whether or not a plus sign (+) can prefix a numeric literal.

Constructors

PlusRequired	(`+`) must always precede a positive numeric literal
PlusOptional	(`+`) may precede a positive numeric literal, but is not necessary
PlusIllegal	(`+`) cannot precede a numeric literal as a prefix (this is separate to allowing an infix binary `+` operator).

data SpaceDesc Source #

This type describes how whitespace and comments should be handled lexically.

Constructors

SpaceDesc

Fields

lineCommentStart :: !String
how to start single-line comments (empty for no single-line comments).
lineCommentAllowsEOF :: !Bool
can a single-line comment be terminated by the end-of-file (True), or must it end with a newline (False)?
multiLineCommentStart :: !String
how to start multi-line comments (empty for no multi-line comments).
multiLineCommentEnd :: !String
how to end multi-line comments (empty for no multi-line comments).
multiLineNestedComments :: !Bool
True when multi-line comments can be nested, False otherwise.
space :: !CharPredicate
the characters to be treated as whitespace
whitespaceIsContextDependent :: !Bool
does the context change the definition of whitespace (True), or not (False)? (e.g. in Python, newlines are valid whitespace within parentheses, but are significant outside of them)

data SymbolDesc Source #

This type describes how symbols (textual literals in a BNF) should be processed lexically, including keywords and operators.

This includes keywords and (hard) operators that are reserved by the language. For example, in Haskell, "data" is a keyword, and "->" is a hard operator.

See the plainSymbol smart constructor for how to implement a custom name description.

Constructors

SymbolDesc
Fields hardKeywords :: !(Set String) what keywords are always treated as keywords within the language. hardOperators :: !(Set String) what operators are always treated as reserved operators within the language. caseSensitive :: !Bool `True` if the keywords are case sensitive, `False` if not (so that e.g. `IF = if`).

data TextDesc Source #

This type describes how to parse string and character literals.

Constructors

TextDesc

Fields

escapeSequences :: !EscapeDesc
the description of escape sequences in literals.
characterLiteralEnd :: !Char
the character that starts and ends a character literal.
stringEnds :: !(Set (String, String))
the sequences that may begin and end a string literal.
multiStringEnds :: !(Set (String, String))
the sequences that may begin and end a multi-line string literal.
graphicCharacter :: !CharPredicate
the characters that can be written verbatim into a character or string literal.