Context-Sensitive Parsing (parsley.state)

Normally, context-sensitive parsing can be done with monadic flatMap. However, in parsley, flatMap is a very expensive operation and is best avoided. Instead, parsley supports a form of arbitrary state threading called references. These can be used to perform context-sensitive parsing in a more performant way at a cost to expressive power.

The Scaladoc for this page can be found at parsley.state.

References

A reference is a single piece of mutable state threaded through a parser. They can be made in three different ways:

  1. Importing parsley.state.RefMaker allows the use of the makeRef method on any type A:

    def makeRef[B](body: Ref[A] => Parsley[B]): Parsley[B]

    This will construct a new reference filled with the receiver value, and this can be used within the scope of the given continuation. Everytime this is executed, it will be uniquely scoped (when the parser is recursive).

  2. Similarly, importing parsley.state.StateCombinators allows the use of the fillRef combinator on Parsley[A], which has the same signature as makeRef.

    It behaves similarly to makeRef, but sources its result from a parser. Loosely: p.fillRef(body) is the same as p.flatMap(_.makeRef(body)), but is much more efficient.

  3. The Ref.make[A]: Ref[A] method allows for the creation of a reference outside of the parsing context. This is not recommended, as it will not guarantee uniqueness of scoping. The state will not be initialised.

    Using the same globally constructed reference in two different parsers (in terms of calling .parse on them) is undefined behaviour and should not be done.

References themselves have two core operations, with several more built on top:

class Ref[A] {
    def get: Parsley[A]
    def set(p: Parsley[A]): Parsley[Unit]
}

The get method will read the reference at parse-time and return the value contained within. The set method takes the result of a parser and stores that into the reference. As examples:

import parsley.character.item
import parsley.state._

List.empty[Char].makeRef { r1 =>
    item.fillRef { r2 =>
        r1.set(r2.get <::> (r2.get <::> r1.get))
    } *> r1.get
}.parse("a")
// res0: parsley.Result[String, List[Char]] = Success(List(a, a))

The above example fills a reference r1, with the empty list, then fills a second reference r2 with the result of parsing any character. The value in r1 is updated with the list obtained by prepending the parsed character onto the empty list stored in r1 twice. After r2 goes out of scope, the current value of r1 is returned.

Persistence

In the above example, the reference r2 is only used for get, but is used multiple times. Normally, using the value of a parser more than once requires a flatMap; this is not the case in parsley when using references. To make this application more ergonomic, parsley.state.StateCombinators also exposes the persist combinator:

List.empty[Char].makeRef { r1 =>
    item.persist { c =>
        r1.set(c <::> (c <::> r1.get))
    } *> r1.get
}.parse("a")
// res1: parsley.Result[String, List[Char]] = Success(List(a, a))

Persist can be thought of as a composition of fillRef and get, or alternatively as a composition of flatMap and pure.

Using Persistence

One use of persist is to otherwise reduce the scope of an expensive flatMap: the flatMap combinator is expensive because it has to process the body of the function in full everytime it is executed, if the size of the body is reduced, that will keep the parser faster. Currently, there is no primitive functionality for parsing with respect to values inside references, like so:

def string(r: Ref[String]): Parsley[String] = r.get.flatMap(parsley.character.string(_))

The scope of the flatMap in that combinator is small, however, so is likely much more efficient than one that didn't use persistence. With this, the context-sensitive parsing of XML tags can be done:

import parsley.Parsley.{atomic, notFollowedBy}
import parsley.character.{stringOfSome, letter}
import parsley.combinator.optional
import parsley.syntax.character.{charLift, stringLift}

val openTag = atomic('<' <~ notFollowedBy('/'))
// openTag: Parsley[Char] = parsley.Parsley@43510415
val tagName = stringOfSome(letter)
// tagName: Parsley[String] = parsley.Parsley@48e5a9f3

lazy val content: Parsley[Unit] = optional(tag)
lazy val tag: Parsley[Unit] = (openTag ~> tagName <~ '>').fillRef { name =>
    content <~ ("</" ~> string(name) <~ ">")
}

tag.parse("<hello></hello>")
// res2: parsley.Result[String, Unit] = Success(())
tag.parse("<hello></hi>")
// res3: parsley.Result[String, Unit] = Failure((line 1, column 10):
//   unexpected "hi>"
//   expected "hello"
//   ><hello></hi>
//             ^^^)
tag.parse("<a><b></b></c>")
// res4: parsley.Result[String, Unit] = Failure((line 1, column 13):
//   unexpected "c"
//   expected "a"
//   ><a><b></b></c>
//                ^)

Long-Term State

Persistence is an example of read-only state used to preserve a value for later. Writable state can also be used for context-sensitive tasks, by tracking a value over time. Examples include whitespace-sensitivity or tracking matching brackets.

Matching Brackets

The setDuring and updateDuring combinators can be used to reset state after a specific context is executed. They can be built primitively out of get, set, and persist. The following parser reports the position of the last unclosed bracket that is well-interleaved with the other kinds of brackets.

import parsley.Parsley.{eof, many}
import parsley.character.char
import parsley.errors.patterns.VerifiedErrors
import parsley.position.pos

case class Brackets(open: Char, position: (Int, Int)) {
    def enter(c: Char, p: (Int, Int)) = {
        val (line, col) = p
        // adjust the column, because it was parsed post-bracket
        this.copy(open = c, position = (line, col-1))
    }
    def missingClose = s"unclosed $open originating at $position"
}
object Brackets {
    def empty = Brackets(0, null) // this will never be matched on
    def toOpen(c: Char) = c match {
        case ')' => '('
        case ']' => '['
        case '}' => '{'
    }
}

def brackets = Brackets.empty.makeRef { bs =>
    def open(c: Char): Parsley[Unit] =
        char(c) ~> bs.update(pos.map[Brackets => Brackets](p => _.enter(c, p)))
    def close(c: Char): Parsley[Unit] =
        char(c).void | bs.get.verifiedExplain(_.missingClose)

    // ensure it is reset
    def scope[A](p: Parsley[A]): Parsley[A] = bs.updateDuring(identity[Brackets])(p)

    lazy val matching: Parsley[Unit] = scope {
        many {
              open('(') ~> matching <~ close(')')
            | open('[') ~> matching <~ close(']')
            | open('{') ~> matching <~ close('}')
        }.void
    }
    matching <~ eof
}

The above parser is designed to report where the last unclosed bracket was. It creates a reference bs that stores a Brackets, which tracks the last open character and its position. Then, whenever a bracket is entered, matching will save the existing information using the updateDuring combinator: giving it the identity function will mean it will simply restore the existing state after it returns. Whenever an open bracket is parsed, it will write its position into the state (lagging by one character), and then if the corresponding closing bracket cannot be parsed, it will use an unconditional Verified Error to report a message based on the last opened bracket. The results are below:

val p = brackets
// p: Parsley[Unit] = parsley.Parsley@71066bf7

p.parse("()()()")
// res5: parsley.Result[String, Unit] = Success(())
p.parse("[][][]")
// res6: parsley.Result[String, Unit] = Success(())
p.parse("{}[]()")
// res7: parsley.Result[String, Unit] = Success(())
p.parse("[[[[[]]]]]")
// res8: parsley.Result[String, Unit] = Success(())

p.parse("([)]")
// res9: parsley.Result[String, Unit] = Failure((line 1, column 3):
//   unexpected ")"
//   expected "(", "[", "]", or "{"
//   unclosed [ originating at (1,2)
//   >([)]
//      ^)
p.parse("({(")
// res10: parsley.Result[String, Unit] = Failure((line 1, column 4):
//   unexpected end of input
//   expected "(", ")", "[", or "{"
//   unclosed ( originating at (1,3)
//   >({(
//       ^)
p.parse("()[]{[(){}}")
// res11: parsley.Result[String, Unit] = Failure((line 1, column 11):
//   unexpected "}"
//   expected "(", "[", "]", or "{"
//   unclosed [ originating at (1,6)
//   >()[]{[(){}}
//              ^)

Given the relatively simple construction, it works quite well, and efficiently too: no flatMap necessary!

Tail-Recursive Combinators

When combinators can be implemented tail recursively instead of recursively, they can be more efficient. In the context of parsley, tail-recursive combinators are ones which only return the result of the last recursive call they make:

lazy val tailRec: Parsley[Unit] = 'a' ~> tailRec | unit

The above is tail recursive, for instance. Combinators like skipMany are implemented tail recursively, with additional optimisations to make them more efficient: implementing new combinators in terms of skipMany with references to carry state is likely to be efficient. For example:

def setOf[A](p: Parsley[A]): Parsley[Set[A]] = {
    Set.empty[A].makeRef { set =>
        many(set.update(p.map[Set[A] => Set[A]](x => _ + x))) ~> set.get
    }
}

In the above code, a set is carried around in a reference, and a new element is added into this set every iteration. When the loop completes (successfully), the set in the reference is returned. A more efficient implementation, however, would use persist and a mutable set (along with impure and fresh): this, of course, still uses a reference.

Whitespace-Sensitive Languages

Another application of long-term state is to track indentation levels in a whitespace-sensitive language: here the start of a new cause, statement or block will record the current column number, and further statements must verify that the column number is correct. Leaving a block will restore the identation level back to how it was before. This is similar to how matching brackets worked.

Stateful Combinators

For some applications, a more structured strategy for tracking state can be useful. The forP combinators allow for looping where the a stateful variable is used to control the control flow. For instance, the classic context-sensitive grammar of a^n b^n c^n can be matched effectively using a reference and a forP:

import parsley.Parsley.pure
import parsley.state.forP
val abcs = 0.makeRef { i =>
    many('a' ~> i.update(_ + 1)) ~>
    forP[Int](i.get, pure(_ > 0), pure(_ - 1)) {
        'b'
    } ~>
    forP[Int](i.get, pure(_ > 0), pure(_ - 1)) {
        'c'
    } ~>
    i.get <~ eof
}
// abcs: Parsley[Int] = parsley.Parsley@784dd7b4

abcs.parse("aabbcc")
// res12: parsley.Result[String, Int] = Success(2)
abcs.parse("aaaaabbbbbccccc")
// res13: parsley.Result[String, Int] = Success(5)
abcs.parse("aaaabbbbbbcccc")
// res14: parsley.Result[String, Int] = Failure((line 1, column 9):
//   unexpected "b"
//   expected "c"
//   >aaaabbbbbbcccc
//            ^)

First, as many as as possible are read and each one will increment the counter i. Then, run the equivalent of a C-style: for (int j = i, j > 0, j -= 1) reading b and c. Internally, forP will use a reference to track its state.