Context-Sensitive Parsing (parsley.state
)
Normally, context-sensitive parsing can be done with monadic flatMap
. However,
in parsley
, flatMap
is a very expensive operation and is best avoided.
Instead, parsley
supports a form of arbitrary state threading called references. These can be used to perform context-sensitive parsing in a
more performant way at a cost to expressive power.
The Scaladoc for this page can be found at parsley.state
.
References
A reference is a single piece of mutable state threaded through a parser. They can be made in three different ways:
-
Importing
parsley.state.RefMaker
allows the use of themakeRef
method on any typeA
:def makeRef[B](body: Ref[A] => Parsley[B]): Parsley[B]
This will construct a new reference filled with the receiver value, and this can be used within the scope of the given continuation. Everytime this is executed, it will be uniquely scoped (when the parser is recursive).
-
Similarly, importing
parsley.state.StateCombinators
allows the use of thefillRef
combinator onParsley[A]
, which has the same signature asmakeRef
.It behaves similarly to
makeRef
, but sources its result from a parser. Loosely:p.fillRef(body)
is the same asp.flatMap(_.makeRef(body))
, but is much more efficient. -
The
Ref.make[A]: Ref[A]
method allows for the creation of a reference outside of the parsing context. This is not recommended, as it will not guarantee uniqueness of scoping. The state will not be initialised.Using the same globally constructed reference in two different parsers (in terms of calling
.parse
on them) is undefined behaviour and should not be done.
References themselves have two core operations, with several more built on top:
class Ref[A] {
def get: Parsley[A]
def set(p: Parsley[A]): Parsley[Unit]
}
The get
method will read the reference at parse-time and return the value
contained within. The set
method takes the result of a parser and stores
that into the reference. As examples:
import parsley.character.item
import parsley.state._
List.empty[Char].makeRef { r1 =>
item.fillRef { r2 =>
r1.set(r2.get <::> (r2.get <::> r1.get))
} *> r1.get
}.parse("a")
// res0: parsley.Result[String, List[Char]] = Success(List(a, a))
The above example fills a reference r1
, with the empty list, then fills
a second reference r2
with the result of parsing any character. The value
in r1
is updated with the list obtained by prepending the parsed character onto
the empty list stored in r1
twice. After r2
goes out of scope, the current
value of r1
is returned.
Persistence
In the above example, the reference r2
is only used for get
, but is used
multiple times. Normally, using the value of a parser more than once requires
a flatMap
; this is not the case in parsley
when using references. To make
this application more ergonomic, parsley.state.StateCombinators
also
exposes the persist
combinator:
List.empty[Char].makeRef { r1 =>
item.persist { c =>
r1.set(c <::> (c <::> r1.get))
} *> r1.get
}.parse("a")
// res1: parsley.Result[String, List[Char]] = Success(List(a, a))
Persist can be thought of as a composition of fillRef
and get
, or
alternatively as a composition of flatMap
and pure
.
Using Persistence
One use of persist
is to otherwise reduce the scope of an expensive flatMap
:
the flatMap
combinator is expensive because it has to process the body of the
function in full everytime it is executed, if the size of the body is reduced,
that will keep the parser faster. Currently, there is no primitive functionality
for parsing with respect to values inside references, like so:
def string(r: Ref[String]): Parsley[String] = r.get.flatMap(parsley.character.string(_))
The scope of the flatMap
in that combinator is small, however, so is likely
much more efficient than one that didn't use persistence. With this, the
context-sensitive parsing of XML tags can be done:
import parsley.Parsley.{atomic, notFollowedBy}
import parsley.character.{stringOfSome, letter}
import parsley.combinator.optional
import parsley.syntax.character.{charLift, stringLift}
val openTag = atomic('<' <~ notFollowedBy('/'))
// openTag: Parsley[Char] = parsley.Parsley@43510415
val tagName = stringOfSome(letter)
// tagName: Parsley[String] = parsley.Parsley@48e5a9f3
lazy val content: Parsley[Unit] = optional(tag)
lazy val tag: Parsley[Unit] = (openTag ~> tagName <~ '>').fillRef { name =>
content <~ ("</" ~> string(name) <~ ">")
}
tag.parse("<hello></hello>")
// res2: parsley.Result[String, Unit] = Success(())
tag.parse("<hello></hi>")
// res3: parsley.Result[String, Unit] = Failure((line 1, column 10):
// unexpected "hi>"
// expected "hello"
// ><hello></hi>
// ^^^)
tag.parse("<a><b></b></c>")
// res4: parsley.Result[String, Unit] = Failure((line 1, column 13):
// unexpected "c"
// expected "a"
// ><a><b></b></c>
// ^)
Long-Term State
Persistence is an example of read-only state used to preserve a value for later. Writable state can also be used for context-sensitive tasks, by tracking a value over time. Examples include whitespace-sensitivity or tracking matching brackets.
Matching Brackets
The setDuring
and updateDuring
combinators can be used to reset state after a specific context is
executed. They can be built primitively out of get
, set
, and persist
. The
following parser reports the position of the last unclosed bracket that is
well-interleaved with the other kinds of brackets.
import parsley.Parsley.{eof, many}
import parsley.character.char
import parsley.errors.patterns.VerifiedErrors
import parsley.position.pos
case class Brackets(open: Char, position: (Int, Int)) {
def enter(c: Char, p: (Int, Int)) = {
val (line, col) = p
// adjust the column, because it was parsed post-bracket
this.copy(open = c, position = (line, col-1))
}
def missingClose = s"unclosed $open originating at $position"
}
object Brackets {
def empty = Brackets(0, null) // this will never be matched on
def toOpen(c: Char) = c match {
case ')' => '('
case ']' => '['
case '}' => '{'
}
}
def brackets = Brackets.empty.makeRef { bs =>
def open(c: Char): Parsley[Unit] =
char(c) ~> bs.update(pos.map[Brackets => Brackets](p => _.enter(c, p)))
def close(c: Char): Parsley[Unit] =
char(c).void | bs.get.verifiedExplain(_.missingClose)
// ensure it is reset
def scope[A](p: Parsley[A]): Parsley[A] = bs.updateDuring(identity[Brackets])(p)
lazy val matching: Parsley[Unit] = scope {
many {
open('(') ~> matching <~ close(')')
| open('[') ~> matching <~ close(']')
| open('{') ~> matching <~ close('}')
}.void
}
matching <~ eof
}
The above parser is designed to report where the last unclosed bracket
was. It creates a reference bs
that stores a Brackets
, which tracks
the last open character and its position. Then, whenever a bracket is
entered, matching
will save the existing information using the updateDuring
combinator: giving it the identity
function will mean it will simply restore
the existing state after it returns. Whenever an open bracket is parsed, it
will write its position into the state (lagging by one character), and then
if the corresponding closing bracket cannot be parsed, it will use an
unconditional Verified Error to report a message based on the last opened bracket. The results are below:
val p = brackets
// p: Parsley[Unit] = parsley.Parsley@71066bf7
p.parse("()()()")
// res5: parsley.Result[String, Unit] = Success(())
p.parse("[][][]")
// res6: parsley.Result[String, Unit] = Success(())
p.parse("{}[]()")
// res7: parsley.Result[String, Unit] = Success(())
p.parse("[[[[[]]]]]")
// res8: parsley.Result[String, Unit] = Success(())
p.parse("([)]")
// res9: parsley.Result[String, Unit] = Failure((line 1, column 3):
// unexpected ")"
// expected "(", "[", "]", or "{"
// unclosed [ originating at (1,2)
// >([)]
// ^)
p.parse("({(")
// res10: parsley.Result[String, Unit] = Failure((line 1, column 4):
// unexpected end of input
// expected "(", ")", "[", or "{"
// unclosed ( originating at (1,3)
// >({(
// ^)
p.parse("()[]{[(){}}")
// res11: parsley.Result[String, Unit] = Failure((line 1, column 11):
// unexpected "}"
// expected "(", "[", "]", or "{"
// unclosed [ originating at (1,6)
// >()[]{[(){}}
// ^)
Given the relatively simple construction, it works
quite well, and efficiently too: no flatMap
necessary!
Tail-Recursive Combinators
When combinators can be implemented tail recursively instead of
recursively, they can be more efficient. In the context of parsley
,
tail-recursive combinators are ones which only return the result
of the last recursive call they make:
lazy val tailRec: Parsley[Unit] = 'a' ~> tailRec | unit
The above is tail recursive, for instance. Combinators like skipMany
are implemented tail recursively, with additional optimisations to make
them more efficient: implementing new combinators in terms of skipMany
with references to carry state is likely to be efficient. For example:
def setOf[A](p: Parsley[A]): Parsley[Set[A]] = {
Set.empty[A].makeRef { set =>
many(set.update(p.map[Set[A] => Set[A]](x => _ + x))) ~> set.get
}
}
In the above code, a set is carried around in a reference, and a new element
is added into this set every iteration. When the loop completes (successfully), the set in the reference is returned. A more efficient implementation, however,
would use persist
and a mutable set (along with impure
and fresh
): this, of course, still uses a reference.
Whitespace-Sensitive Languages
Another application of long-term state is to track indentation levels in a whitespace-sensitive language: here the start of a new cause, statement or block will record the current column number, and further statements must verify that the column number is correct. Leaving a block will restore the identation level back to how it was before. This is similar to how matching brackets worked.
Stateful Combinators
For some applications, a more structured strategy for tracking state can be
useful. The forP
combinators allow for looping where the a stateful variable
is used to control the control flow. For instance, the classic context-sensitive
grammar of a^n b^n c^n
can be matched effectively using a reference and a forP
:
import parsley.Parsley.pure
import parsley.state.forP
val abcs = 0.makeRef { i =>
many('a' ~> i.update(_ + 1)) ~>
forP[Int](i.get, pure(_ > 0), pure(_ - 1)) {
'b'
} ~>
forP[Int](i.get, pure(_ > 0), pure(_ - 1)) {
'c'
} ~>
i.get <~ eof
}
// abcs: Parsley[Int] = parsley.Parsley@784dd7b4
abcs.parse("aabbcc")
// res12: parsley.Result[String, Int] = Success(2)
abcs.parse("aaaaabbbbbccccc")
// res13: parsley.Result[String, Int] = Success(5)
abcs.parse("aaaabbbbbbcccc")
// res14: parsley.Result[String, Int] = Failure((line 1, column 9):
// unexpected "b"
// expected "c"
// >aaaabbbbbbcccc
// ^)
First, as many a
s as possible are read and each one will increment the counter
i
. Then, run the equivalent of a C-style: for (int j = i, j > 0, j -= 1)
reading b
and c
. Internally, forP
will use a reference to track its state.