Posts in This Series
- Simple C# Tokenizer Using Regex
- Simple C# Parser
- Using A Parse Tree
Introduction
I have started building a JSON data mapper for which I have first defined a simple mapping language.
In order to build the data mapper I need to be able to parse mapping scripts written in the mapping language, the second step of which is to take the script tokens and parse them into an AST. The tokenizing step is covered in a previous post.
The latest code for this parser is available as part of my JSuite GitHub project with the code file currently located here.
My Parser Requirements
- I want the parser to be configurable, allowing me to define parser rules based on the token types I defined previously.
- The configuration should be easy to modify if I need to add or change a feature of the language definition.
- The output of the parser should be a concise, well structured tree that
- Terminates in the input tokens at the leaf nodes.
- Does not contain unnecessary nodes (e.g. if brackets in the language are implied by the resultant tree structure they do not not need to be included in the tree).
- Does not contain unnecessary structure (e.g. if, as an artifact of the rule definitions, additional nesting is added that does not aid understanding then that should be able to be removed.)
Pre Parsing
There are a few artifacts of the language definition that make it hard to define simple rules based on the series of tokens obtained, for instance comments are allowed. As a result I decided that it would make sense to have an intermediate set of steps between the tokenization and parsing which would strip out surplus tokens, apply some basic clean-up and split the tokens into individual mapping statements. The parser will then operate on the individual mapping statements making it far simpler to define.
The set of pre-parsing tasks that are performed are as follows:
- Applying Modifications
- This is a simple step whose only task is to update the token values for quoted item tokens to remove the quoting. This means that any later use of the tokens value does not need to worry about de-quoting the value and can simply use it in the same way an unquoted value would be used.
- Splitting into Statements
- This step splits the tokens into individual statements, removing comments, line continuations and empty lines as it goes. After this a statement can either be a mapping definition or a partial definition.
- Apply Partials
- This step finds the partial definitions and replaces their usages with the definition tokens. Doing this here means that the parser does not need to worry about partial definitions and can simply treat each statement as a single mapping definition with no complications.
The Parser
A parser is basically a pattern matcher across a series of tokens. As a result a good way to configure a parser is to define a set of rules which each match a certain series of tokens or other rules with all matching eventually getting to tokens. The parser can then apply these rules to the input tokens to find the rule and token structure that matches the input tokens and return that structure.
Configuration Structure
The parse configuration structure I decided on is as follows:
- A set of named rules are defined.
- Each rule consists of one or more rule options where, for a rule to be matched one of the options must be matched.
- Rule options are tested in the order they were defined with the first to be matched being the one to be used.
- Each rule option consists of a series of elements to match where an element:
- Can be a token or another rule.
- Has an occurrence expectation (zero or more, once, at most once, at least once)
The parser is also told which rule is the top level rule, this is the rule that it should start with when asked to parse a new series of tokens.
Parsing Process
Once configured the parser can parse a series of tokens as follows:
- Starting with the top level rule
- Loop through the rule options, trying to apply each one until one is found that matches
- Within each rule option loop through the elements, attempting to match each element in accordance with it's occurrence expectation. If an element is found that cannot be matched correctly then the option fails.
- If a rule option is found to match at the top level it will be checked to ensure it matches all the provided tokens, if it does not then parsing fails.
- If no matching rule option was found for the top level rule then parsing fails.
Additional Tracking
The parser also tracks what the furthest matched token has been. This information can be used along with positional information I added to the tokens to report useful parsing errors to users. The idea being that, if a match is not found then we can report "unexpected token X at location Y" style error messages giving the user a lot of contextual information to start investigating the problem.
Ensuring Output is Concise and Well Structured
Imagine you have a very simple language allowing you to define addition of multiple numbers which has tokens Number
and Add
. You could define the following rules for parsing this language:
- Sum
- First Number
- Then SubsequentSum AtLeastOnce
- SubsequentSum
- First Add
- Then Number
If we applied this to the sum 1+2+3
we would get the following tree:
Sum:
Number: 1
SubsequentSum:
Add: +
Number: 2
SubsequentSum:
Add: +
Number: 3
This is not as concise as it could be nor is it as well structured as it could be. An ideal result would be something like the following where we have removed the SubsequentSum nesting and the Add tokens since they do not add any useful information:
Sum:
Number: 1
Number: 2
Number: 3
To achieve this there are two additional concepts that can be applied to a rule element:
- Exclude
- This can be applied to token elements (such as
Add
) and will cause that token to be matched but not included in the resultant tree.
- This can be applied to token elements (such as
- Hoist
- This can be applied to rule elements (such as
SubsequentSum
) and will cause the element to be replaced by it's content causing the content to be hoisted up a level.
- This can be applied to rule elements (such as
By updating our rule definitions using these two concepts we can now achieve our desired output tree structure:
- Sum
- First Number
- Then SubsequentSum Hoist AtLeastOnce
- SubsequentSum
- First Add Exclude
- Then Number
Self Referencing Rules
As rules get more complicated we may want to be able to include circular references.
A good example of this is bracketed sub-expressions in sums where either a number of a bracketed sum are interchangeable. Here you may want to define Sum as Value SubsequentSum
, define SubsequentSum as Add Value
and define Value as Number or (Sum)
. As you can see Sum references Value which in turn references Sum.
This is perfectly valid however it does mean we could end up in an infinite loop if we are not careful. The way to avoid these loops is to ensure that the self referencing is never the first element to be matched. In our example we need an opening bracket before we can try to match a Sum within a Sum meaning that we will never hit an infinite loop.
Some parsers allow head recursion definitions which does make the definitions slightly simpler (in our example we could get rid of the SubsequentSum
rule and define Sum
as Number or Sum + Number
). This however makes the parser logic a little more complicated so it is not supported here.
Usage
The parser defines a fluent API for configuring and is defined as generic on the type used to represent the different rules. Using this the above example can be defined as follows (assuming that both the token types and rule types are defined by string values - I would generally recommend defining enums for stronger typing here):
var parser = new Parser<string, string>()
.Rule(
"Sum",
o => o
.WithT("Number").Once()
.ThenR("SubsequentSum").Hoist().AtLeastOnce())
.Rule(
"SubsequentSum",
o => o
.WithT("Add").Exclude().Once()
.ThenT("Number").Once())
.WithRoot("Sum");
Conclusion
The parser itself is currently less than 500 lines of code and meets all my requirements, it is also able to report nicely on parsing errors. It is easily configurable using a fluent API which makes the rule definitions easy to read and understand.