gwerren.com

Programming notes and tools...

Simple C# Parser

Thu, 14 May 2020 JSON Data MapperC#Text Parsing

Posts in This Series

Introduction

I have started building a JSON data mapper for which I have first defined a simple mapping language.

In order to build the data mapper I need to be able to parse mapping scripts written in the mapping language, the second step of which is to take the script tokens and parse them into an AST. The tokenizing step is covered in a previous post.

The latest code for this parser is available as part of my JSuite GitHub project with the code file currently located here.

My Parser Requirements

Pre Parsing

There are a few artifacts of the language definition that make it hard to define simple rules based on the series of tokens obtained, for instance comments are allowed. As a result I decided that it would make sense to have an intermediate set of steps between the tokenization and parsing which would strip out surplus tokens, apply some basic clean-up and split the tokens into individual mapping statements. The parser will then operate on the individual mapping statements making it far simpler to define.

The set of pre-parsing tasks that are performed are as follows:

The Parser

A parser is basically a pattern matcher across a series of tokens. As a result a good way to configure a parser is to define a set of rules which each match a certain series of tokens or other rules with all matching eventually getting to tokens. The parser can then apply these rules to the input tokens to find the rule and token structure that matches the input tokens and return that structure.

Configuration Structure

The parse configuration structure I decided on is as follows:

The parser is also told which rule is the top level rule, this is the rule that it should start with when asked to parse a new series of tokens.

Parsing Process

Once configured the parser can parse a series of tokens as follows:

Additional Tracking

The parser also tracks what the furthest matched token has been. This information can be used along with positional information I added to the tokens to report useful parsing errors to users. The idea being that, if a match is not found then we can report "unexpected token X at location Y" style error messages giving the user a lot of contextual information to start investigating the problem.

Ensuring Output is Concise and Well Structured

Imagine you have a very simple language allowing you to define addition of multiple numbers which has tokens Number and Add. You could define the following rules for parsing this language:

If we applied this to the sum 1+2+3 we would get the following tree:

Sum:
    Number: 1
    SubsequentSum:
        Add: +
        Number: 2
    SubsequentSum:
        Add: +
        Number: 3

This is not as concise as it could be nor is it as well structured as it could be. An ideal result would be something like the following where we have removed the SubsequentSum nesting and the Add tokens since they do not add any useful information:

Sum:
    Number: 1
    Number: 2
    Number: 3

To achieve this there are two additional concepts that can be applied to a rule element:

By updating our rule definitions using these two concepts we can now achieve our desired output tree structure:

Self Referencing Rules

As rules get more complicated we may want to be able to include circular references.

A good example of this is bracketed sub-expressions in sums where either a number of a bracketed sum are interchangeable. Here you may want to define Sum as Value SubsequentSum, define SubsequentSum as Add Value and define Value as Number or (Sum). As you can see Sum references Value which in turn references Sum.

This is perfectly valid however it does mean we could end up in an infinite loop if we are not careful. The way to avoid these loops is to ensure that the self referencing is never the first element to be matched. In our example we need an opening bracket before we can try to match a Sum within a Sum meaning that we will never hit an infinite loop.

Some parsers allow head recursion definitions which does make the definitions slightly simpler (in our example we could get rid of the SubsequentSum rule and define Sum as Number or Sum + Number). This however makes the parser logic a little more complicated so it is not supported here.

Usage

The parser defines a fluent API for configuring and is defined as generic on the type used to represent the different rules. Using this the above example can be defined as follows (assuming that both the token types and rule types are defined by string values - I would generally recommend defining enums for stronger typing here):

var parser = new Parser<string, string>()
    .Rule(
        "Sum",
        o => o
            .WithT("Number").Once()
            .ThenR("SubsequentSum").Hoist().AtLeastOnce())
    .Rule(
        "SubsequentSum",
        o => o
            .WithT("Add").Exclude().Once()
            .ThenT("Number").Once())
    .WithRoot("Sum");

Conclusion

The parser itself is currently less than 500 lines of code and meets all my requirements, it is also able to report nicely on parsing errors. It is easily configurable using a fluent API which makes the rule definitions easy to read and understand.