gwerren.com

Programming notes and tools...

Simple C# Tokenizer Using Regex

Tue, 28 Apr 2020 JSON Data MapperRegexC#Text Parsing

Posts in This Series

Introduction

I have started building a JSON data mapper for which I have first defined a simple mapping language.

In order to build the data mapper I need to be able to parse mapping scripts written in the mapping language, the first step of which is to take the mapping scripts and tokenize them. I considered using Antlr and may well switch if things get much more complicated however I thought I would have a go at building what I needed in C# first.

The latest code for this tokenizer is available as part of my JSuite GitHub project with the latest version under tag LatestRegexTokenizer and the code file currently located here.

My Tokenizer Requirements

Looking at my language specification I realized that there were only a few complex tokens, (such as a quoted item - a series of characters surrounded by double quotes with any contained double quotes being escaped by a second double quote), which needed to be carefully extracted. After that the tokens were all simple (e.g. a colon symbol) with anything remaining at the end being string segments.

I decided to see if it would be possible to extract one token type at a time using Regex matching. Using this approach I would need to ensure extraction was done in the correct order (e.g. extract quoted strings early so that subsequent extractions do not need to worry about potentially being within a quoted string).

To start with I came up with a set of tokens I wanted to be able to extract and the order they should be extracted:

The most difficult are the first two since something looking like a comment could appear in a quoted string and quotes could appear within a comment. Luckily the C# Regex implementation supports look-back and with a bit of work I was able to define a Regex to match a comment that did not start in the middle of a comment.

The Tokenizer

Given the above requirements I came up with the following generic tokenizer implementation which iteratively applies the token Regex's to extract the tokens and returns the resultant tokens. The implementation is generic on the type that holds the token type (TType). This allows us to use any type we want, I defined an enum for my usage (see below) however you could just as easily use string or some othe rtype if you prefer.

public readonly struct Token<TType>
{
    public Token(TType type, string value)
    {
        this.Type = type;
        this.Value = value;
    }

    public TType Type { get; }

    public string Value { get; }
}

public class Tokenizer<TType>
{
    private readonly IList<TokenType> tokenTypes = new List<TokenType>();
    private readonly TType defaultTokenType;

    public Tokenizer(TType defaultTokenType) => this.defaultTokenType = defaultTokenType;

    public Tokenizer<TType> Token(TType type, params string[] matchingRegexs)
    {
        foreach (var matchingRegex in matchingRegexs)
            this.tokenTypes.Add(new TokenType(type, matchingRegex));

        return this;
    }

    public IList<Token<TType>> Tokenize(string input)
    {
        IEnumerable<Token<TType>> tokens = new[] { new Token<TType>(this.defaultTokenType, input) };
        foreach (var type in this.tokenTypes)
            tokens = ExtractTokenType(tokens, type);

        return tokens.ToList();
    }

    private IEnumerable<Token<TType>> ExtractTokenType(
        IEnumerable<Token<TType>> tokens,
        TokenType toExtract)
    {
        var tokenType = toExtract.Type;
        var tokenMatcher = new Regex(toExtract.MatchingRegex, RegexOptions.Multiline);
        foreach (var token in tokens)
        {
            if (!token.Type.Equals(this.defaultTokenType))
            {
                yield return token;
                continue;
            }

            var matches = tokenMatcher.Matches(token.Value);
            if (matches.Count == 0)
            {
                yield return token;
                continue;
            }

            var currentIndex = 0;
            foreach (Match match in matches)
            {
                if (currentIndex < match.Index)
                {
                    yield return new Token<TType>(
                        this.defaultTokenType,
                        token.Value.Substring(currentIndex, match.Index - currentIndex));
                }

                yield return new Token<TType>(tokenType, match.Value);
                currentIndex = match.Index + match.Length;
            }

            if (currentIndex < token.Value.Length)
            {
                yield return new Token<TType>(
                    this.defaultTokenType,
                    token.Value.Substring(currentIndex, token.Value.Length - currentIndex));
            }
        }
    }

    private readonly struct TokenType
    {
        public TokenType(TType type, string matchingRegex)
        {
            this.Type = type;
            this.MatchingRegex = matchingRegex;
        }

        public TType Type { get; }

        public string MatchingRegex { get; }
    }
}

Usage

Having built the tokenizer I was able to use it to tokenize define a specific tokenizer for my mapping scripts. My definition looks as follows:

var tokenizer = new Tokenizer<TokenType>(TokenType.String)
    .Token(TokenType.Comment, @"(?<=^([^""\r\n]|""[^""\r\n]*"")*)//[^\r\n]*")
    .Token(TokenType.QuotedItem, @"""([^""]|""{2})+""")
    .Token(TokenType.Partial, @"<<[a-zA-Z0-9_-]+>>")
    .Token(TokenType.Variable, @"\$\([a-zA-Z0-9_-]+\)")
    .Token(TokenType.LineContinuation, @"[\r\n]+[ \t\r\n]+[ \t]+")
    .Token(TokenType.NewLine, @"[\r\n]+")
    .Token(TokenType.Whitespace, @"[ \t]+")
    .Token(TokenType.PartialAssignment, "::")
    .Token(TokenType.Equals, "=")
    .Token(TokenType.Dot, @"\.")
    .Token(TokenType.OpenSquareBracket, @"\[")
    .Token(TokenType.CloseSquareBracket, @"\]")
    .Token(TokenType.OpenCurlyBracket, "{")
    .Token(TokenType.CloseCurlyBracket, "}")
    .Token(TokenType.Comma, ",")
    .Token(TokenType.QuestionMark, @"\?")
    .Token(TokenType.ExclaimationMark, @"\!")
    .Token(TokenType.OpenRoundBracket, @"\(")
    .Token(TokenType.CloseRoundBracket, @"\)")
    .Token(TokenType.Colon, ":")
    .Token(TokenType.Item, "[a-zA-Z0-9_-]+")
    .Token(TokenType.WildCard, @"\*");

And now I can use this to tokenize my scripts as follows:

var tokens = tokenizer.Tokenize(script);

Conclusion

For this level of script complexity the approach works well for tokenization, however if things get much more complex I would definitely want to look at alternative options.