Cook Computing

Functional Style Regex Engine in C# Revisited

May 28, 2009 Written by Charles Cook

In January 2007 I posted about a Functional Style Regex Engine in C#. This was an exercise in using iterators and anonymous functions in C#. I decided it was time to see if it was possible to rewrite the code using Linq.

To recap, the goal is to be able to specify a regular expression in a functional style like this:


// c(a|d)+r

Rgx.Expr e = 
  Rgx.Seq(Rgx.Char('c'), 
    Rgx.Seq(Rgx.Plus(Rgx.Alt(Rgx.Char('a'), Rgx.Char('d'))), 
      Rgx.Char('r')));

foreach (string r in e("cdar"))
  Console.WriteLine("Match with remainder: {0}", r);

Calling an instance of Expr returns an enumeration of all the matches which start at the beginning of the input string. When each of the functions is called it returns a lambda expression which takes a string parameter. The expression returns an enumeration containing the remainder for every match on the string. For example, if the lambda expression generated by Rgx.Char('x') is passed the string "xyz", the enumeration returned by the expression will contain the single string "yz", this being the remainder from the match on the character 'x'.

The rewritten class turns out to be quite neat compared to the original version which required some hackiness because the yield statement cannot be used in an anonymous block. Here it is:


using System;
using System.Collections.Generic;
using System.Linq;

class Rgx
{
  public delegate IEnumerable<string> Expr(string s);

  static Expr Nil()
  {
    return s => Enumerable.Repeat(s, 1);
  }

  static public Expr Seq(Expr l, Expr r)
  {
    return s => l(s).SelectMany(x => r(x));
  }

  static public Expr Alt(Expr l, Expr r)
  {
    return s => l(s).Concat(r(s));
  }

  static public Expr Star(Expr e)
  {
    return s => Nil()(s).Concat(Seq(e, Star(e))(s));
  }

  static public Expr Plus(Expr e)
  {
    return Seq(e, Star(e));
  }

  static public Expr Char(char c)
  {
    return s => s.Length > 0 && s[0] == c
      ? Enumerable.Repeat(s.Substring(1), 1)
      : Enumerable.Repeat("", 0);
  }
}

The implementation of Seq() made me think the most. The enumeration of match remainders from the left-hand side of the sequence is processed by the right-hand side of the sequence by using the SelectMany() function, i.e. each string in the output from the left-hand side is passed to the expression on the right-hand side, each call to the right-hand expression resulting in zero or more match remainders.

Again, this is definitely not the way to implement regular expression matching, but just an exercise in using Linq.