Cook Computing

Structural Regular Expressions in C#

July 31, 2010 Written by Charles Cook

When I read Joe Gregorio's sregex: Structural Regular Expressions in Python last October I thought porting his code to C# would provide an interesting exercise in C# functional style coding. Structural regular expressions were originally described in a paper by Rob Pike (available here). Joe's sregex project page describes like this:

Structural regular expressions work by describing the shape of the whole string, not just the piece you want to match. Each pattern is a list of operators to perform on a string, each time constraining the range of text that matches the pattern. Examples will make this much clearer.

The first operator to consider is the x// operator, which means e(x)tract. When applied to a string, all the substrings that match the regular expression between // are passed on to the next operator in the pattern.

Given the source string "Atom-Powered Robots Run Amok" and the pattern "x/A.../" the result would be ['Atom', 'Amok']. The sregex module does that using the 'sres' function:

>>> list(sres("Atom-Powered Robots Run Amok", "x/A.../"))
['Atom', 'Amok']

A pattern can contain mulitple operators, separated by whitespace, which are applied in order, each to the result of the previous match.

>>> list(sres("Atom-Powered Robots Run Amok", "x/A.../ x/.*m$/"))
['Atom']

There are four operators in total:

  • x/regex/ - Matches all the text that matches the regular expression
  • y/regex/ - Matches all the text that does not match the regular expression
  • g/regex/ - If the regex finds a match in the string then the whole string is passed along.
  • v/regex/ - If the regex does not find a match in the string then the whole string is passed along.

I ported Joe's code to C# keeping the general structure of the code the same and using Linq where possible. The C# version is used like this:


string src = "Atom-Powered Robots Run Amok";
Console.WriteLine(string.Join(", ", sregex.sres(src, "y/ / x/R.*/")));
Console.WriteLine(sregex.sub(src, "y/( |-)/ v/^R/ g/om/", "Coal"));
Console.WriteLine(sregex.sub(src, "x/A.../", s => s.ToUpper()));

/* Outputs to console:
     
Robots, Run
Coal-Powered Robots Run Amok
ATOM-Powered Robots Run AMOK
 
*/

UPDATE 2nd Aug 2010: I've improved the code to handle the 'y' command. I'm using the EnumerableEx.Defer() method from Rx for .NET Framework to ensure that the value of the remaining variable is taken after all the matches in the range has been processed.


public class sregex
{
  static Regex CMD = new Regex("(?<cmd>[xygv])/(?<rgx>[^/]*)/");
  static Regex VALIDCMD
    = new Regex(@"^\s*([xygv]/[^/]*/)(\s+([xygv]/[^/]*/))*\s*$");
  delegate IEnumerable<Range> Func(IEnumerable<Range> f, string pattern);

  static bool IsPattern(string s)
  {
    return (s == "" || VALIDCMD.IsMatch(s));
  }

  static Range _makeRange(Range range, Match match)
  {
    return new Range(match.Index + range.Start,
      match.Index + match.Length + range.Start);
  }

  public static IEnumerable<Range> sre(string src, string pattern)
  {
    IEnumerable<Range> start = new Range[] { new Range(0, src.Length) };

    Func from_g = (f, pat) => f.Where(range =>
      Regex.Match(slice(src, range), pat).Success);

    Func from_v = (f, pat) => f.Where(range =>
      !Regex.Match(slice(src, range), pat).Success);

    Func from_x = (f, pat) => f.SelectMany(range =>
        Regex.Matches(slice(src, range), pat).Cast<Match>()
          .Select(m => _makeRange(range, m)));

    Func from_y = (f, pat) => f.SelectMany(range =>
    {
      var remaining = range;
      return Regex.Matches(slice(src, range), pat).Cast<Match>()
          .Select(m =>
          {
            Range r = new Range(remaining.Start, range.Start + m.Index);
            remaining = new Range(range.Start + m.Index + m.Length, range.End);
            return r;
          })
          .Concat(EnumerableEx.Defer(() => new Range[] { remaining }))
          .Where(r => r.Length > 0); 
    });

    if (!IsPattern(pattern))
      throw new ArgumentException("Invalid expression");

    var wrapper = new Dictionary<string, Func>();
    wrapper["g"] = from_g;
    wrapper["v"] = from_v;
    wrapper["x"] = from_x;
    wrapper["y"] = from_y;

    var ret = CMD.Matches(pattern).Cast<Match>()
      .Aggregate(start, (f, m) =>
        wrapper[m.Groups["cmd"].Value](f, m.Groups["rgx"].Value));
    return ret;
  }

  public static IEnumerable<string> sres(string src, string pattern)
  {
    return sre(src, pattern).Select(r => src.Substring(r.Start,
      r.End - r.Start));
  }

  public static string sub(string src, string pattern,
    Func<string, string> repl)
  {
    StringBuilder sb = new StringBuilder(src);
    foreach (Range range in sre(src, pattern).Reverse())
    {
      sb.Remove(range.Start, range.Length);
      sb.Insert(range.Start,
        repl(src.Substring(range.Start, range.Length)));
    }
    return sb.ToString();
  }

  public static string sub(string src, string pattern, string repl)
  {
    return sub(src, pattern, s => repl);
  }

  static string slice(string str, Range range)
  {
    return str.Substring(range.Start, range.Length);
  }
}

public struct Range
{
  public readonly int Start;
  public readonly int End;
  public int Length { get { return End - Start; } }

  public Range(int start, int end)
  {
    Start = start;
    End = end;
  }
}