Your assignment is to make a regular expression parser. I expect you will work in Java, although most other languages are fine, just check with me. You should accept a regular expression as a string and a test string and output if the test string is or isn't in the language specified by the regular expression. You most probably will first that the regular expression, parse it, create a NFA from it, and simulate the NFA deterministically on the test string (by following the set of states the NFA is simultaneously pursuing). However, other approaches might be possible. There is undoubtably a lot of information about this on the web, since parsing and string pattern matching are really important subjects.

It would be easiest if you program worked from the command line, but you will probably make subroutines that take a pattern string and a test string. You might also make one subroutine that takes a pattern string and returns a reference to a pattern matching gizmo. A second call takes the reference to the gizmo and a test string and returns true or false. A few of the standard languages work this way in which the pattern matching NFA is constructed from the regular expression in a first step, and an pattern matching object returned. This object is then applied to test strings in a second step.

We agree that the syntax for regular expressions will follow the definition in the Sipser book, with a few modifications and clarifications.

  1. The characters a,b c, ... , z are regular expressions.
  2. The character E is a regular expression denoting the empty string.
  3. The character Z is a regular expression denoting the empty set.
  4. If X and Y are regular expressions, so is ( X | Y ), representing union.
  5. If X and Y are regular expressions, so if (XY), representing concatenation.
  6. If X is a regular expression, so is (X*), representing star.
  7. For convienience, the character . (dot) is the regular expression for the language of any single character, a through z.
  8. Any string of characters, such as the three characters cat, is short for the concatentation, dropping parentheses. Strictly speaking we should otherwise write (c(at)) or ((ca)t).
  9. White space is ignored.
Hopefully this works out. Let me know if there are any large problems getting this to work simply.