A Portable C++ Regular Expression Facility - Regex(3C++)

Constructing and matching

Regular expressions in this library are modeled by the single class Regex. Constructing a Regex is simple.

       Regex r1("foo");
       Regex r2("(foo)|(bar)");
       Regex r3("[a-z]$");

Regular expressions are as in egrep(1), with the following exceptions:

the meta-character $ matches the null character (in egrep(1), $ matches the newline character);
newline characters are treated as ordinary characters; and,
0, 1, ... 9, subexpression references are allowed.

Thus, the first Regex above will match any string containing the literal string ``foo'', the second will match any string containing ``foo'' or ``bar'', and the third will match any string ending in a lower case letter.

       r1.match("foo");     // true
       r1.match("foobar");  // also true
       r2.match("_foo_");   // true
       r3.match("12a");     // true
       r3.match("12a\ n");   // false

Notice that, as in egrep(1), a match is successful as long as the target contains a matching substring that is, a substring (possibly the null string, or the entire target string itself) which exactly matches the pattern.

As illustrated below, the egrep(1) meta-characters -N and $ can be used to anchor the match to the beginning and end of the target string, respectively.

       Regex r("-foo$");
       r.match("foo");      // true
       r.match("foobar");   // false
       r.match("foo\ nbar"); // also false

The last statement emphasizes the fact that unlike in egrep(1), the meta-character $ matches the end of the string, not the newline character.

Backslashes in patterns must be escaped in order to get past the C++ compiler. For example, the egrep(1) pattern \ \ (literal backslash) must be constructed as follows:

       Regex("\ \ \ \ ");

and the egrep(1) pattern -(\ +|-)?\ .[0-9]+$ (optionally signed decimal fraction) must be constructed as follows:

       Regex("-e \ +|-)?\ \ .[0-9]+$");

Alphabetic characters in patterns are by default treated case sensitively.

       Regex r("foo");
       r.match("foo");  // true
       r.match("Foo");  // false

If the user wants case insensitive matching, that can be specified in the constructor.

       Regex r("foo", Regex::case_insensitive);
       r.match("foo");  // true
       r.match("Foo");  // true

The case sensitivity can also be changed at any time.

       Regex r("foo");  // case_sensitive by default
       r.set_sensitivity(Regex::case_insensitive);
       r.match("Foo");  // true

When used with character class ranges (e.g., [a-z], [0-9]), case insensitivity is applied only after range expansion. For example, the (rather unusual) range [A-c] is always first expanded into the character class { A, B, ..., Y, Z, [, \ , ], ^ , _, `, a, b, c } (using the ASCII collating sequence). Under case sensitive matching this matches any character in the shown set, while under case insensitive matching this matches any character in the set { A, a, B, b, ..., Y, y, Z, z, [, \ , ], -, ` }. Similarly, the (again rather unusual) range [a-Z] is always first expanded into the empty character class (using the ASCII collating sequence). This matches no characters under either case sensitive or case insensitive matching.

Some commonly used regular expressions come pre-defined in the library.

       ... = Regex::Int;        // -(\ +|-)?[0-9]+$
       ... = Regex::Float;
             // -(\+|-)?((\.[0-9]+)|([0-9]+(\ .[0-9]*)?))$
       ... = Regex::Double;
             // -(\ +|-)?((\ .[0-9]+)|([0-9]+(\ .[0-9]*)?))
             // ([eE](\ +|-)?[0-9]+)?$
       ... = Regex::Alpha;      // -[A-Za-z]+$
       ... = Regex::Alphanum;   // -[0-9A-Za-z]+$
       ... = Regex::Identifier; // -[A-Za-z_][A-Za-z0-9_]*$

For the sake of example, let's look at the definitions of the last four of these Regexes.

       Regex Regex::Double(
           "-(\ \ +|-)?((\ \ .[0-9]+)|([0-9]+(\ \ .[0-9]*)?))"
           "(e(\ \ +|-)?[0-9]+)?$", Regex::case_insensitive);
       Regex Regex::Alpha(
           "-[a-z]+$", Regex::case_insensitive);
       Regex Regex::Alphanumeric(
           "-[0-9a-z]+$", Regex::case_insensitive);
       Regex Regex::Identifier(
           "-[a-z_][a-z0-9_]*$", Regex::case_insensitive);

Again notice the extra backslashes needed to get past the C++ compiler.