|
|
Regular expressions in this library are modeled by the single class Regex. Constructing a Regex is simple.
Regex r1("foo"); Regex r2("(foo)|(bar)"); Regex r3("[a-z]$");
Regular expressions are as in egrep(1), with the following exceptions:
r1.match("foo"); // true r1.match("foobar"); // also true r2.match("_foo_"); // true r3.match("12a"); // true r3.match("12a\ n"); // false
Notice that, as in egrep(1), a match is successful as long as the target contains a matching substring that is, a substring (possibly the null string, or the entire target string itself) which exactly matches the pattern.
As illustrated below, the egrep(1) meta-characters -N and $ can be used to anchor the match to the beginning and end of the target string, respectively.
Regex r("-foo$"); r.match("foo"); // true r.match("foobar"); // false r.match("foo\ nbar"); // also false
The last statement emphasizes the fact that unlike in egrep(1), the meta-character $ matches the end of the string, not the newline character.
Backslashes in patterns must be escaped in order to get past the C++ compiler. For example, the egrep(1) pattern \ \ (literal backslash) must be constructed as follows:
Regex("\ \ \ \ ");
and the egrep(1) pattern -(\ +|-)?\ .[0-9]+$ (optionally signed decimal fraction) must be constructed as follows:
Regex("-e \ +|-)?\ \ .[0-9]+$");
Alphabetic characters in patterns are by default treated case sensitively.
Regex r("foo"); r.match("foo"); // true r.match("Foo"); // false
If the user wants case insensitive matching, that can be specified in the constructor.
Regex r("foo", Regex::case_insensitive); r.match("foo"); // true r.match("Foo"); // true
The case sensitivity can also be changed at any time.
Regex r("foo"); // case_sensitive by default r.set_sensitivity(Regex::case_insensitive); r.match("Foo"); // true
When used with character class ranges (e.g., [a-z], [0-9]), case insensitivity is applied only after range expansion. For example, the (rather unusual) range [A-c] is always first expanded into the character class { A, B, ..., Y, Z, [, \ , ], ^ , _, `, a, b, c } (using the ASCII collating sequence). Under case sensitive matching this matches any character in the shown set, while under case insensitive matching this matches any character in the set { A, a, B, b, ..., Y, y, Z, z, [, \ , ], -, ` }. Similarly, the (again rather unusual) range [a-Z] is always first expanded into the empty character class (using the ASCII collating sequence). This matches no characters under either case sensitive or case insensitive matching.
Some commonly used regular expressions come pre-defined in the library.
... = Regex::Int; // -(\ +|-)?[0-9]+$ ... = Regex::Float; // -(\+|-)?((\.[0-9]+)|([0-9]+(\ .[0-9]*)?))$ ... = Regex::Double; // -(\ +|-)?((\ .[0-9]+)|([0-9]+(\ .[0-9]*)?)) // ([eE](\ +|-)?[0-9]+)?$ ... = Regex::Alpha; // -[A-Za-z]+$ ... = Regex::Alphanum; // -[0-9A-Za-z]+$ ... = Regex::Identifier; // -[A-Za-z_][A-Za-z0-9_]*$
For the sake of example, let's look at the definitions of the last four of these Regexes.
Regex Regex::Double( "-(\ \ +|-)?((\ \ .[0-9]+)|([0-9]+(\ \ .[0-9]*)?))" "(e(\ \ +|-)?[0-9]+)?$", Regex::case_insensitive); Regex Regex::Alpha( "-[a-z]+$", Regex::case_insensitive); Regex Regex::Alphanumeric( "-[0-9a-z]+$", Regex::case_insensitive); Regex Regex::Identifier( "-[a-z_][a-z0-9_]*$", Regex::case_insensitive);
Again notice the extra backslashes needed to get past the C++ compiler.