What’s so regular about “^(XF)?(CAT|DOG)([\d]+|[^KSCBMT]+)$” anyway?
Regular expressions are an extremely powerful tool for dealing with text input of all kinds. They are “regular” in the sense that there is a very specific way of defining them, so that you can get the results you desire. Regular expressions have their roots in POSIX and Perl, but have now been adopted by many tools and toolkits across many platforms.
At their heart, regular expressions are just a way of searching text. Regular expressions are just text, but formatted in a very specific way.
For example, a simple regular expression is just a piece of text, like “cat”. You could use this expression to search other text, like so:
Dim expression = "cat" Dim someText = "This is a story about a dog and a cat." if Regex.IsMatch(someText, expression) then MsgBox("I found a cat!") end if
If you ran this code you would get a popup saying you found a cat. That’s because the expression “cat” matches the input string someText. That’s nice, but not very impressive, since you could do a simple string search like that in any number of ways. To realize the full power of regular expressions, you need to know about the other tools in the regex toolbox.
Quantifiers
These allow you to modify the expression you are searching for. They modify the previous “atom”, which is the smallest unit of the expression that can match. Here are the quantifiers:
- * Match zero or more of the preceding atom, up to infinity times.
- ? Match exactly zero or one of the preceding atom
- + Match one or more of the preceding atom, up to infinity times.
Here are some examples of how to use them.
Expression | Match Info |
cat* | Matches cat, ca, can, calamazoo, catty, cattle, etc. |
cat? | Matches cat, can, calamazoo, but not catty, or cattle |
cat+ | Matches cat, catty, and cattle, but not ca, can, or calamazoo |
Bigger Atoms
Those quantifiers might be even more useful if you could apply them to something bigger than just one character. And you can. For example, if you wanted to apply them to the entire word “cat”, the simplest way is surround that word with parentheses. Then you can use it like this:
Expression | Match Info |
(cat)* | Matches any text, because “cat” can appear zero or more times. |
(cat)? | Matches “my cat has fleas” and matches “my cat is a big cat” twice. |
(cat)+ | Matches “my cat is huge” and matches “cat times two is catcat” twice |
Special Characters
Part of the power of regular expressions is that certain characters have special meaning. Here they are:
Character | Meaning |
. | A dot matches any single character. |
\n | Matches a newline character (or CR/LF combination) |
\t | Matches a tab character |
\d | Matches a digit (characters 0 through 9) |
\D | Matches any character that is NOT a digit |
\w | Matches any alphanumeric character |
\W | Matches any character that is NOT aphanumeric |
\s | Matches any whitespace character |
\S | Matches any character that is NOT whitespace |
\ | Escapes special characters. For example, \. would match a literal dot. |
^ | Match the beginning of the string |
$ | Match the end of the string |
These are fairly self-explanatory, but the last two bear special mention. Matching the beginning and ending of a string allows you to tighten up what matches. For example:
Expression | Matches |
^cat | Matches any string starting with “cat” |
cat$ | Matches any string ending with “cat” |
^cat$ | Matches only strings that start with “cat”, followed immediately by end of string. In other words, only matches “cat”. |
Character Classes
Another common feature of regular expressions is the character class. This is simply something placed in square brackets. The expression engine then matches any character which appears inside the brackets. For example:
Expression | Matches |
[cat] | Matches any single character that is either a “c”, an “a”, or a “t” |
[0-9] | Matches any single character that is in the range “0” through “9” |
[A-Z] | Matches any uppercase letter |
[a-z] | Matches any lowercase letter |
[A-Za-z4] | Matches any letter, or the digit “4” |
[+?.*] | Matches any of the literal characters +, ?. * or dot. Inside a character class, these have no special meaning. |
[^A-E] | Matches any character that is NOT one of “A” through “E”. The ^ inside a character class means to negate the match. |
More on Parentheses
Earlier, parenthesis were shown to make bigger atoms. But they have other functions as well. A common one is to enclose alternatives. To use them in that way, just separate the alternative matches with a vertical bar character. For example:
Expression | Matches |
(cat|dog) | Matches anything with either “cat” or “dog” in it. |
(house|tom)\s* cat | Matches “house cat”, “tom cat”, “housecat”, “tomcat”, but not “alley cat” |
Parenthesis can also be used to extract text from the string for further processing. How you do that is dependent on the language you are using. For example, in VB, you would do something like this:
For Each m As Match In Regex.Matches("I went to Catmandu with my house cat on a catamaran", "(cat\w)", re.IGNORECASE) Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index) Next
That would match the words Catmandu, cat, and catamaran, in that order. Note the use of the ignorecase flag. Regular expressions are normally case sensitive, so “cat” would not normally match ‘Catmandu”. Also note that regular expressions are normally “greedy”, meaning they will match the largest text possible. So if you search the sentence in the example above for the expression “cat.*” you would find only one match, containing everything from “Catmandu” to the end of the string. That’s why the example used “\w” to limit it to alphanumeric characters, which broke the match at the spaces between the words.
There are more advanced features of regular expressions, to be sure, but these are the most common and will get you started.
And that mess at the top of this page? Here’s what it means:
“^(XF)?(CAT|DOG)([\d]+|[^KSCBMT]+)$”
Match any string that starts with CAT or DOG, optionally preceded by “XF”, then followed by either any number of digits, or any combination of characters tht does not contain a K, S, C, B, M, or T, followed by the end of the string. Why you would want to match that particular combination, I have no idea, but that’s what it does. So it would match things like “cat24”, “dognapper”, “xfcat7j”, but not things like “chihuahua”, “xfdog”, or “catblt”.