Regular Expressions

What’s so regular about “^(XF)?(CAT|DOG)([\d]+|[^KSCBMT]+)$” anyway?

Regular expressions are an extremely powerful tool for dealing with text input of all kinds. They are “regular” in the sense that there is a very specific way of defining them, so that you can get the results you desire. Regular expressions have their roots in POSIX and Perl, but have now been adopted by many tools and toolkits across many platforms.

At their heart, regular expressions are just a way of searching text. Regular expressions are just text, but formatted in a very specific way.

For example, a simple regular expression is just a piece of text, like “cat”. You could use this expression to search other text, like so:

   Dim expression = "cat"
   Dim someText = "This is a story about a dog and a cat."
   if Regex.IsMatch(someText, expression) then
      MsgBox("I found a cat!")
   end if

If you ran this code you would get a popup saying you found a cat. That’s because the expression “cat” matches the input string someText. That’s nice, but not very impressive, since you could do a simple string search like that in any number of ways. To realize the full power of regular expressions, you need to know about the other tools in the regex toolbox.

Quantifiers

These allow you to modify the expression you are searching for. They modify the previous “atom”, which is the smallest unit of the expression that can match. Here are the quantifiers:

  1. * Match zero or more of the preceding atom, up to infinity times.
  2. ? Match exactly zero or one of the preceding atom
  3. + Match one or more of the preceding atom, up to infinity times.

Here are some examples of how to use them.

Expression Match Info
cat* Matches cat, ca, can, calamazoo, catty, cattle, etc.
cat? Matches cat, can, calamazoo, but not catty, or cattle
cat+ Matches cat, catty, and cattle, but not ca, can, or calamazoo

Bigger Atoms

Those quantifiers might be even more useful if you could apply them to something bigger than just one character. And you can. For example, if you wanted to apply them to the entire word “cat”, the simplest way is surround that word with parentheses. Then you can use it like this:

Expression Match Info
(cat)* Matches any text, because “cat” can appear zero or more times.
(cat)? Matches “my cat has fleas” and matches “my cat is a big cat” twice.
(cat)+ Matches “my cat is huge” and matches “cat times two is catcat” twice

Special Characters

Part of the power of regular expressions is that certain characters have special meaning. Here they are:

Character Meaning
. A dot matches any single character.
\n Matches a newline character (or CR/LF combination)
\t Matches a tab character
\d Matches a digit (characters 0 through 9)
\D Matches any character that is NOT a digit
\w Matches any alphanumeric character
\W Matches any character that is NOT aphanumeric
\s Matches any whitespace character
\S Matches any character that is NOT whitespace
\ Escapes special characters. For example, \. would match a literal dot.
^ Match the beginning of the string
$ Match the end of the string

These are fairly self-explanatory, but the last two bear special mention. Matching the beginning and ending of a string allows you to tighten up what matches. For example:

Expression Matches
^cat Matches any string starting with “cat”
cat$ Matches any string ending with “cat”
^cat$ Matches only strings that start with “cat”, followed immediately by end of string. In other words, only matches “cat”.

Character Classes

Another common feature of regular expressions is the character class. This is simply something placed in square brackets. The expression engine then matches any character which appears inside the brackets. For example:

Expression Matches
[cat] Matches any single character that is either a “c”, an “a”, or a “t”
[0-9] Matches any single character that is in the range “0” through “9”
[A-Z] Matches any uppercase letter
[a-z] Matches any lowercase letter
[A-Za-z4] Matches any letter, or the digit “4”
[+?.*] Matches any of the literal characters +, ?. * or dot. Inside a character class, these have no special meaning.
[^A-E] Matches any character that is NOT one of “A” through “E”. The ^ inside a character class means to negate the match.

More on Parentheses

Earlier, parenthesis were shown to make bigger atoms. But they have other functions as well. A common one is to enclose alternatives. To use them in that way, just separate the alternative matches with a vertical bar character. For example:

Expression Matches
(cat|dog) Matches anything with either “cat” or “dog” in it.
(house|tom)\s* cat Matches “house cat”, “tom cat”, “housecat”, “tomcat”, but not “alley cat”

Parenthesis can also be used to extract text from the string for further processing. How you do that is dependent on the language you are using. For example, in VB, you would do something like this:

For Each m As Match In Regex.Matches("I went to Catmandu with my house cat on a catamaran", "(cat\w)", re.IGNORECASE)
         Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index)
Next         

That would match the words Catmandu, cat, and catamaran, in that order. Note the use of the ignorecase flag. Regular expressions are normally case sensitive, so “cat” would not normally match ‘Catmandu”. Also note that regular expressions are normally “greedy”, meaning they will match the largest text possible. So if you search the sentence in the example above for the expression “cat.*” you would find only one match, containing everything from “Catmandu” to the end of the string. That’s why the example used “\w” to limit it to alphanumeric characters, which broke the match at the spaces between the words.

There are more advanced features of regular expressions, to be sure, but these are the most common and will get you started.

And that mess at the top of this page? Here’s what it means:

“^(XF)?(CAT|DOG)([\d]+|[^KSCBMT]+)$”

Match any string that starts with CAT or DOG, optionally preceded by “XF”, then followed by either any number of digits, or any combination of characters tht does not contain a K, S, C, B, M, or T, followed by the end of the string. Why you would want to match that particular combination, I have no idea, but that’s what it does. So it would match things like “cat24”, “dognapper”, “xfcat7j”, but not things like “chihuahua”, “xfdog”, or “catblt”.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


*