Wednesday, April 20, 2011

Using regular expressions

Regular expressions can be used for many things; however, they are typically used for input validation or to perform advanced searches on text in supporting applications.  This first article will explain how to create a regular expression pattern; the expression defines what is considered a match.  The second article will provide details on how to implement regular expressions in .NET applications.
Prior to starting, I would like to point out a free regular expression tester available from my website; you can use this to test the behavior of your regular expression. During the second article I will discuss the specific options available on this test page as well as how the page was created. You can find the tester here!
Regular expressions have three basic types of symbols that are used: metacharacters, escaped characters, and character classes. Below is a short table listing the important character(s), a short description, and example of each.


^Indicates start of string, used to match a specific beginning sequence^abcabc, acb123, abcdefg
$Indicates end of a string, used to match a specifc ending sequenceabc$123456789abc, 987abc
.Any character excluding \n (new line)a.cabc, aac, a9c
|Or operator used to specify one criteria or anotherjohn|janejane, john
*Zero or more of previous expression12c*12, 12c, 12cc
+One or more of previous expression1a+c1ac, 1aac
?Zero or one of previous expression12?c1c, 12c
\Escape character, used to make any of the special characters (^, $, ., |, *, +, ?, (, [, {, etc...) literal for matching.  See next chart for other excape characters1\*a1*a
{....}Explicit quantifier notation, used to indicate _ occurences of a character or character class.  A comma can be added to provide min/max occurences12a{2}12aa, 12aa3
[....]Matches a range of characters, you can provide collections of characters (abcdefg) as well as hyphenated ranges of characters for matching (A-Z).123[abc]123a, 123b, 123c
(....)Groups a portion of the expression, used to group sections for display(123){2}123123

Escaped Characters

The characters in the table below are used to match special characters in regular expressions, we will use some of these later in this article.
NOTE: This is not a complete list of excape characters, but a list of commonly used excape characters
\bWord boundary, indicates a space or other non-word character to signify the end of a word
\tTab character
\nNew line character (Great for multi line textboxes)
\(any metacharacter)Matches the inputted meta character. (Ex \* matches *, \$ matches $)

Character Classes

The below character classes represent different groups of characters to make it easier to match on common groups of characters.
.Matches any character except for \n.  If Single Line option is enabled it matches ANY charactera.caac, abc, a1c
[rstlne]Matches any single character in the provided lista[rstlne]ar, as, al
[^aeiou]Matches any single character NOT in the provided lista[^aeiou]ab, ad, ah
[0-9a-zA-Z]Matches any single character in the following ranges (0 through 9, A through Z, and a through z).  The hyphen indicates a range element123[0-9A-F]123A, 1234
\wMatches any word character, in ECMAScript mode this matches [0-9A-Za-z]123\w123a, 1234
\WMatches any NON word character, in ECMAScript mode this is the same as [^0-9A-Za-z]123\W123$, 123-
\sMatches any whitespace character, in ECMAScript mode this matches, spaces, tabs, and new lines123\sa123 a
\SMatches any NON whitespace character1\Sa14a, 1ba
\dMatches any digit character, in ECMAScript mode this matches 0-9\d212, 32
\DMatches any NON digit character, in ECMAScript mode this matches anything that is not 0-9\D2a2, b2

How To Apply This Information

Now that we have explained the various characters included in matching regular expressions lets walk through some practical examples to illustrate how all of these items are pulled together.  In the following subsections I will walk you through a series of real world validations and provide examples with detailed information.
Prior to beginning the examples I do want to point out that in ALL of my examples the regular expressions created start with the ^ character and end with the $ character. This is done to ensure that the expression matches the entire string. This is done to ensure that the string is that match, and ONLY that match. Otherwise you can receive matches for strings with more than the included characters. You may play around with this using my expression tester to see the effects of omitting the ^ and $ characters.
Postal Code Validation
Postal Code validation is a very common user input validation, typically your postal code will either be 5 digits or 9 digits with a hyphen after the 5th digit.  We can validate this input with the following expression
First we have the "\d{5}" portion of the expression which indicates that the input must start with five digit characters (0-9). Next the portion of the expression inside the parenthesis, "-\d{4}" indicates a - to be followed by four digit characters. This is grouped within parenthesis and has a question mark appended to the end. This question mark indicates that the input should have zero or one of the preceding item, which happens to be the entire expression in the parenthesis. Therefore in the case of zero the expression would simply be five digit characters, in the case of one the expression would be five digits, a hyphen, and four more digits.
Simple Date Validation
Validation of date input is another very common occurance, full regular expression date validation is very involved, however it is very easy to restrict users to a MM/DD/YYYY format with basic checking for incorrect input.  Below is a regular expression to validate a date in the MM/DD/YYYY format. I have added parenthesis characters for readability.
The first section of this expression "([01]\d)" represents the month portion of our date, since there are only 12 months in the year we restrict the first digit to either a zero or a one, and the second character can be any number 0-9.  This is one portion of this example that can be improved upon, you can modify and create regular expressions that are capable of validating that the input is between 1 and 12, however this is outside the scope of this article.

The second section of this expression "([0-3]\d)" represents the day portion of our date.  This is separated from our first part by a / character which is a literal requirement that the month be separated from the date by a forward slash.  The first part of our day check requires that the first digit of the day is a 0, 1, 2, or 3, then the second digit would can be any number 0-9.  Just as with the month portion, this can be expanded to ensure that the day value is appropriate for the month provided, however it is outside the scope of this article.

The final section of this expression is again separated by a / character, then it allows for 4 digit characters to be inputted.  This forms the final portion of the date.
Phone Number Validation
Another common input item to validate are phone numbers, including area codes and extensions. Below you will find a sample regular expression that validates a phone number that meets one of the following formats; (555) 555-1212, 555-555-1212, (555) 555-1212 x1111, or 555-555-1212 x1111. Portions of the expression have been highlighted to illustrate the different sections of logic. These sections will be explained below.
The yellow portion of this expression validates the area code input. Notice that we have two individual groups separated by the or operator |. This indicates that one of the two expressions must be true.  The first one validates on a left parenthesis (, three digits, a right parenthesis ) and a space, the second option validates on three digits and a space.  Therefore the phone number must begin with either (515) , or 515 , this validates the area code portion of our phone number.

The green portion of this expression validates the remaining portion of the standard phon number.  The first part "\d{3}" requires three digits, then the "[\s-]" allows for either a space or a hyphen.  This is then followed up with the "\d{4}" portion which indicates that an additional 4 digits are required.  We now have validation for a standard 10 digit phone number with support for multiple formats.

The gray portion of this expression validates the optional telephone extension.  The expression "\sx\d+" indicates that the input string should have a space, the letter x, and then one or more digits.  This is enclosed in parenthesis and followed by a question mark to indicate that it is optional.  This provides for validation of numbers such as (555) 555-1212 x102.

No comments:

Post a Comment