An example driven guide to regular expressions

The problem

You have a string of text that needs to be checked to see if it fits a validation pattern or to extract information from it.

In the case of validation you might want to know if a given input is a valid currency amount like £100, so you can prompt the user to enter a valid amount before you process a transaction.

For parsing you might want to get the version of a users web browser given from a User Agent string like these:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0

To solve both problems we either need to either split the text and manually check for conditions in our code or we can use a regular expression.

What is a regular expression?

Regular expressions are a DSL which consist of two parts, a target string and the regular expression itself. The regular expression part is kind of like the patterns you use to search for files with wildcards. They look very scary at first but you only need to know a few rules to get the most out of them.

Given a target string of 'Mississippi' and a regular expression of /s/ we would get a match back as the target string contains at least one 's'. Though this is a quite simple example, usually they use a number of regular expression features like: /^d\w[uiop](in|vi)[^a-f]*$/ which matches 'driving'.

How does it work?

A regular expression is made up of literal characters, metacharacters and escape sequences.

A literal is like in the Mississippi example above, the /m/ literally means this contains an 'm' anywhere in the target string.

A metacharacter is used within a regular expression for special characters that don't have a literal meaning, for example a caret sign indicates this regular expression must match from the start of the line. Meaning /^s/ would no longer match Mississippi but /^m/ would.

Finally, an escape sequence is used to convert a metacharacter into a literal for when the need arrises. For example the dollar sign '$' has a special meaning so to literally search for it we'd escape it by putting a slash at the front like /\$/.

Capture groups

The first metacharacter we'll get to know properly is the capture group, as this is what allows you to extract a substring of text from a target string.

Let's say we wanted to find the version of the IE web browser a visitor was using which has a target string of 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0'. In here we are looking for the '8.0' following the MSIE. A simple way to achieve this would be to do /MSIE (8.0)/

The problem

What is a regular expression?

How does it work?

Capture groups

Boolean 'or'

Matching any single character

Iteration metacharacters

Positioning

Character classes

Shorthand character classes

Capture groups extended

/^d\w[uiop](in|vi)[^a-f]*$/

Resources