Regular expressions

Regular Expressions

A Rationale

The regular expression is a kind of a programming language—but a programming language with a specific purpose: to describe strings.

The tools we’ve used so far in class let us ask coarse-grained questions about strings, such as “does this string contain the letter ‘a’?” and “does this string have nine or more characters?”

Regular expressions are a language that lets us ask, in very nuanced ways, the question “Does the string look like this?”

Regular expression sandwich

Regular expression sandwich

This is a strange analogy, but it might be helpful for your understanding. You can think of your sandwich order at the deli as a kind of regular expression. You’re asking for a sandwich with its ingredients in a particular order: “two slices of bread, with swiss, salami, and romaine in between.” If you’re getting a pre-made sandwich from the refridgerator, you might be less picky, and look instead for two slices of bread, with cheese, meat and lettuce in between. Maybe you’re picking up a sandwich for a friend, who doesn’t care whether or not the sandwich has lettuce: “two slices of bread, with some kind of cheese and meat between them, and optionally lettuce.”

Regular expressions are a way to make computers understand these kinds of requests. Except for strings, not sandwiches.

Otherwise stated: a cautionary tale

Our programs so far have been able check whether strings meet very simple criteria, such as “does this string begin with a particular character” or “does this string contain another string”? But imagine writing a program that performs the following task: print only those lines from input that contain a ZIP code (i.e., a five-character sequence of digits). Give up? Here’s my attempt, using only the tools we’ve discussed so far (no regular expressions):

# finding zip codes in input without regular expressions---don't do this!
import sys

for line in sys.stdin:
  in_number = False
  number_char_count = 0
  found_zip = False
  for char in line:
    if char in '0123456789':
      in_number = True
      number_char_count += 1
    else:
      if number_char_count == 5:
        found_zip = True
      in_number = False
      number_char_count = 0
  if found_zip:
    print line.strip()

Basically, we have to iterate over each character in the line, check to see if that character is a digit, set a flag if so, continue reading characters until we reach a non-digit character, unset the flag, check to see if we found exactly five digit characters, and print out the line if so. Problems with this code: it’s messy; it doesn’t overtly communicate what it’s doing; it’s not easily generalized to other, similar tasks (e.g., if we wanted to write a program that printed out lines with phone numbers, the code would likely look completely different).

Our ancient UNIX pioneers had this problem, and in pursuit of a solution, thought to themselves, “Let’s make a tiny language that allows us to write specifications for textual patterns, and match those patterns against strings. No one will ever have to write fiddly code that checks strings character-by-character ever again.” And regular expressions were born.

Here’s that same program using regular expressions, by the way:

import sys
import re

for line in sys.stdin:
  line = line.strip()
  if re.search(r"\d{5}", line):
    print line

I’ll allow that the \d{5} in there is mighty cryptic (though hopefully it won’t be when you’re done reading this page and/or participating in the associated lecture). But the overall structure of the program is much simpler.

Crafting with egrep: patternmaking and matchmaking

The most basic operation that regular expressions perform is matching strings: you’re asking the computer whether a particular string matches some description. The grep utility has a very unsophisticated way of performing this operation. The following command, for example:

  grep you <poem.txt

… will return only those lines in poem.txt that contain the literal string you. In fact, that’s the only operation grep lets us perform: “Does the current line contain this literal string?”

But what if we want to ask the computer for lines matching a more complex description? UNIX has the tool for you! There’s a command called egrep which works just like grep, except that it prints lines that match a regular expression, not just a literal string. For example:

  egrep 'yo.' <poem.txt

… will print lines in poem.txt that contain the letters yo followed by any other character–not just lines containing “you” but also lines containing “yon,” “york,” “yolanda,” and so forth. The regular expression that we pass to egrep is called a pattern. (Make sure to put your pattern inside single quotes—otherwise, the UNIX shell might misinterpret or mangle the pattern.)

Metacharacters

The period (.) in the above example is a metacharacter—a character that, when inside a regular expression, means something other than its literal value. The metacharacters in regular expressions fall (broadly) into four categories. We’ll be discussing each category, but we’ll start out with the most important one: character classes.

.    match any character
\w   match any alphanumeric ("word") character (lowercase and capital letters, 0 through 9, underscore)
\W   match any non-alphanumeric character (the inverse of \w)
\s   match any whitespace character (i.e., space and tab)
\S   match any non-whitespace character (the inverse of \s)
\d   match any digit (0 through 9)

(nb. \s, \S and \d work in Python, but not in egrep; see below)

You can define your own character classes by enclosing a list of characters, or range of characters, inside square brackets:

[aeiou]   matches any vowel
[02468]   matches any even digit
[a-z]     matches any lower-case letter
[^0-9]    matches any non-digit (the ^ inverts the class, matches anything not in the list)

The next important kind of metacharacters are anchors. These allow you to specify exactly where in the string your pattern should match:

^     match at beginning of string
$     match at end of string
\b    match at word boundary

(Note that ^ in a character class has a different meaning from ^ outside a character class!)

Now we have enough regular expression knowledge to do some fairly sophisticated matching. An example:

  egrep '^T\w\w ' <frost.txt

You can read the pattern above like this: “At the beginning of a line, match an uppercase T followed by two alphanumeric characters, followed by a space.” Run this on the frost.txt text from class, and you’ll get the following output:

Two roads diverged in a yellow wood,
Two roads diverged in a wood, and I—

Another example:

  egrep '\b[aeiou]\w\w\w\b' <frost.txt

This pattern reads as “at the beginning of a word, match any vowel followed by three alphanumeric characters, followed by the end of the word”—in other words, match any line that contains four-letter words beginning with a vowel. Output from frost.txt:

I doubted if I should ever come back.
Somewhere ages and ages hence:

Quantifiers

Typing out all of those \ws is kind of a pain. Fortunately, there’s a way to specify how many times to match a particular character, using quantifiers. These affect the character that immediately precede them:

{n}   match exactly n times
{n,m} match at least n times, but no more than m times
{n,}  match at least n times
+     match one or more times (same as {1,})
*     match zero or more times
?     match one time or zero times

This example finds all lines that have a word that begins with s and ends with p (with at least one character between them):

  egrep '\bw\w+d\b' <frost.txt

This produces the following output from frost.txt:

Two roads diverged in a yellow wood,
Because it was grassy and wanted wear;
Two roads diverged in a wood, and I—

The following example will match all lines that have five or more consecutive vowels:

  egrep '[aeiou]{5,}' <sowpods.txt

Run against our scrabble word list, we’ll get the following output:

cooeeing
euouae
euouaes
forhooieing
miaoued
miaouing
queueing
queueings
zoaeae
zooeae

Escaping: Sometimes a bracket is just a bracket

Whew! That’s a lot of metacharacters. But what if we want to match a character that has a special meaning, without invoking that character’s special meaning?

The answer: put a backslash before the metacharacter. \[ will match a literal square bracket; \* will match a literal asterisk, and so forth.

This call to egrep, for example, will match all lines that end in a period:

  egrep '\.$' <frost.txt

… yielding the following output from frost.txt:

In leaves no step had trodden black.
I doubted if I should ever come back.
And that has made all the difference.

Alternation

One final bit of regular expression syntax: alternation.

(x|y)   match either x or y

You can have as many alternatives in the parentheses as you want, separated by pipe characters (|).

This example matches either “The” or “And” at the beginning of a line:

  egrep '^(And|I)' <frost.txt

…producing the following output:

And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
And having perhaps the better claim,
And both that morning equally lay
In leaves no step had trodden black.
I doubted if I should ever come back.
I shall be telling this with a sigh
I took the one less travelled by,
And that has made all the difference.

Regular expressions in Python

Using egrep is all well and good, but there comes a time when you’ll find it useful to use regular expressions inside a Python program. Luckily, Python’s standard library includes a set of functions and objects to facilitate this: the re (“regular expression”) library. In the following section, we’ll go over the contents of the re library and demonstrate its various uses.

There are three things we’re likely to want to do with regular expressions in Python: matching, capture and substitution. We’ll look at these one at a time.

“Raw” string literals: r””

But first: a bit about string literals.

Python provides another way to include string literals in your program, in addition to the single- and double-quoted strings we discussed in session 2. The r"" string literal, or “raw” string, includes all characters inside the quotes literally, without interpolating special escape characters. Here’s an example from the interactive interpreter:

>>> print "this is\na test"
this is
a test
>>> print r"this is\na test"
this is\na test
>>> print "I love \\ backslashes!"
I love \ backslashes!
>>> print r"I love \ backslashes!"
I love \ backslashes!

As you can see, whereas a double- or single-quoted string literal interprets \n as a new line character, the raw quoted string includes those characters as they were literally written. More importantly, for our purposes at least, is the fact that, in the raw quoted string, we only need to write one backslash in order to get a literal backslash in our string.

Why is this important? Because regular expressions use backslashes all the time, and we don’t want Python to try to interpret those backslashes as special characters. (Inside a regular string, we’d have to write a simple regular expression like \b\w+\b as \\b\\w+\\b—yecch.)

So the basic rule of thumb is this: use r"" to quote any regular expressions in your program. All of the examples you’ll see below will use this convention.

Matching strings

The most basic operation we’ll want to perform with regular expressions in Python is simply to check whether or not a pattern matches a given string. We can do this using the re module’s search function. The first parameter to search is the regular expression you’re trying to match; the second parameter is the string you’re matching against. If the pattern doesn’t match, as in the following transcript from the interactive interpreter:

>>> import re
>>> print re.search(r"^\w{5}$", "foo") # match strings consisting of exactly five characters
None

… then the search function returns None. (None is Python’s version of null in other languages: it simply means nothing.)

If, on the other hand, the pattern does match, the search function returns a special “match object.” Here’s a continuation of the interactive interpreter session above:

>>> print re.search(r"^\w{5}$", "xyzzy")
<_sre.SRE_Match object at 0x71608>

We’ll discuss the specifics of match objects below, but for now the important thing is that if search succeeds, it counts as “true”; if the search fails, it counts as “false.” This fact allows us to build a simple grep clone in Python, with the following code: (you.py)

import sys
import re

for line in sys.stdin:
  line = line.strip()
  if re.search(r'\b[Yy]ou\b', line):
    print line

Run the program and see the results. You’ll only see the lines from the file that have the word ‘you’ in them (either with a capital or lower-case ‘y’).

The overall structure of this program should look familiar: we visit every line in standard input. The key line is line 6, in which we check the current line against a regular expression. If search returns true (in other words, if the match succeeds), then print the line; otherwise, do nothing.

Capture

In some cases, we want to be able to not just check to see if a string matches a regular expression, but also to find out just what, exactly, in the string matched the regular expression. In other words, we want to capture whatever it was that matched.

The easiest way to do this is with the re module’s findall function, which takes a regular expression and a string to match it against, and returns a list of all parts of the string that the regular expression matched. Here’s an example from the interactive interpreter:

>>> import re
>>> re.findall(r"\b\w{5}\b", "alpha beta gamma delta epsilon zeta eta theta")
['alpha', 'gamma', 'delta', 'theta']
>>> re.findall(r"\b\w{5}\b", "snub")
[]

The regular expression here is checking for any word that contains exactly five characters. The first call results in a list of all five-character words in the string; the second call has no matches, and so returns an empty list.

We can use this function to build a different version of you.py, one which tells us not just which lines matched the word “you” but what word came directly after. Here’s the source code: (contextual_you.py):

for line in sys.stdin:
  line = line.strip()
  for matching_string in re.findall(r'\b[Yy]ou \w+', line):
    print matching_string

Here, we loop over every line in standard input and then loop over every part of the line that matches the given regular expression (which finds the word “you” or “You” followed by a space and one or more alphanumeric characters). Run this with, e.g., Pride and Prejudice as input, and you’ll get results that look like the following:

you heard
you not
you must
you be
you talk
you must
You and
you may
you are
you the
you flatter
you must
you know
you must
you do
...

Match objects; capturing with parentheses

We saw above that the re.search function doesn’t just return True or False: it returns a special “match object,” which has a number of methods that we might find useful. The official documentation has its say here, but the method we’re going to focus on is group. Simply stated: the group method returns the part of the string that the regular expression matched.

Here’s an example from the interactive interpreter, in which a telephone number is extracted from

>>> import re
>>> search_me = "my phone number is 917-555-1234 hey hey"
>>> match_obj = re.search(r"\d{3}-\d{3}-\d{4}", search_me)
>>> match_obj.group()
'917-555-1234'

Sometimes we have fairly complex regular expressions, from which we want to be able to capture only selected parts of the matching string. Let’s take the example above as a starting point. We might want to capture the three parts of the phone number (area code, exchange, etc.) without having to parse out these values after the search. To do these, we surround the portions of the regular expression that we want to capture later with parentheses, like so:

>>> import re
>>> search_me = "my phone number is 917-555-1234 hey hey"
>>> match_obj = re.search(r"(\d{3})-(\d{3})-(\d{4})", search_me)
>>> match_obj.group()
'917-555-1234'
>>> match_obj.group(1)
'917'
>>> match_obj.group(2)
'555'
>>> match_obj.group(3)
'1234'

The group method still returns the whole match, but now we can pass parameters to the function to get whatever string matched between the corresponding group of parentheses. A call to group(1) gets whatever matched between the first set of parentheses in the regular expression; a call to group(2) gets whatever matched between the second set; etc.

The re.search function, as we’ve already established, only returns the first match. We used the re.findall function above to find all of the strings that match a particular regular expression, but what if we want to get all of the match objects that correspond to each match? There’s a function for that, too: re.finditer (short for “find with iterator”).

The finditer function allows you to use a for loop to iterate over all of the times a regular expression matched a particular string. Unlike findall, finditer gives you the match objects, not the strings.

Let’s look at an example that uses finditer, line-by-line. Here’s whats_in_what.py, which searches the source text for occurences of the string “in the” and captures the words directly to the right and left, thereby building a dictionary of containers and the things therein contained:

import sys
import re

data = dict()

for line in sys.stdin:
  line = line.strip()
  for match_obj in re.finditer(r"\b(\w+) in the (\w+)", line):
    contained = match_obj.group(1)
    container = match_obj.group(2)
    contained = contained.lower()
    container = container.lower()
    if container in data:
      data[container].append(contained)
    else:
      data[container] = list()
    data[container].append(contained)

for container in data.keys():
  contained_list = data[container]
  all_contained = ', '.join(contained_list)
  print container + ": " + all_contained

The most important lines in this program are lines 8-10. Line 8 has the call to re.finditer, which is what the for loop is going to loop over. In lines 9 and 10, the first and second matching groups (as defined by the parentheses in the regular expression in line 8) are extracted and placed into their own variables, one for the “container” (whatever comes after “in the”) and the “contained” (whatever was in the container).

After the end of the first for loop, the data variable will point to a dictionary, whose keys are containers, and whose values are lists of things contained in those containers. Here’s some sample output, just to illustrate (using Pride and Prejudice as input):

circle: blank
east: regiment
warmest: seated, bickerton
hope: time
kingdom: rank, any, haunts, houses
town: as
complaint: infection
evening: late, and, quadrille, come, join, table, there, settled, and, dance, over, house, up, there, fire, call
...

Here we’ve used regular expressions to extract what might be somewhat meaningful relationships between words in the source text—suitable for remixing into a new text, or for visualization, or for whatever.

Substitution

The last task that we’ll attempt with regular expressions is substitution, wherein whatever a particular regular expression matches will be replaced with a new string. The re module provides a straightforward function called re.sub to accomplish just such a task. It takes three parameters: a regular expression to match, a string to replace whatever matched in the regular expression, and the string to search. It returns a copy of the search string, with the relevant substitution applied.

Here’s an example of its use from the interactive interpreter:

>>> import re
>>> replace_me = "Frogs! Why do frogs always whisper my name? Oh, frogs."
>>> re.sub(r"[fF]rogs", "antelope", replace_me)
'antelope! Why do antelope always whisper my name? Oh, antelope.'

Using regular expressions for substitution is a tremendously powerful technique. Here’s a more full-fledged example, genderswap.py:

import re
import sys

for line in sys.stdin:
  line = line.strip()
  line = re.sub(r'\b[Hh]im\b', 'her', line)
  line = re.sub(r'\b[Hh]e\b', 'she', line)
  line = re.sub(r'\b[Hh]is\b', 'her', line)
  print line

This program has the effect of substituting third-person masculine pronouns in English with third-person feminine pronouns. Here’s some sample output, from Pride and Prejudice:

It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.

However little known the feelings or views of such a man may be on her
first entering a neighbourhood, this truth is so well fixed in the minds
of the surrounding families, that she is considered the rightful property
of some one or other of their daughters.

"My dear Mr. Bennet," said her lady to her one day, "have you heard that
Netherfield Park is let at last?"

Mr. Bennet replied that she had not.

"But it is," returned she; "for Mrs. Long has just been here, and she
told me all about it."
...

Important note: Remember to assign the return value of re.sub to a variable! Remember, re.sub returns a copy of the string; it doesn’t modify the string in-place. If you don’t assign the result of the substitution to a variable, your call to the function essentially has no effect.

Exercises

1. Make a version of you.py that takes an arbitrary string on the command line (use sys.argv), instead of using the hardcoded pattern already in the program.

2. Make a version of the program you made in exercise 1 above that prints only those lines in the input that do not match the specified pattern.

3. Make a program that finds every instance of proper names with titles (e.g., “Mr. X,” “Ms. Y,” “Dr. Z”) in a text, and then prints out the names with their titles. Bonus: Keep track of how many times each title occurs, and print out that count as well.

Helpful reading

Reply