Regular expressions.

Regular Expressions

A Rationale

The regular expression is a kind of a programming language—but a programming language with a specific purpose: to describe strings.

The tools we’ve used so far in class let us ask coarse-grained questions about strings, such as “does this string contain the letter ‘a’?” and “does this string have nine or more characters?”

Regular expressions are a language that lets us ask, in very nuanced ways, the question “Does the string look like this?”

Regular expression sandwich

This is a strange analogy, but it might be helpful for your understanding. You can think of your sandwich order at the deli as a kind of regular expression. You’re asking for a sandwich with its ingredients in a particular order: “two slices of bread, with swiss, salami, and romaine in between.” If you’re getting a pre-made sandwich from the refridgerator, you might be less picky, and look instead for two slices of bread, with cheese, meat and lettuce in between. Maybe you’re picking up a sandwich for a friend, who doesn’t care whether or not the sandwich has lettuce: “two slices of bread, with some kind of cheese and meat between them, and optionally lettuce.”

Regular expressions are a way to make computers understand these kinds of requests. Except for strings, not sandwiches.

Crafting with egrep: patternmaking and matchmaking

The most basic operation that regular expressions perform is matching strings: you’re asking the computer whether a particular string matches some description. The grep utility has a very unsophisticated way of performing this operation. The following command, for example:

  grep you <poem.txt

… will return only those lines in poem.txt that contain the literal string you. In fact, that’s the only operation grep lets us perform: “Does the current line contain this literal string?”

But what if we want to ask the computer for lines matching a more complex description? UNIX has the tool for you! There’s a command called egrep which works just like grep, except that it prints lines that match a regular expression, not just a literal string. For example:

  egrep 'yo.' <poem.txt

… will print lines in poem.txt that contain the letters yo followed by any other character–not just lines containing “you” but also lines containing “yon,” “york,” “yolanda,” and so forth. The regular expression that we pass to egrep is called a pattern. (Make sure to put your pattern inside single quotes—otherwise, the UNIX shell might misinterpret or mangle the pattern.)

Metacharacters

The period (.) in the above example is a metacharacter—a character that, when inside a regular expression, means something other than its literal value. The metacharacters in regular expressions fall (broadly) into four categories. We’ll be discussing each category, but we’ll start out with the most important one: character classes.

.    match any character
\w   match any alphanumeric ("word") character (lowercase and capital letters, 0 through 9, underscore)
\W   match any non-alphanumeric character (the inverse of \w)
\s   match any whitespace character (i.e., space and tab)
\S   match any non-whitespace character (the inverse of \s)
\d   match any digit (0 through 9)

(nb. \s, \S and \d work in Python, but not in egrep; see below)

You can define your own character classes by enclosing a list of characters, or range of characters, inside square brackets:

[aeiou]   matches any vowel
[02468]   matches any even digit
[a-z]     matches any lower-case letter
[^0-9]    matches any non-digit (the ^ inverts the class, matches anything not in the list)

The next important kind of metacharacters are anchors. These allow you to specify exactly where in the string your pattern should match:

^     match at beginning of string
$     match at end of string
\b    match at word boundary

(Note that ^ in a character class has a different meaning from ^ outside a character class!)

Now we have enough regular expression knowledge to do some fairly sophisticated matching. An example:

  egrep '^T\w\w ' <frost.txt

You can read the pattern above like this: “At the beginning of a line, match an uppercase T followed by two alphanumeric characters, followed by a space.” Run this on the frost.txt text from class, and you’ll get the following output:

The darkest evening of the year.
The only other sound's the sweep
The woods are lovely, dark and deep.

Another example:

  egrep '\b[aeiou]\w\w\w\b' <frost.txt

This pattern reads as “at the beginning of a word, match any vowel followed by three alphanumeric characters, followed by the end of the word”—in other words, match any line that contains four-letter words beginning with a vowel. Output from frost.txt:

The only other sound's the sweep
Of easy wind and downy flake.

Quantifiers

Typing out all of those \ws is kind of a pain. Fortunately, there’s a way to specify how many times to match a particular character, using quantifiers. These affect the character that immediately precede them:

{n}   match exactly n times
{n,m} match at least n times, but no more than m times
{n,}  match at least n times
+     match one or more times (same as {1,})
*     match zero or more times
?     match one time or zero times

This example finds all lines that have a word that begins with s and ends with p (with at least one character between them):

  egrep '\bs\w+p\b' <frost.txt

This produces the following output from frost.txt:

To stop without a farmhouse near
The only other sound's the sweep
And miles to go before I sleep,
And miles to go before I sleep.

The following example will match all lines that have five or more consecutive vowels:

  egrep '[aeiou]{5,}' <sowpods.txt

Run against our scrabble word list, we’ll get the following output:

cooeeing
euouae
euouaes
forhooieing
miaoued
miaouing
queueing
queueings
zoaeae
zooeae

Escaping: Sometimes a bracket is just a bracket

Whew! That’s a lot of metacharacters. But what if we want to match a character that has a special meaning, without invoking that character’s special meaning?

The answer: put a backslash before the metacharacter. \[ will match a literal square bracket; \* will match a literal asterisk, and so forth.

This call to egrep, for example, will match all lines that end in a period:

  egrep '\.$' <frost.txt

… yielding the following output from frost.txt:

Whose woods these are I think I know.
To watch his woods fill up with snow.
The darkest evening of the year.
To ask if there is some mistake.
Of easy wind and downy flake.
The woods are lovely, dark and deep.
And miles to go before I sleep.

Alternation

One final bit of regular expression syntax: alternation.

(x|y)   match either x or y

You can have as many alternatives in the parentheses as you want, separated by pipe characters (|).

This example matches either “The” or “And” at the beginning of a line:

  egrep '^(The|And)' <frost.txt

…producing the following output:

The darkest evening of the year.
The only other sound's the sweep
The woods are lovely, dark and deep.
And miles to go before I sleep,
And miles to go before I sleep.

Regular expressions in Python

Using egrep is all well and good, but there comes a time when you’ll find it useful to use regular expressions inside a Python program. Luckily, Python’s standard library includes a set of functions and objects to facilitate this: the re (“regular expression”) library. In the following section, we’ll go over the contents of the re library and demonstrate its various uses.

There are three things we’re likely to want to do with regular expressions in Python: matching, capture and substitution. We’ll look at these one at a time.

“Raw” string literals: r””

But first: a bit about string literals.

Python provides another way to include string literals in your program, in addition to the single- and double-quoted strings we discussed in session 2. The r"" string literal, or “raw” string, includes all characters inside the quotes literally, without interpolating special escape characters. Here’s an example from the interactive interpreter:

>>> print "this is\na test"
this is
a test
>>> print r"this is\na test"
this is\na test
>>> print "I love \\ backslashes!"
I love \ backslashes!
>>> print r"I love \ backslashes!"
I love \ backslashes!

As you can see, whereas a double- or single-quoted string literal interprets \n as a new line character, the raw quoted string includes those characters as they were literally written. More importantly, for our purposes at least, is the fact that, in the raw quoted string, we only need to write one backslash in order to get a literal backslash in our string.

Why is this important? Because regular expressions use backslashes all the time, and we don’t want Python to try to interpret those backslashes as special characters. (Inside a regular string, we’d have to write a simple regular expression like \b\w+\b as \\b\\w+\\b—yecch.)

So the basic rule of thumb is this: use r"" to quote any regular expressions in your program. All of the examples you’ll see below will use this convention.

Matching strings

The most basic operation we’ll want to perform with regular expressions in Python is simply to check whether or not a pattern matches a given string. We can do this using the re module’s search function. The first parameter to search is the regular expression you’re trying to match; the second parameter is the string you’re matching against. If the pattern doesn’t match, as in the following transcript from the interactive interpreter:

>>> import re
>>> print re.search(r"^\w{5}$", "foo") # match strings consisting of exactly five characters
None

… then the search function returns None. (None is Python’s version of null in other languages: it simply means nothing.)

If, on the other hand, the pattern does match, the search function returns a special “match object.” Here’s a continuation of the interactive interpreter session above:

>>> print re.search(r"^\w{5}$", "xyzzy")
<_sre.SRE_Match object at 0x71608>

We’ll discuss the specifics of match objects below, but for now the important thing is that if search succeeds, it counts as “true”; if the search fails, it counts as “false.” This fact allows us to build a simple grep clone in Python, with the following code: (you.py)

import sys
import re

for line in sys.stdin:
  line = line.strip()
  if re.search(r'\b[Yy]ou\b', line):
    print line

Run the program and see the results. You’ll only see the lines from the file that have the word ‘you’ in them (either with a capital or lower-case ‘y’).

The overall structure of this program should look familiar: we visit every line in standard input. The key line is line 6, in which we check the current line against a regular expression. If search returns true (in other words, if the match succeeds), then print the line; otherwise, do nothing.

Capture

In some cases, we want to be able to not just check to see if a string matches a regular expression, but also to find out just what, exactly, in the string matched the regular expression. In other words, we want to capture whatever it was that matched.

The easiest way to do this is with the re module’s findall function, which takes a regular expression and a string to match it against, and returns a list of all parts of the string that the regular expression matched. Here’s an example from the interactive interpreter:

>>> import re
>>> re.findall(r"\b\w{5}\b", "alpha beta gamma delta epsilon zeta eta theta")
['alpha', 'gamma', 'delta', 'theta']
>>> re.findall(r"\b\w{5}\b", "snub")
[]

The regular expression here is checking for any word that contains exactly five characters. The first call results in a list of all five-character words in the string; the second call has no matches, and so returns an empty list.

We can use this function to build a different version of you.py, one which tells us not just which lines matched the word “you” but what word came directly after. Here’s the source code: (contextual_you.py):

for line in sys.stdin:
  line = line.strip()
  for matching_string in re.findall(r'\b[Yy]ou \w+', line):
    print matching_string

Here, we loop over every line in standard input and then loop over every part of the line that matches the given regular expression (which finds the word “you” or “You” followed by a space and one or more alphanumeric characters). Run this with, e.g., Pride and Prejudice as input, and you’ll get results that look like the following:

you heard
you not
you must
you be
you talk
you must
You and
you may
you are
you the
you flatter
you must
you know
you must
you do
...

Match objects; capturing with parentheses

We saw above that the re.search function doesn’t just return True or False: it returns a special “match object,” which has a number of methods that we might find useful. The official documentation has its say here, but the method we’re going to focus on is group. Simply stated: the group method returns the part of the string that the regular expression matched.

Here’s an example from the interactive interpreter, in which a telephone number is extracted from

>>> import re
>>> search_me = "my phone number is 917-555-1234 hey hey"
>>> match_obj = re.search(r"\d{3}-\d{3}-\d{4}", search_me)
>>> match_obj.group()
'917-555-1234'

Sometimes we have fairly complex regular expressions, from which we want to be able to capture only selected parts of the matching string. Let’s take the example above as a starting point. We might want to capture the three parts of the phone number (area code, exchange, etc.) without having to parse out these values after the search. To do these, we surround the portions of the regular expression that we want to capture later with parentheses, like so:

>>> import re
>>> search_me = "my phone number is 917-555-1234 hey hey"
>>> match_obj = re.search(r"(\d{3})-(\d{3})-(\d{4})", search_me)
>>> match_obj.group()
'917-555-1234'
>>> match_obj.group(1)
'917'
>>> match_obj.group(2)
'555'
>>> match_obj.group(3)
'1234'

The group method still returns the whole match, but now we can pass parameters to the function to get whatever string matched between the corresponding group of parentheses. A call to group(1) gets whatever matched between the first set of parentheses in the regular expression; a call to group(2) gets whatever matched between the second set; etc.

The re.search function, as we’ve already established, only returns the first match. We used the re.findall function above to find all of the strings that match a particular regular expression, but what if we want to get all of the match objects that correspond to each match? There’s a function for that, too: re.finditer (short for “find with iterator”).

The finditer function allows you to use a for loop to iterate over all of the times a regular expression matched a particular string. Unlike findall, finditer gives you the match objects, not the strings.

Let’s look at an example that uses finditer, line-by-line. Here’s whats_in_what.py, which searches the source text for occurences of the string “in the” and captures the words directly to the right and left, thereby building a dictionary of containers and the things therein contained:

import sys
import re

data = dict()

for line in sys.stdin:
  line = line.strip()
  for match_obj in re.finditer(r"\b(\w+) in the (\w+)", line):
    contained = match_obj.group(1)
    container = match_obj.group(2)
    contained = contained.lower()
    container = container.lower()
    if container in data:
      data[container].append(contained)
    else:
      data[container] = list()
      data[container].append(contained)

for container in data.keys():
  contained_list = data[container]
  all_contained = ', '.join(contained_list)
  print container + ": " + all_contained

The most important lines in this program are lines 8-10. Line 8 has the call to re.finditer, which is what the for loop is going to loop over. In lines 9 and 10, the first and second matching groups (as defined by the parentheses in the regular expression in line 8) are extracted and placed into their own variables, one for the “container” (whatever comes after “in the”) and the “contained” (whatever was in the container).

After the end of the first for loop, the data variable will point to a dictionary, whose keys are containers, and whose values are lists of things contained in those containers. Here’s some sample output, just to illustrate (using Pride and Prejudice as input):

circle: blank
east: regiment
warmest: seated, bickerton
hope: time
kingdom: rank, any, haunts, houses
town: as
complaint: infection
evening: late, and, quadrille, come, join, table, there, settled, and, dance, over, house, up, there, fire, call
...

Here we’ve used regular expressions to extract what might be somewhat meaningful relationships between words in the source text—suitable for remixing into a new text, or for visualization, or for whatever.

Substitution

The last task that we’ll attempt with regular expressions is substitution, wherein whatever a particular regular expression matches will be replaced with a new string. The re module provides a straightforward function called re.sub to accomplish just such a task. It takes three parameters: a regular expression to match, a string to replace whatever matched in the regular expression, and the string to search. It returns a copy of the search string, with the relevant substitution applied.

Here’s an example of its use from the interactive interpreter:

>>> import re
>>> replace_me = "Frogs! Why do frogs always whisper my name? Oh, frogs."
>>> re.sub(r"[fF]rogs", "antelope", replace_me)
'antelope! Why do antelope always whisper my name? Oh, antelope.'

Using regular expressions for substitution is a tremendously powerful technique. Here’s a more full-fledged example, genderswap.py:

import re
import sys

for line in sys.stdin:
  line = line.strip()
  line = re.sub(r'\b[Hh]im\b', 'her', line)
  line = re.sub(r'\b[Hh]e\b', 'she', line)
  line = re.sub(r'\b[Hh]is\b', 'her', line)
  print line

This program has the effect of substituting third-person masculine pronouns in English with third-person feminine pronouns. Here’s some sample output, from Pride and Prejudice:

It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.

However little known the feelings or views of such a man may be on her
first entering a neighbourhood, this truth is so well fixed in the minds
of the surrounding families, that she is considered the rightful property
of some one or other of their daughters.

"My dear Mr. Bennet," said her lady to her one day, "have you heard that
Netherfield Park is let at last?"

Mr. Bennet replied that she had not.

"But it is," returned she; "for Mrs. Long has just been here, and she
told me all about it."
...

Important note: Remember to assign the return value of re.sub to a variable! Remember, re.sub returns a copy of the string; it doesn’t modify the string in-place. If you don’t assign the result of the substitution to a variable, your call to the function essentially has no effect.

Helpful reading

The Data structures section from the Python tutorial
Regular Expression Operations from the Python library documentation
egrep for linguists includes a good regular expressions tutorial, helpful for supplementing our in-class exercises.
Dictionaries from How to Think Like a Computer Scientist

Homework #2: Due 13 July

Create a program that creatively transforms a text using regular expressions. The program should take its input from the keyboard and send its to the screen (or redirect to/from a file). Your program might (a) filter lines from the input, based on whether they match a pattern; (b) match and display certain portions of each line; (c) replace certain portions of each line with new text; or (d) any combination of the above.

Choose one text that you created with your program to read in class.

Your program must make use of at least one set, dictionary, or list.

Bonus: If the logic of your procedure requires it, define and use a function in your program.

SUPER BONUS x4 MULTIPLIER: Use some aspect of regular expression syntax or Python’s re library that we didn’t discuss in class. (Possible ideas: flags like re.I; back references.)

No comments

Comments feed for this article

Trackback link: http://www.decontextualize.com/teaching/dwwp/regular-expressions-functions/trackback/

decontextualize