Programming from ^ to $

Download this week’s example code here.

Regular Expressions: A Rationale

The regular expression is a kind of a programming language—but a programming language with a specific purpose: to describe strings.

The tools we’ve used so far in class let us ask coarse-grained questions about strings, such as “does this string contain the letter ‘a’?” and “does this string have nine or more characters?”

Regular expressions will let us ask questions like, “Does this string begin with the letter ‘a’, have at least seven characters (but no more than ten), followed by a semicolon? Oh yeah, give me whatever text came after the semicolon.”

Crafting with egrep: patternmaking and matchmaking

The most basic operation that regular expressions perform is matching strings: you’re asking the computer whether a particular string matches some description. The grep utility has a very unsophisticated way of performing this operation. The following command, for example:

  grep you <poem.txt

… will return only those lines in poem.txt that contain the literal string you. In fact, that’s the only operation grep lets us perform: “Does the current line contain this literal string?”

But what if we want to ask the computer for lines matching a more complex description? UNIX has the tool for you! There’s a command called egrep which works just like grep, except that it prints lines that match a regular expression, not just a literal string. For example:

  egrep 'yo.' <poem.txt

… will print lines in poem.txt that contain the letters yo followed by any other character–not just lines containing “you” but also lines containing “yon,” “york,” “yolanda,” and so forth. The regular expression that we pass to egrep is called a pattern. (Make sure to put your pattern inside single quotes—otherwise, the UNIX shell might misinterpret or mangle the pattern.)

Metacharacters

The period (.) in the above example is a metacharacter—a character that, when inside a regular expression, means something other than its literal value. The metacharacters in regular expressions fall (broadly) into four categories. We’ll be discussing each category, but we’ll start out with the most important one: character classes.

.    match any character
\w   match any alphanumeric ("word") character (lowercase and capital letters, 0 through 9, underscore)
\W   match any non-alphanumeric character (the inverse of \w)
\s   match any whitespace character (i.e., space and tab)
\S   match any non-whitespace character (the inverse of \s)
\d   match any digit (0 through 9)

(nb. \s, \S and \d work in Java, but not in egrep; see below)

You can define your own character classes by enclosing a list of characters, or range of characters, inside square brackets:

[aeiou]   matches any vowel
[02468]   matches any even digit
[a-z]     matches any lower-case letter
[^0-9]    matches any non-digit (the ^ inverts the class, matches anything not in the list)

The next important kind of metacharacters are anchors. These allow you to specify exactly where in the string your pattern should match:

^     match at beginning of string
$     match at end of string
\b    match at word boundary

(Note that ^ in a character class has a different meaning from ^ outside a character class!)

Now we have enough regular expression knowledge to do some fairly sophisticated matching. An example:

  egrep '^T\w\w ' <frost.txt

You can read the pattern above like this: “At the beginning of a line, match an uppercase T followed by two alphanumeric characters, followed by a space.” Run this on the frost.txt text from class, and you’ll get the following output:

The darkest evening of the year.
The only other sound's the sweep
The woods are lovely, dark and deep.

Another example:

  egrep '\b[aeiou]\w\w\w\b' <frost.txt

This pattern reads as “at the beginning of a word, match any vowel followed by three alphanumeric characters, followed by the end of the word”—in other words, match any line that contains four-letter words beginning with a vowel. Output from frost.txt:

The only other sound's the sweep
Of easy wind and downy flake.

Quantifiers

Typing out all of those \ws is kind of a pain. Fortunately, there’s a way to specify how many times to match a particular character, using quantifiers. These affect the character that immediately precede them:

{n}   match exactly n times
{n,m} match at least n times, but no more than m times
{n,}  match at least n times
+     match one or more times (same as {1,})
*     match zero or more times
?     match one time or zero times

This example finds all lines that have a word that begins with s and ends with p (with at least one character between them):

  egrep '\bs\w+p\b' <frost.txt

This produces the following output from frost.txt:

To stop without a farmhouse near
The only other sound's the sweep
And miles to go before I sleep,
And miles to go before I sleep.

The following example will match all lines that have five or more consecutive vowels:

  egrep '[aeiou]{5,}' <sowpods.txt

Run against our scrabble word list, we’ll get the following output:

cooeeing
euouae
euouaes
forhooieing
miaoued
miaouing
queueing
queueings
zoaeae
zooeae

Escaping: Sometimes a bracket is just a bracket

Whew! That’s a lot of metacharacters. But what if we want to match a character that has a special meaning, without invoking that character’s special meaning?

The answer: put a backslash before the metacharacter. \[ will match a literal square bracket; \* will match a literal asterisk, and so forth.

This call to egrep, for example, will match all lines that end in a period:

  egrep '\.$' <frost.txt

… yielding the following output from frost.txt:

Whose woods these are I think I know.
To watch his woods fill up with snow.
The darkest evening of the year.
To ask if there is some mistake.
Of easy wind and downy flake.
The woods are lovely, dark and deep.
And miles to go before I sleep.

Alternation

One final bit of regular expression syntax: alternation.

(x|y)   match either x or y

You can have as many alternatives in the parentheses as you want, separated by pipe characters (|). (See PronounGroup.java below for an example.)

This example matches either “The” or “And” at the beginning of a line:

  egrep '^(The|And)' <frost.txt

…producing the following output:

The darkest evening of the year.
The only other sound's the sweep
The woods are lovely, dark and deep.
And miles to go before I sleep,
And miles to go before I sleep.

Regular expressions in Java

Java comes with built-in support for regular expressions. Using regular expressions in Java takes a bit of work, and it looks strange at first. But it’s very powerful, and unlike egrep, it allows us not just to match text, but to extract and replace it as well. Let’s take a look at a very simple example—You.java:

import com.decontextualize.a2z.TextFilter;
import java.util.regex.*;

public class You extends TextFilter {
  public static void main(String[] args) {
    new You().run();
  }
  public void eachLine(String line) {
    Pattern p = Pattern.compile("^[Yy]ou ");
    Matcher m = p.matcher(line);
    if (m.find()) {
      println(line);
    }
  }
}

If you run this program, you’ll find that it’s essentially a version of egrep with the pattern hard-coded in. It matches and prints out every line from input that begins with the word “You” (or “you”).

The first thing to notice is import java.util.regex.*;—you’ll need to include this line whenever you’re using the Pattern and Matcher classes. Just as you might write a “Ball” class to keep track of balls floating around the screen in a Processing sketch, Java’s built-in Pattern class keeps track of regular expression patterns. Likewise, a Matcher object uses a Pattern object to match strings.

In the eachLine method, a new Pattern object is created by calling Pattern.compile with a string containing our regular expression pattern. Once we have a Pattern object, we call the matcher method on it with the string we want to check as a parameter. This returns a Matcher object, which we can then use to check whether or not the line matched the regular expression. The find method of the Matcher object will return true if the line matches, and false otherwise.

Capturing

But the Matcher object lets us do more than just determine whether or not a string matches a particular pattern. Let’s take a look at another example (PronounGroup.java):

import com.decontextualize.a2z.TextFilter;
import java.util.regex.*;
public class PronounGroup extends TextFilter {
  public static void main(String[] args) {
    new PronounGroup().run();
  }
  private String pronounPattern = "\\b(I|[Ww]e|[Yy]ou|[Hh]e|[Ss]he|[Tt]hey) (\\w+)\\b";
  public void eachLine(String line) {
    Pattern p = Pattern.compile(pronounPattern);
    Matcher m = p.matcher(line);
    if (m.find()) {
      String pronoun = m.group(1);
      String afterPronoun = m.group(2);
      println(pronoun + " " + afterPronoun);
    }
  }
}

Compile and run this program, sending in some text as input—sonnets.txt, for example. You’ll get back a list of all occurrences of English pronouns in the text, along with the word that followed the pronoun. So how does it work?

In this program, after we’ve determined that the string matches the pattern, we’re capturing parts of the string. You’ll notice that certain portions of the pattern are enclosed in parentheses: these parentheses define a “group”—a part of the regular expression that we’re interested in extracting information from later. (This function of parentheses works in parallel with the alternation syntax discussed above.)

The group method of the Matcher object tells us what, exactly, matched between a given set of parentheses; the parameter passed to group determines which set of parentheses to check. The call to m.group(1) in the example above, for example, roughly translates like this: “Hey Java, I know that I asked you to match ‘I’ or ‘we’ or ‘you’ or ‘he’ or ‘she’ or ‘they’—but what did you actually find?”

Likewise, the call to m.group(2) translates as: “Hey Java, I know that I asked you to find one or more alphanumeric characters, followed by a word boundary—but what did you actually find?”

The Case of the Extra Backslashes

One unusual thing that you probably noticed about our pattern in PronounGroup.java is all those extra backslashes. What’s going on there?

You might remember from ICM that backslashes in Java strings already have a special meaning: you use them to make “escape sequences” (such as \n for a new line, \r for a carriage return, \" for a literal double-quote mark). In order to include a literal backslash in our string, you have to escape the escape character—in other words, write the backslash twice: \\.

If you’re having trouble wrapping your head around all that, don’t worry. The rule you need to remember is actually very simple: whenever you’d have a single backslash in a pattern you’d pass to egrep, you need to use two backslashes in your Java regular expression.

Replacements

Let’s look at one more regular expression example, PronounReplace.java:

import java.util.regex.*;
import com.decontextualize.a2z.TextFilter;

public class PronounReplace extends TextFilter {
  public static void main(String[] args) {
    new PronounReplace().run();
  }
  public void eachLine(String line) {
    line = line.replaceAll("\\bhe\\b", "she");
    line = line.replaceAll("\\bHe\\b", "She");
    line = line.replaceAll("\\bhis\\b", "her");
    line = line.replaceAll("\\bHis\\b", "Her");
    line = line.replaceAll("\\bhim\\b", "her");
    line = line.replaceAll("\\bHim\\b", "Her");
    println(line);
  }
}

This program reads lines from input and replaces every occurrence of a masculine pronoun in English (he, his, him) with the equivalent feminine pronoun (she, her, her).

In this program, we’re using the handy replaceAll method of the String object. This method finds every stretch of text in the String object that matches the regular expression in the first argument and replaces it with the string passed in as the second argument.

(You could accomplish the equivalent with the same Pattern/Matcher pair that we used in the previous examples, but the replaceAll method from the String class is a bit more convenient.)

ArrayLists

(notes forthcoming)

Reading for next week

Assignment #3

Make a program that creatively transforms or performs analysis on a text using regular expressions. The program should take its input from the keyboard and send its to the screen (or redirect to/from a file). Your program might (a) filter lines from the input, based on whether they match a pattern; (b) match and display certain portions of each line; (c) replace certain portions of each line with new text; or (d) any combination of the above.

Sample ideas: Replace all words in a text of a certain length with a random word; find telephone numbers or e-mail addresses in a text; locate words within a word list that will have a certain score in Scrabble; etc.

Bonus challenge 1: Use one or more features of regular expression syntax that we didn’t discuss in class. Reference here.

Bonus challenge 2: Use one or more features of the Pattern or Matcher class that we didn’t discuss in class. Of particular interest: regex flags (CASE_INSENSITIVE, MULTILINE), “back references” in replaceAll. Matcher class reference here.

Reply