Regular expressions; Perl: strings and lists

Regular Expressions

A Rationale

The regular expression is a kind of a programming language—but a programming language with a specific purpose: to describe strings.

The tools we’ve used so far in class let us ask coarse-grained questions about strings, such as “does this string contain the letter ‘a’?” and “does this string have nine or more characters?”

Regular expressions are a language that lets us ask, in very nuanced ways, the question “Does the string look like this?”

Regular expressions are also baked into Perl’s syntax; you can’t really know Perl unless you know regular expressions.

Crafting with egrep: patternmaking and matchmaking

The most basic operation that regular expressions perform is matching strings: you’re asking the computer whether a particular string matches some description. The grep utility has a very unsophisticated way of performing this operation. The following command, for example:

  grep you <poem.txt

… will return only those lines in poem.txt that contain the literal string you. In fact, that’s the only operation grep lets us perform: “Does the current line contain this literal string?”

But what if we want to ask the computer for lines matching a more complex description? UNIX has the tool for you! There’s a command called egrep which works just like grep, except that it prints lines that match a regular expression, not just a literal string. For example:

  egrep 'yo.' <poem.txt

… will print lines in poem.txt that contain the letters yo followed by any other character–not just lines containing “you” but also lines containing “yon,” “york,” “yolanda,” and so forth. The regular expression that we pass to egrep is called a pattern. (Make sure to put your pattern inside single quotes—otherwise, the UNIX shell might misinterpret or mangle the pattern.)

Metacharacters

The period (.) in the above example is a metacharacter—a character that, when inside a regular expression, means something other than its literal value. The metacharacters in regular expressions fall (broadly) into four categories. We’ll be discussing each category, but we’ll start out with the most important one: character classes.

.    match any character
\w   match any alphanumeric ("word") character (lowercase and capital letters, 0 through 9, underscore)
\W   match any non-alphanumeric character (the inverse of \w)
\s   match any whitespace character (i.e., space and tab)
\S   match any non-whitespace character (the inverse of \s)
\d   match any digit (0 through 9)

(nb. \s, \S and \d work in Perl, but not in egrep; see below)

You can define your own character classes by enclosing a list of characters, or range of characters, inside square brackets:

[aeiou]   matches any vowel
[02468]   matches any even digit
[a-z]     matches any lower-case letter
[^0-9]    matches any non-digit (the ^ inverts the class, matches anything not in the list)

The next important kind of metacharacters are anchors. These allow you to specify exactly where in the string your pattern should match:

^     match at beginning of string
$     match at end of string
\b    match at word boundary

(Note that ^ in a character class has a different meaning from ^ outside a character class!)

Now we have enough regular expression knowledge to do some fairly sophisticated matching. An example:

  egrep '^T\w\w ' <frost.txt

You can read the pattern above like this: “At the beginning of a line, match an uppercase T followed by two alphanumeric characters, followed by a space.” Run this on the frost.txt text from class, and you’ll get the following output:

The darkest evening of the year.
The only other sound's the sweep
The woods are lovely, dark and deep.

Another example:

  egrep '\b[aeiou]\w\w\w\b' <frost.txt

This pattern reads as “at the beginning of a word, match any vowel followed by three alphanumeric characters, followed by the end of the word”—in other words, match any line that contains four-letter words beginning with a vowel. Output from frost.txt:

The only other sound's the sweep
Of easy wind and downy flake.
Quantifiers

Typing out all of those \ws is kind of a pain. Fortunately, there’s a way to specify how many times to match a particular character, using quantifiers. These affect the character that immediately precede them:

{n}   match exactly n times
{n,m} match at least n times, but no more than m times
{n,}  match at least n times
+     match one or more times (same as {1,})
*     match zero or more times
?     match one time or zero times

This example finds all lines that have a word that begins with s and ends with p (with at least one character between them):

  egrep '\bs\w+p\b' <frost.txt

This produces the following output from frost.txt:

To stop without a farmhouse near
The only other sound's the sweep
And miles to go before I sleep,
And miles to go before I sleep.

The following example will match all lines that have five or more consecutive vowels:

  egrep '[aeiou]{5,}' <sowpods.txt

Run against our scrabble word list, we’ll get the following output:

cooeeing
euouae
euouaes
forhooieing
miaoued
miaouing
queueing
queueings
zoaeae
zooeae
Escaping: Sometimes a bracket is just a bracket

Whew! That’s a lot of metacharacters. But what if we want to match a character that has a special meaning, without invoking that character’s special meaning?

The answer: put a backslash before the metacharacter. \[ will match a literal square bracket; \* will match a literal asterisk, and so forth.

This call to egrep, for example, will match all lines that end in a period:

  egrep '\.$' <frost.txt

… yielding the following output from frost.txt:

Whose woods these are I think I know.
To watch his woods fill up with snow.
The darkest evening of the year.
To ask if there is some mistake.
Of easy wind and downy flake.
The woods are lovely, dark and deep.
And miles to go before I sleep.
Alternation

One final bit of regular expression syntax: alternation.

(x|y)   match either x or y

You can have as many alternatives in the parentheses as you want, separated by pipe characters (|).

This example matches either “The” or “And” at the beginning of a line:

  egrep '^(The|And)' <frost.txt

…producing the following output:

The darkest evening of the year.
The only other sound's the sweep
The woods are lovely, dark and deep.
And miles to go before I sleep,
And miles to go before I sleep.

Perl begins: a grep clone

Regular expressions are such a huge part of programming Perl that our very first example is going to use them. Let’s take a look at grep.pl, which will print all lines in the input that match a pattern specified in the program:

#!/usr/bin/perl
use strict;

while (my $line = <>) {
  chomp($line);
  if ($line =~ /\b[Ee]\w{4}\b/) {
    print "$line\n";
  }
}

What’s going on in here?

Line 1: the “shebang.” This is one way to let the UNIX command line which interpreter is responsible for running a program. Perl also counts on this line to be there for other reasons, so make sure it’s the first line in your programs. (the path to perl might be different on other machines)

Line 2: use strict puts Perl into “strict mode,” which halts compilation of the program if certain types of errors are found. Most importantly, it stops you from using variables that you haven’t yet initialized. If you’re programming something that you hope to maintain, always use this.

Line 4: There’s a lot going on here. The while loop works as you’d expect: inside the parentheses, there’s an expression that will evaluate as true or false; the loop will execute until that condition evaluates as false. The <> operator reads a line from standard input, which is assigned to the variable $line. Variables that you don’t want to be global must be prefixed with my; more on this below.

Line 5: What does chomp do? One way to find out is to consult the documentation! Here’s the chomp documentation from perldoc.perl.org. You can also use the command-line tool perldoc -f nnnn, replacing nnnn with the name of the function you want information on, e.g.:

$ perldoc -f chomp

(chomp modifies the value in the variable passed to it, removing whitespace from the end of the string.)

Line 6: Lots of stuff happening here as well. The =~ operator “binds” the variable on the left to the regular expression on the right. The binding operation, among other things, returns true if the variable to the left matches the pattern on the right. Regular expression patterns in Perl are found inside slashes (/).

Line 7: Like PHP, Perl will interpolate variable names inside of a string and replace them with the variable’s value. Unlike PHP, Perl’s function to print a line to output is print (not echo).

Line 8: Also unlike PHP, compound statements with conditionals or loops (if, while, etc.) must include curly braces, even if there’s only one line. (Perl does have a shorter way of writing single-line conditionals and loops, which we’ll get to later.)

Mini-exercise. Make a version of grep.pl that prints only those lines in the input that do not match the specified pattern? Use conditionals to your advantage, or see if perhaps there is an operator that can help you.

A few more notes:

  • Note the $ sign in front of the variable name. These are required, just as in PHP, although the story is a bit more complicated in Perl, as we’ll see below.
  • Parentheses around calls to functions are optional in Perl. We could just as easily write chomp $line; on line 5. (although, for reasons of precedence and clarity, it’s usually better to include the parentheses.)

Strings and lists

The following program (multigrep.pl) is very similar to grep.pl, except that it finds lines that match any of the patterns in a list of patterns given in the source code. The @patterns variable below is assigned an array value from an array literal; inside of our input-reading loop, we use Perl’s foreach loop to try to match the current line against each expression in the list. Here’s the code:

#!/usr/bin/perl
use strict;

my @patterns = ('\byou\b', '\bme\b', '\bhe\b', '\bshe\b');

while (my $line = <>) {
  chomp($line);
  foreach my $pattern (@patterns) {
    if ($line =~ /$pattern/) {
      print "$line\n";
      last;
    }
  }
}

In Perl, variables that contain arrays are prefixed with @ (more about this below). On line 4, the variable @patterns gets assigned an array from an array literal on the right side of the assignment operator. There are a number of ways to write array literals in Perl, but the most common is what you see above: a pair of parentheses, surrounding zero or more comma-separated values.

Line 8 demonstrates Perl’s foreach loop. The foreach syntax looks like this:

  foreach my $x (@some_kind_of_array) {
    # code here gets executed once for
    # each item in @some_kind_of_array
  }

The loop variable ($x above or $pattern in multigrep.pl) should be declared with my; otherwise, it could overwrite another variable with the same name in global scope.

On line 11, we bail out of the innermost loop using last, which is Perl’s equivalent of the break statement in PHP (or C or C++ or Java…). Because we’ve already determined that the line in question matches at least one of the patterns, we don’t need to check the other patterns, so we exit the loop.

Mini-exercise. Make a version of multigrep.pl that, for each line, prints out how many patterns from @patterns the current line matches.

A few other notes:

  • Just as in PHP, Perl arrays can be dynamically resized: you can add elements to an array with push and remove elements from the end of the array with pop. (Analogous operations can be performed on the beginning of an array with unshift and shift.)
  • Also as in PHP, the members of an array can be of mixed types: strings, numbers—any scalar value. (See below for the definition of “scalar” in the context of Perl.)
  • You can get individual elements from an array using C-style syntax, using a number to indicate which index you want: $foo[5] gets the value in the fifth index of the array. You can also assign to these indices; importantly, you can assign to an index that hasn’t yet been defined, and Perl will automatically resize the array to accomodate.
Scalars and arrays: What’s in a sigil?

Perl has three kinds of values: scalars, arrays, and hashes. A scalar contains a single value (say, a number or a string); an array contains multiple scalar values (like a list of strings); a hash contains a set of mappings between keys and values (like a map in C++/Java or a dictionary in Python). Each of these are associated with different sigils—the character that comes at the beginning of a variable name. Scalars use $; arrays use @, and hashes use %. Some example code:

$foo = "hello"; # a scalar
@bar = ("one", "two", "three"); # an array
%baz = ("bird" => "duck", "cheese" => "feta"); # a hash

You’ll use these sigils when you’re assigning an initial value to a variable, or when you’re passing a variable to a function:

print $foo; # prints scalar value $foo to output
sort @bar; # sorts array @bar in-place
@keys_of_hash = keys %baz;

The trick with sigils, though, is that they indicate not the kind of variable you’re working with, but the kind of value. For example, if you’re getting a single value out of an array, you’re dealing with a scalar, not an array, so you use $ (where you might expect @):

print $bar[0]; # not @bar[0]
print $baz{"cheese"}; # not %baz{"cheese"}

Perl scope: my, my, my

So what’s with this my keyword we’re seeing all over the place? It’s a keyword that controls variable scope. In Perl, the default behavior when you assign to a variable is to create a global variable: every part of the program will from then on be able to see the variable you’ve just assigned to, and you may have inadvertantly overwritten the value of a previously declared global variable with the same name.

This is obviously not optimal: if all of your variables are global, you can’t write interoperable pieces of code. To solve this problem, Perl has the my keyword which, when it accompanies a variable assignment or declaration, restricts the variable to the enclosing block. “Enclosing block” here essentially means everything from the most recent opening curly bracket to the matching closing curly bracket. Variables declared with my will temporarily mask variables with identical names in the outer scope. Here’s an example illustrating this behavior:

$foo = 2;
{
  my $foo = 1;
  print $foo; # prints 1
}
print $foo; # prints 2

Variables declared as my the looping variable in a foreach, or inside the condition of a while loop, are similarly local to the respective loop.

Further reading

Reply