Perl: Syntax, hashes, and style

Perl syntax grab bag

Whitespace is unimportant. You’re free to format your programs however you wish.
Everything from a # to the end of the line is a comment. (You can’t use C/C++ style comment syntax.)
Operators in Perl work very similar to their counterparts in C/C++. Here’s a full list of Perl operators. We’ll discuss a number of Perl’s less familiar operators and operator behavior below.
Boolean operators like && and || return the last value evaluated, not true or false; you’ll often see code like $foo = $bar || $baz;, which will set $foo to the value of $bar if it’s non-false, and the value of $baz otherwise.
for and foreach are synonyms. You can use C-style syntax in for loops (e.g. for (my $i = 0; $i < 10; $i++) { print $i; }), but you'll hardly ever see it.
Perl's keyword for "else if" is elsif (spelled just like that).

All strings being equal

Perl has two equality operators, == and eq. Use == to compare numbers, and eq to compare strings.

To be more specific, == checks to see if its operands are numerically equal (i.e., if they evaluate to the same numerical value), and eq checks to see if they are stringwise equal (i.e., if they evaluate to the same string). Perl will do its best to figure out what numerical value is in a string, but a string with no numerical content will always evaluate (numerically) to zero. Importantly, this means that == will evaluate to true for strings in unexpected situations. The following example code, for example, will print numerically equal (but not, of course, stringwise equal):

$foo = "blah";
$bar = "ducks";
print "numerically equal\n" if ($foo == $bar);
print "stringwise equal\n" if ($foo eq $bar);

Along with eq come a whole host of operators that compare variables based on their stringwise value. Most important is ne, the stringwise equivalent of !=, but there's also lt (stringwise less than), gt (stringwise greater than), le (stringwise less-than-or-equal-to), etc. You can use these operators to compare and sort strings according to their alphabetical order.

Quote me on that

One of Perl's strengths is that it has a multitude of quoting operators. Double-quote and single-quote work as they do in PHP (the former allows for variable interpolation and the latter doesn't), but Perl supports the following operators in addition:

The double-quote operator qq{string} is the equivalent of "string"
The single-quote operator q{string} is the equivalent of 'string'
The "quote words" operator qw{foo bar baz} is equivalent to writing ("foo", "bar", "baz") (this is great for creating arrays of words)

In each of the special quote operators above, the delimiter characters { and } can be replaced with any other matching pair of characters (say, ( and )) or any other single character. The two following statements, for example, both print the string foo:

print qq/foo/;
print q#foo#;

(This is handy for quoting strings with characters you would otherwise have to escape in a single- or double-quoted string.)

Hashes; Capturing regex matches

The following program, you_capture.pl, finds all occurrences of the word "you" and whatever word follows; it stores a count of how many times each following word occurs. In order to do this, we introduce two new features of Perl: capturing and hashes. Let's look at the source code:

#!/usr/bin/perl
use strict;

my %words = ();

while (my $line = <>) {
  chomp($line);
  if ($line =~ /[Yy]ou (\w+)/) {
    $words{$1}++;
  }
}

foreach my $key (keys %words) {
  print "$key: $words{$key}\n";
}

Here's how it works. First off, we declare a hash and assign an empty value to it. Hashes are the Perl equivalent of PHP's associative arrays, or Python's dictionaries, or C++'s STL map. They store associations between keys and values.

The regular expression on line 8 should look comprehensible. The new syntax here is the parentheses, which tell Perl to store (or "capture") the part of the source string that matched that portion of the regular expression. You might render the regular expression in line 8 in English as "if the line matches the word 'you,' beginning with either an upper- or lowercase 'y', then match a space after that, and then remember the sequence of one or more alphanumeric characters that follow."

When you use regular expressions with parentheses, the binding operator =~ not only returns true if the expression matched, but also makes available the variables $1, $2, $3 and upwards, each of which contains the string captured in the corresponding set of parentheses in the regular expression. (e.g., $1 contains the string from the first match, $2 contains the string from the second match, etc.)

Working with hashes

You can access or modify values in a hash with the following syntax:

$x{"y"}

... where x is the variable name of your hash and y is the key whose value you wish to access or modify. In you_capture.pl above, we're using the matched word ($1 on line 9) as the key, and incrementing the key's value each time it's matched. (In Perl, using the value of a key that hasn't yet been defined isn't an error; Perl simply initialized the value to zero before you use it. This is called autovivification, and it's a unique feature of Perl.)

There are several strategies for looping over each key/value pair in a hash. The final foreach in you_capture.pl demonstrates one such strategy: the keys built-in function returns an array of all of the keys in a hash, which the foreach then loops over.

Further notes about hashes:

PHP remembers in what order you put keys into an associative array, and then returns those keys in the same order (when you, e.g., loop over the array). Perl does not remember the order of keys; you'll get the keys back in an arbitrary order.

Substitutions and hash literals

Next up: hash literals and substitutions. The following program replaces all instances of the patterns in the keys of %patterns with the string in the corresponding values:

#!/usr/bin/perl
use strict;

my %patterns = (
  '\bwoods\b' => 'ducks',
  'ee' => 'eeeeeee',
  '\.$' => '. Oh yeah, baby!'
);

while (my $line = <>) {
  chomp $line;
  while (my ($pat, $repl) = each %patterns) {
    $line =~ s/$pat/$repl/g;
  }
  print "$line\n";
}

Notes:

Hash literals look a lot like array literals, with one key difference: every other entry in the list becomes a key, with the next entry as the corresponding value. (The => operator is synonymous with ,.)
Line 12 illustrates another way to loop over the key/value pairs in a hash. The each function returns a key/value pair from the given hash until all key/value pairs have been exhausted (in which case it returns false).
each returns a list: you can assign the result of such functions to more than variable using list syntax. (e.g. my ($first, $second, $third) = returns_list() will create variables $first, $second, and $third in lexical scope with the corresponding values from the return value of returns_list()).
The s/x/y/ operator replaces occurrences of the pattern x with the string in y. The final g means "global," i.e., for every instance on the given line; without the 'g', only the first occurrence on each line would be replaced

Putting it all together: parsing a configuration file

Let's say we had a configuration file in the following format, and we wanted to get some information out of it. Let's write a Perl program to do just that.

# this is a test configuration file!
# we're going to write a perl script to parse it
# format:
#
# [section header]
# key=value
#
# lines beginning with # are comments, and won't be interpreted.
# anything from # to the end of the line will be ignored.
# the '=' sign between key/value pairs can have leading/trailing whitespace.

[widget]
blinkiness=7
num_buttons=6
brand_name = Jim's Widget and Baking Supply Co.

[zookeeper]
is_surly=true
raise=$1000       # our zookeeper has been very thorough!
ducks_owned=438
favorite_cheese=gouda

[classic_movie]
this=is a test
hello=there

What we want is a data structure that will let us get to these configuration values in a hierarchical fashion. Here's parse_config.pl, designed to do just such a job. At the end of the first while, the %config hash will contain a key for each section in the config file, whose value is another hash, which has a key/value pair for each key/value pair in the configuration file.

#!/usr/bin/perl
use strict;

my %config = ();
my $current_section = "undefined";

while (my $line = <>) {
  next if $line =~ /^#/; # next line if this is a comment

  chomp($line);
  $line =~ s/#.+$//; # remove commends from the end of the line

  if ($line =~ /^\[(\w+)\]$/) {
    $current_section = $1;
  }
  if ($line =~ /^(\w+)\s*=\s*(.*)$/) {
    # creates/updates a reference to a hash within a hash
    $config{$current_section}{$1} = $2;
  }
}

while (my ($section, $pairs) = each %config) {
  print "Section: $section\n";
  # "%$pairs" below because $pairs is a reference to a hash
  while (my ($key, $val) = each %$pairs) {
    print "  $key -> $val\n";
  }
}

Notes:

expr1 if expr2 is short for if (expr2) { expr1 }
next is Perl-speak for continue
Note on line 16, capturing two things from one regex
On line 18, we're assigning to a hash inside another hash. There's a trick here, though---remember, hashes can only contain scalar values. What's actually happening here is that the value for this key is a hash reference. More on this later, but it's also the reason that we need the extra % in front of the hash variable on line 25.

Mini-exercise. Make a version of replace.pl that uses a configuration file to specify patterns and replacements. You may need to investigate Perl's functions for opening and reading from file handles other than standard input (see open).

Laconic Perl

Perl has a reputation for being a very terse language, with the ability to write powerful code that doesn't use a lot of characters. There are a number of tricks that make this possible. Let's learn about some of them.

The default variable

The first is the use of $_, the so-called "default variable," which gets set inside of looping constructs like foreach. The following code, for example, will print the digits one through nine, each on a new line:

my @numbers = (1, 2, 3, 4, 5, 6, 7, 8, 9);
foreach (@numbers) { # note missing loop variable!
  print "$_\n";
}

Certain built-in Perl functions will use $_ as a default argument if no other argument is present. Among these is print; the above code could be re-written as:

my @numbers = (1, 2, 3, 4, 5, 6, 7, 8, 9);
foreach (@numbers) { # note missing loop variable!
  print; # no argument specified, so uses $_
  print "\n";
}

Statement modifiers

Certain conditionals and loops can be written in Perl as statement modifiers. We saw such a modifier in parse_config.pl (the next if ... line). Perl foreach and while loops support a similar syntax. In the statement to the left of the modifier, $_ contains the value of the current iteration of the loop. The brief program above could be rewritten like so:

my @numbers = (1, 2, 3, 4, 5, 6, 7, 8, 9);
print "$_\n" foreach @numbers;

When a line is read from standard input inside the condition of a while loop, the line is automatically assigned to $_ inside the loop. You can take advantage of this fact to write incredibly terse programs. For example, the following program could be a simple clone of the cat UNIX utility in Perl:

#!/usr/bin/perl
print $_ while <>;
# or even shorter:
# print while <>;

Grep short

Let's take a look at grep_short.pl, which does the same thing as grep.pl, but takes advantage of some of these tricks:

#!/usr/bin/perl
use strict;

while (<>) {
  print if /\b[Ee]\w{4}\b/;
}

No comments

Comments feed for this article

Trackback link: http://www.decontextualize.com/teaching/ppp/perl-syntax-hashes-and-style/trackback/

decontextualize