Perl: references; map/grep/sort; subroutines; packages

Elsif and division

In Perl, you’ve got if/elsif/else:

if (some_condition) {
  # code
}
elsif (some_other_condition) {
  # code
}
else {
  # code
}

The elsif and else are, of course, optional; but the parentheses surrounding the blocks are not optional (even if they only contain one line).

Division in Perl yields floating-point numbers, even if the divisors are integers.

print 7 / 4;
# prints 1.75

References

A reference is a scalar variable that points to another variable. The variable pointed to can be an array, a hash, or even another scalar. These are helpful when constructing nested data structures (lists of lists, hashes of hashes, lists of hashes, etc.) or in certain situations when you’re passing variables to subroutines. Here’s how references made, and how to get back values from your references:

# one way:
@an_array = ("aardvark", "abacus", "academic");
$arrayref = \@an_array;
%a_hash = ("a"=>1, "b"=>2);
$hashref = \%a_hash;

# or, the short way:
$hashref = {"a"=>1, "b"=>2};
$arrayref = ["aardvark", "abacus", "academic"];

# use -> to get a value for a key or index from a hashref or arrayref
print $hashref->{"a"}; # prints 1
print $arrayref->[2]; # prints academic

# in situations where you need to pass your hashref or arrayref
# to a function that takes an actual (non-referenced) hashref or
# arrayref, just put % or @ in front of the reference:
print foreach @$arrayref; # prints aardvarkabacusacademic
print foreach keys %$hashref; # prints ab

# built-in function ref() tells you what type of variable the reference points to:
print ref($arrayref); # prints ARRAY
print ref($hashref); # prints HASH

# you can also do this:
$scalarval = 5;
$scalarref = \$scalarval;
print $$scalarref; # prints 5

Using this syntax, we can easily make sophisticated nested data structures:

use Data::Dumper; # loads a perl module to print out nested data structures
my $data = {
  "a" => [1, 2, 3, 4, [5, 6, 7, 8]],
  "b" => {"c"=>[1, 2, 3], "d"=>[1000, 2000, 3000]},
  "frozz" => [1, 2, {"baz"=>"quux"}]
};
print Dumper($data);
# prints a representation of the data (long)

Perl modules

Perl’s library is extensive. Many subroutines and objects are available as “modules,” bits of external code. Use the use function to import such modules. Here’s a list of Perl’s standard modules; there are thousands of user-submitted modules in CPAN. (the breadth and depth of CPAN is one of Perl’s great strengths as a language.)

We can also create our own modules. We’ll do this in the Context Free Grammar example below.

Join and split: twin siblings of destruction and rebirth

join and split

Subroutines

Subroutines in Perl… are weird.

- syntax (sub name { })
- arguments passed in as an array (@_), either assigned to as locals (my ($foo, $bar)); shifted off (shift @_ or just shift); or referenced individually ($_[0])
- return

A simple example in disemvowel.pl:

#!/usr/bin/perl

sub disemvowel {
  my $tmp = shift @_; # or: $_[0]
  $tmp =~ s/[aeiou]//g;
  return $tmp;
}

while (my $line = <>) {
  chomp($line);
  $line = disemvowel($line);
  print "$line\n";
}

Map, grep, and sort

These are built-in functions for manipulating arrays.

my @data = (6, 19, 1, 2, 24);

# map evaluates the expression in the block for each item
# in the array, returning a new array with those items
my @mapped_data = map { $_ + 1 } @data;
print join ", ", @mapped_data; # prints 7, 20, 2, 3, 25

# grep evaluates the expression in the block for each item
# in the array, returning an array with only those items for
# which the expression evaluates to true
my @grepped_data = grep { $_ >= 3 } @data;
print join ", ", @grepped_data; # prints 6, 19, 24

# sort sorts the array and returns a sorted copy. inside the
# block, variables $a and $b are available, which correspond
# to items in the array; the block must return -1, 0, or 1,
# depending on whether the parameter is less than, equal to,
# or greater than the first. The <=> operator does this for numbers;
# there's a stringwise variant cmp.
my @sorted_data = sort { $a <=> $b } @data;
print join ", ", @sorted_data; # prints 1, 2, 6, 19, 24

# easy way to reverse-sort an array:
my @reversed = sort { $b <=> $a } @data;
print join ", ", @reversed; # prints 24, 19, 6, 2, 1

Examples: token_sort.pl, second_sort.pl, you_capture_sort.pl

  • <=>: what it does, compare to cmp

Context-Free Grammars

Language is organized hierarchically: sentences are composed of clauses, which themselves might be composed of clauses. A sentence, for example, is composed of a noun phrase and a verb phrase, and noun phrases and verb phrases can have their own internal structure, even containing other noun phrases or verb phrases. Linguists illustrate this kind of structure with tree diagrams:

syntax tree

syntax tree

(generate your own tree diagrams with phpSyntaxTree)

Clearly, we need a model that captures the hierarchical and recursive nature of syntax. One such model is the context-free grammar.

A grammar is a set of rules. A context-free grammar defines rules in a particular way: as a series of productions. Our goal is to use these rules to generate sentences that have the structure of English. Here’s a mini-grammar of English that we can use for our initial experiments:

S -> NP VP
NP -> the N
VP -> V
VP -> V NP
N -> amoeba
N -> dichotomy
N -> seagull
V -> shook
V -> escaped
V -> changed

Key: NP = noun phrase; VP = verb phrase; S = sentence; N = noun; V = verb

Each rule specifies a “expansion”: it states that the symbol on the left side of the arrow, whenever it occurs in the process of producing a sentence, can be rewritten with the series of symbols on the right side of the arrow. Symbols to the left of the arrow are non-terminal; symbols to the right side of the arrow can be a mix of terminal and non-terminal. (“Terminal” here simply means that there is no rule that specifies how that symbol should be re-written.)

A symbol that appears on the left side more than once (like VP, N and V above) can be rewritten with the right side of any of the given rules. We’ll sometimes write such rules like this, for convenience, but it has the same meaning as the three N rules above:

N -> amoeba | dichotomy | seagull
Generating sentences

Okay! So let’s generate a simple sentence from our grammar. We have to choose a non-terminal symbol to start the process: this symbol is called the axiom, and it’s usually S (or whatever the “top level” rule is).

After we choose the axiom, the process for generating a symbol is simple. The process can be described like this:

  • Replace each non-terminal symbol with an expansion of that symbol.
  • If the expansion of a symbol itself contains non-terminal symbols, replace each of those symbols with their expansions.
  • When there are no non-terminal symbols left, stop.

For this example, we’ll use S as our axiom, and stipulate that whenever we encounter a non-terminal symbol with more than one production rule, we’ll chose among those options randomly. Here’s what the generation might look like:

  • S (apply rule: S -> NP VP)
  • NP VP (apply rule: NP -> the N)
  • the N VP (apply rule: N -> amoeba)
  • the amoeba VP (apply rule: VP -> V)
  • the amoeba V (apply rule: V -> escaped)
  • the amoeba escaped

This isn’t a very sophisticated example, true, but the machinery is in place for producing very complex sentences. Of course, no one wants to do all this by hand, so let’s look at a program written in Perl that automates the process of generating sentences from context-free grammars.

ContextFree.pm

- $rules is a hashref (literally defined with { })
- => is used when defining hash literals; auto-quotes value to left
- [ ] define arrayrefs
- qw() autoquotes space-separated items, returns as array
- nested data structure
- use warnings left until after array definition
- \@results gives us a *reference* to the array (why is this important? because otherwise we’d just get something appended to the @_ as the method argument); you can take a reference of any variable by putting the backslash in front of it
- get the original variable out of the reference by appending the appropriate sigil: @$ref, %$ref, $$ref…
- [ (1, 2, 3) ] is the same as [ 1, 2, 3 ] is the same as [ 1, (2, 3) ]
- “use ContextFree” causes the code in “ContextFree” to become available

(8) ContextFree.pm
- our first module! (.pm = “perl module”)
- “package” keyword creates a separate “name space”; global names in this package are available to other packages as “$ContextFree::foo” (the default package is called “main”)
- expand is recursive… doesn’t look different in perl
- using an array reference or hash reference: ->
- ref() takes a reference and returns what kind of thing the reference points to (ARRAY, HASH, SCALAR)
- in choice: talk about array/scalar contexts (force scalar context with scalar(@array))

Further reading

Reply