PHP: Classes, web templates

(draft!)

Object-oriented PHP

PHP thinks about classes and objects in a way that strongly resembles C++ and Java. We’re not going to get into the nitty-gritty details (like, say, static methods, inheritance or interfaces)—just enough so that you get the gist. Here’s a simple example of how to define and instantiate a class in PHP:

<?php

class Foo {
  public $z; // visible to anyone
  protected $x; // visible to objects of this class and of derived classes
  private $y; // visible only to objects of this class

  // constructor method
  function __construct($x) {
    // $this refers to this object, use -> to access data or methods 
    $this->x = $x;
  }

  // methods can also be public, protected, private
  public function bar() {
    echo "foo? " . $this->x . "!\n";
  }

}

// instantiate object
$foo = new Foo("bar"); // parameters here will be passed to constructor
$foo->bar(); // call method on object
// prints "foo? bar!"

Keep this in mind for a while, while we digress for a moment in to the world of…

N-gram analysis and Markov chains

“Markov chain generator” is a term we’re using here for a particular species of awesome text generators, which you’ve probably experienced on the web. It uses statistical properties of a given text to generate a new text that strongly resembles, but is not identical to, the source text. Here’s an example of text generated from a Markov chain, using Shakespeare’s sonnets as a source:

A woman's eyes witness with thy scythe eart,
My bount.
That is the believe thy bars
So, either hast prison of think of praising, swift did husband out be than thee 
age of such more cause t
Nor so dignified.
  And defore beauty as her thief, and unfair in a woman's eyes not land,
Much cheered chang
O absence than gently night fair
As I'll read;
For his larg'd;
These I nevery with young.
  And sight grind

Text generated from a Markov chain is often humorous, and can also be used to get a better feel for the “style” of a particular text. The text you generate will have interesting juxtapositions that weren’t present in the source text. A number of poets have used such texts as the basis for jumpstarting their creative process.

The code in Markov.php and chain.php implements a Markov chain algorithm. Run chain.php, giving it some input, and you’ll get an infinite stream of generated text. (pipe it through more and hit ctrl+c to quit.)

We’re going to get around to explaining the code in those files, but let’s talk a little bit of theory first.

N-grams

An n-gram is simply a sequence of units drawn from a longer sequence; in the case of text, the unit in question is usually a character or a word. The unit of the n-gram is called its level; the length of the n-gram is called its order. For example, the following is a list of all unique character-level order-2 n-grams in the word condescendences:

co
on
nd
de
es
sc
ce
en
nc

And the following is an excerpt from the list of all unique word-level order-5 n-grams from frost.txt:

whose woods these are i
woods these are i think
these are i think i
are i think i know
i think i know his
...

N-grams are used frequently in natural language processing and are a basic tool text analysis. Their applications range from programs that correct spelling to creative visualizations to compression algorithms to generative text. Identifying the n-grams in a text is the first step toward building a Markov chain generator.

The next question we’re going to try to answer is this: Given a particular n-gram in a text, what unit is most likely to come next? In order to answer this question, we’ll need to parse a text, keep a record of the unique n-grams, and associate each n-gram with a list of what directly follows that n-gram.

Let’s do a quick example by hand. This is the same character-level order-2 n-gram analysis of the (very brief) text “condescendences” as above, but this time keeping track of all characters that follow each n-gram:

n-grams next?
co n
on d
nd e, e
de s, n
es c, (end of text)
sc e
ce n, s
en d, c
nc e

From this table, we can determine that while the n-gram co is followed by n 100% of the time, and while the n-gram on is followed by d 100% of the time, the n-gram de is followed by s 50% of the time, and n the rest of the time. Likewise, the n-gram es is followed by c 50% of the time, and followed by the end of the text the other 50% of the time.

Generative text with Markov chains

The table above doesn’t just give us some interesting statistical data. It also allows us to reconstruct the underlying text—or, at least, generate a text that is statistically similar to the original text. Here’s how we’ll do it: (1) start with the initial n-gram (co)—those are the first two characters of our output. (2) Now, look at the last n characters of output, where n is the order of the n-grams in our table, and find those characters in the “n-grams” column. (3) Choose randomly among the possibilities in the corresponding “next” column, and append that letter to the output. (Sometimes, as with co, there’s only one possibility). (4) If you chose “end of text,” then the algorithm is over. Otherwise, repeat the process starting with (2). Here’s a record of the algorithm in action:

co
con
cond
conde
conden
condend
condendes
condendesc
condendesce
condendesces

As you can see, we’ve come up with a word that looks like the original word, and could even be passed off as a genuine English word (if you squint at it). From a statistical standpoint, the output of our algorithm is nearly indistinguishable from the input.

This kind of algorithm—moving from one state to the next, according to a weighted list of possibilities—is known as a Markov chain. For our purposes, the term “Markov chain” is synonymous with “text generated from n-gram model probability tables,” but the field of research is actually much more rich than that. A starting point for the interested: Monopoly as a Markov chain.

A PHP implementation of a Markov chain generator

Our PHP implementation is split across two files: chain.php and Markov.php. Here’s chain.php:

<?php

require_once("Markov.php");

$mark = new Markov(4, 100);

while ($line = fgets(STDIN)) {
  $line = rtrim($line);
  $mark->feed($line);
}

// markov until eternity (^C to quit)
while (1) {
  echo $mark->gen();
  echo "\n";
}

What’s new?

  • require_once() is one way to load PHP code from another file into the current file. chain.php loads in code from Markov.php, which actually contains the Markov class.
  • Instantiating an object of class Markov (which, again, is defined in Markov.php); calling that object’s feed() and gen() methods
Markov.php
<?php

function choice($in) {
  return $in[rand(0, count($in) - 1)];
}

class Markov {
  protected $n;
  protected $max;
  protected $grams = array();
  protected $begin = array();

  public function __construct($n, $max) {
    $this->n = $n;
    $this->max = $max;
  }

  public function feed($line) {

    // add the beginning of this string to the $begin array
    $this->begin[] = substr($line, 0, $this->n);

    // create $grams array by grabbing strings of length $n from
    // the given string
    $i = 0;
    while ($i < strlen($line) - 1) {
      $gram = substr($line, $i, $this->n);
      $following = substr($line, $i + $this->n, 1);
      if (array_key_exists($gram, $this->grams)) {
        $this->grams[$gram][] = $following;
      }
      else {
        $this->grams[$gram] = array($following);
      }
      $i++;
    }

  }

  public function gen($startwith="") {

    $output = "";
    $current = "";
    if ($startwith) {
      $current = $startwith;
    }
    else {
      $current = choice($this->begin);
    }
    $output .= $current;
    $max = $this->max;

    while ($max--) {
      if ($current && array_key_exists($current, $this->grams)) {
        $possible = $this->grams[$current];
        $next = choice($possible);
        $output .= $next;
        $current = substr($current . $next, 1, $this->n);
      }
      else {
        break;
      }
    }

    return $output;

  }

  public function get_grams() {
    return $this->grams;
  }

}
  • feed() builds n-grams from specified strings; call this as many times as necessary
  • gen() uses an algorithm similar to the one we used above with “condescendences” to generate strings from the stored n-grams
  • Call print_r($mark->get_grams()) to see what the data structure looks like

PHP as a web templating program

markov_web.php (in public_html)

- gotta make your own public_html directory
- this explains the whole <?php ?> thing!
- code gets interpreted between these processing directives; otherwise, the text of the file is printed out as-is
- (you might also see <? ?>, <?= ?>, <% %>, <%= %> but DON’T USE THEM… these can be turned off in configuration, making your script useless)
- information passed to your script in various ways, mainly by $_POST and $_SERVER (also $_GET …) var_dump these for more information, maybe test adding a field to the form?
- line-by-line: <form method=”” action=””><textarea> or <input>, submit button, etc.
- htmlentities and strip_tags are for security; use </textarea><script>…</script> to demonstrate XSS attack
- how error messages work in PHP for the webszsssz

Further reading

Reply