A Markov Distinction

Get this week’s example code.

Rationale

Last week we developed code to help us answer the following question: How often does a particular unit (character, word) occur in a text? This week, we attempt to answer a slightly more sophisticated question: Which units co-occur in a text, and in which order? This is our first foray into a syntactic analysis of text. Our eventual goal is to be able to use this syntactic model to generate a text that shares statistical properties with a given source text—not just in the distribution of units, but in the sequence of those units.

N-Grams

The first kind of text analysis that we’ll look at today is an n-gram model. An n-gram is simply a sequence of units drawn from a longer sequence; in the case of text, the unit in question is usually a character or a word. The unit of the n-gram is called its level; the length of the n-gram is called its order. For example, the following is a list of all unique character-level order-2 n-grams in the word condescendences:

co
on
nd
de
es
sc
ce
en
nc

And the following is an excerpt from the list of all unique word-level order-5 n-grams from frost.txt:

whose woods these are i
woods these are i think
these are i think i
are i think i know
i think i know his
...

N-grams are used frequently in natural language processing and are a basic tool text analysis. Their applications range from programs that correct spelling to creative visualizations to compression algorithms to generative text. They can be used as the basis of a Markov chain algorithm—and, in fact, that’s one of the applications we’ll be using them for later in this lesson.

NGram.java: a basic program for finding word-level n-grams

During class, we cooperatively developed a program that would take genesis.txt as input and output the contents of ngram_output.txt (included in this week’s example code). The idea was to find all unique word-level n-grams of order 3 in the input, then print those n-grams to output. The program that follows isn’t exactly what we came up with, but it’s pretty close.

import java.util.ArrayList;
import java.util.HashSet;
import com.decontextualize.a2z.TextFilter;
public class NGram extends TextFilter {
  public static void main(String[] args) {
    new NGram().run();
  }
  protected ArrayList<String> words = new ArrayList<String>();
  protected HashSet<String> grams = new HashSet<String>();
  protected int n = 3; // order
  public void eachLine(String line) {
    String[] tokens = line.split(" ");
    for (String t: tokens) {
      t = t.toLowerCase();
      t = t.replaceAll("[.,:;?!]", "");
      words.add(t);
    }
  }
  public void end() {
    for (int i = 0; i < words.size() - n + 1; i++) {
      String key = "";
      for (int j = i; j < i + n; j++) {
        key += words.get(j);
        key += " ";
      }
      grams.add(key);
    }
    for (String ngram: grams) {
      println(ngram);
    }
  }
}

Three aspects of the desired output text (ngram_output.txt) immediately present themselves: first, there is no punctuation; second, the n-grams overlap line endings; and third, everything is in lower case. Our eachLine function, accordingly, breaks each line up into words, strips out punctuation from those words, and converts them each to lower case. It then adds each word, in order, to an ArrayList of strings named words. After all lines have been read, the words ArrayList will contain a list of all words from the source text, converted to lower case and with punctuation removed, in the order that the occurred in the source text.

Let’s look at the end function. The first for loop visits each index of the ArrayList in turn; if that index is i, then the inner for loop visits indices i, i + 1 up to i + n, retrieving the words at those indices from the ArrayList along the way and using those strings to build a string variable key. (Note how the outer for loop’s termination condition is carefully designed to avoid accessing elements beyond the end of the ArrayList.) The key is then added to a HashSet object grams (remember from last week that HashSets will ignore attempts to add duplicate keys).

A second loop retrieves each object from grams and prints them out. The order in which the n-grams are printed is arbitrary—it reflects neither the order that the n-grams were put into the HashSet, nor their alphabetical order. (Review last week for an explanation of this.) The easiest way to sort the results of the program to get your output to match ngram_output.txt exactly is simply to pipe it through sort on the command line:

$ java NGram <genesis.txt | sort

N-gram models: what comes next?

Now that we have the ability to find and record the n-grams in a text, it’s time to take our analysis one step further. The next question we’re going to try to answer is this: Given a particular n-gram in a text, what unit is most likely to come next?

We can imagine the kind of algorithm we’ll need to extract this information from the text. It will look very similar to the code in NGram.java above, but it will need to keep track not just of the n-grams but also a list of all units (word, character, whatever) that follow those n-grams.

Let’s do a quick example by hand. This is the same character-level order-2 n-gram analysis of the (very brief) text “condescendences” as above, but this time keeping track of all characters that follow each n-gram:

n-grams next?
co n
on d
nd e, e
de s, n
es c, (end of text)
sc e
ce n, s
en d, c
nc e

From this table, we can determine that while the n-gram co is followed by n 100% of the time, and while the n-gram on is followed by d 100% of the time, the n-gram de is followed by s 50% of the time, and n the rest of the time. Likewise, the n-gram es is followed by c 50% of the time, and followed by the end of the text the other 50% of the time.

Generative text with Markov chains

The table above doesn’t just give us some interesting statistical data. It also allows us to reconstruct the underlying text—or, at least, generate a text that is statistically similar to the original text. Here’s how we’ll do it: (1) start with the initial n-gram (co)—those are the first two characters of our output. (2) Now, look at the last n characters of output, where n is the order of the n-grams in our table, and find those characters in the “n-grams” column. (3) Choose randomly among the possibilities in the corresponding “next” column, and append that letter to the output. (Sometimes, as with co, there’s only one possibility). (4) If you chose “end of text,” then the algorithm is over. Otherwise, repeat the process starting with (2). Here’s a record of the algorithm in action:

co
con
cond
conde
conden
condend
condendes
condendesc
condendesce
condendesces

As you can see, we’ve come up with a word that looks like the original word, and could even be passed off as a genuine English word (if you squint at it). From a statistical standpoint, the output of our algorithm is nearly indistinguishable from the input.

This kind of algorithm—moving from one state to the next, according to a weighted list of possibilities—is known as a Markov chain. For our purposes, the term “Markov chain” is synonymous with “text generated from n-gram model probability tables,” but the field of research is actually much more rich than that. A starting point for the interested: Monopoly as a Markov chain.

Markov chain algorithm in Java

The Markov chain example code for this week is structured similarly to last week’s concordance example. There are three classes:

  • MarkovFilter.java: Extends TextFilter, provides basic input/output functionality
  • Markov.java: Instantiated by MarkovFilter; contains all of the Markov chain logic
  • MarkovGram.java: A basic class for keeping track of n-grams and units that follow that n-gram; used as the “value” type for the HashMap in Markov.java

The reason for splitting up the code into three classes is reusability. True, we could have written all of the Markov chain logic right into the MarkovFilter.java class. But that implementation would have been tied to the particular input/output method that we’re using for this example. By separating the algorithmic logic into a separate class, we’ve created something that we can use in many different situations (e.g., not just command line programs, but Processing sketches, Android applications, etc.).

Markov.java

The MarkovFilter class and the MarkovGram class are simple enough to be readily understandable on their own, so I’m going to focus here on explaining what happens in the Markov class. Specifically, we’ll look at the feedLine and generateLine methods. (See the .zip file with this week’s examples for the code in full.)

The Markov class has four fields: n (the order of n-gram used in analysis and generation), max (the maximum number of characters that the chain will generated), grams (a HashMap that associates strings with MarkovGram objects) and lineStarts (an array that stores the first n characters of each line passed to feedLine). These fields are set or initialized in the constructor.

Let’s take a look at the feedLine method.

  public void feedLine(String line) {
    String lineStart = line.substring(0, n);
    lineStarts.add(lineStart);
    for (int i = 0; i < line.length() - n; i++) {
      String gramStr = line.substring(i, i+n);
      String next = String.valueOf(line.charAt(i+n));
      if (grams.containsKey(gramStr)) {
        MarkovGram g = grams.get(gramStr);
        g.addNext(next);
      }
      else {
        MarkovGram g = new MarkovGram(gramStr);
        g.addNext(next);
        grams.put(gramStr, g);
      }
    }
  }

The purpose of this method is to break down a line into its component n-grams, then extract the character following each n-gram. To accomplish this, a for loop visits each index of the string (up until the length of the line minus the length of the n-gram), extracting a substring of length n for each index. The next variable is set to the value of the character that directly follows that n-gram. The grams HashMap is then checked to see if the n-gram in question has already been seen in the text; if so, the MarkovGram object for that n-gram is retrieved, and the next variable is passed into its addNext method, thereby recording the value of next as one of the characters that the n-gram is followed by in the text. If the n-gram hasn’t been seen yet, we create a new MarkovGram object, add the next string with addNext, and put the MarkovGram object into the HashMap.

The generateLine method takes the data structure built up over multiple calls to feedLine and generates a random text with it. Let’s look at the code:

  public String generateLine() {
    int randomIndex = (int)(Math.random() * lineStarts.size());
    String current = lineStarts.get(randomIndex);
    String output = current;
    for (int i = 0; i < max; i++) {
      if (grams.containsKey(current)) {
        MarkovGram g = grams.get(current);
        String nextStr = g.getRandomNext();
        output += nextStr;
        current = output.substring(output.length() - n);
      }
      else {
        break;
      }
    }
    return output;
  }

The first thing that this method does is select a random entry from the lineStarts ArrayList, which contains a list of all n-grams that occurred at the beginning of a line. (Check the first two lines of feedLine above to see how this happens.)

The two most important variables in this method are current and output. The current string is the current n-gram: the last n characters of the chain (where n is the length of our n-grams). The output string is what we’re eventually going to return: at the end of the method, it will have the string of text that the Markov chain algorithm generated.

The for loop guarantees that we only generate max characters in our output. The first thing that we do inside the loop is check words to see if there is an n-gram equal to the value of current. (Remember, current starts out as a random n-gram from the beginning of a line of input text, and will thereafter be set to the n characters most recently added to the output.) If we found an n-gram with the value of current, we retrieve the MarkovGram object associated with it. The getRandomNext method of the MarkovGram object returns one of the characters that might follow the given n-gram; this is appended to the output, and the value of current is set to the last n characters of the input.

If the n-gram isn’t found (i.e., the given n-gram was never followed by another character in the source text), or if the for loop terminates, the method returns the generated output. Fed in the text of genesis.txt, with an n-gram of order 4 and a max of 100, we might expect generated output like this, over multiple iterations:

And God creeping God called the earth.

And the earth the air, and over the was good. And the earth.

And the earth: and the darkness he heaven and to rule over the fifth day.

And God divided the light. And God said, Let the air, and have dominion over the stars also.

Some relevant work

(more to come)

Reading for next week

Homework for next week

Modify, augment, or replace the in-class Markov chain example. A few ideas:

  • As given, the basic unit of the Markov example is the character. Rework the example so that it uses something else as its basic unit—the word, the line, the sentence, etc.
  • Along the same lines: rework the example so that it uses something other than text (or, at least, text that represents language) as its basic unit. For example: musical notes, songs in playlists, pixels in an image, etc. (An example.)
  • Modify the MarkovFilter class so that it accepts parameters on the command line—for example, the value of n (the “order” of the Markov chain), the value of max (the maximum number of characters to generate).
  • Develop a program that compares two text files based on how similar their n-grams are.
  • Create a text by feeding an existing text into the Markov chain algorithm. Justify your source text—what text did you use, and why? What effect does the value of n (the “order” of the n-gram) have on the resulting text?

Reply