Topics: N-Grams and Markov chains.

N-Grams and Markov chains

Rationale

Last week we developed code to help us answer the following question: How often does a particular unit (character, word) occur in a text? This week, we attempt to answer a slightly more sophisticated question: Which units co-occur in a text, and in which order? This is our first foray into a syntactic analysis of text. Our eventual goal is to be able to use this syntactic model to generate a text that shares statistical properties with a given source text—not just in the distribution of units, but in the sequence of those units.

N-Grams

The first kind of text analysis that we’ll look at today is an n-gram model. An n-gram is simply a sequence of units drawn from a longer sequence; in the case of text, the unit in question is usually a character or a word. The unit of the n-gram is called its level; the length of the n-gram is called its order. For example, the following is a list of all unique character-level order-2 n-grams in the word condescendences:

co
on
nd
de
es
sc
ce
en
nc

And the following is an excerpt from the list of all unique word-level order-5 n-grams from frost.txt:

whose woods these are i
woods these are i think
these are i think i
are i think i know
i think i know his
...

N-grams are used frequently in natural language processing and are a basic tool text analysis. Their applications range from programs that correct spelling to creative visualizations to compression algorithms to generative text. They can be used as the basis of a Markov chain algorithm—and, in fact, that’s one of the applications we’ll be using them for later in this lesson.

Baby steps: finding and counting word pairs

First off, let’s go after a simple task: finding word pairs in a text. A word pair is essentially a word-level order-2 n-gram; once we have code to find word pairs, we’ll generalize it to handle n-grams of any order. The following program (word_pairs.py) will print out a list of each pairing of two words in the input text:

import sys

# will be a list of lists
pairs = list()

for line in sys.stdin:

  # strip line, break into words
  line = line.strip()
  words = line.split(" ")

  # skip this line if there are fewer than two words on it
  if len(words) < 2:
    continue

  # iterate from 0 up to the length of the list minus one (so we don't
  # inadvertently grab elements from beyond the length of the list)
  for i in range(len(words) - 1):
    pair = words[i:i+2] # i.e., the element at i and the next element
    pairs.append(pair)

# print out pairs
for pair in pairs:
  print ' '.join(pair)

The operation of this program is simple. Each line of the input is broken up into its component words; the list of these words is then iterated over by numerical index. The slice of the word list comprising the current index plus the next index is added to the pairs list; by the end of the outer for loop, pairs will be a list of word pairs, the pairs themselves represented as lists. Finally, we loop over pairs, joining together each entry and then printing them out.

This program’s output for genesis.txt looks like this:

In the
the beginning
beginning God
God created
created the
the heaven
heaven and
and the
the earth.
And the
the earth
earth was
...

Mini-exercise: What UNIX utility could you use to take the output of this program and produce a list of unique word pairs?

The next program (count_word_pairs.py) takes an extra step, and keeps track not just of which word pairs occur, but how many times each pair occurs. With this program, we can begin to see which word pairs are the most common in a given source text.

import sys

# will be a dictionary with tuples as keys, integers as values
pairs = dict()

for line in sys.stdin:

  # strip line, break into words
  line = line.strip()
  words = line.split(" ")

  # skip this line if there are fewer than two words on it
  if len(words) < 2:
    continue

  # iterate from 0 up to the length of the list minus one (so we don't
  # inadvertently grab elements from beyond the length of the list)
  for i in range(len(words) - 1):
    # find the pair; convert to tuple so we can use it as a dictionary key
    pair = tuple(words[i:i+2])
    if pair in pairs:
      pairs[pair] += 1
    else:
      pairs[pair] = 1

# print out pairs
for pair in pairs.keys():
  count = pairs[pair]
  if count > 1: # only print pairs that occur more than once
    print ' '.join(pair) + ": " + str(count)

Mechanically, this program functions a lot like the last one: loop over every word in each line of input, extracting each pair of words in numerical order. The difference is that we’re storing the word pairs in a dictionary, rather than a list, with the word pairs themselves being the key, and the count stored as the value for that key.

Note the use of tuple here. We want to be able to use the pairs themselves as keys, but we can’t do that if they’re simply slices of the word list, since lists can’t be dictionary keys. In order to get around this limitation, we convert each word pair slice to a tuple, which acts like a list in most ways and can additionally be used as a dictionary key. (See the discussion of tuples from session 06 for more information.)

The output of this program specifically excludes any word pair that occurs only once. The output, using genesis.txt again, looks something like this:

bring forth: 3
the fowl: 2
were the: 6
the waters: 8
the sea,: 2
that it: 6
which were: 2
was so.: 6
the morning: 6
the firmament: 5
and God: 7
...

From this, we can ascertain that the phrase “the waters” occurs eight times in the source text, the phrase “and God” occurs seven times, etc.

From word pairs to arbitrary n-grams: ngram_count.py

Now that we have our heads wrapped around finding word pairs, it shouldn’t be hard to expand our program to find and count n-grams of arbitrary length. As an additional step, we’ll create a class to do the work, so that we can easily use our n-gram finding/counting code in other programs. Such a class is defined in ngram_count.py. Here’s an excerpt:

# NGramCounter builds a dictionary relating ngrams (as tuples) to the number
# of times that ngram occurs in a text (as integers)
class NGramCounter(object):

  # parameter n is the 'order' (length) of the desired n-gram
  def __init__(self, n):
    self.n = n
    self.ngrams = dict()

  # feed method calls tokenize to break the given string up into units
  def tokenize(self, text):
    return text.split(" ")

  # feed method takes text, tokenizes it, and visits every group of n tokens
  # in turn, adding the group to self.ngrams or incrementing count in same
  def feed(self, text):

    tokens = self.tokenize(text)

    # e.g., for a list of length 10, and n of 4, 10 - 4 + 1 = 7;
    # tokens[7:11] will give last three elements of list
    for i in range(len(tokens) - self.n + 1):
      gram = tuple(tokens[i:i+self.n])
      if gram in self.ngrams:
        self.ngrams[gram] += 1
      else:
        self.ngrams[gram] = 1

  def get_ngrams(self):
    return self.ngrams

if __name__ == '__main__':

  import sys

  # create an NGramCounter object and feed data to it
  ngram_counter = NGramCounter(4)
  for line in sys.stdin:
    line = line.strip()
    ngram_counter.feed(line)

  # get ngrams from ngram counter; iterate over keys, printing out keys with
  # a count greater than one
  ngrams = ngram_counter.get_ngrams()
  for ngram in ngrams.keys():
    count = ngrams[ngram]
    if count > 1:
      print ' '.join(ngram) + ": " + str(count)

The bulk of the interesting functionality in this class is in the feed method, which takes a string as input, breaks it into a list of units (“tokens”), then finds each successive group of n units (where n is the parameter passed to the __init__ function). If a unit hasn’t been seen before, it’s added as a key to self.ngrams, with a value of 1; on subsequent sightings, the value gets incremented.

(Note here again how we’re using tuple to convert the list slice to a tuple.)

Take special note of the tokenize function, which is called by feed to break the input into a list of units.

Exercise: Modify the NGramCounter class to use characters as units instead of words. Could you do this by subclassing NGramCounter? What function(s) would you have to override?

Quiz time! What parameter would we have to pass to NGramCounter’s __init__ method in order to get the program above to behave exactly like count_word_pairs.py?

N-gram models: what comes next?

Now that we have the ability to find and record the n-grams in a text, it’s time to take our analysis one step further. The next question we’re going to try to answer is this: Given a particular n-gram in a text, what unit is most likely to come next?

We can imagine the kind of algorithm we’ll need to extract this information from the text. It will look very similar to the code in NGramCounter above, but it will need to keep track not just of the n-grams but also a list of all units (word, character, whatever) that follow those n-grams.

Let’s do a quick example by hand. This is the same character-level order-2 n-gram analysis of the (very brief) text “condescendences” as above, but this time keeping track of all characters that follow each n-gram:

n-grams next?
co n
on d
nd e, e
de s, n
es c, (end of text)
sc e
ce n, s
en d, c
nc e

From this table, we can determine that while the n-gram co is followed by n 100% of the time, and while the n-gram on is followed by d 100% of the time, the n-gram de is followed by s 50% of the time, and n the rest of the time. Likewise, the n-gram es is followed by c 50% of the time, and followed by the end of the text the other 50% of the time.

Generative text with Markov chains

The table above doesn’t just give us some interesting statistical data. It also allows us to reconstruct the underlying text—or, at least, generate a text that is statistically similar to the original text. Here’s how we’ll do it: (1) start with the initial n-gram (co)—those are the first two characters of our output. (2) Now, look at the last n characters of output, where n is the order of the n-grams in our table, and find those characters in the “n-grams” column. (3) Choose randomly among the possibilities in the corresponding “next” column, and append that letter to the output. (Sometimes, as with co, there’s only one possibility). (4) If you chose “end of text,” then the algorithm is over. Otherwise, repeat the process starting with (2). Here’s a record of the algorithm in action:

co
con
cond
conde
conden
condend
condendes
condendesc
condendesce
condendesces

As you can see, we’ve come up with a word that looks like the original word, and could even be passed off as a genuine English word (if you squint at it). From a statistical standpoint, the output of our algorithm is nearly indistinguishable from the input.

This kind of algorithm—moving from one state to the next, according to a weighted list of possibilities—is known as a Markov chain. For our purposes, the term “Markov chain” is synonymous with “text generated from n-gram model probability tables,” but the field of research is actually much more rich than that. A starting point for the interested: Monopoly as a Markov chain.

MarkovGenerator.py

The MarkovGenerator class in MarkovGenerator.py implements a general-purpose N-Gram-based text generator, using a Markov-like algorithm. It’s the most sophisticated program we’ve created so far. Before we drill down into a line-by-line explanation of the code, let’s look at all of the methods of MarkovGenerator and what they do:

  • def __init__(self, n, max): takes two parameters: n, which determines the order of the n-grams that the object will look for, and max, which defines the maximum number of units the generate method will generate
  • def feed(self, text): takes a string as input, which it parses into units and n-grams. Each n-gram is examined and put into self.ngrams as a key; the unit that follows the n-gram is appended to the list that is the value for the n-gram’s key in self.ngrams. Also builds self.beginnings, a list of all n-grams that occurred at the beginning of a line
  • def tokenize(self, text): used by feed to break incoming text into units
  • def generate(self): uses self.ngrams to generate a text. It starts with a random entry from self.beginnings and builds a list from there, grabbing a random entry from the list in self.ngrams for the key defined by the last n elements of the output.
  • def concatenate(self, source): used by generate to join together the output elements

Here’s how we might use the MarkovGenerator class. We instantiate it, feed some input to it via the feed method, then finally ask it to generate some stuff with generate.

import sys
import markov

generator = markov.MarkovGenerator(n=3, max=500)
for line in sys.stdin:
  line = line.strip()
  generator.feed(line)

for i in range(5):
  print generator.generate()

More detail: feed and generate

Let’s take a closer look at feed and generate. First, feed:

  def feed(self, text):

    tokens = self.tokenize(text)

    # discard this line if it's too short
    if len(tokens) < self.n:
      return

    # store the first ngram of this line
    beginning = tuple(tokens[:self.n])
    self.beginnings.append(beginning)

    for i in range(len(tokens) - self.n):

      gram = tuple(tokens[i:i+self.n])
      next = tokens[i+self.n] # get the element after the gram

      # if we've already seen this ngram, append; otherwise, set the
      # value for this key as a new list
      if gram in self.ngrams:
        self.ngrams[gram].append(next)
      else:
        self.ngrams[gram] = [next]

The purpose of feed is to find all of the n-grams in the source text and all of the units that directly follow those n-grams.

After tokenizing the incoming string, we check to see if the tokens
list is long enough (we can’t find an n-gram of length n unless there are at least n elements in the list!). Then we store the first n-gram in the tokens list as a tuple in self.beginnings—we’ll use this list later when we’re generating text.

Then we loop through every numerical index of the tokens list, up to the length of the list minus the order of the desired n-grams (this is to prevent looking for elements outside the bounds of the list). The current n-gram is identified (and stored in gram), along with the unit that directly follows it (in next). We look up gram in the dictionary; if it’s there, we append next to the list of possible units that follow that n-gram. If not, we create a new list and assign it to that key.

After you’re done calling find, self.ngrams will be a dictionary with tuples (your n-grams) as keys, each with a list as a value, which contains all of the words that follow that n-gram in the text. It looks something like this:

{ ... ('every', 'green', 'herb'): ['for'], ('rule', 'the', 'nigh
t:'): ['he'], ('to', 'give', 'light'): ['upon', 'upon'], ('the', 'air,', 'and'):
 ['over', 'over', 'to'], ('in', 'the', 'image'): ['of'], ('tree', 'yielding', 'f
ruit,'): ['whose'], ('that', 'it', 'was'): ['good:', 'good.', 'good.', 'good.', 
'good.', 'good.'], ... }

The generate method uses this data structure to create a new text. Here’s the source code:

  # generate a text from the information in self.ngrams
  def generate(self):

    from random import choice

    # get a random line beginning; convert to a list. 
    current = choice(self.beginnings)
    output = list(current)

    for i in range(self.max):
      if current in self.ngrams:
        possible_next = self.ngrams[current]
        next = choice(possible_next)
        output.append(next)
        # get the last N entries of the output; we'll use this to look up
        # an ngram in the next iteration of the loop
        current = tuple(output[-self.n:])
      else:
        break

    output_str = self.concatenate(output)
    return output_str

The output variable is the running result of the generation process. Initially, we set it to be a random n-gram that occurred at the beginning of a line in feed. (This is an aesthetic choice: for poetic texts, this tends to produce more coherent results.)

(more to come)

Differences in order and level

General rules of thumb:

  • A higher order will generally create more comprehensible text. If your order is too high, however, you risk having your generated text repeat large portions of the source text.
  • Word-level n-grams are generally more comprehensible, but tend to repeat the source text at much lower orders; you need a lot of source text to make word-level n-grams differ significantly from the source text. Character-level n-grams require less text to create interesting results, but you might find that the generation algorithm produces non-words from time to time (see “condendesces” above).

Some relevant work

Exercises

  • Create a text by feeding an existing text into the Markov chain algorithm. Justify your source text—what text did you use, and why? What effect does the value of n (the “order” of the n-gram) have on the resulting text?
  • Rework any of the example programs to use something other than text (or, at least, text that represents language) as its basic unit. For example: musical notes, songs in playlists, pixels in an image, etc. (An example.)
  • Modify the MarkovGenerator class to ignore case and punctuation. How does the quality of the resulting texts differ from the original?
  • Use the MarkovGenerator class to create a new text, using the output of your web API/screen scraping program as input. (You could do this either by piping the output of your web-scraping program to the input of your MarkovGenerator program, or by making a secondary version of your web-scraping program that uses the MarkovGenerator class.)
  • Write a program that uses the MarkovGenerator class. Make your program accept parameters on the command line for n and max.
  • Develop a program that compares two text files based on how similar their n-grams are.
  1. Calicoder’s avatar

    Wow, cool! This helped me with my computer science class. We made a more complex random-language generator on this concept. This article helped me understand the basics. Thanks!

Reply