N-grams and Markov chains

N-Grams and Markov chains

Rationale

We’ve already developed code that helps to answer the question: How often does a particular unit (character, word) occur in a text? This week, we attempt to answer a slightly more sophisticated question: Which units co-occur in a text, and in which order? This is our first foray into a syntactic analysis of text. Our eventual goal is to be able to use this syntactic model to generate a text that shares statistical properties with a given source text—not just in the distribution of units, but in the sequence of those units.

N-Grams

The first kind of text analysis that we’ll look at today is an n-gram model. An n-gram is simply a sequence of units drawn from a longer sequence; in the case of text, the unit in question is usually a character or a word. The unit of the n-gram is called its level; the length of the n-gram is called its order. For example, the following is a list of all unique character-level order-2 n-grams in the word condescendences:

co
on
nd
de
es
sc
ce
en
nc

And the following is an excerpt from the list of all unique word-level order-5 n-grams from frost.txt:

whose woods these are i
woods these are i think
these are i think i
are i think i know
i think i know his
...

N-grams are used frequently in natural language processing and are a basic tool text analysis. Their applications range from programs that correct spelling to creative visualizations to compression algorithms to generative text. They can be used as the basis of a Markov chain algorithm—and, in fact, that’s one of the applications we’ll be using them for later in this lesson.

Baby steps: finding and counting word pairs

First off, let’s go after a simple task: finding word pairs in a text. A word pair is essentially a word-level order-2 n-gram; once we have code to find word pairs, we’ll generalize it to handle n-grams of any order. The following program (word_pairs.py) will print out a list of each pairing of two words in the input text:

import sys

# will be a list of lists
pairs = list()

for line in sys.stdin:

  # strip line, break into words
  line = line.strip()
  words = line.split(" ")

  # skip this line if there are fewer than two words on it
  if len(words) < 2:
    continue

  # iterate from 0 up to the length of the list minus one (so we don't
  # inadvertently grab elements from beyond the length of the list)
  for i in range(len(words) - 1):
    pair = words[i:i+2] # i.e., the element at i and the next element
    pairs.append(pair)

# print out pairs
for pair in pairs:
  print ' '.join(pair)

The operation of this program is simple. Each line of the input is broken up into its component words; the list of these words is then iterated over by numerical index. The slice of the word list comprising the current index plus the next index is added to the pairs list; by the end of the outer for loop, pairs will be a list of word pairs, the pairs themselves represented as lists. Finally, we loop over pairs, joining together each entry and then printing them out.

This program’s output for genesis.txt looks like this:

In the
the beginning
beginning God
God created
created the
the heaven
heaven and
and the
the earth.
And the
the earth
earth was
...

Mini-exercise: What UNIX utility could you use to take the output of this program and produce a list of unique word pairs?

The next program (count_word_pairs.py) takes an extra step, and keeps track not just of which word pairs occur, but how many times each pair occurs. With this program, we can begin to see which word pairs are the most common in a given source text.

import sys

# will be a dictionary with tuples as keys, integers as values
pairs = dict()

for line in sys.stdin:

  # strip line, break into words
  line = line.strip()
  words = line.split(" ")

  # skip this line if there are fewer than two words on it
  if len(words) < 2:
    continue

  # iterate from 0 up to the length of the list minus one (so we don't
  # inadvertently grab elements from beyond the length of the list)
  for i in range(len(words) - 1):
    # find the pair; convert to tuple so we can use it as a dictionary key
    pair = tuple(words[i:i+2])
    if pair in pairs:
      pairs[pair] += 1
    else:
      pairs[pair] = 1

# print out pairs
for pair in pairs.keys():
  count = pairs[pair]
  if count > 1: # only print pairs that occur more than once
    print ' '.join(pair) + ": " + str(count)

Mechanically, this program functions a lot like the last one: loop over every word in each line of input, extracting each pair of words in numerical order. The difference is that we’re storing the word pairs in a dictionary, rather than a list, with the word pairs themselves being the key, and the count stored as the value for that key.

Note the use of tuple here. We want to be able to use the pairs themselves as keys, but we can’t do that if they’re simply slices of the word list, since lists can’t be dictionary keys. In order to get around this limitation, we convert each word pair slice to a tuple, which acts like a list in most ways and can additionally be used as a dictionary key.

The output of this program specifically excludes any word pair that occurs only once. The output, using genesis.txt again, looks something like this:

bring forth: 3
the fowl: 2
were the: 6
the waters: 8
the sea,: 2
that it: 6
which were: 2
was so.: 6
the morning: 6
the firmament: 5
and God: 7
...

From this, we can ascertain that the phrase “the waters” occurs eight times in the source text, the phrase “and God” occurs seven times, etc.

From word pairs to arbitrary n-grams: ngramcount.py

Now that we have our heads wrapped around finding word pairs, it shouldn’t be hard to expand our program to find and count n-grams of arbitrary length. As an additional step, we’ll create a Python module, so that we can easily use our n-gram finding/counting code in other programs. Such a module is defined in ngramcount.py. Here’s an excerpt:

def count_ngrams(tokens, n):
  """Returns a dictionary mapping tuples of n-grams with order n to the number
    of times that n-gram occurred in list tokens."""
  ngrams = dict()
  if len(tokens) < n:
    return ngrams
  for i in range(len(tokens) - n + 1):
    ngram = tuple(tokens[i:i+n])
    if ngram not in ngrams:
      ngrams[ngram] = 0
    ngrams[ngram] += 1
  return ngrams

def merge_counts(ngrams_list):
  """Returns a dictionary with counts merged from a list of n-gram count
    dictionaries (such as that returned from count_ngrams)."""
  merged = dict()
  for ngrams in ngrams_list:
    for key, val in ngrams.iteritems():
      if key not in merged:
        merged[key] = 0
      merged[key] += val
  return merged

if __name__ == '__main__':
  import sys
  n = int(sys.argv[1])
  counts = list()
  for line in sys.stdin:
    line = line.strip()
    counts.append(count_ngrams(list(line), n))
  combined_counts = merge_counts(counts)
  sorted_counts = sorted(combined_counts.items(), key=lambda x: x[1],
      reverse=True)
  for key, val in sorted_counts:
    print ''.join(key) + ": " + str(val)

The bulk of the interesting functionality in this module is in the count_ngrams function, which takes a string as input, breaks it into a list of units (“tokens”), then finds each successive group of n units. If a unit hasn’t been seen before, it’s added as a key to ngrams, with a value of 1; on subsequent sightings, the value gets incremented. The function returns the dictionary containing the n-grams and their counts. (Note here again how we’re using tuple to convert the list slice to a tuple.)

The merge_counts merges together multiple dictionaries returned from the count_ngrams function. This is helpful if you want to combine an n-gram analysis of two different texts.

In the __main__ section of this module, we create a list of n-gram count dictionaries, one for each line in standard input. We then merge all of these n-gram counts together, to give an n-gram analysis of the entire text. (You could just call the count_ngrams function on a string containing the entire file! But then the n-gram analysis would also include new line characters, which usually isn’t what we want for working with line-based text like poetry.) Here’s the (truncated) result of running the program on sea_rose.txt with an n of 4:

the : 5
 the: 5
n th: 5
are : 4
n a : 3
 are: 3
 lea: 3
you : 3
ou a: 3
u ar: 3
leaf: 3
rose: 3
in t: 3
 in : 3
 of : 2
with: 2
 ros: 2
th s: 2
sand: 2
 on : 2
ose,: 2
...

Exercise: Write a program that uses the count_ngrams function to count word-level n-grams (instead of character-level n-grams, as illustrated in the __main__ part of the module above).

Quiz time! What parameter would we have to pass to count_ngrams to get the program above to behave exactly like count_word_pairs.py?

N-gram models: what comes next?

Now that we have the ability to find and record the n-grams in a text, it’s time to take our analysis one step further. The next question we’re going to try to answer is this: Given a particular n-gram in a text, what unit is most likely to come next?

We can imagine the kind of algorithm we’ll need to extract this information from the text. It will look very similar to the code in NGramCounter above, but it will need to keep track not just of the n-grams but also a list of all units (word, character, whatever) that follow those n-grams.

Let’s do a quick example by hand. This is the same character-level order-2 n-gram analysis of the (very brief) text “condescendences” as above, but this time keeping track of all characters that follow each n-gram:

n-grams next?
co n
on d
nd e, e
de s, n
es c, (end of text)
sc e
ce n, s
en d, c
nc e

From this table, we can determine that while the n-gram co is followed by n 100% of the time, and while the n-gram on is followed by d 100% of the time, the n-gram de is followed by s 50% of the time, and n the rest of the time. Likewise, the n-gram es is followed by c 50% of the time, and followed by the end of the text the other 50% of the time.

Generative text with Markov chains

The table above doesn’t just give us some interesting statistical data. It also allows us to reconstruct the underlying text—or, at least, generate a text that is statistically similar to the original text. Here’s how we’ll do it: (1) start with the initial n-gram (co)—those are the first two characters of our output. (2) Now, look at the last n characters of output, where n is the order of the n-grams in our table, and find those characters in the “n-grams” column. (3) Choose randomly among the possibilities in the corresponding “next” column, and append that letter to the output. (Sometimes, as with co, there’s only one possibility). (4) If you chose “end of text,” then the algorithm is over. Otherwise, repeat the process starting with (2). Here’s a record of the algorithm in action:

co
con
cond
conde
conden
condend
condendes
condendesc
condendesce
condendesces

As you can see, we’ve come up with a word that looks like the original word, and could even be passed off as a genuine English word (if you squint at it). From a statistical standpoint, the output of our algorithm is nearly indistinguishable from the input.

This kind of algorithm—moving from one state to the next, according to a weighted list of possibilities—is known as a Markov chain. For our purposes, the term “Markov chain” is synonymous with “text generated from n-gram model probability tables,” but the field of research is actually much more rich than that. A starting point for the interested: Monopoly as a Markov chain.

markov.py

The markov module implements a general-purpose N-Gram-based text generator, using a Markov-like algorithm. It’s the most sophisticated program we’ve created so far. Before we drill down into a line-by-line explanation of the code, let’s look at all of the functions in the markov module to see what they do:

  • def build_model(tokens, n): takes two parameters: tokens, or a list of items we want to model as a Markov chain, and n, which determines the order of the n-grams that the object will look for. This function returns a dictionary data structure that maps each n-gram in the token list to a list of tokens that follow that n-gram (like the data structure we built from “condescendences” above).
  • generate(model, n, seed=None, max_iterations=100): uses a model returned from build_model to generate a text. It starts with the n-gram given as seed and builds a list from there, grabbing a random entry from the list in model for the key defined by the last n elements of the output.
  • def merge_models(models): merges together multiple models returned from build_model.
  • def generate_from_token_lists(token_lines, n, count=14, max_iterations=100): Generates text from a list of lists of tokens. This function is intended
    for input text where each line forms a distinct unit (e.g., poetry), and
    where the desired output is to recreate lines in that form. It does this
    by keeping track of the n-gram that comes at the beginning of each line,
    and then only generating lines that begin with one of these “beginnings.”
    It also builds a separate Markov model for each line, and then merges
    those models together, to ensure that lines end with n-grams statistically
    likely to end lines in the original text.
  • The char_level_generate and word_level_generate functions are convenience functions to generate Markov chain texts from lists of strings, using character-level and word-level ngrams, respectively.

Here’s how we might use the functions in the markov module. We build a model from some tokens using build_model, then generate a list of tokens from that model with generate. Here’s an example:

import sys
import markov

text = sys.stdin.read()
model = markov.build_model(text.split(), 3)
generated = markov.generate(model, 3)
print ' '.join(generated)

More detail: build_model and generate

Let’s take a closer look at build_model and generate. First, build_model:

def build_model(tokens, n):
  "Builds a Markov model from the list of tokens, using n-grams of length n."
  model = dict()
  if len(tokens) < n:
    return model
  for i in range(len(tokens) - n):
    gram = tuple(tokens[i:i+n])
    next_token = tokens[i+n]
    if gram in model:
      model[gram].append(next_token)
    else:
      model[gram] = [next_token]
  final_gram = tuple(tokens[len(tokens)-n:])
  if final_gram in model:
    model[final_gram].append(None)
  else:
    model[final_gram] = [None]
  return model

The purpose of build_model is to find all of the n-grams in the source text and all of the units that directly follow those n-grams.

First, we create an empty dictionary to hold the n-grams and their following units. We check to see if the tokens
list is long enough (we can’t find an n-gram of length n unless there are at least n elements in the list!).

Then we loop through every numerical index of the tokens list, up to the length of the list minus the order of the desired n-grams (this is to prevent looking for elements outside the bounds of the list). The current n-gram is identified (and stored in gram), along with the unit that directly follows it (in next_token). We look up gram in the dictionary; if it’s there, we append next to the list of possible units that follow that n-gram. If not, we create a new list and assign it to that key.

Finally, we have a special case for the last n-gram in the text: we add None as a potential following token for this n-gram. This ensures that n-grams that end texts are equally likely to end any generation procedure that uses the model. (Without this safeguard, the generation process can easily lead to weird run-on strings that look nothing like your input.)

The build_model function returns a dictionary with tuples (your n-grams) as keys, each with a list as a value, which contains all of the words that follow that n-gram in the text. It looks something like this:

{ ... ('every', 'green', 'herb'): ['for'], ('rule', 'the', 'nigh
t:'): ['he'], ('to', 'give', 'light'): ['upon', 'upon'], ('the', 'air,', 'and'):
 ['over', 'over', 'to'], ('in', 'the', 'image'): ['of'], ('tree', 'yielding', 'f
ruit,'): ['whose'], ('that', 'it', 'was'): ['good:', 'good.', 'good.', 'good.', 
'good.', 'good.'], ... }

The generate method uses this data structure to create a new text. Here’s the source code:

def generate(model, n, seed=None, max_iterations=100):
  """Generates a list of tokens from information in model, using n as the
    length of n-grams in the model. Starts the generation with the n-gram
    given as seed. If more than max_iteration iterations are reached, the
    process is stopped. (This is to prevent infinite loops)"""
  if seed is None:
    seed = random.choice(model.keys())
  output = list(seed)
  current = tuple(seed)
  for i in range(max_iterations):
    if current in model:
      possible_next_tokens = model[current]
      next_token = random.choice(possible_next_tokens)
      if next_token is None: break
      output.append(next_token)
      current = tuple(output[-n:])
    else:
      break
  return output

The output variable is the running result of the generation process. Initially, we set it to be the “seed” text passed into the function. (If the seed isn’t specified, we simply select a random key from the model.) The current variable keeps track of which n-gram we’re currently generating from, and is updated in each iteration of the loop that begins below. In the loop, we check to see if our current n-gram is found in the Markov model; if it is, we select a random next token from those listed in the model for that n-gram, and append it to the output. We then set current to the new n-gram (at the end of the output). If the n-gram isn’t found, or if the randomly selected token is None, we move on to the next iteration of the loop. Finally, the output is returned.

Here’s a transcript of using these two functions in the interactive interpreter:

>>> import markov
>>> text = open("../sea_rose.txt").read()
>>> model = markov.build_model(text, 3)
>>> markov.generate(model, 3)
['l', 'l', ' ', 'l', 'e', 'a', 'f', '?', '\n']
>>> markov.generate(model, 3)
['a', 'f', ',', ' ', '\n', 's', 'i', 'n', 'g', 'l', 'e', ' ', 'o', 'n', ' ', 'a', ' ', 's', 't', 'e', 'm', ' ', '-', '-', ' ', '\n', 'y', 'o', 'u', ' ', 'a', 'r', 'e', ' ', 'c', 'r', 'i', 's', 'p', ' ', 's', 'a', 'n', 'd', ' ', 'w', 'i', 't', 'h', ' ', 's', 'm', 'a', 'l', 'l', ' ', 'l', 'e', 'a', 'f', ',', '\n', '\n', 'm', 'o', 'r', 'e', ' ', 'p', 'r', 'e', 'c', 'i', 'o', 'u', 's', ' ', '\n', 't', 'h', 'a', 'n', ' ', 't', 'h', 'e', ' ', 'w', 'i', 't', 'h', ' ', 's', 't', 'i', 'n', 't', ' ', 'o', 'f', ' ', 'l', 'e']
>>> print ''.join(markov.generate(model, 3))
 
meagre of leaf, 
single on the spice-rose 
drift.

Can the drip sand, with stem -- 
you are of leaf, 
>>> print ''.join(markov.generate(model, 3))
 and, 
you are lift.

Stunted in the spice-rose, harsh rose 
spare caught in the drifted 
in the sand 


As noted above, the char_level_generate and word_level_generate functions are short-hand functions for generating text from character-level and word-level (respectively) n-grams, based on lists of strings passed to the function. (These functions use the generate_from_token_lists function internally, which takes care to retain beginnings of lines as “seeds” for generation. This makes for slightly more coherent output, especially when working with poetic text.)

Here’s an example of using both of these functions in the interactive interpreter:

>>> import markov
>>> lines = list()
>>> for line in open("../sea_rose.txt"):
...     line = line.strip()
...     lines.append(line)
... 
>>> markov.char_level_generate(lines, 3, count=4)
['single on a leaf,', '', 'drift.', 'meagrance']
>>> markov.word_level_generate(lines, 3, count=4)
['single on a stem --', 'you are flung on the sand,', '', 'more precious']
>>> print '\n'.join(markov.char_level_generate(lines, 6, count=8))
Stunted, with small leaf,
hardened in a leaf?
marred and with stint of petals,
meagre flower, thin,
that drives in the spice-rose
meagre flower, thin,

marred and with small leaf,
>>> print '\n'.join(markov.word_level_generate(lines, 2, count=8))
more precious
Stunted, with small leaf,


you are flung on the sand,
than a wet rose
Rose, harsh rose,
marred and with stint of petals,

Differences in order and level

General rules of thumb:

  • A higher order will generally create more comprehensible text. If your order is too high, however, you risk having your generated text repeat large portions of the source text.
  • Word-level n-grams are generally more comprehensible, but tend to repeat the source text at much lower orders; you need a lot of source text to make word-level n-grams differ significantly from the source text. Character-level n-grams require less text to create interesting results, but you might find that the generation algorithm produces non-words from time to time (see “condendesces” above).

Some relevant work

Exercises

  • Create a text by feeding an existing text into the Markov chain algorithm. Justify your source text—what text did you use, and why? What effect does the value of n (the “order” of the n-gram) have on the resulting text?
  • Rework any of the example programs to use something other than text (or, at least, text that represents language) as its basic unit. For example: musical notes, songs in playlists, pixels in an image, etc. (An example.)
  • Modify the MarkovGenerator class to ignore case and punctuation. How does the quality of the resulting texts differ from the original?
  • Use the markov librari to create a new text, using the output of your web API/screen scraping program as input. (You could do this either by piping the output of your web-scraping program to the input of your markov program, or by making a secondary version of your web-scraping program that uses the markov module.)
  • Write a program that uses the markov module. Make your program accept parameters on the command line for n and max.
  • Develop a program that compares two text files based on how similar their n-grams are.

Reply