Concordance: Complete Example Code

The Concordance, a Complete Example

The preliminaries: HashSets, HashMaps.

Now we can get a list of unique words from a text, and we can count how many times a particular word occurs in a text. The next step is to store not just how many times a word occurs, but also in which contexts a word occurs.

Our final concordance example is split across three files—a first for this course. We’ve broken up the functionality of our program into three classes:

  • ConcordanceFilter: interacts with input/output, sends incoming lines to Concordance object for analysis, prints out results
  • Concordance: does the textual analysis, stores words and their contexts, generates (but does not display) results
  • WordContext: a basic class to store a word along with its context (these are the values of the HashMap declared in the Concordance class

Why bother splitting up the functionality like this? Our goal here is to make code that is reusable. Once we’ve written a fully-functioning Concordance class—one that doesn’t depend on any particular input/output methods to function—we can use it in all sorts of contexts: not just from the command line, but also in a Processing sketch, or a Twitter bot, or in a program that listens to your Arduino that’s capturing radio messages from weather balloons (or whatever).

To run our concordance program, type java ConcordanceFilter searchword <your_source_text.txt on the command line, replacing your_source_text.txt with the text that you want to analyze and searchword with the word whose contexts you wish to view. Given genesis.txt as an input, and beast as a search word, we’d expect the following output:

And God said, Let the earth bring forth the living creature after his kind, cattle, and creeping thing, and beast of the earth after his kind: and it was so.

And God made the beast of the earth after his kind, and cattle after their kind, and every thing that creepeth upon the earth after his kind: and God saw that it was good.

And to every beast of the earth, and to every fowl of the air, and to every thing that creepeth upon the earth, wherein there is life, I have given every green herb for meat: and it was so.

Many classes, many source files

A few things to keep in mind when working with multiple classes in Java:

  • Despite what you might have learned from Processing, when you’re working with (pure) Java, each class must be defined in its own file. We’re using three separate classes in this example, which means we’re working with three separate files. When you’re compiling these examples, make sure that you’ve compiled all of the relevant source files!
  • Only one class gets its main method run—whichever class that you pass to the java command on the command line. You’ll note that neither the Concordance class nor the WordContext class even has a main method: these classes can’t be run as standalone programs. They need a helper class, like ConcordanceFilter, to instantiate them and make use of their functionality.

Concordance: a class-by-class rundown

Let’s take a look at each of the classes involved in our concordance program, in turn. The first is ConcordanceFilter.java:

import java.util.HashMap;
import com.decontextualize.a2z.TextFilter;
public class ConcordanceFilter extends TextFilter {
  public static void main(String[] args) {
    ConcordanceFilter cf = new ConcordanceFilter();
    cf.setSearchWord(args[0]);
    cf.run();
  }
  private Concordance concord = new Concordance();
  private String searchWord;
  public void setSearchWord(String searchWord_) {
    searchWord = searchWord_;
  }
  public void eachLine(String line) {
    concord.feedLine(line);
  }
  public void end() {
    if (concord.containsWord(searchWord)) {
      WordContext w = concord.getWord(searchWord);
      for (String context: w.getContexts()) {
        println(context);
      }
    }
  }
}

The first new thing we’re doing in this program is making use of the String[] args parameter of the main method. The args array contains the command-line arguments to the program—whatever text comes after (e.g.) java ConcordanceFilter. For example, if we ran the program like this:

$ java ConcordanceFilter foo bar baz

Then args[0] would evaluate to foo, args[1] would evaluate to bar, and args[2] would evaluate to baz. (File names with redirection, like <genesis.txt, aren’t put in args. Java imposes no limit on the number of arguments you can pass, but your flavor of UNIX might.)

We instantiate a new ConcordanceFilter object and call itssetSearchWord method to tell it what the user passed in on the command line. The eachLine method is unusually simple: we merely take whatever line just came in and pass it directly to our Concordance object. The end function calls two methods on the Concordance object, one to check to see if the search word is present in the concordance, and another to get all contexts associated with that word (if present). Those contexts are then printed to the screen.

Other than handling the user interface and input/output, the ConcordanceFilter class is pretty bare. Most of the logic is in the Concordance class:

import java.util.HashMap;
import java.util.ArrayList;

public class Concordance {
  protected HashMap<String, WordContext> concordance;
  public Concordance() {
    concordance = new HashMap<String, WordContext>();
  }
  public void feedLine(String line) {
    String[] tokens = line.split("\\W+");
    for (String token: tokens) {
      if (concordance.containsKey(token)) {
        WordContext w = concordance.get(token);
        w.addContext(line);
      }
      else {
        WordContext w = new WordContext(token);
        w.addContext(line);
        concordance.put(token, w);
      }
    }
  }
  public boolean containsWord(String token) {
    return concordance.containsKey(token);
  }
  public WordContext getWord(String token) {
    return concordance.get(token);
  }
  public ArrayList<WordContext> getAllWords() {
    return new ArrayList<WordContext>(concordance.values());
  }
}

The main meat of the Concordance object is in the feedLine function, which takes a line of text, splits it up into words, and then checks to see if each of those words is already present in a HashMap object. If a given word is present, then it retrieves the WordContext object associated with that word (see below for more details on WordContext), and tells the object to add the current line to its list of contexts. If the given word isn’t already present, it creates a new WordContext object for the given word, adds the current line to the object, then puts the new WordContext object into the HashMap, using the current token as a key.

The net effect of this logic is this: when the program has reached the end of the input, the concordance HashMap will have one key for every unique word in the text. That key will be mapped to a WordContext object, which will have a list of all lines in the text that the given word occurs in.

The containsWord, getWord and getAllWords methods provide an interface to the Concordance object’s internal HashMap. The containsWord method returns true if the HashMap has the given word; the getWord returns the WordContext object for a given word; and the getAllWords method (not used in this example) returns an ArrayList of all WordContext objects in the concordance.

The WordContext class itself is the simplest of the bunch. Let’s take a look:

import java.util.ArrayList;
public class WordContext {
  private ArrayList<String> contexts;
  private String word;
  public WordContext(String word_) {
    word = word_;
    contexts = new ArrayList<String>();
  }
  public void addContext(String context) {
    contexts.add(context);
  }
  public ArrayList<String> getContexts() {
    return contexts;
  }
  public String getWord() {
    return word;
  }
}

The purpose of the WordContext object is merely to contain a list of lines—specifically, lines in which a particular word appears. The Concordance object calls the WordContext object’s addContext when it wants to add another line to the list of contexts that a particular word occurs in, and calls the getContexts method later to retrieve all of the contexts that it had previously stored.

Reply