Strung Out (On Java)

Last week we learned about the UNIX command line: specifically, programs for the UNIX command line that cut up, filter, and mangle text. This week’s goal is to learn how to make programs that behave like those UNIX programs. Our language of choice: Java.

Download the source code for this week’s examples here. I recommend going to your Terminal application and changing to the directory we created last week (a2z) and using curl to download the examples:

$ cd a2z
$ curl -O http://a2z.decontextualize.com/code/week02.zip
$ unzip week02.zip

In our examples for this week, we’ll be using a Java library called TextFilter, a little project of mine that I’ve been working on for this course. The library is included in the examples file above. Find the complete documentation here.

Compiling Java source code

Our first example program is Echo.java. Here’s the listing:

import com.decontextualize.a2z.TextFilter;
public class Echo extends TextFilter {

  public static void main(String[] args) {
    Echo e = new Echo();
    e.run(); 
  }
  public void begin() {
    println("beginning");
  }
  public void eachLine(String line) {
    println(line);
  }
  public void end() {
    println("at end of file");
  }

}

This program prints out the string beginning, then prints out each line of input as it comes in (from the keyboard or from a redirected file). When the end of the file is reached (or when you hit Ctrl+D), it prints out at the end of file. It’s like our own version of cat, but with some extra stuff.

(There’s a lot of strange syntax here for those of you only familiar with Processing. All that will be explained in a second. Sit tight.)

Before we can run this program, we need to compile it. The compiler takes your source code and converts it to Java bytecodes–a series of bytes that the Java virtual machine knows how to execute. If you’re running OS X, the compiler is already installed on your computer. It’s called javac, and you run it like this:

$ export CLASSPATH=a2z.jar:.
$ javac Echo.java

(The export CLASSPATH line only needs to be executed once every time you open your terminal session. It tells Java where to find any external libraries. Come see me if you want tips on how to make this automatic.)

After you run javac, there should be a new file in your directory: Echo.class. This is the compiled version of our code, which Java can actually execute. To run Java code, use the java command, passing it the name of the class you want to execute:

$ java Echo

Have fun with the ensuing madness, and hit Ctrl+D to quit. (Try redirecting the output to a file, or piping it through another UNIX command.)

Defining and executing classes

If you’re only familiar with Processing, Echo.java might seem a little bewildering. What’s all that extra garbage? Here’s what you need to know:

(1) Everything in Java is part of a class. One .java file defines exactly one class, and the name of the file (e.g., Echo.java) must match the name of the class defined in the file (e.g., class Echo).

(2) If a class defines a main method, then the java command can run that class as a program. When you run java, it searches for a class with the name you specified on the command line, and then executes the main method of that class. (The main method must be specified as static, meaning that Java can use the method without actually creating an instance of the class.)

(3) Classes can extend other classes. When you’re extending a class, you gain access to all of its functionality, and have the option of overriding some of the behavior in the class with your own behavior.

(4) Variables and methods in a class can be defined as public, private, or protected. These are called “access modifiers” and they affect how other classes can use the data and methods in your class. Don’t worry too much about these guys right now. A good rule of thumb is that variables in your class should be declared private, and methods should be declared public.

Processing’s dirty secrets

The Processing environment hides these details from you, but they’re still present in the code that Processing generates. In fact, whenever you press the “play” button in the Processing IDE (or when you export as an applet or application), Processing rewrites your code to follow the above conventions. You can see the “real” Java code by looking at the .java file that Processing produces when you export a sketch. Compare the Java source and Processing source of this applet, for example.

TextFilter, a PApplet for text

As you can see from the source code above, Processing sketches are, at heart, just Java classes that extend a class called PApplet. The PApplet class provides functionality (drawing lines and polygons and tracking mouse position), but allows you to specify behavior: specifically, what happens before the sketch begins (setup) and what happens whenever the sketch is supposed to draw something to the screen (draw). You get the tough stuff for free and the fun stuff easy.

Likewise, TextFilter is a Java class (of your instructor’s design) that hides the complexity of Java’s input/output operations, and lets you get down to the work of mungeing text. A class that extends TextFilter need only define any of the following:

  • begin(): This method will be called before any lines are read from input. (Analogous to Processing’s setup() method.)
  • eachLine(String line): This method will be called for each line read from input, in the order they occur in the input. The current line is passed to the method as a parameter.
  • end(): This method will be called after all lines have been read from input.

Other methods available to your classes that extend TextFilter:

  • println: prints a line to output; takes a String or a char as a parameter
  • print: prints a String or char to output (without terminating the current line)

Comprehensive documentation here. We’ll be going over more advanced uses of the library as the course progresses.

Using TextFilter

In order to use the TextFilter library, the a2z.jar file must be visible to your compiler (that’s what the export CLASSPATH=a2z:. command that you executed earlier does). You must also put the following line at the top of your Java file:

import com.decontextualize.a2z.TextFilter;

This tells the Java compiler that you want to be able to use the TextFilter class in your program.

(Remember: If you want the java command to be able to run your TextFilter class as a standalone program, you need to define a static method called main in your class. This method should create an object of the class that you’ve defined, and call the run method on that object. See any of this week’s examples for an idea of how it works.)

SimpleFilter.java

The example programs this week are all simple examples of programs that filter or analyze text. They’re also intended to exploit a number of features of Java’s String class. Here’s SimpleFilter.java, which you can think of as a rudimentary form of grep:

import com.decontextualize.a2z.TextFilter;

public class SimpleFilter extends TextFilter {
  public static void main(String[] args) {
    SimpleFilter s = new SimpleFilter();
    s.run();
  }
  public void eachLine(String line) {
    if (line.indexOf('a') != -1) {
      println(line);
    }
  }
}

This program reads from input, printing out only those lines that contain the character a. (Notice that we don’t have to define begin and end functions if we don’t want the program to do anything special when it begins or ends.) The important bit of the code is this:

line.indexOf('a') != -1

The indexOf method of the String class (official documentation here) tells you where a particular substring or character occurs within the string object that you call it on. If the substring is present, indexOf returns the index of the substring—i.e., where that substring begins. If the substring is not present, indexOf returns -1. In this case, we don’t care where the character a appears in the string. We just want to know whether or not it’s there—so the program checks the return value of indexOf and prints out the line only if it isn’t -1.

Reverse.java

Next up, Reverse.java:

import com.decontextualize.a2z.TextFilter;

public class Reverse extends TextFilter {
  public static void main(String[] args) {
    Reverse r = new Reverse();
    r.run();
  }
  public void eachLine(String line) {
    for (int i = line.length() - 1; i >= 0; i--) {
      print(line.charAt(i));
    }
    println();
  }
}

This program demonstrates two important methods of the String class, namely length and charAt. The length method returns the length (number of characters) in the string. The charAt function returns the character that occurs at a particular index of the string—the index of the first character is 0, the second character is 1, and so forth. Think of it as a way to use a String as an array of characters. For example:

String foo = "hello";
println(foo.length()); // prints 5
println(foo.charAt(0)); // prints h
println(foo.charAt(4)); // prints o

In Reverse.java, we’re using these two methods of the String object to print out each line of input in reverse. The for loop starts at the last index of the string (line.length() - 1), then counts backwards to 0, printing out the character at each index. After the for loop completes, the program prints a new line character by calling println with no parameters.

AverageWordLength.java

This program is a simple example of text analysis. Instead of printing out or mangling the input as it comes in, AverageWordLength.java looks at each line and tries to extract some statistical information about it—in this case, the average length of every word in the input. When all lines have been read, that statistical information is printed out.

import com.decontextualize.a2z.TextFilter;

public class AverageWordLength extends TextFilter {

  public static void main(String[] args) {
    new AverageWordLength().run();
  }

  private int wordCount = 0;
  private int wordLengthTotal = 0;

  public void eachLine(String line) {
    String[] components = line.split(" ");
    for (int i = 0; i < components.length; i++) {
      wordCount++;
      wordLengthTotal += components[i].length();
    }
  }

  public void end() {
    println(String.valueOf(wordLengthTotal / (float)wordCount));
  }

}

The first new bit of syntax in this program is new AverageWordLength().run();. All we're doing here is calling a method on the AverageWordLength object without having assigned that object to a variable. This code does the exact same thing as the following:

AverageWordLength a = new AverageWordLength();
a.run();

The shorter form saves us a few keystrokes, and also cuts down on repetition (and therefore, a chance to introduce a typo).

Another new thing we've done in this program is introduce instance variables: wordCount and wordLengthTotal. The object uses these variables to keep track of how many words have occurred in the text so far, and the total length of all words in the text. (Remember, your TextFilter classes are just like any other class: you can define your own variables and methods, in addition to using those that the TextFilter class defines for you.)

Inside the eachLine method, we use the split method of the String class to "split" the incoming text into an array of strings. The split method takes one argument, which indicates the string that separates the elements of the string that we want to retrieve. It's sort of like the UNIX cut command. For example:

String foo = "hello there you";
String[] fooWords = foo.split(" "); // fooWords now contains "hello", "there", "you"
String foo = "comma,separated,values";
String[] fooValues = foo.split(","); // fooValues now contains "comma", "separated", "values"

Using a space character as a delimiter isn't the best way to extract "words" from a text---we'll be counting strings like hello! and said, as words---but it's an easy implementation, and it's often good enough. (The parameter passed to split is actually a regular expression, which we'll talk about next week. For now, simple patterns like " " and "," should work like you expect them to.)

After splitting the line into words, the program loops over the resulting array of strings, incrementing the wordCount variable and adding the length of each string to wordLengthTotal. In the end method, the program prints wordLengthTotal divided by wordCount---i.e., the average number of characters per word.

Processing vs. Java: continued

Two things to notice in the end method of AverageWordLength.java. The first is that we used this strange syntax to convert wordCount to a float:

(float)wordCount

... instead of what we might use in Processing (i.e., float(wordCount)). The other is that we used String.valueOf() to convert that float value into a string, instead of just passing the value in to println. ( Documentation of String.valueOf() begins here.)

These are examples of functions that are either present in Processing, but not present in barebones Java (the float function) or that work differently in Processing than they do in Java (println). In fact, the vast majority of Processing's built-in functions (see here) are Processing-specific. Most of them---like the ones that deal with drawing stuff---you won't need for this class. Others---like data conversion functions, string manipulation functions---you will need. Fortunately, Java provides most of this functionality for you. You just have to dig a bit into the Java Standard Library.

Java API

The Java API is a set of classes that any Java program can use. They're all available by default and already present on your computer (and on the computer of anyone you might give your program to). Before you try to program something complex, it's best to look in the standard library to see if Java already has a class that you can use. Browse the library here.

The API is huge, though, so it may be hard to find what you need. The homework this week involves getting familiar with the String class. You might also look at the Math class for implementations of basic math functions (e.g., sin, floor, random...). The Collection classes (ArrayList, HashMap...) will loom large for the remainder of the course, so poke at the documentation for those too.

Reading for next week

Assignment #2

Create a program (using, e.g., the tools presented in class) that behaves like a UNIX text processing program (such as cat, grep, tr, etc.). Your program should take text as input (any text, or a particular text of your choosing) and output a version of the text that has been filtered and/or munged. Your program should use at least one method of Java's String class that we didn't discuss in class.

Be creative, insightful, or intentionally banal. Optional: Use the program that you created in tandem with another UNIX command line utility.

Reply