“Boy” in a Bubble

Managing HTML

(This is a bit of catch-up from last week.)

Sometimes we want to get data from a web site that doesn’t provide any kind of XML feed. In those cases, we have to find some way to parse the site’s HTML. This is harder than it seems at first. Due to a lax specification and years of browser-compatibility hacks, it’s hard to get meaningful data from an HTML page in an automated fashion (i.e., without a human reader).

Parsing HTML with the intention of extracting particular bits of information is called “screen scraping”—so called because HTML isn’t designed to be readable by computers. Better stated: it is designed to be readable by computers, but only so the computer can render it to the screen; only a human reader can understand the content of the HTML. So we’re going to have to work a little bit to get that information out.

The first thing to remember is that HTML is not XML: they’re visually and syntactically similar, but there are a number of important differences. For example:

HTML doesn’t require a document type declaration (the <?xml ... ?> line at the beginning of all XML files
HTML can have multiple elements at the root, not just one
Attributes on HTML tags don’t have to have quotes around them (e.g., <foo bar=baz></foo> is valid HTML but not valid XML)
In HTML, empty element tags don’t require the closing slash as they do in XML (e.g., <img src="hello"> is valid HTML but not valid XML)
HTML is forgiving of improperly nested tags: <b><i>foo</b></i> works fine in HTML, but would make an XML parser choke

TagSoup to the rescue

Still, HTML documents work the same as XML documents: they’re composed of elements, which have can have attributes, child elements, and content. It would be nice to use our XML parsing tools to parse HTML.

Fortunately for us, there are a number of Java libraries that clean up HTML and output valid XML. One of these is John Cowan’s TagSoup. TagSoup does its best to transform HTML—no matter how horrendous—into valid XML that dom4j can use as input.

Let’s take a look at an example program: HomeworkList.java. This program extracts the names of homework assignments and links to those assignments from the course homework wiki.

So what are we looking for in the HTML? Here’s the HTML source of the page in plain text format: a quick look through the source shows that in order to extract the needed information, we’ll need to grab every li tag, then every a tag that has ‘urllink’ as its class attribute. We’ll take the value the a tag’s href attribute, and the content of the tag, printing them out only if the content of the li tag matches the student name passed to the program on the command line.

Before trying to compile this code, make sure that your Java classpath contains all of the necessary JAR files (included in the example code). On a UNIX-like operating system, you’ll need to run this on the command line:

$ export CLASSPATH=jaxen-1.1.1.jar:tagsoup-1.2.jar:dom4j-1.6.1.jar:.

import org.dom4j.Document;
import org.dom4j.DocumentFactory;
import org.dom4j.io.SAXReader;
import org.dom4j.Element;
import org.xml.sax.XMLReader;
import java.util.List;
import java.util.HashMap;
import java.util.regex.*;

public class HomeworkList {
  public static void main(String[] args) throws Exception {
    String studentName = args[0];
    Pattern namePattern = Pattern.compile(studentName,
      Pattern.CASE_INSENSITIVE);

    HashMap<String, String> map = new HashMap<String, String>();
    map.put("xhtml", "http://www.w3.org/1999/xhtml");
    DocumentFactory factory = DocumentFactory.getInstance();
    factory.setXPathNamespaceURIs(map);

    XMLReader tagsoup = new org.ccil.cowan.tagsoup.Parser();
    SAXReader reader = new SAXReader(tagsoup);
    EasyHTTPGet getter = new EasyHTTPGet(
      "http://itp.nyu.edu/varwiki/Classwork/A2Z-S09"
    );

    Document document = reader.read(getter.responseAsInputStream());
    List listItems = document.selectNodes("//xhtml:li");
    for (Object o: listItems) {
      Element elem = (Element)o;
      String[] parts = elem.getText().split("/");
      Matcher m = namePattern.matcher(parts[0]);
      if (m.find()) {
        Element anchor = (Element)elem.selectSingleNode("xhtml:a");
        String project = anchor.getText();
        String href = anchor.attributeValue("href");
        System.out.println(project + " " + href);
      }
    }
  }
}

Lines 12-14: This program takes a string on the command line, which it makes into a regular expression; that regular expression is later used to search for a particular student’s name.

Lines 21-22: This is where TagSoup comes in. On line 21, we create a new instance of the TagSoup parser, which we then hand off to dom4j to use instead of its default XML parser. (Normally, we call the SAXReader constructor with no arguments; in this case, we give it the TagSoup object. This has the effect of telling dom4j, “Don’t parse your input the normal way—use this object to parse your input instead.”)

Lines 16-19: When TagSoup creates XML from HTML, it puts the XML into the “xhtml” namespace. These lines tell dom4j about the “xhtml” namespace, so we can use it in our XPath queries later on. (See last week’s notes on XML namespaces and XPath for details.)

Line 28: Create a list of all li elements in the document.

Line 31: Get the text of the li element (whatever’s between the opening and closing tag) and split it using the ‘/’ character. We’re only interested in the first element of the resulting array—whatever was to the left of the slash, which is probably a student’s name.

Line 32: Try to match the student name from the li element with the regular expression we created on lines 12-14.

Lines 33-38: If the regular expression matched, get the a element that is the direct child of the current li element, and extract its href attribute and its text. Print them both out to standard output.

Here’s a sample run of the program, with output:

$ java HomeworkList steven
Coupled Couplets http://lehrblogger.com/2009/01/26/programming-a-to-z-assignment-1-coupled-couplets/
Repunctuate.java http://lehrblogger.com/2009/02/02/programming-a-to-z-assignment-2-repunctuatejava/
URLFinder.java http://lehrblogger.com/2009/02/08/programming-a-to-z-assignment-3-urlfinderjava/
Concordance Sorting http://lehrblogger.com/2009/02/24/programming-a-to-z-assignment-4-concordance-sorting/
Markov v vokraM http://lehrblogger.com/2009/02/24/programming-a-to-z-assignment-5-markov-v-vokram/
Delvicious http://delvicious.com
grammar extensions http://lehrblogger.com/2009/03/24/programming-a-to-z-assignments-6-and-7
BayesianNGramClassifier.java http://lehrblogger.com/2009/03/24/programming-a-to-z-assignments-6-and-7
Delvicious and Django http://lehrblogger.com/2009/03/31/programming-a-to-z-delvicious-django-and-assignment-8/

Text visualizations (some interactive)

A taxonomy. Which of these are successful, and when they are successful, why?

Word count:

Wordle
Matthew Ericson, The Words They Used
R. Luke DuBois, Hindsight Is Always 20/20
Poetry on the Road 2008

Character and word transliterations:

Sai Sriskandarajah, The Waste Land
Nina Katchadourian, Talking Popcorn
Chris MacDonald, Fontback
Martin Wattenberg, Color Code
Poetry on the Road 2002 (characters), Poetry on the Road 2003 (words to geometry), Poetry on the Road 2004 (words to shapes), Poetry on the Road 2005 (words to shapes), Poetry on the Road 2006 (words to shapes), Poetry on the Road 2007 (words to photos)

N-grams (and other adjacency relationships):

Philipp Steinweber and Andreas Koller, Similar Diversity
Chris HarrisonWeb Trigrams
Jonathan Harris and Sep Kamvar, We Feel Fine, I Want You To Want Me

Structural:

Blog post on Stefanie Posavec

Semantics (i.e. “wordnet lookup”):

Visual Thesaurus

Interactive and visual text: interactive_ngram

This program accepts input from the keyboard. The most recent 3 characters are used to search a word list (in this case, the SOWPODS scrabble dictionary) for all words containing those three characters in sequence.

View the applet online.

How it works:

The main Processing tab takes care of loading the input file, accepting input from the keyboard, and displaying the results to the screen.
Text analysis takes place in the NGramTracker class. This class has a HashMap that relates n-grams to an ArrayList of words containing those n-grams.
The n-gram analysis happens in the feed method, and the getWordsForGram method returns a list of words that contain the given n-gram.
The keyPressed function in the Processing sketch checks to see whether the given key was alphanumeric, then appends it to the buffer; if the buffer is more than three characters, it’s clipped to the last three characters in the string. The “buffer” variable is then used to look for words containing that n-gram.

Interactive and visual text: genderplot3

This program draws two overlapping lines: one tracks every occurrence of third-person masculine pronouns (he, him), the other tracks every occurrence of third-person feminine pronouns (she, her). A nominative pronoun causes the line to turn slightly to the right; an accusative pronoun causes the line to turn slightly to the left. The length between segments is determined by the pronoun’s position in the text.

Output from Pride and Prejudice; output from the KJV Bible.

The functionality of the program is divided into two parts: PronounExtractor.java does the actual text munging, while the Processing applet (genderplot3.pde) displays the data. Here’s PronounExtractor.java in full:

import java.util.regex.*;
import com.decontextualize.a2z.TextFilter;
class PronounExtractor extends TextFilter {
  public static void main(String[] args) {
    PronounExtractor pe = new PronounExtractor();
    pe.run();
  }
  StringBuffer contents = new StringBuffer();
  public void eachLine(String line) {
    contents.append(line + " ");
  }
  public void end() {
    String everything = contents.toString();
    Pattern p = Pattern.compile("(\\S+)\\s+(she|her|he|him)\\s+(\\S+)",
      Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(everything);
    while (m.find()) {
      println(m.start(2) + "\t" + m.group(2).toLowerCase() + "\t" + m.group());
    }
  }
}

This is a TextFilter class, and its overall structure should look fairly familiar. The eachLine method turns all lines from the input into a big long string; the end method performs a regular expression on that string and does something with all of the matches. The main departure from previous example code is that we’re using Java’s StringBuffer class, instead of a regular string, to build our variable containing the entire contents of the file. Here’s a good overview of the difference between String and StringBuffer, and when to use which class.

The regular expression on line 14 is designed to find every third person English pronoun, and also capture the word that came before and the word that came after. On line 18, we print out the position in the text where the match occurred (with the start method of the Matcher object), along with the actual pronoun that matched (m.group(2).toLowerCase()) and the entire string that matched (m.group()). Here’s a sample run of the program, with an excerpt of the output:

$ java PronounExtractor <austen.txt
348     he      that he is
468     him     to him one
561     he      that he had
640     she     and she told
998     he      that he came
1100    he      that he agreed
1144    he      that he is
1300    he      "Is he married

The visual component (genderplot3.pde) expects a file in exactly this format; it reads in the file using Processing’s loadStrings function. (Deciphering the drawing code in genderplot3 is an exercise left to the reader.)

Homework

Acquire some text. Visualize it. Source and methodology are up to you, but be prepared to justify your choices.

No comments

Comments feed for this article

Trackback link: http://www.decontextualize.com/teaching/a2z/boy-in-a-bubble/trackback/

decontextualize