A Significant Markup

Download this week’s example code here.

URLs

Up to now, we’ve been using static texts: spare scraps that we’ve had on our hard drives, or a stray public domain text we’ve grabbed from wherever. This week, we’re going to try to get our hands on some living text: text from the web. In particular, we’re going to learn how to use the web-facing APIs of Delicious and Twitter to get XML and extract interesting information from it.

Getting information from the web is easy: all you need is a web client and a URL. By “web client” I mean any program that knows how to talk HTTP (Hypertext Transfer Protocol)—your computer is lousy with ‘em. Your web browser is the one you’re most familiar with; curl is another. (curl is the command-line utility we’ve been using in class to fetch text and sample code from the server.) The most basic operation that we can do with HTTP is to fetch the document at a given URL. Here’s how to do that from the command line with curl:

$ curl http://www.decontextualize.com/teaching/a2z/a-significant-markup/

Most of what we do on the web—whether we’re using a web browser or writing a program that accesses the web—boils down to manipulating URLs. So it’s important to understand the structure of the URL. Let’s take the following URL, for example:

http://www.example.com/foo/bar?arg1=baz&arg2=quux

Here are the components of that URL:

http protocol
www.example.com host
/foo/bar path
?arg1=baz&arg2=quux query string

The host specifies which server on the Internet we’re going to be talking to, and the protocol specifies how we’re going to be talking to that server. (For web requests, the protocol will be either http or https.) The path and the query string together tell the server what it is exactly that we want that computer to send us.

Traditionally, the path corresponds to a particular file on the server, and the query string specifies further information about how that file should be accessed. (The query string is a list of key/value pairs separated by ampersands: you can think of it like sending arguments to a function.) Everything in a URL after the host name, however, is fair game, and Individual web sites are free to form their URLs however they please.

In the course of normal web browsing, this flexibility in URL structure isn’t a problem, since the HTML documents that our browser retrieves from a site already contain URLs to other documents on the site (in the form of hyperlinks). Most web services, however, require you to construct URLs in very particular ways in order to get the data you want.

In fact, most of the work you’ll do in learning how to use a web service is learning how to construct and manipulate URLs. A quick glance through the documentation for web services like Twitter and Delicious reveals that the bulk of the documentation is just a big list of URLs, with information on how to adjust those URLs to get the information you want.

HTML, XML, web services and “web APIs”

The most common format for documents on the web is HTML (HyperText Markup Language). Web browsers like HTML because they know how to render as human-readable documents—in fact, that’s what web browsers are for: turning HTML from the web into visually splendid and exciting multimedia affairs. (You’re looking at an HTML page right now.) HTML was specifically designed to be a tool for creating web pages, and it excels at that, but it’s not so great for describing structured data. Another popular format—and the format we’ll be learning how to work with this week—is XML (eXtensible Markup Language). XML is similar to HTML in many ways, but has additional affordances that allow it to describe virtually any kind of data in a structured and portable way.

Roughly speaking, whenever a web site exposes a URL for human readers, the document at that URL is in HTML. Whenever a web site exposes a URL for programmatic use, the document at that URL is in XML. (There are other formats commonly used for computer-readable documents, like JSON. But let’s keep it simple for now.) As an example, Twitter has a human-readable version of its public timeline, available at the following URL:

http://twitter.com/public_timeline

But Twitter also has a version of its public timeline designed to be easily readable by computers. This is the URL, and it returns a document in XML format:

http://twitter.com/statuses/public_timeline.xml

Every web site makes available a number of URLs that return human-readable documents; many web sites (like Twitter) also make available URLs that return documents intended to be read by computer programs. Often—as is the case with Twitter, or with sites like Metafilter that make their content available through RSS feeds—these are just two views into the same data.

You can think of a “web service” as the set of URLs that a web site makes available that are intended to be read by computer programs. “Web API” is a popular term that means the same thing. (API stands for “application programming interface”; a “web API” is an interface enables you to program applications that use the web site’s data.)

Further reading on URLs and web services:

EasyHTTPGet

In this week’s example code, there’s a class called EasyHTTPGet that makes it easy to get documents from the web from Java. Getter.java demonstrates how it works:

public class Getter {
  public static void main(String[] args) throws Exception {
    String url = args[0];
    EasyHTTPGet getter = new EasyHTTPGet(url);
    String response = getter.responseAsString();
    System.out.println(response);
  }
}

You can think of Getter.java as a kind of bare-bones replacement for curl. It takes a URL on the command-line and then fetches that URL. The EasyHTTPGet class takes a URL as a parameter to its constructor; after you’ve got an EasyHTTPGet object, you can call that object’s responseAsString method to get the entire contents of the given URL as a big long string.

So, for example, to get Twitter’s public timeline as XML, you’d run Getter like so:

$ java Getter http://twitter.com/statuses/public_timeline.xml

Further reading:

XML: the basics

So, now we can get XML from a web service. But what is XML, and how do we understand it?

All documents have some kind of content. For the plain text files we’ve been working with so far in this class, the content is just what’s in the file: bytes of text, separated into lines. Documents can also have structure: some indication of (a) what parts of the document mean and (b) how parts of a document relate to one another. Plain text documents (by definition) have almost no structure: there’s no indication of how the content should be divided up, and no indication of what the different parts of the document are supposed to mean.

XML is a standard for “marking up” a plain-text document, in order to structure its content. It’s intended to make structured data easy to share and easy to parse. Let’s take a look at a sample XML file, which describes a pair of kittens and their favorite television shows:

<?xml version="1.0" encoding="UTF-8"?>
<kittens>
  <kitten name="Fluffy" color="brown">
    <televisionshows>
      <show>House</show>
      <show>Weeds</show>
    </televisionshows>
    <lastcheckup date="2009-01-01" />
  </kitten>
  <kitten name="Monsieur Whiskeurs" color="gray">
    <televisionshows>
      <show>The Office</show>
    </televisionshows>
    <lastcheckup date="2009-02-15" />
  </kitten>
</kittens>

Let’s talk a little bit about the anatomy of this file. Line 1 contains the XML declaration, which specifies which version of XML the document contains and what character encoding it uses. Line 2 contains the opening tag of the root element: this element contains all other elements in the file. Every XML must have exactly one element at the root level.

XML elements (like HTML elements) are defined by pairs of matching tags: everything from <kittens> to </kittens> comprises the “kittens” element; everthing from <kitten name="Fluffy" color="brown"> to </kitten> comprises the first “kitten” element. Some elements—like the “lastcheckup” elements in the example above—don’t have closing tags; such tags use a special syntax, with a slash before the closing angle bracket. (This is unlike HTML, where unclosed tags like <img> don’t need the trailing slash.)

XML elements exist in a hierarchical relationship: in the example above, the “kittens” element contains several “kitten” elements, which in turn contain “televisionshows” and “lastcheckup” date elements. These hierarchical relationships are commonly referred to using a familial metaphor: the “kitten” element, for example, is the child of the “kittens” element—and the “kittens” element is the parent of both “kitten” elements. Two elements that are children of the same parent are called siblings: for example, both “show” elements under Fluffy’s “televisionshows” element are siblings (since they have the same parent).

In addition to containing other elements, an XML element can have attributes: name/value pairs that occur in the opening tag of an element. Both “kitten” tags have two attributes: “name” and “color.” An XML element can also contain text that isn’t a part of any other element: this is often called the element’s contents or text. (The “show” elements in the examples above are a good example of elements with text.)

Further reading:

Fetching and parsing XML in Java with dom4j

Parsing XML isn’t a task that we want to take on for ourselves—sure, we could get basic information out using regular expressions, but it’d be more trouble that it’s worth. Thankfully, there are several libraries for Java that make parsing XML and extracting information from it a snap. The library we’re going to use is called dom4j.

Note: To compile the examples in this section, you’ll need to add all of the relevant JAR files to your classpath. The JARs are included in the zip file for this week’s examples. If you’re on OS X or another UNIX-like operating system, you should run the following command before trying to compile any of the example code:

$ export CLASSPATH=export CLASSPATH=dom4j-1.6.1.jar:jaxen-1.1.1.jar:.

Let’s dive right into some example code. Here’s TwitterStatus.java, which extracts just the status text of a given user’s Twitter updates:

import org.dom4j.Document;
import org.dom4j.io.SAXReader;
import org.dom4j.Element;
import java.util.List;
public class TwitterStatus {
  public static void main(String[] args) throws Exception {
    String username = args[0];
    EasyHTTPGet getter = new EasyHTTPGet(
      "http://twitter.com/statuses/user_timeline/" + username + ".xml"
    );
    SAXReader reader = new SAXReader();
    Document document = reader.read(getter.responseAsInputStream());
    List statusText = document.selectNodes("//status/text");
    for (Object o: statusText) {
      Element elem = (Element)o;
      String text = elem.getText();
      System.out.println(text);
    }
  }
}

This program shows the basic steps needed to get an XML document from the web and extract information from it. First, we grab a Twitter username from the command line and use it to create a URL, which we pass into EasyHTTPGet.

Next, we create a SAXReader object and call its read method, which returns a Document object. The SAXReader class is from the dom4j library; this object is what actually parses the XML file. The Document class is also from dom4j: objects of this class are what allow us to access data within the document, after it’s parsed.

[Aside: Note that SAXReader’s read method requires an InputStream object, rather than a string. The EasyHTTPGet class has a method responseAsInputStream that conveniently returns the results of fetching a URL as an InputStream object. More on Java IO streams.)]

Once we have a Document object, we call the selectNodes method, which takes an XPath query as an argument (More on XPath in the next section.) and returns a list of objects. The XPath query is selecting all “text” elements that are direct children of “status” elements. (These are the elements of Twitter’s XML that give us the actual text of someone’s status update.)

We iterate over the list of objects, casting them to dom4j Element objects as necessary, and extract the contents of each element using the Element object’s getText method, which we then print to output. Here’s an excerpted from some sample output (for my Twitter account):

"game lecture over please try again" cactus at gdc: http://tr.im/hOvz
TRAINS WITH CANNONS.
@vikramtank but where do we get the hydrogen?
why don't more web apps make packages for virtual platforms? (e.g. why no official WordPress AMI for EC2?) like this: http://tr.im/hL9p
@brendn to the extent that they're used for authentication, yeah. truthfully though I just hate signing up for accounts
@chasing sure, but API keys sever your data from the web---a uri to the resource doesn't fully describe how to access the resource
is there any purpose for API keys other than rate limiting and usage tracking? why should web services have keys when web sites don't?
@ckolderup http://twitpic.com/2ds7m
beat Moving with about 60% completion. Gonna have to play through it again, move some of the things I missed the first time through

List? What happened to ArrayList? And what’s an Object?

There are some weird things afoot in the above code. Let’s focus in on a couple of points:

List statusText = document.selectNodes("//status/text");
for (Object o: statusText) {
  Element elem = (Element)o;
  String text = elem.getText();
  ...
}

First off: “List” is just dom4j’s way of saying that it’s returning something that behaves like a list (i.e., we can iterate over it, get objects from it, add to it), but it’s not going to tell us exactly what kind of list it is under the hood. When a function returns a List, you can treat it exactly like you’d treat an ArrayList. (More information here.)

“But wait,” you’re saying. “What happened to the angle brackets? Why don’t we write List<Element> instead of just List?” An astute question! Support for the angle-bracket syntax to add type information to lists (and HashMaps) was added in Java 1.5; dom4j predates Java 1.5, so it doesn’t support that syntax.

As a consequence, we don’t know what kind of objects are stored in the List that selectNodes returns. That’s the reason that we’re using a variable of class Object when we’re looping over the list. The Object class is our way of telling Java, “hey, I want to store an object in a variable, but I don’t know what kind of object it is.”

Before we use the object, we have to cast it to the appropriate class. We know from the dom4j documentation that the selectNodes method returns a list of objects that we can cast to the Element class; the following code

Element elem = (Element)o;

accomplishes that. Once we have an Element object, we can do useful work, like extracting the element’s text with getText.

Further reading:

Dom4j Element cheat sheet

Once you have an Element object, you can do the following useful things:

  • element.getText() returns the text inside the element’s tag. Example: for an element like <foo>hello there!</foo> would return hello there!)
  • element.attributeValue(str) returns the text inside the attribute with the given name. Example: for an element like <foo bar="baz" />, a call to element.attributeValue("bar") would return baz
  • element.selectNodes(xpathexpression) returns a list of Elements from the given XPath expression
  • element.selectSingleNode(xpathexpression) returns a single Element object from the given XPath expression

XPath

XPath is a language for extracting portions of XML documents. It’s sort of like SQL for XML; you can also think of it as a kind of regular expression language. It allows you to extract particular elements from an XML document.

XPath is a heavy-duty query language, and is capable of some remarkable things, but most of our examples will be simple. The XPath queries in both TwitterStatus.java and DeliciousPosts.java can be understood even if you only know these three bits of XPath syntax:

  • //foo — search for an element with the name “foo” anywhere in the document
  • foo — search for an element with the name “foo” that is the child of the element on which selectNodes was called (this is usually the Document object, but can be any Element)
  • foo/bar — search for an element with the name “bar” that is the child of an element named “foo”

For example:

  • //status/text translates as “find me every text element that is a child of a status element, anywhere in the document”
  • //posts/post translates as “find me every post element that is a child of a posts element, anywhere in the document”
  • item/title translates as “find me any title node that is a child of a item node that is a child of the current element”

Further reading:

An example: Parsing RSS with Processing

MetaFilter RSS in Processing (screenshot)

MetaFilter RSS in Processing (screenshot)

Download the source code here.

RSS is an XML format that many web sites use to generate “syndicated” versions of their content. The Processing sketch above fetches the RSS feed for popular interwub timewaster MetaFilter, and draws to the screen the titles and URLs of every article in the feed. Here’s a direct link to the RSS file in question.

The relevant part of the code is in the fetchTitles function:

    SAXReader reader = new SAXReader();
    EasyHTTPGet getter = new EasyHTTPGet(rssUrl);
    Document document = reader.read(getter.responseAsInputStream());
    List items = document.selectNodes("//item");
    for (int i = 0; i < items.size(); i++) {
      Element item = (Element)items.get(i);
      Element title = (Element)item.selectSingleNode("title");
      Element href = (Element)item.selectSingleNode("link");
      titles.add(title.getText());
      links.add(href.getText());
    }

Once we’ve retrieved all item nodes (querying the document object with XPath //item), we extract the contents of the title node and link node.

Atom and XML namespaces

Another popular syndication format is Atom. (Here’s a comparison of RSS and Atom. Some sites support both, some sites only support one or the other.)

Notably, Twitter’s search API returns documents in Atom format. The example program for this section—TwitterSearch.java—fetches the Atom feed for a particular Twitter search term, and then extracts the contents of the matching Twitter posts, along with their authors. Run it on the command line like so:

$ java TwitterSearch sandwich

Atom’s just regular XML, but it utilizes a feature of XML that we haven’t talked about yet: namespaces. Namespaces are a bit scary, but they’re actually very useful, though they do require us to change our code a little bit. We’ll take a very practical approach here to explaining their use.

How can you tell if an XML document uses namespaces? It’s all in the attribute of the root element. Here’s the root element from a typical Atom feed XML document:

<feed xmlns="http://www.w3.org/2005/Atom">
...
</feed>

It’s the xmlns attribute that we’re looking for. When the xmlns attribute is found on the root element of an XML document, it means that the entire document is using a particular namespace.

You can think of the namespace as a prefix that is automatically appended to all tag names in the document. For example, if the xmlns attribute is set to http://www.w3.org/2005/Atom, what is written as <title> is seen by the parser as (something like) <{http://www.w3.org/2005/Atom}title>.

XPath queries with namespaces

You might be able to see the problem now: if the parser sees (e.g.) <link> internally as <{http://www.w3.org/2005/Atom}link>, our regular XPath queries won’t work: //link won’t match the tag. In order to get around this problem, we use dom4j’s DocumentFactory class. Here’s the relevant code, from TwitterSearch.java:

    HashMap<String, String> map = new HashMap<String, String>();
    map.put("atom", "http://www.w3.org/2005/Atom");
    DocumentFactory factory = DocumentFactory.getInstance();
    factory.setXPathNamespaceURIs(map);

The DocumentFactory class is what dom4j uses internally to create new document objects. By manipulating DocumentFactory, we can change the behavior of our Document objects before we create them. The setXPathNamespaceURIs method will let us use XML namespaces in our XPath queries. It takes a HashMap as an argument—with the namespace URI (the value of the xmlns attribute of the root element) as the value, and the “short name” that we want to use for that namespace as the key. (The short name could be anything; I chose “atom” here because it’s easy to remember.)

Once we’ve made this adjustment to the way our document thinks about XPath queries, we can get on with business. Here’s the portion of TwitterSearch.java that queries the XML document:

    Document document = reader.read(getter.responseAsInputStream());
    List entries = document.selectNodes("//atom:entry");
    for (Object o: entries) {
      Element entry = (Element)o;
      Element title = (Element)entry.selectSingleNode("atom:title");
      Element authorName =
        (Element)entry.selectSingleNode("atom:author/atom:name");
      System.out.println(authorName.getText() + ": " + title.getText());
    }

As you can see, we can reference tags within the Atom namespace by appending atom: (or whatever the key in our setXPathNamespaceURIs call was) to the beginning of the tag name. Once we have a list of entry tags (referenced in XPath as atom:entry), we can get every title and name tag within the entry; we then print those out to the screen.

The moral of the story

Check the root element of your XML first! If you find an xmlns attribute, you’ll need to use the code above to make the namespace visible to your XPath queries. As an example, let’s say that we added a namespace to our kitten example above, like so:

<?xml version="1.0" encoding="UTF-8"?>
<kittens xmlns="http://a2z.decontextualize.com/kittens">
  <kitten name="Fluffy" color="brown">
    <televisionshows>
      <show>House</show>
      <show>Weeds</show>
    </televisionshows>
    <lastcheckup date="2009-01-01" />
  </kitten>
  ...
</kittens>

A regular XPath query, like //kitten, wouldn’t work—because (internally) the name of the tag is actually <{http://a2z.decontextualize.com/kittens}kitten>. In order to find that tag, we’d have to add something like this to our code (before any dom4j document objects were created):

    HashMap<String, String> map = new HashMap<String, String>();
    map.put("k", "http://a2z.decontextualize.com/kittens");
    DocumentFactory factory = DocumentFactory.getInstance();
    factory.setXPathNamespaceURIs(map);

The XPath query above could then be re-written as //k:kitten. To get all show tags: //k:televisionshows/k:show and etc.

(Why go to all this trouble? XML namespaces exist so that we can easily include snippets of other XML documents within our own documents, without risking any ambiguity.)

Further reading:

Homework for next week

Get some XML from a web service. Extract some interesting information from the XML and use it as input to one of your homework assignments from previous weeks.

You could do this either by piping the output of your XML parsing program to the input of your previously implemented program, or by using a class (e.g., instantiate and use the MarkovChain class).

  1. FrannyB’s avatar

    Adam, thanks so much for this posting. It has helped me so much in clarifying some issues that I had with using java with http. Am off to do my homework… :)

Reply