Getting data from the web.

Python: hidden details

In the interest of brevity, we’ve skipped over some fairly important details of Python. Here’s our chance to play catch-up.

Other kinds of loops; loop control

The for loop is far and away the most common loop in Python. But there’s another kind of loop that you’ll encounter frequently: the while loop. The while loop begins with the keyword while and a condition; as long as that condition evaluates to True, the loop will continue to execute. An example:

>>> i = 0
>>> while i < 10:
...     i += 1
...     print i
... 
1
2
3
4
5
6
7
8
9
10

Python also has two loop control statements. The first, continue, tells Python to skip the rest of the current iteration of the loop, and continue with the next iteration. To make the previous example only print even numbers, for example:

>>> i = 0
>>> while i < 10:
...     i += 1
...     if i % 2 == 1:
...             continue
...     print i
... 
2
4
6
8
10

The continue statement causes Python to skip back to the top of the loop; the remaining statements aren't executed.

Finally, we have break, which causes Python to drop out of the loop altogether. One last modification of the example above:

>>> i = 0
>>> while i < 10:
...     i += 1
...     if i > 5:
...             break
...     print i
... 
1
2
3
4
5

Here, as soon as i achieves a value greater than 5, the break statement gets executed, and Python stops executing the loop.

If these examples seem contrived to you, you're very perceptive! It's usually easier to use for with a carefully crafted list to iterative over, rather than a while with a bunch of continue and break statements.

One case where the while loop is helpful is when we don't have a set list to iterate over---when we want to do something over and over, forever. Forever, at least, until a certain condition obtains. Here's a common idiom for such code in Python:

while True:
  some statements...
  if some condition:
    last

Tuples

Tuples are another basic data structure of Python, along with lists, sets, and dictionaries. They look and behave almost exactly like lists, except they're declared with parentheses rather than square brackets. Here's a comparison:

>>> foo_tuple = (1, 2, 3, 4)
>>> foo_list = [1, 2, 3, 4]
>>> foo_tuple[0]
1
>>> foo_list[0]
1

The main difference between lists and tuples is that tuples are immutable: they can't be changed once they've been declared. For example:

>>> foo_list[0] = 5
>>> foo_list
[5, 2, 3, 4]
>>> foo_tuple[0] = 5
Traceback (most recent call last):
  File "", line 1, in 
TypeError: 'tuple' object does not support item assignment

As you can see, the list allowed us to assign to one of its indices; the tuple did not.

Because tuples are immutable, Python can do some creative optimizations behind the scenes, meaning that for many applications tuples are faster than lists. Another difference between tuples and lists is that tuples can be dictionary keys, while lists cannot. (This will become important in future class sessions.)

from module import stuff

There's an alternate way to use the import statement that is in many cases cleaner than the way we've been using import so far. The from module import names syntax imports the code in module, but also makes it so the names in names don't have to be used with the module's name beforehand. For example:

>>> from random import choice
>>> choice(foo_list) # not random.choice!
3

Multiple names can be specified:

>>> from re import search, findall
>>> search(r"foo", "foobar")
<_sre.SRE_Match object at 0x6d330>
>>> findall(r"\b\w{3}\b", "the quick brown fox")
['the', 'fox']

String formatting

So far, we've been using the + operator to put strings together. Python supports an alternate syntax, with the % operator, which allows us to use a very simple templating language for interpolating values into strings.

The way it works: on the left side of the %, specify a string with some number of occurrences of %s in it; on the right side, a tuple. Successive elements from the tuple will be used to "fill in" the corresponding occurrences %s. Some examples:

>>> "the %s fox" % ('quick') # single replacement
'the quick fox'
>>> "the %s %s fox" % ('quick', 'brown') # multiple replacements
'the quick brown fox'
>>> "the %s %s fox %s over the lazy %s" % ('quick', 'brown', 'jumped', 'ocelot') 
'the quick brown fox jumped over the lazy ocelot'

This is the most basic (and most common) way to use this syntax, but the % operator is much more sophisticated; there are other replacement strings aside from %s, for example. Here's an overview from the Python documentation. (Those of you familiar with the sprintf function in C/Perl should feel right at home.)

Python 2.6 and later support an entirely different set of string formatting/templating operations, which has a number of advantages (and drawbacks). Read about it here.

File objects

We've skirted around the issue of reading from files, for the most part, by relying on UNIX stdin/stdout for input and output. As it turns out, Python makes opening files and reading data from them remarkably easy. File input/output is managed by file objects. You can create one by calling the open method, with the name of the file you want to open as an argument. Here's some code demonstrating what you can do with such objects:

>>> file_obj = open("this_is_just.txt")
>>> type(file_obj)

>>> dir(file_obj)
[... 'closed', 'encoding', 'fileno', 'flush', 'isatty', 'mode', 'name', 'newlines', 'next', 'read', 'readinto', 'readline', 'readlines', 'seek', 'softspace', 'tell', 'truncate', 'write', 'writelines', 'xreadlines']
>>> file_obj.readline() # reads a single line from the file
'This is just to say\n'
>>> file_obj.read() # reads the entire file into a string
'\nI have eaten\nthe plums\nthat were in\nthe icebox\n\nand which\nyou were probably\nsaving\nfor breakfast\n\nForgive me\nthey were delicious\nso sweet\nand so cold\n'

The most helpful method of the file object demonstrated above is read, which reads in the entire file as a string. More information on reading and writing from the Python tutorial.

Getting data from the web

Up to now, we've been using static texts: spare scraps that we've had on our hard drives, or a stray public domain text we've grabbed from wherever. This week, we're going to try to get our hands on some living text: text from the web. In particular, we're going to learn how to mine Atom and RSS feeds, and use Twitter's web-facing APIs to get XML and extract interesting information from it.

Getting information from the web is easy: all you need is a web client and a URL. By "web client" I mean any program that knows how to talk HTTP (Hypertext Transfer Protocol)---your computer is lousy with 'em. Your web browser is the one you're most familiar with; curl is another. (curl is the command-line utility we've been using in class to fetch text and sample code from the server.) The most basic operation that we can do with HTTP is to fetch the document at a given URL. Here's how to do that from the command line with curl:

$ curl http://www.decontextualize.com/teaching/dwwp/getting-data-from-the-web/

URLs

Most of what we do on the web---whether we're using a web browser or writing a program that accesses the web---boils down to manipulating URLs. So it's important to understand the structure of the URL. Let's take the following URL, for example:

http://www.example.com/foo/bar?arg1=baz&arg2=quux

Here are the components of that URL:

`http`	protocol
`www.example.com`	host
`/foo/bar`	path
`?arg1=baz&arg2=quux`	query string

The host specifies which server on the Internet we're going to be talking to, and the protocol specifies how we're going to be talking to that server. (For web requests, the protocol will be either http or https.) The path and the query string together tell the server what it is exactly that we want that computer to send us.

Traditionally, the path corresponds to a particular file on the server, and the query string specifies further information about how that file should be accessed. (The query string is a list of key/value pairs separated by ampersands: you can think of it like sending arguments to a function.) Everything in a URL after the host name, however, is fair game, and Individual web sites are free to form their URLs however they please.

In the course of normal web browsing, this flexibility in URL structure isn't a problem, since the HTML documents that our browser retrieves from a site already contain URLs to other documents on the site (in the form of hyperlinks). Most web services, however, require you to construct URLs in very particular ways in order to get the data you want.

In fact, most of the work you'll do in learning how to use a web service is learning how to construct and manipulate URLs. A quick glance through the documentation for web services like Twitter and Delicious reveals that the bulk of the documentation is just a big list of URLs, with information on how to adjust those URLs to get the information you want.

HTML, XML, web services and "web APIs"

The most common format for documents on the web is HTML (HyperText Markup Language). Web browsers like HTML because they know how to render as human-readable documents---in fact, that's what web browsers are for: turning HTML from the web into visually splendid and exciting multimedia affairs. (You're looking at an HTML page right now.) HTML was specifically designed to be a tool for creating web pages, and it excels at that, but it's not so great for describing structured data. Another popular format---and the format we'll be learning how to work with this week---is XML (eXtensible Markup Language). XML is similar to HTML in many ways, but has additional affordances that allow it to describe virtually any kind of data in a structured and portable way.

Roughly speaking, whenever a web site exposes a URL for human readers, the document at that URL is in HTML. Whenever a web site exposes a URL for programmatic use, the document at that URL is in XML. (There are other formats commonly used for computer-readable documents, like JSON. But let's keep it simple for now.) As an example, Twitter has a human-readable version of its public timeline, available at the following URL:

http://twitter.com/public_timeline

But Twitter also has a version of its public timeline designed to be easily readable by computers. This is the URL, and it returns a document in XML format:

http://twitter.com/statuses/public_timeline.xml

Every web site makes available a number of URLs that return human-readable documents; many web sites (like Twitter) also make available URLs that return documents intended to be read by computer programs. Often---as is the case with Twitter, or with sites like Metafilter that make their content available through RSS feeds---these are just two views into the same data.

You can think of a "web service" as the set of URLs that a web site makes available that are intended to be read by computer programs. "Web API" is a popular term that means the same thing. (API stands for "application programming interface"; a "web API" is an interface enables you to program applications that use the web site's data.)

Further reading on URLs and web services:

On the importance of the URL: Leonard Richardson's The System of the World Wide Web
Twitter's API documentation
Delicious API documentation
Programmable Web: a list of web services
Some web services support JSON in addition to XML (the recently-released Etsy API, only supports JSON). JSON is a computer-readable data format originally designed for use in Javascript applications. An explanation of JSON from Wikipedia. Python 2.6 and later includes a json library for easy encoding/decoding JSON data.

urllib: Getting the contents of URLs with Python

Python makes it easy to fetch the data at a particular URL. The simplest means of doing this is Python's urllib module. The urllib module gives us a function called urlopen, which takes a URL as an argument and returns a file-like object, from which we can extract the data retrieved from the URL.

It's file-like in that we can call read method on it, which will return all of the data retrieved from the URL as a string. Here's a basic example (get_url.py):

import urllib
import sys

url = sys.argv[1]
urlobj = urllib.urlopen(url)
data = urlobj.read()
print data

This program takes a URL on the command line, and prints out whatever it gets from that URL to standard output. It's a tiny clone of curl.

All well and good, you say, but now that we have data from the web, how do we get meaningful information out of it? Most information on the web is in either HTML or XML, so first, we're going to have to understand how those two languages work.

Structure of XML/HTML

All documents have some kind of content. For the plain text files we've been working with so far in this class, the content is just what's in the file: bytes of text, separated into lines. Documents can also have structure: some indication of (a) what parts of the document mean and (b) how parts of a document relate to one another. Plain text documents (by definition) have almost no structure: there's no indication of how the content should be divided up, and no indication of what the different parts of the document are supposed to mean.

XML and HTML are standards for "marking up" a plain-text document, in order to structure its content. It's intended to make structured data easy to share and easy to parse. Let's take a look at a sample XML file, which describes a pair of kittens and their favorite television shows:

<?xml version="1.0" encoding="UTF-8"?>
<kittens>
  <kitten name="Fluffy" color="brown">
    <televisionshows>
      <show>House</show>
      <show>Weeds</show>
    </televisionshows>
    <lastcheckup date="2009-01-01" />
  </kitten>
  <kitten name="Monsieur Whiskeurs" color="gray">
    <televisionshows>
      <show>The Office</show>
    </televisionshows>
    <lastcheckup date="2009-02-15" />
  </kitten>
</kittens>

Let's talk a little bit about the anatomy of this file. Line 1 contains the XML declaration, which specifies which version of XML the document contains and what character encoding it uses. (Depending on the flavor of HTML, this declaration may be absent or different.) Line 2 contains the opening tag of the root element: this element contains all other elements in the file. Every XML document must have exactly one element at the root level.

XML and HTML elements are defined by pairs of matching tags: everything from <kittens> to </kittens> comprises the "kittens" element; everthing from <kitten name="Fluffy" color="brown"> to </kitten> comprises the first "kitten" element. Some elements---like the "lastcheckup" elements in the example above---don't have closing tags; such tags use a special syntax, with a slash before the closing angle bracket. (This is unlike HTML, where unclosed tags like <img> don't need the trailing slash.)

XML and HTML elements elements exist in a hierarchical relationship: in the example above, the "kittens" element contains several "kitten" elements, which in turn contain "televisionshows" and "lastcheckup" date elements. These hierarchical relationships are commonly referred to using a familial metaphor: the "kitten" element, for example, is the child of the "kittens" element---and the "kittens" element is the parent of both "kitten" elements. Two elements that are children of the same parent are called siblings: for example, both "show" elements under Fluffy's "televisionshows" element are siblings (since they have the same parent).

In addition to containing other elements, an element can have attributes: name/value pairs that occur in the opening tag of an element. Both "kitten" tags have two attributes: "name" and "color." An element can also contain text that isn't a part of any other element: this is often called the element's contents or text. (The "show" elements in the examples above are a good example of elements with text.)

XML vs. HTML

XML and HTML have a common ancestry, and because they're used in similar domains (making data available on the web), tools designed to work with one will often work with the other. The library we'll be working with to parse XML/HTML, BeautifulSoup, is one such tool. But there are a number of differences between the two that are worth noting.

HTML doesn't require a document type declaration (the <?xml ... ?> line at the beginning of all XML files
HTML can have multiple elements at the root, not just one
Attributes on HTML tags don't have to have quotes around them (e.g., <foo bar=baz></foo> is valid HTML but not valid XML)
In HTML, empty element tags don't require the closing slash as they do in XML (e.g., <img src="hello"> is valid HTML but not valid XML)
HTML is forgiving of improperly nested tags: <b><i>foo</b></i> works fine in HTML, but would make an XML parser choke

Most importantly: XML is designed to be a general format for exchanging any kind of data. You can specify whatever tags and attributes you want. HTML is a language for structuring web pages; it has a closed set of tags, and a long history of being hacked for browser compatibility. HTML works well for making web pages, but often doesn't work well for making computer-parseable data.

Parsing HTML/XML with BeautifulSoup

There are many ways to parse XML and HTML with Python. We're going to use a library by Leonard Richardson called Beautiful Soup. (Beautiful Soup isn't part of the standard distribution; you have to download it yourself. There's a copy in the directory on the server with the example code for this session.)

Beautiful Soup makes it easy to extract data from HTML and XML documents in efficient and easy-to-understand ways.

BeautifulSoup: the basics

Let's go back to the kittens.xml file mentioned above. I'll print it here again, just to keep it fresh in your mind:

<?xml version="1.0" encoding="UTF-8"?>
<kittens>
  <kitten name="Fluffy" color="brown">
    <televisionshows>
      <show>House</show>
      <show>Weeds</show>
    </televisionshows>
    <lastcheckup date="2009-01-01" />
  </kitten>
  <kitten name="Monsieur Whiskeurs" color="gray">
    <televisionshows>
      <show>The Office</show>
    </televisionshows>
    <lastcheckup date="2009-02-15" />
  </kitten>
</kittens>

Let's begin an interactive session, first getting the contents of kittens.xml and then using BeautifulSoup to extract data from the XML.

>>> from BeautifulSoup import BeautifulStoneSoup
>>> data = open("kittens.xml").read()
>>> soup = BeautifulStoneSoup(data)
>>> kitten_tag = soup.find('kitten')

The first three lines above demonstrate how to bring the appropriate class into your program from the Beautiful Soup library, load a text file, and create a BeautifulStoneSoup object with that data. (Note: for XML data, use the BeautifulStoneSoup class; for HTML, use BeautifulSoup.)

We then use the find method on soup, which will return the first kitten tag in the document. The kitten_tag variable will contain an object of type Tag (a Beautiful Soup class). What can we do with such objects? Lots of stuff:

>>> kitten_tag = soup.find('kitten')
>>> kitten_tag['name']
u'Fluffy'
>>> kitten_tag['color']
u'brown'
>>> lastcheckup_tag = kitten_tag.find('lastcheckup')
>>> lastcheckup_tag['date']
u'2009-01-01'

We can use tag objects as though they were dictionaries, which will allow us to access the attributes for that tag. (Check the XML; you'll see that the name and color attributes of the first kitten tag are, indeed, 'Fluffy' and 'brown', respectively.)

The tag object also itself supports the find method. Calling find on a tag will return the first tag with the specified name that is a descendant of the given tag (i.e., child, grandchild, great-grandchild, etc.). Finding the first lastcheckup tag and then checking the date attribute of that tag does indeed reveal Fluffy's most recent checkup date (2009-01-01).

More than one: findAll; the string attribute

The findAll method works just like find, except it returns all matching tags in a list. (actually a "ResultSet," but it behaves like a list for our purposes.) Let's say we wanted to extract and then loop over all kitten tags in the document. Here's how to do it:

>>> kitten_tags = soup.findAll("kitten")
>>> for kitten_tag in kitten_tags:
...     print kitten_tag['name']
... 
Fluffy
Monsieur Whiskeurs

Let's say that in addition to printing out each kitten's name, we also wanted to print out his or her favorite TV shows. We'll accomplish this by calling findAll on each kitten tag in the loop:

>>> for kitten_tag in kitten_tags:
... print kitten_tag['name']
... for show_tag in kitten_tag.findAll('show'):
... print "show: " + show_tag.string
...
Fluffy
show: House
show: Weeds
Monsieur Whiskeurs
show: The Office

The one new thing here is the string attribute. We use string to get the text inside of a tag—e.g., the name of the television shows inside of the show tags in our kitten XML file.

This idiom—performing findAll on the whole document, looping over the results, and then performing find or findAll on each of those tags—is very common in both XML parsing and web scraping.

More specific: the attrs argument

The find and findAll can take a second parameter, named attrs, which allows you to search for only those tags that have attributes with particular values. The parameter that you pass should be a dictionary, whose keys are the attributes you want to match, and whose values are the desired values for those attributes. For example, if we wanted to find only kitten tags whose color value is ‘gray’:

>>> kitten_tags = soup.findAll('kitten', attrs={'color': 'gray'})
>>> for kitten_tag in kitten_tags:
...     print kitten_tag['name']
... 
Monsieur Whiskeurs

We can see that the only matching kitten tag was Monsieur Whiskeurs. This functionality doesn’t look impressive when we’re dealing with such a small data set, but it becomes immensely important for dealing with (e.g.) HTML. We might know, for example, that the piece of data we want is contained within a div tag with a particular attribute (say a class or id attribute). The attrs argument to findAll allows us to find only the tags we want to see.

(Both find and findAll have even more powerful ways to specify particular tags to find. Check the documentation for more information.)

Lateral movement: findNextSibling

Sometimes we want to be able to find tags that are siblings of the current tag (i.e., on the same level of the XML hierarchy), rather than descendants. For this purpose, Beautiful Soup supplies a findNextSibling function:

>>> kitten_tag = soup.find('kitten')
>>> print kitten_tag['name']
Fluffy
>>> next_kitten_tag = kitten_tag.findNextSibling('kitten')
>>> print next_kitten_tag['name']
Monsieur Whiskeurs

The initial find on soup returns the first kitten tag. Calling findNextSibling on that tag gives us the next kitten tag on the same level of the hierarchy.

(Again, this becomes more useful in web scraping applications where, for example, we want to get the paragraph that follows a particular h3 tag.)

HTML scraping: some examples

youtube_comments.py; itp_course_list.py; alphapedia_crawl.py (if time)

XML parsing: some examples

RSS is an XML format that many web sites use to generate “syndicated” versions of their content. The Processing sketch above fetches the RSS feed for popular interwub timewaster MetaFilter, and draws to the screen the titles and URLs of every article in the feed. Here’s a direct link to the RSS file in question.

meta_summary.py

Another popular syndication format is Atom. (Here’s a comparison of RSS and Atom. Some sites support both, some sites only support one or the other.)

Notably, Twitter’s search API returns documents in Atom format. The example program for this section fetches the Atom feed for a particular Twitter search term, and then extracts the contents of the matching Twitter posts, along with their authors.

search_twitter.py

Custom XML schemas: more twitter

twitter_friend_locator.py

Homework #3: Due 29 July

Write a Python program that extracts text from somewhere on the web. Use this program to provide input to a text generation/mungeing algorithm, in order to make a new remixed text. Think about the relationship between the data that you’re getting from the web and the algorithm you’ve chosen to modify it. (The text generation/mungeing algorithm can be something you’ve already written.)

Bonus: Write your web-extraction program as a class. (see youtube_comments_extractor.py for one possible model.)

No comments

Comments feed for this article

Trackback link: http://www.decontextualize.com/teaching/dwwp/getting-data-from-the-web/trackback/

decontextualize