Getting data from the web

Python: hidden details

In the interest of brevity, we’ve skipped over some fairly important details of Python. Here’s our chance to play catch-up.

Other kinds of loops; loop control

The for loop is far and away the most common loop in Python. But there’s another kind of loop that you’ll encounter frequently: the while loop. The while loop begins with the keyword while and a condition; as long as that condition evaluates to True, the loop will continue to execute. An example:

>>> i = 0
>>> while i < 10:
...     i += 1
...     print i

Python also has two loop control statements. The first, continue, tells Python to skip the rest of the current iteration of the loop, and continue with the next iteration. To make the previous example only print even numbers, for example:

>>> i = 0
>>> while i < 10:
...     i += 1
...     if i % 2 == 1:
...             continue
...     print i

The continue statement causes Python to skip back to the top of the loop; the remaining statements aren't executed.

Finally, we have break, which causes Python to drop out of the loop altogether. One last modification of the example above:

>>> i = 0
>>> while i < 10:
...     i += 1
...     if i > 5:
...             break
...     print i

Here, as soon as i achieves a value greater than 5, the break statement gets executed, and Python stops executing the loop.

If these examples seem contrived to you, you're very perceptive! It's usually easier to use for with a carefully crafted list to iterative over, rather than a while with a bunch of continue and break statements.

One case where the while loop is helpful is when we don't have a set list to iterate over---when we want to do something over and over, forever. Forever, at least, until a certain condition obtains. Here's a common idiom for such code in Python:

while True:
  some statements...
  if some condition:


Tuples are another basic data structure of Python, along with lists, sets, and dictionaries. They look and behave almost exactly like lists, except they're declared with parentheses rather than square brackets. Here's a comparison:

>>> foo_tuple = (1, 2, 3, 4)
>>> foo_list = [1, 2, 3, 4]
>>> foo_tuple[0]
>>> foo_list[0]

The main difference between lists and tuples is that tuples are immutable: they can't be changed once they've been declared. For example:

>>> foo_list[0] = 5
>>> foo_list
[5, 2, 3, 4]
>>> foo_tuple[0] = 5
Traceback (most recent call last):
  File "", line 1, in 
TypeError: 'tuple' object does not support item assignment

As you can see, the list allowed us to assign to one of its indices; the tuple did not.

Because tuples are immutable, Python can do some creative optimizations behind the scenes, meaning that for many applications tuples are faster than lists. Another difference between tuples and lists is that tuples can be dictionary keys, while lists cannot. (This will become important in future class sessions.)

from module import stuff

There's an alternate way to use the import statement that is in many cases cleaner than the way we've been using import so far. The from module import names syntax imports the code in module, but also makes it so the names in names don't have to be used with the module's name beforehand. For example:

>>> from random import choice
>>> choice(foo_list) # not random.choice!

Multiple names can be specified:

>>> from re import search, findall
>>> search(r"foo", "foobar")
<_sre.SRE_Match object at 0x6d330>
>>> findall(r"\b\w{3}\b", "the quick brown fox")
['the', 'fox']

File objects

We've skirted around the issue of reading from files, for the most part, by relying on UNIX stdin/stdout for input and output. As it turns out, Python makes opening files and reading data from them remarkably easy. File input/output is managed by file objects. You can create one by calling the open method, with the name of the file you want to open as an argument. Here's some code demonstrating what you can do with such objects:

>>> file_obj = open("this_is_just.txt")
>>> type(file_obj)

>>> dir(file_obj)
[... 'closed', 'encoding', 'fileno', 'flush', 'isatty', 'mode', 'name', 'newlines', 'next', 'read', 'readinto', 'readline', 'readlines', 'seek', 'softspace', 'tell', 'truncate', 'write', 'writelines', 'xreadlines']
>>> file_obj.readline() # reads a single line from the file
'This is just to say\n'
>>> # reads the entire file into a string
'\nI have eaten\nthe plums\nthat were in\nthe icebox\n\nand which\nyou were probably\nsaving\nfor breakfast\n\nForgive me\nthey were delicious\nso sweet\nand so cold\n'

The most helpful method of the file object demonstrated above is read, which reads in the entire file as a string. More information on reading and writing from the Python tutorial.

Getting data from the web

Up to now, we've been using static texts: spare scraps that we've had on our hard drives, or a stray public domain text we've grabbed from wherever. This week, we're going to try to get our hands on some living text: text from the web. In particular, we're going to learn how to extract interesting text from the JSON returned from a few popular web APIs.

Getting information from the web is easy: all you need is a web client and a URL. By "web client" I mean any program that knows how to talk HTTP (Hypertext Transfer Protocol)---your computer is lousy with 'em. Your web browser is the one you're most familiar with; curl is another. (curl is the command-line utility we've been using in class to fetch text and sample code from the server.) The most basic operation that we can do with HTTP is to get the document at a given URL.


Most of what we do on the web---whether we're using a web browser or writing a program that accesses the web---boils down to manipulating URLs. So it's important to understand the structure of the URL. Let's take the following URL, for example:

Here are the components of that URL:

http protocol host
/foo/bar path
?arg1=baz&arg2=quux query string

The host specifies which server on the Internet we're going to be talking to, and the protocol specifies how we're going to be talking to that server. (For web requests, the protocol will be either http or https.) The path and the query string together tell the server what it is exactly that we want that computer to send us.

Traditionally, the path corresponds to a particular file on the server, and the query string specifies further information about how that file should be accessed. The query string is a list of key/value pairs separated by ampersands: you can think of it like sending arguments to a function. These are just conventions---everything in a URL after the host name, is fair game, and individual web sites are free to form their URLs however they please.

In the course of normal web browsing, this flexibility in URL structure isn't a problem, since the HTML documents that our browser retrieves from a site already contain URLs to other documents on the site (in the form of hyperlinks). Most web services, however, require you to construct URLs in very particular ways in order to get the data you want.

In fact, most of the work you'll do in learning how to use a web service is learning how to construct and manipulate URLs. A quick glance through the documentation for web services like Twitter and Facebook reveals that the bulk of the documentation is just a big list of URLs, with information on how to adjust those URLs to get the information you want.

HTML, JSON, web services and "web APIs"

The most common format for documents on the web is HTML (HyperText Markup Language). Web browsers like HTML because they know how to render as human-readable documents---in fact, that's what web browsers are for: turning HTML from the web into visually splendid and exciting multimedia affairs. (You're looking at an HTML page right now.) HTML was specifically designed to be a tool for creating web pages, and it excels at that, but it's not so great for describing structured data. Another popular format---and the format we'll be learning how to work with this week---is JSON (JavaScript Object Notation). Like HTML, JSON is a format for exchanging structured data between two computer programs. Unlike HTML, JSON is primarily intended to communicate content, rather than layout.

Roughly speaking, whenever a web site exposes a URL for human readers, the document at that URL is in HTML. Whenever a web site exposes a URL for programmatic use, the document at that URL is in JSON. (There are other formats commonly used for computer-readable documents, like XML. But let's keep it simple for now.) As an example, Facebook has a human-readable version of a fan page for Python, available at the following URL:

But Facebook also has a version of this fan page designed to be easily readable by computers. This is the URL, and it returns a document in JSON format:

Every web site makes available a number of URLs that return human-readable documents; many web sites (like Twitter) also make available URLs that return documents intended to be read by computer programs. Often---as is the case with Facebook, or with sites like Metafilter that make their content available through RSS feeds---these are just two views into the same data.

You can think of a "web service" as the set of URLs that a web site makes available that are intended to be read by computer programs. "Web API" is a popular term that means the same thing. (API stands for "application programming interface"; a "web API" is an interface enables you to program applications that use the web site's data.)

Further reading on URLs and web services:

Using curl to get a URL

We've used curl before to fetch the content of a URL. It's a great tool for exploring web services. Here's how we would use curl to fetch the JSON version of the Facebook object mentioned above:

$ curl -s
{"id":"7899581788", ...[lots of json data]... ,"were_here_count":0}

curl simply does a request to the remote server for the given URL, and prints the response to standard output. (The -s flag tells curl not to display a progress bar. The progress bar is useful when we're just downloading a file, but not so useful when we want the contents of the response printed directly to the screen.)

Well, okay, we got the data, but it's a huge mess. We can clean it up using a special invocation of Python's JSON module, piping the results from curl in as input:

$ curl -s | python -mjson.tool
    "about": "programming, the way Guido indented it", 
    "app_id": "0", 
    "can_post": false, 
    "category": "Product/service", 
    "checkins": 0, 
    "company_overview": "Python is a dynamic object-oriented programming language that can be used for many kinds of software development. It offers strong support for integration with other languages and tools, comes with extensive standard libraries, and can be learned in a few days. Many Python programmers report substantial productivity gains and feel the language encourages the development of higher quality, more maintainable code.", 
    "cover": {
        "cover_id": "10150985230661789", 
        "offset_x": 0, 
        "offset_y": 0, 
        "source": ""
    "founded": "February 1991 by Guido van Rossum", 
    "has_added_app": false, 
    "id": "7899581788", 
    "is_community_page": false, 
    "is_published": true, 
    "likes": 79538, 
    "link": "", 
    "name": "Python", 
    "talking_about_count": 737, 
    "username": "pythonlang", 
    "website": "", 
    "were_here_count": 0

Much better! We can see that the JSON returned from Facebook has a particular structure. It even looks a little bit like a string representation of a Python data structure---like a dictionary with lists and dictionaries inside. It's not exactly that, but it's easy to convert a JSON data structure to a Python data structure, which will make it easy for us to use this data in our Python programs. More on this later.

Further reading and tools of interest:

  • jutil, a command-line program for slicing and dicing JSON responses from web services on the command line
  • curl usage explained, a tutorial on how to use curl effectively
  • wget, a popular alternative to curl

urllib: Getting the contents of URLs with Python

Python makes it easy to fetch the data at a particular URL. The simplest means of doing this is Python's urllib module. The urllib module gives us a function called urlopen, which takes a URL as an argument and returns a file-like object, from which we can extract the data retrieved from the URL.

It's file-like in that we can call read method on it, which will return all of the data retrieved from the URL as a string. Here's a basic example (

import urllib
import sys

url = sys.argv[1]
urlobj = urllib.urlopen(url)
data =
print data

This program takes a URL on the command line, and prints out whatever it gets from that URL to standard output. It's a tiny clone of curl.

Other tools of interest and further reading:

  • urllib documentation
  • Requests, an alternate HTTP library for Python. (If you're doing anything more sophisticated than simple GETs, you should use this library instead of Python's built-in urllib.)

All well and good, you say, but now that we have data from the web, how do we get meaningful information out of it? We'll discuss how to extract information from HTML documents below, using BeautifulSoup. Right now, though, let's talk about JSON.

Encoding strings and queries with urllib.quote_plus and urllib.urlencode

We spoke briefly above about query string parameters---those funny key/value pairs that so often follow a ? in URLs. Almost every web service you're likely to run across makes use of query string parameters in one way or another, and it turns out that there are some funny rules about how those query string parameters need to be structured. (Mostly, you need to replace certain characters with other sequences of characters; read more about it here.)

Fortunately, Python has a built-in function called urllib.urlencode that will automatically format query strings parameters for us. Just pass a Python dictionary to urllib.urlencode, and it will return the key/value pairs of that dictionary formatted as query string parameters. An example:

>>> import urllib
>>> urllib.urlencode({"q": "rabid chinchillas", "rpp": 100})

We can then take the results of urlencode and append it to the end of a URL. See for an example.

Using JSON

JSON (JavaScript Object Notation) is a popular way of formatting data so that it can be shared between different computer systems. The idea is that you might have a data structure in one application, and you want to be able to send that data structure to another application. In order to do this, we need three things: (1) a common format that both applications understand (like JSON); (2) a way to take an in-memory data structure on the source machine and convert it to that format---this is called "serialization"; and (3) a way to change the "serialized" data back into an in-memory data structure on the target machine.

Python has a library, called json that does the work in (2) and (3) for us. The json library has two important functions: dumps ("dump string"), which converts a Python data structure to JSON, and loads ("load string") which converts a JSON string to a Python data structure. Here's an example:

>>> import json
>>> mouse = {"name": "gerald", "length": 22.5, "favorite_food": "gouda", "age": 2}
>>> json.dumps(mouse)
'{"age": 2, "length": 22.5, "name": "gerald", "favorite_food": "gouda"}'
>>> type(json.dumps(mouse))
<type 'str'>

As you can see, the literal notation for Python objects (i.e., the way we write them in our programs) has a strong resemblance to the way that same data looks when encoded as JSON. There are a number of differences (i.e., JSON uses null instead of None; JSON always has double-quoted keys and values; escape sequences in JSON strings are very different from those in Python), but for the most part the formatted data should look very familiar. The json library can take pretty much any Python data structure and turn it into JSON---dictionaries, lists, ints, floats---even nested data structures, like dictionaries with lists as values.

The great thing about JSON (as illustrated above) is that JSON-encoded data is just a string. I could copy this JSON data into an e-mail and send it to you, without having to worry about formatting, and you could paste it back into Python to get back the original data structure. (Or I could make a web application that encodes data structures as JSON, and you could read them with another computer program.)

Here's an example of how to take a JSON-encoded data structure and turn it back into a Python data structure:

>>> json_encoded = '{"age": 2, "length": 22.5, "name": "gerald", "favorite_food": "gouda"}'
>>> gerald = json.loads(json_encoded)
>>> type(gerald)
<type 'dict'>
>>> gerald['name']

Nicely done. Where once we had a (JSON) string, we now have a (Python) dictionary. With urllib and json.loads, we essentially have everything we need to consume web services.

Consuming web services

Consuming web services basically works like this: (1) browse through the documentation of a web service to see which URLs are interesting; (2) use urllib to fetch the contents of that remote URL (in JSON format); (3) "deserialize" the data using json.loads; and (4) poke around in the resulting Python data structure for the data that we want.

In class, we'll examine a few examples of consuming the Wordnik API, an API for retrieving information about words.

examples tk:,

Web service authentication

Some web services require authentication. "Authentication" here means some kind of information that associates the request with an individual. In many APIs, this takes the form of a "token" or "key" (also called a "client ID" and/or "secret")---most usually an extra parameter that you pass on the end of the URL (or in an HTTP header) that identifies the request as having come from a unique user. Some services (like Facebook) provide a subset of functionality to non-authenticated ("anonymous") requests; others require authentication for all requests.

So how do you get "keys" or "tokens"? There's usually some kind of sign-up form in or near the developer documentation for the service in question. Sign up for Wordnik here, or Foursquare here (click on "My Apps" in the header after you've logged in). The form may ask you for a description of your application; it's usually safe to leave this blank, or to put in some placeholder text. Only rarely is this text reviewed by an actual human being; your key is usually issued automatically.

Different services have different requirements regarding how to include your keys in your request; you'll have to consult the documentation to know for sure.

Performing actions on behalf of a user

Some web services make it possible for you to undertake actions on behalf of a user of that service. This includes, e.g., posting a status update to Twitter, or uploading a photo to Facebook, or "checking in" on Foursquare. Services that offer these abilities usually require you to have the user "authorize" your application to act on their behalf. (You've probably done this before when, e.g., signing into a web site with Twitter or playing a game on Facebook.) Some web services (like Twitter) require every call to the API to be taking place on behalf of a user, even if you're creating an automated agent.

There's a commonly accepted set of conventions and code that make this type of authentication possible, and it's called OAuth. A full discussion of OAuth is not in the scope of this lecture, but you'll need to know how it works if you want to use, e.g., the Twitter API or the more useful parts of the Facebook Graph API. OAuth requests require a couple of extra steps to get started, and also require using a 3rd-party library for making HTTP requests, but are otherwise very similar to unauthenticated HTTP requests. Come see me for more information.

Structure of XML/HTML

All documents have some kind of content. For the plain text files we've been working with so far in this class, the content is just what's in the file: bytes of text, separated into lines. Documents can also have structure: some indication of (a) what parts of the document mean and (b) how parts of a document relate to one another. Plain text documents (by definition) have almost no structure: there's no indication of how the content should be divided up, and no indication of what the different parts of the document are supposed to mean.

XML and HTML are standards for "marking up" a plain-text document, in order to structure its content. It's intended to make structured data easy to share and easy to parse. Let's take a look at a sample XML file, which describes a pair of kittens and their favorite television shows:

<?xml version="1.0" encoding="UTF-8"?>
  <kitten name="Fluffy" color="brown">
    <lastcheckup date="2009-01-01" />
  <kitten name="Monsieur Whiskeurs" color="gray">
      <show>The Office</show>
    <lastcheckup date="2009-02-15" />

Let's talk a little bit about the anatomy of this file. Line 1 contains the XML declaration, which specifies which version of XML the document contains and what character encoding it uses. (Depending on the flavor of HTML, this declaration may be absent or different.) Line 2 contains the opening tag of the root element: this element contains all other elements in the file. Every XML document must have exactly one element at the root level.

XML and HTML elements are defined by pairs of matching tags: everything from <kittens> to </kittens> comprises the "kittens" element; everthing from <kitten name="Fluffy" color="brown"> to </kitten> comprises the first "kitten" element. Some elements---like the "lastcheckup" elements in the example above---don't have closing tags; such tags use a special syntax, with a slash before the closing angle bracket. (This is unlike HTML, where unclosed tags like <img> don't need the trailing slash.)

XML and HTML elements elements exist in a hierarchical relationship: in the example above, the "kittens" element contains several "kitten" elements, which in turn contain "televisionshows" and "lastcheckup" date elements. These hierarchical relationships are commonly referred to using a familial metaphor: the "kitten" element, for example, is the child of the "kittens" element---and the "kittens" element is the parent of both "kitten" elements. Two elements that are children of the same parent are called siblings: for example, both "show" elements under Fluffy's "televisionshows" element are siblings (since they have the same parent).

In addition to containing other elements, an element can have attributes: name/value pairs that occur in the opening tag of an element. Both "kitten" tags have two attributes: "name" and "color." An element can also contain text that isn't a part of any other element: this is often called the element's contents or text. (The "show" elements in the examples above are a good example of elements with text.)

Further reading:


XML and HTML have a common ancestry, and because they're used in similar domains (making data available on the web), tools designed to work with one will often work with the other. The library we'll be working with to parse XML/HTML, BeautifulSoup, is one such tool. But there are a number of differences between the two that are worth noting.

  • HTML doesn't require a document type declaration (the <?xml ... ?> line at the beginning of all XML files
  • HTML can have multiple elements at the root, not just one
  • Attributes on HTML tags don't have to have quotes around them (e.g., <foo bar=baz></foo> is valid HTML but not valid XML)
  • In HTML, empty element tags don't require the closing slash as they do in XML (e.g., <img src="hello"> is valid HTML but not valid XML)
  • HTML is forgiving of improperly nested tags: <b><i>foo</b></i> works fine in HTML, but would make an XML parser choke

Most importantly: XML is designed to be a general format for exchanging any kind of data. You can specify whatever tags and attributes you want. HTML is a language for structuring web pages; it has a closed set of tags, and a long history of being hacked for browser compatibility. HTML works well for making web pages, but often doesn't work well for making computer-parseable data.

Parsing HTML/XML with BeautifulSoup

There are many ways to parse XML and HTML with Python. We're going to use a library by Leonard Richardson called Beautiful Soup. (Beautiful Soup isn't part of the standard distribution; you have to download it yourself. There's a copy in the directory on the server with the example code for this session.)

Beautiful Soup makes it easy to extract data from HTML and XML documents in efficient and easy-to-understand ways.

BeautifulSoup: the basics

Let's go back to the kittens.xml file mentioned above. I'll print it here again, just to keep it fresh in your mind:

<?xml version="1.0" encoding="UTF-8"?>
  <kitten name="Fluffy" color="brown">
    <lastcheckup date="2009-01-01" />
  <kitten name="Monsieur Whiskeurs" color="gray">
      <show>The Office</show>
    <lastcheckup date="2009-02-15" />

Let's begin an interactive session, first getting the contents of kittens.xml and then using BeautifulSoup to extract data from the XML.

>>> from bs4 import BeautifulSoup
>>> data = open("kittens.xml").read()
>>> soup = BeautifulSoup(data)
>>> kitten_tag = soup.find('kitten')

The first three lines above demonstrate how to bring the appropriate class into your program from the Beautiful Soup library, load a text file, and create a BeautifulStoneSoup object with that data. (Note: for XML data, use the BeautifulStoneSoup class; for HTML, use BeautifulSoup.)

We then use the find method on soup, which will return the first kitten tag in the document. The kitten_tag variable will contain an object of type Tag (a Beautiful Soup class). What can we do with such objects? Lots of stuff:

>>> kitten_tag = soup.find('kitten')
>>> kitten_tag['name']
>>> kitten_tag['color']
>>> lastcheckup_tag = kitten_tag.find('lastcheckup')
>>> lastcheckup_tag['date']

We can use tag objects as though they were dictionaries, which will allow us to access the attributes for that tag. (Check the XML; you'll see that the name and color attributes of the first kitten tag are, indeed, 'Fluffy' and 'brown', respectively.)

The tag object also itself supports the find method. Calling find on a tag will return the first tag with the specified name that is a descendant of the given tag (i.e., child, grandchild, great-grandchild, etc.). Finding the first lastcheckup tag and then checking the date attribute of that tag does indeed reveal Fluffy's most recent checkup date (2009-01-01).

More than one: findAll; the string attribute

The findAll method works just like find, except it returns all matching tags in a list. (actually a "ResultSet," but it behaves like a list for our purposes.) Let's say we wanted to extract and then loop over all kitten tags in the document. Here's how to do it:

>>> kitten_tags = soup.findAll("kitten")
>>> for kitten_tag in kitten_tags:
...     print kitten_tag['name']
Monsieur Whiskeurs

Let's say that in addition to printing out each kitten's name, we also wanted to print out his or her favorite TV shows. We'll accomplish this by calling findAll on each kitten tag in the loop:

>>> for kitten_tag in kitten_tags:
... print kitten_tag['name']
... for show_tag in kitten_tag.findAll('show'):
... print "show: " + show_tag.string
show: House
show: Weeds
Monsieur Whiskeurs
show: The Office

The one new thing here is the string attribute. We use string to get the text inside of a tag—e.g., the name of the television shows inside of the show tags in our kitten XML file.

This idiom—performing findAll on the whole document, looping over the results, and then performing find or findAll on each of those tags—is very common in both XML parsing and web scraping.

More specific: the attrs argument

The find and findAll can take a second parameter, named attrs, which allows you to search for only those tags that have attributes with particular values. The parameter that you pass should be a dictionary, whose keys are the attributes you want to match, and whose values are the desired values for those attributes. For example, if we wanted to find only kitten tags whose color value is ‘gray’:

>>> kitten_tags = soup.findAll('kitten', attrs={'color': 'gray'})
>>> for kitten_tag in kitten_tags:
...     print kitten_tag['name']
Monsieur Whiskeurs

We can see that the only matching kitten tag was Monsieur Whiskeurs. This functionality doesn’t look impressive when we’re dealing with such a small data set, but it becomes immensely important for dealing with (e.g.) HTML. We might know, for example, that the piece of data we want is contained within a div tag with a particular attribute (say a class or id attribute). The attrs argument to findAll allows us to find only the tags we want to see.

(Both find and findAll have even more powerful ways to specify particular tags to find. Check the documentation for more information.)

Lateral movement: findNextSibling

Sometimes we want to be able to find tags that are siblings of the current tag (i.e., on the same level of the XML hierarchy), rather than descendants. For this purpose, Beautiful Soup supplies a findNextSibling function:

>>> kitten_tag = soup.find('kitten')
>>> print kitten_tag['name']
>>> next_kitten_tag = kitten_tag.findNextSibling('kitten')
>>> print next_kitten_tag['name']
Monsieur Whiskeurs

The initial find on soup returns the first kitten tag. Calling findNextSibling on that tag gives us the next kitten tag on the same level of the hierarchy.

(Again, this becomes more useful in web scraping applications where, for example, we want to get the paragraph that follows a particular h3 tag.)

HTML scraping: some examples;; (if time)