Chance operations. Simple models of text.

Reading discussion

Nissenbaum “Bias in Computer Systems”

  • Technical bias results from “the attempt to make human constructs amenable to computers, when we quantify the qualitative, discretize the continuous, or formalize the nonformal.” Is there an expressive potential in this process?
  • Nissenbaum discusses a “counterstrategy” for avoiding bias in MLSA (the time-sharing system). Can you think of other examples of “hacking” bias?

Can you think of any examples of how programs that deal with text contain bias? (Or how the underlying system of representing text contains biases?)

Word frequency analysis

Glazier “Grep: A Grammar”

  • What are the “fundamental ‘materials’ of writing” according to grep? How does this differ from the “materials” of writing in other kinds of practice?
  • What kinds of texts does the grep procedure create? What are their aesthetics? Can they be read, and if so, how?

Chance operations

We’ll discuss here several varieties of “non-intential” composition (which may or may not incorporate “chance” in the sense of “randomness.”) These categories overlap a great deal, and there’s obviously a lack of clarity about where to draw the lines between them. Which of these can be reproduced algorithmically?

Aleatory. “Some element of the composition is left to chance, and/or some primary element of a composed work’s realization is left to the determination of its performer(s). The term is most often associated with procedures in which the chance element involves a relatively limited number of possibilities.” (from Wikipedia. Examples: Automatic writing, John Cage’s early mesostics.

                                asK
			          Little
		               autO
                     Where it wantS
		                  To take
		                  You.   

Deterministic. A non-intentional process that leads to the same result every time. (No chance or choice involved.) Examples: Jackson Mac Low’s diastics and asymmetries. Asymmetry 205, There are many ways to use Strayer’s Vegetable Soybeans:

To hours, enough. Remove enough
And. Remove enough
Minutes. And not Iowa
Water and Iowa simmer.
To or
Until simmer. Enough
Simmer. To. Remove and Iowa enough. Remove simmer.
Vegetable. Enough good enough to and buttered loaf, enough
Simmer. Or Iowa buttered enough and not simmer.

Tomatoes, hot egg. Roll egg.
Added. Roll egg.
Minutes. Added, nutty in.
Wash added, in soak
Tomatoes, overnight,
Until soak egg,
Soak tomatoes. Roll added, in egg. Roll soak
Vitamins—egg, giving egg, tomatoes, added, beans, largest egg,
Soak overnight, in beans, egg, added, nutty soak

Stochastic. Existing parts are re-arranged and juxtaposed using chance. Example: Raymond Queneau’s Cent mille milliards de poèmes.

Random numbers in Python

Python makes it easy to work with random numbers. The random module includes several functions for generating random numbers and choosing random items from lists. Here’s a sample transcript from the interactive interpreter:

>>> import random
>>> random.random() # random number between 0 and 1
0.90685046351757992
>>> random.randrange(1, 10) # random number from 1 to 10
2
>>> random.gauss(0, 1) # gaussian random, mean 0, stddev 1
-0.15235026257945011

Python: Simple models of text

So far, we’ve been working with programs that examined just one line of a file at a time. During this session, we’ll be expanding our scope a little bit: we want to make programs that can build a picture of how an entire text looks, seems and behaves. In order to facilitate that, we’ll be looking at a few simple data structures.

Lists

Lists in Python are a kind of object that stores other objects. (They’re a lot like arrays in Processing, but more powerful, as you’ll see.) Once you’ve created a list, or put objects into a list, you can retrieve them using the same syntax we used last week to get individual characters out of strings. You can also get slices of a list, using the same syntax we used to get slices of strings. Here’s some example code:

>>> parts = ['led', 'resistor', 'capacitor']
>>> len(parts) # how many elements are in the list?
3
>>> type(parts) # what kind of object is this?
<type 'list'>
>>> parts[1]
'resistor'
>>> parts.append('ultrasonic range finder') # adds a new element to list
>>> parts
['led', 'resistor', 'capacitor', 'ultrasonic range finder']
>>> parts[2:]
['capacitor', 'ultrasonic range finder']
>>> parts.sort() # sorts the list in-place (i.e., changes the list)
>>> parts
['capacitor', 'led', 'resistor', 'ultrasonic range finder']
>>> parts.reverse() # reverses the list in-place
>>> parts
['ultrasonic range finder', 'resistor', 'led', 'capacitor']
>>> 'led' in parts # is the string 'led' in the list?
True
>>> 'flex sensor' in parts
False
>>> more_parts = [] # create an empty list
>>> more_parts = list() # same thing

As you can see, list literals are made by surrounding a comma-separated list of objects with square brackets. You can store any kind of object in a list: strings, integers, floats, even other lists (or sets or dictionaries)!

You can iterate over the elements of a list with for, just like you iterate over the individual characters of a string. Here’s a transcript to demonstrate, from the interactive interpreter:

>>> materials = ['poplar', '8-segment LED', 'photoresistor', 'felt', 'lard']
>>> for material in materials:
...     print material
... 
poplar
8-segment LED
photoresistor
felt
lard
What if I just want to count from one to ten?

Use Python’s built-in range() function, which returns a list containing numbers in the desired range:

>>> range(1,11)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

This script:

for i in range(1,6):
  print i

Will output:

1
2
3
4
5
Randomness with lists

Python’s random library provides two helpful functions for performing chance operations on lists. The first is shuffle, which takes a list and randomly shuffles its contents; the second is choice, which returns a random element from the list.

>>> import random
>>> cards = ['two of cups', 'four of swords', 'the empress', 'the fool']
>>> random.shuffle(cards)
>>> cards
['two of cups', 'the fool', 'four of swords', 'the empress']
>>> random.choice(cards)
'the fool'

Data structures to store text: randomize_lines.py

This brings us to our first full-fledged example program, randomize_lines.py. Instead of operating on one line at a time, this program stores all of the lines from standard input into a list, then uses the random.shuffle function to print out the lines in random order. Here’s the code:

import sys
import random

all_lines = list()

for line in sys.stdin:
  line = line.strip()
  all_lines.append(line)

random.shuffle(all_lines)

for line in all_lines:
  print line

The all_lines variable points to a list. Inside the first for loop, we add each line that comes in from standard input to the list. After calling shuffle to re-order the list, we then print the list back out. If you pass in the William Carlos Williams poem, you’ll get back a delightful re-imagining:

for breakfast
the icebox
so sweet
Forgive me
the plums

I have eaten

and so cold

This is just to say
that were in
saving
and which
they were delicious
you were probably
Strings and lists: split and join

String objects in Python provide two helpful functions to break strings up into lists of strings, and join lists of strings back into a single string. The split method “splits” a string into a list of strings, using the parameter that you pass to the method as a delimiter. The join method uses whatever string you call it on to join together the list of strings passed in as a parameter, creating a list. These are a little tricky, so it’s helpful to see them in action. Here’s a transcript from the interactive interpreter:

>>> foo = "mother said there'd be days like these"
>>> foo.split(" ") # split on white space 
['mother', 'said', "there'd", 'be', 'days', 'like', 'these']
>>> foo.split("e") # split on the letter "e"
['moth', 'r said th', 'r', "'d b", ' days lik', ' th', 's', '']
>>> wordlist = ['this', 'is', 'a', 'test']
>>> separator = " "
>>> separator.join(wordlist)
'this is a test'
>>> " ".join(wordlist) # same thing
'this is a test'

We’ll most often be using the split method as a shorthand for “split this string into a list of words.” (We’ll find more robust solutions for the problem of parsing words from a string when we discuss regular expressions.) Here’s a program that shuffles the order of the words on each line of standard input (available in the examples as randomize_words.py):

import sys
import random

for line in sys.stdin:
  line = line.strip()
  words = line.split(" ")
  random.shuffle(words)
  output = " ".join(words)
  print output

Here’s the result from passing in our favorite Robert Frost poem:

Stopping Woods On Snowy A By Evening

I think Whose I these woods know. are
is village the house in though; His
see He stopping here not me will
snow. fill woods up watch with To his
queer think My it horse little must
without farmhouse stop To a near
and the lake woods Between frozen
darkest the of year. evening The
bells his a harness shake He gives
To some mistake. if is there ask
sound's only The sweep the other
flake. easy and wind downy Of
lovely, woods dark and The are deep.
to keep, have I But promises
to go And sleep, miles before I
sleep. to before I go miles And
sys.argv: Python’s important built-in list

Last week we learned how to run our Python scripts from the command line, as though they were UNIX text mungeing utilities. Most UNIX utilities take arguments on the command line: grep takes a pattern to search for, for example. We can read command-line parameters from Python as well, using the sys.argv list. This list contains all of the parameters passed on the command line, including the same of the script itself.

For example, take the following script, called argv_reader.py:

import sys

for arg in sys.argv:
  print arg

If you ran it on the command line like so:

$ python argv_reader.py cat wallaby armadillo

You’d get the following output:

argv_reader.py
cat
wallaby
armadillo

Sets

The set is our second important data structure. You can think of a set as a kind of list, but with the following caveats:

  1. Sets don’t maintain the order of objects after you’ve put them in.
  2. You can’t add an object to a set if it already has an identical object.

Objects can be added to a set by calling its add method (as opposed to the append method used for lists).

A corollary to #1 above is that you can’t use the square bracket notation to access a particular element in a set. Once you’ve added an object, the only operations you can do are to check to see if an object is in the set (with the in operator), and iterate over all objects in the set (with, for example, for). Here’s a transcript of an interactive interpreter session that demonstrates these basic features:

>>> foo = set()
>>> foo.add(1)
>>> foo.add(2)
>>> foo.add(3)
>>> foo
set([1, 2, 3])
>>> foo.add(1) # will be ignored---only one of any identical object can be in set
>>> foo
set([1, 2, 3])
>>> 1 in foo
True
>>> 5 in foo
False
>>> for elem in foo:
...     print elem
... 
2
1
3

An additional aspect of sets to note from the transcript above: because sets don’t maintain the order of objects, you’ll get the objects back in (seemingly) random order when you iterate over the set. For most applications, this isn’t a problem, but it’s something to keep in mind.

Sets are great when you want to store data, but you want to ignore duplicates in the data. One classic example is to create a list of unique words in a text file. Here’s a program that does just that, available in this session’s example programs as unique_words.py:

import sys

words = set()

for line in sys.stdin:
  line = line.strip()
  line_words = line.split()
  for word in line_words:
    words.add(word)

for word in words:
  print word

The important lines in this program are lines 8 and 9, in which we loop over every word from the current line and add them to the set. Because sets ignore any attempt to add in an object that is already in the set, once we’ve inserted one word, the set will only ever contain one copy of that word.

On lines 11 and 12, we loop over the contents of the set and print them out. (Note here again that the words in the set won’t appear in any particular order.) Here’s some sample output, obtained by running the program with this_is_just.txt as input:

and
saving
they
just
sweet
is
say
have
in
breakfast
cold
for
to
Forgive
This
delicious
which
probably
you
plums
icebox
that
I
eaten
me
so
were
the

Dictionaries

The ”dictionary” is a very powerful data structure. You can think of it as an array whose indices are strings (or any other object) instead of numbers. In PHP, they’re known as ”associative arrays” and in Perl they’re ”hashes”; in Java, there’s a class called ”Map” that does the same thing. Dictionary literals in Python consist of comma-separated key/value pairs, with a colon between the key and the value (see the transcript below for an example). Keys can be any object (with some exceptions); values can be any object. You can access values of a dictionary with square brackets, much like the list indexing syntax. Some sample code for the interactive interpreter:

>>> assoc = {'butter': 'flies', 'cheese': 'wheel', 'milk': 'expensive'}
>>> assoc['butter'] # access value at a key
'flies'
>>> assoc['gelato'] = 'delicious' # assign to a key
>>> assoc.keys()
['butter', 'cheese', 'milk', 'gelato']
>>> assoc.values()
['flies', 'wheel', 'expensive', 'delicious']
>>> 'milk' in assoc
True
>>> 'yogurt' in assoc
False
>>> foo = {} # create an empty dictionary
>>> foo = dict() # same thing
The concordance

So what are dictionaries good for? One classic application is to build a simple concordance: a list of words that occur in a text, and how many times those words occur. Here’s the source listing for just such a program (concordance.py):

import sys

words = dict()

for line in sys.stdin:
  line = line.strip()
  line_words = line.split(" ")
  for word in line_words:
    if word in words:
      words[word] += 1
    else:
      words[word] = 1

for word in words.keys():
  print word + ": " + str(words[word])

This program illustrates several important idioms for working with dictionaries:

  1. It’s okay to assign to a key that doesn’t already exist in a dictionary, but it’s an error in Python to try to access a non-existent key. That’s what the code in lines 9-12 is for: first we check to see if the current word is already in the dictionary; if it is, then we increment its value by one. Otherwise, we assign a value to that key.
  2. There are many ways to iterate over the contents of a dictionary. One is to iterate over the list of keys returned from the dictionary’s keys method, then access the corresponding value by using that key as an index.

Exercises

We’ll do some of these in class.

1. How would you write a program that randomizes the words in each line of standard output, then prints out the randomized lines in random order? Can this be done without any writing any new code at all?

2. Let’s say that we wanted our concordance script to store not just how many times a particular word occurred, but on which lines. How would we go about doing that? What kind of data structure would we need? What additional information would we need that we aren’t already tracking in concordance.py?

3. The alpha_replace.py reads in a text file as a source file, then replaces the words in standard input with words in the source file that begin with the same letter. Write a version of this script that replaces words not according to their first letter, but according to the number of letters in the word.

4. Make any of the example scripts insensitive to case.

Reading for next week

Reply