You have now seen a large portion of the Python language; enough to be able to write useful programs for data analysis. Some of the features we have not looked at are just syntactic sugar, they allow you to express concepts we have already considered more elegantly using less characters. I’ll leave you to discover these features yourselves. Some of the missing features become important when you start to think of Python as an object oriented language and we will consider these on Thursday. In this session we will look at how to write a relatively simple but useful program. The program will follow many of the patterns you will find if you write real data analysis code: you will learn how to read data from a collection of data files, perform some simple analysis of the data, and then produce a report of the analysis. Along the way you will learn how to write code in small pieces such that are easily reusable. You will learn to design a script that can act as a module (and a module that can act as a script).

The task you will focus on is the analysis of historical monthly rainfall data for the south of England and Wales. This data is made available by the Met Office and is stored in fairly simple text files. These files do have some quirks and you will need to work out how to deal with these. You will use the same dataset on Thursday when you learn how to use Python to plot time-series and geographical data, and perform more advanced analysis.

Caution Some aspects of this practical need extra Python modules installed beyond those which make up the standard library. In particular, the exercises make use of Matplotlib and Basemap to plot the results of your programs. If you are using your own computer basic installation instructions can be found here.

Exercise 1: A first script

Before making a start on the rainfall data we will look at how to write a Python program that can be used from the command line. In the final exercise of practical 1 you looked at how to make Python perform encryption and decryption using the Caesar cipher – a simple one-to-one substitution cipher with a key that shifts the alphabet. This code has existed for at least two millennia. In the earlier version of this code the plaintext message (or the encoded ciphertext) is embedded with the Python source code. This isn’t very useful, as you would need to change the source every time you wrote a new message or changed the key (the shift). It would be helpful to be able to run this as a script from the command line so, for example, you could enter python ./caesar.py encode 12 < plaintext.txt > ciphertext.txt to encode a message with a shift of 12. The file caesar.py, reproduced below, shows how this can be done. Key lines are described. Make sure you understand how this is supposed to work. Fill in the blanks and test the resulting code.

def encode(plaintext, shift):1

    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    cipherbet = alphabet[shift:27]+alphabet[0:shift]

    # Put code in here.
    2

    ciphertext = ''.join(ciphertext) # Turn list into a string

    return ciphertext

def decode(ciphertext, shift):1

    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    cipherbet = alphabet[shift:27]+alphabet[0:shift]

    # Put code in here.
    2

    plaintext = ''.join(plaintext) # Turn list into a string

    return plaintext

if __name__ == "__main__":3
    import sys 4
    if (len(sys.argv)!=3): 5
        print "ERROR: Two arguments needed, mode and shift"
    else:
        mode = sys.argv[1] 6
        shift = int(sys.argv[2]) 7
        for line in sys.stdin: 8
            if mode == 'encode': 9
                print encode(line, shift) 10
            elif mode == 'decode':
                print decode(line, shift)
            else:
                print "ERROR: mode must be encode or decode"
1 plaintext and ciphertext are strings and shift is an integer. The encode and decode functions return strings.
2 You need to provide the body of this function
3 Things below this line only get executed if the file is run as a python program, not if it is loaded as a module
4 We need to import the sys module from the standard library to access command line arguments
5 A check to see if we have the correct number of command line arguments
6 Read the mode from the command line (encode or decode)
7 Read the key (shift) from the command line. Note that we need to change this from a string into a integer
8 Loop over the lines from standard input (sys.stdin is a file type pre-connected to whatever file name comes after the < character
9 Call the appropriate function or raise an error
10 By default, print writes to standard output, this is pre-connected to whatever file name follows the > character, or prints to the teminal.

On unix like system, it is common to have the shell select the interpreter rather than have the user specify it. That is, the shell reads a "hashbang" line at the top of an executable text file and uses this to select the python interpreter so the user can enter ./caesar.py encode 12 < plaintext.txt > ciphertext.txt to encode a message. Make this possible by adding a suitable hashbang and making the Python program executable.

Tip On unix like systems the chmod command is used to change the permission bits of a file. You will need to set (+) the execute (x) bit for yourself (the user, u) for the file caesar.py by entering chmod u+x caesar.py. The shell will need to find the Python interpreter. A particularly portable way to set this up so that it does not matter where Python is installed is to add #!/usr/bin/env python to the top of your program. Further information can be found by entering`man chmod` and man env.

A useful form of documentation is to add a literal string (a "docstring") to the top of each function describing what it does. The most convenient form is to use three quotation marks (""" some text """) before or after the string (so that you can break the documentation over several lines). Add documentation to the two functions. Does the script still work?

Exercise 2: Graphing rainfall

The file plot_rain.py (reproduced below) contains a snippet of rainfall data covering data from 1992 - 2010 for one weather station in Camborne, Cornwall. As you can see, the data is stored within the Python source file in a data structure that consists of a dictionary for each year with the monthly rainfall stored in a list. There is a function that graphs the data and this is used if you run the program. Run the program, do you see a graph?

#!/usr/bin/env python

# Import plotting modules - we will cover these on Thursday
import matplotlib
import matplotlib.pyplot as plt

# 18 years of data
monthly_data = {1992: [29.6, 65.0, 58.2, 103.9, 14.7, 9.9,
                       57.7, 154.4, 93.1, 95.7, 175.1, 124.3],
                2009: [146.0, 71.3, 48.8, 144.2, 45.2, 18.8,
                       222.2, 68.8, 37.8, 93.4, 193.6, 101.8],
                2010: [108.8, 63.0, 76.6, 35.6, 56.6, 34.0,
                       78.2, 71.8, 83.6, 85.6, 164.8, 48.1]}

def plot_rainfall(rainfall, annual=None):
    # Plot the graph - don't worry about how this function works,
    # we will look at it on Thursday.

    # etc.

    return fig

if __name__ == "__main__":

    # Calculate the annual rainfall here.

    plot_rainfall(monthly_data)
    plt.show()

As well as an argument for the data, the plot_rainfall() function also accepts an optional second argument for the annual rainfall. This should be expressed as a dictionary with integer years as the keys and floating point numbers as the values for the total annual rainfall:

  • Write the Python code needed to calculate the annual rainfall and create the "dictionary of years data" structure described above.

  • Pass the data structre into the plotting function as a second argument and run the code.

  • Do the wettest years contain the wettest months?

Exercise 3: Reading a simplified data file

The file cambornedata_simple.txt is a simplified version of the data file distributed by the Met Office. It consists of 7 columns of data with one row per month that the Camborne weather station has been operating. The columns are year, month (1-12), maximum temperature (degrees C), minimum temperature (degrees C), number of days of air frost, total rain for the month (mm) and total number of hours of sun for the month. Modify the plot_rain.py program to read data from this file and build the monthly data structure. Can you graph the monthly and annual rainfall without further modifications to the code? How did you deal with missing months at the start and end of the file? I hope you put the file reading machinery in a function. What is the function’s argument?

Tip Remember that you can read lines from a file by opening it and using the resulting file type in a for line in file: block. In each iteration line will then correspond to a different month of data. You can extract each item of data by turning the string into a list by doing list = line.split(). .split() chops up input on spaces by default. Don’t forget to close the file when you have finished with it. Remember that you’ll need to change the string extracted from the file into data of the float type.

Exercise 4: Mean monthly rainfall

How wet is an average October in Camborne? What about an average July? Write a function to calculate the mean monthly rainfall for each calendar month from the dictionary of lists data structure you can generate from a file. You can assume that any month that is reported as having no rainfall has missing data. Ignore these months when calculating the mean. Your function should return two 12-element lists, one with the mean rainfall for each month and one with the number of months included in the average. Print out a formatted list of average rainfall. You should end up printing out something like this:

Jan 121.896969697 33
Feb 91.7818181818 33
Mar 80.7848484848 33
Apr 69.9878787879 33
May 63.8878787879 33
Jun 59.1 33
Jul 65.05 32
Aug 67.934375 32
Sep 74.7818181818 33
Oct 113.748484848 33
Nov 120.751515152 33
Dec 130.312121212 33

Does the dictionary of lists data structure make this exercise easy? Can you think of a data structure that would lend itself to this problem? Finding that a task that should be easy seems to be difficult is often a sign that the data structure is not a good match to the problem. Thinking about the problem in terms of the way you represent the data can be helpful, even if you need to change the way that data is stored in your application. We will return to this theme on Thursday.

Exercise 5: Parsing the full file format

The file format provided by the Met Office is a little more complex than the file provided in task 2. Some of the data is estimated, this is noted with an asterisk next to the measured rainfall in the file (there is no space between the numbers and the asterisk). What happens if you try to convert 12.4* into a float (e.g. float("12.4*"))? A second complication is that the data files contain a large block of header text. Some of this is useful (the station name on the first line, the location data on the second) but some only helps people reading the file and mainly gets in the way of your attempt to read the file. Finally, some data is missing (marked ---). Modify your data file reading function to handle these two cases and put the weather station name in a field called "name" in the returned dictionary. Read data from the file called cambornedata.txt and cambridgedata.txt.

Tip Your for line in file: iteration will visit each line in the file in turn. The Met Office header is always the same number of lines, so you can simply keep track of how many lines you have seen so far and only start reading monthly data after you have seen 8 lines. Think back to how do you run a piece of code within the iteration conditionally?

The location data in the data file is in an extremely unhelpful format and, as far as I can see, isn’t documented anywhere. It turns out that the numbers before the N and E line are the number of hundreds of meters east and north of a point south west of the Scilly Isles. The numbers are a six figure Ordinance Survey grid reference in disguise. I’ve provided a module (in the file metloc.py) with functions to convert the location to an (approximate) latitude and longitude. You have two options: (1) Extract the numbers from the second line of the data file and use my met2latlon function. If you do this you may want to explore Python’s regular expression library. (2) Skip the data extraction part and pass the whole second line (as a string) into my parselocationlinelatlon function which extracts the numbers for you. In either case, make your file reading function return (data, stationName, lat, lon) so that you can use this data.

Tip Remember that you will need to import metloc to access my functions.

There is another function in the metloc module, plot_location. This takes three arguments, the latitude, longitude (in degrees) and station name (a string). Use this function to plot the station location at the very end of your program.

Exercise 6: Great-circle distances

What is the distance between the Camborne and Cambridge weather stations? Write a script to find out. This should take two arguments (the two files) and can use your file reading function (imported, so your existing script behaves as a Python module) to read the files. You can approximate the Earth as a sphere with a radius, $r$, of 6367 km. With this approximation the distance between two points with latitude $\phi_1$ and $\phi_2$ and longitude separation $\Delta \lambda$ is given by:

$d = 2 r \arcsin(haversin(\phi_1-\phi_2) + \cos(\phi_1)\cos(\phi_2)haversin(\Delta \lambda))$

Where the Haversine function is defined by:

$haversin(\theta) = \sin^2(\frac{\theta}{2}) = \frac{(1-\cos(\theta))}{2}$

Remember that you can access basic mathematical functions such as sqrt, asin and cos from the math module in the standard library. Can you write a function to perform a similar calculation for a non-spherical Earth?