Notebook download

A more advanced example Python data analysis notebook¶

As we did at the start of the beginner course, we are going to illustrate how to use Python to perform a slight more "advanced" analysis: retrieve data, do some computations based on it, and visualise the results.

A tour of capabilities¶

Let's start this second part by taking a look at some of the awesome things that you can do joining programming and the Internet. As before, you will not understand all the detail of the code in this section yet, and nor should you. If you do, maybe one of our more advanced courses is more appropriate for you!

As we show the code for different parts of the work, we will be touching on various aspects you may want to keep in mind, either related to Python specifically, or to research programming more generally.

Why write software to manage your data and plots?¶

We can use programs for our entire research pipeline. Not just big scientific simulation codes, but also the small scripts which we use to tidy up data and produce plots. This should be code, so that the whole research pipeline is recorded for reproducibility. Data manipulation in spreadsheets is much harder to share or check.

You can see another similar demonstration on the Software Carpentry site. We'll try to give links to other sources of Python training along the way. Part of our approach is that we assume you know how to use the internet! If you find something confusing out there, please bring it along to the next session. In this course, we'll always try to draw your attention to other sources of information about what we're learning. Paying attention to as many of these as you need to, is just as important as these core notes.

Importing libraries¶

Research programming is all about using libraries: tools other people have provided programs that do many cool things. By combining them we can feel really powerful but doing minimum work ourselves. The Python syntax to import someone else's library is import.

In [1]:

import geopy # A Python library for investigating geographic information.
# https://pypi.org/project/geopy/

Now, if you try to follow along on this example in an Jupyter notebook, you'll probably find that you just got an error message.

You'll need to wait until we've covered installation of additional Python libraries later in the course, then come back to this and try again. For now, just follow along and try get the feel for how programming for data-focused research works.

In [2]:

# Select geocoding service provided by OpenStreetMap's Nominatim - https://wiki.openstreetmap.org/wiki/Nominatim
geocoder = geopy.geocoders.Nominatim(user_agent="uk/doctoral-programming-intro") 
geocoder.geocode('Cambridge', exactly_one=False)

Out[2]:

[Location(Cambridge, Cambridgeshire, Cambridgeshire and Peterborough, England, United Kingdom, (52.197584649999996, 0.13915373736874398, 0.0)),
 Location(Cambridgeshire, Cambridgeshire and Peterborough, England, United Kingdom, (52.2055314, 0.1186637, 0.0)),
 Location(Cambridge, Middlesex County, Massachusetts, United States, (42.3750997, -71.1056157, 0.0)),
 Location(Cambridge, Region of Waterloo, Southwestern Ontario, Ontario, Canada, (43.3600536, -80.3123023, 0.0)),
 Location(Cambridge, Henry County, Illinois, United States, (41.3025257, -90.1962861, 0.0)),
 Location(Cambridge, Isanti County, Minnesota, 55008, United States, (45.5727408, -93.2243921, 0.0)),
 Location(Cambridge, Story County, Iowa, 50046, United States, (41.8990768, -93.5294029, 0.0)),
 Location(Cambridge, Dorchester County, Maryland, 21613, United States, (38.5714624, -76.0763177, 0.0)),
 Location(Cambridge, Guernsey County, Ohio, 43725, United States, (40.031183, -81.5884561, 0.0)),
 Location(Cambridge, Jefferson County, Kentucky, United States, (38.2217369, -85.616627, 0.0))]

The results are returned as a list with each element in the list a Location object representing a particular location and specifying among other things text specifying the address of the location and a triplet of numbers specifying the latitude, longitude and altitude of the location. Programs represent data in a variety of different containers like this.

Functions¶

As we've learnt from the first half, we can wrap code up in a function, so that we can repeatedly get just the information we want.

In [3]:

def geolocate(place):
    location = geocoder.geocode(place, exactly_one=True)
    return (location.latitude, location.longitude)

Remember! Defining functions which put together code to make a more complex task seem simple from the outside is the most important thing in programming. The output of the function is stated by return; the input comes in in brackets after the function name:

In [4]:

geolocate('Cambridge')

Out[4]:

(52.197584649999996, 0.13915373736874398)

Variables¶

We can store a result in a variable (with a self-descriptive name):

In [5]:

london_location = geolocate("London")
print(london_location)

(51.5073219, -0.1276474)

More complex functions (connecting with the Internet)¶

We'll fetch a map of a place from the Google Maps server, given a longitude and latitude. The URLs look like: https://mt0.google.com/vt?x=658&y=340&z=10&lyrs=s. Since we'll frequently be generating these URLs, we will create two helper functions to make our life easier. The first is a function to convert our latitude and longitude into the coordinate system used by Google Maps. If interested, you can see more details of the mathematics. We will then create a second function to build up a web request from the URL given our parameters.

In [6]:

import os
import math
import requests

def deg_to_tilenum(lat_deg, lon_deg, zoom):
    lat_rad = math.radians(lat_deg)
    n = 2.0 ** zoom
    xtile = int((lon_deg + 180.0) / 360.0 * n)
    ytile = int((1.0 - math.asinh(math.tan(lat_rad)) / math.pi) / 2.0 * n)
    return (xtile, ytile)

def request_map_at(lat, long, zoom=10, satellite=True):
    base_url = "https://mt0.google.com/vt?"

    x_coord, y_coord = deg_to_tilenum(lat, long, zoom)

    params = dict(
        x=int(x_coord),
        y=int(y_coord),
        z=zoom
    )
    if satellite:
        params['lyrs'] = 's'

    return requests.get(base_url, params=params)

In [7]:

map_response = request_map_at(51.5072, -0.1275)

Checking our work¶

Let's see what URL we ended up with:

In [8]:

url = map_response.url
print(url[0:25])
print(url[25:])

https://mt0.google.com/vt
?x=511&y=340&z=10&lyrs=s

We can write automated tests so that if we change our code later, we can check the results are still valid.

In [9]:

assert "https://mt0.google.com/vt?" in url
assert "z=10" in url
assert "lyrs=s" in url

Our previous function comes back with an object representing the web request. In object oriented programming, we use the . operator to get access to a particular property of the object, in this case, the actual image at that URL is in the content property. It's a big file, so we'll just get the first few bytes:

In [10]:

map_response.content[0:20]

Out[10]:

b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00'

Displaying results¶

We'll need to do this a lot, so we'll wrap up our previous function in another function, to save on typing.

In [11]:

def map_at(*args, **kwargs):
    return request_map_at(*args, **kwargs).content

We can use a library that comes with Jupyter notebook to display the image. Being able to work with variables which contain images, or documents, or any other weird kind of data, just as easily as we can with numbers or letters, is one of the really powerful things about modern programming languages like Python.

In [12]:

import IPython
map_jpeg = map_at(*london_location)

In [13]:

print("The type of our map result is actually a: ", type(map_jpeg))

The type of our map result is actually a:  <class 'bytes'>

In [14]:

IPython.core.display.Image(map_jpeg)

Out[14]:

In [15]:

IPython.core.display.Image(map_at(*geolocate("New Delhi")))

Out[15]:

Manipulating numbers¶

Now we get to our research project: we want to find out how urbanised the world is, based on satellite imagery, along a line between two cites. We expect the satellite image to be greener in the countryside.

We'll use lots more libraries to count how much green there is in an image.

In [16]:

from io import BytesIO # A library to convert between files and strings
import numpy as np # A library to deal with matrices
import imageio # A library to deal with images

Let's define what we count as green:

In [17]:

def is_green(pixels):
    threshold = 1.1
    greener_than_red = pixels[:,:,1] > threshold * pixels[:,:,0]
    greener_than_blue = pixels[:,:,1] > threshold * pixels[:,:,2]
    green = np.logical_and(greener_than_red, greener_than_blue) 
    return green

This code has assumed we have our pixel data for the image as a 256 × 256 × 3 three-dimensional matrix, with each of the three layers being red, green, and blue pixels.

We find out which pixels are green by comparing, element-by-element, the middle (green, number 1) layer to the top (red, zero) and bottom (blue, 2)

Now we just need to parse in our data, which is a JPEG image, and turn it into our matrix format:

In [18]:

def count_green_in_image(data):
    pixels = imageio.imread(data, format='jpg') # Get our JPEG image as a numpy array
    return np.sum(is_green(pixels))

In [19]:

print(count_green_in_image(map_at(*london_location)))

/tmp/ipykernel_7031/4189313165.py:2: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.
  pixels = imageio.imread(data, format='jpg') # Get our JPEG image as a numpy array

We'll also need a function to get an evenly spaced set of places between two endpoints:

In [20]:

def location_sequence(start, end, steps):
    lats = np.linspace(start[0], end[0], steps) # "Linearly spaced" data
    longs = np.linspace(start[1], end[1], steps)
    return np.vstack([lats, longs]).transpose()

In [21]:

location_sequence(geolocate("London"), geolocate("Cambridge"), 5)

Out[21]:

array([[ 5.15073219e+01, -1.27647400e-01],
       [ 5.16798876e+01, -6.09471157e-02],
       [ 5.18524533e+01,  5.75316868e-03],
       [ 5.20250190e+01,  7.24534530e-02],
       [ 5.21975846e+01,  1.39153737e-01]])

Creating images¶

We should display the green content to check our work:

In [22]:

def show_green_in_image(data):
    pixels = imageio.imread(data, format='jpg') # Get our JPEG image as rows of pixels
    green = is_green(pixels)
    out = green[:, :, np.newaxis] * np.array([0, 255, 0])[np.newaxis, np.newaxis, :]  
    buffer = BytesIO()
    result = imageio.imwrite(buffer, out.astype('uint8'), format='png')
    return buffer.getvalue()

In [23]:

IPython.core.display.Image(
    map_at(*london_location, satellite=True)
)

Out[23]:

In [24]:

IPython.core.display.Image(
    show_green_in_image(
        map_at(*london_location, satellite=True)
    )
)

/tmp/ipykernel_7031/3124252408.py:2: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.
  pixels = imageio.imread(data, format='jpg') # Get our JPEG image as rows of pixels

Out[24]:

Looping¶

We can loop over each element in out list of coordinates, and get a map for that place:

In [25]:

for location in location_sequence(
    geolocate("London"), geolocate("Birmingham"), 4
):
    IPython.display.display( 
        IPython.core.display.Image(map_at(*location))
    )

So now we can count the green from London to Birmingham!

In [26]:

[
    count_green_in_image(map_at(*location))
    for location in location_sequence(
        geolocate("London"), geolocate("Birmingham"), 10
    )
]

/tmp/ipykernel_7031/4189313165.py:2: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.
  pixels = imageio.imread(data, format='jpg') # Get our JPEG image as a numpy array

Out[26]:

[31417, 31417, 62395, 63374, 63405, 63471, 63374, 62907, 59831, 60067]

Plotting graphs¶

Let's plot a graph.

In [27]:

import matplotlib.pyplot as plt

In [28]:

plt.plot([
    count_green_in_image(map_at(*location))
    for location in location_sequence(
        geolocate("London"), geolocate("Birmingham"), 10
    )
])

/tmp/ipykernel_7031/4189313165.py:2: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.
  pixels = imageio.imread(data, format='jpg') # Get our JPEG image as a numpy array

Out[28]:

[<matplotlib.lines.Line2D at 0x7efcc8a12d60>]

From a research perspective, of course, this code needs a lot of work. But I hope the power of using programming is clear.

Composing program elements¶

We built little pieces of useful code, to:

Find latitude and longitude of a place
Get a map at a given latitude and longitude
Decide whether a (red, green, blue) triple is mainly green
Decide whether each pixel is mainly green
Plot a new image showing the green places
Find evenly spaced points between two places

By putting these together, we can make a function which can plot this graph automatically for any two places:

In [29]:

def green_between(start, end,steps):
    return [
        count_green_in_image( map_at(*location) )
        for location in location_sequence(
            geolocate(start), geolocate(end), steps
        )
    ]

In [30]:

plt.plot(green_between('New York', 'Chicago', 20))

/tmp/ipykernel_7031/4189313165.py:2: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.
  pixels = imageio.imread(data, format='jpg') # Get our JPEG image as a numpy array

Out[30]:

[<matplotlib.lines.Line2D at 0x7efcc6955b80>]

And that's it! We've covered, very very quickly, the majority of the Python language, and much of the theory of software engineering.

Now we'll go back, carefully, through all the concepts we touched on, and learn how to use them properly ourselves.