XClose

An introduction to research programming with Python

Home
Menu

Tabular data

Tabular data, that is data that is formatted as a table with a fixed number of rows and columns, is very common in a research context. A particularly simple and also popular file format for such data is delimited-separated value files.

Delimiter-separated values

Let's carry on with our sunspots example. As we saw previously the data is semicolon-separated.

We can request the CSV file text from the URL we used previously:

In [1]:
import requests
# Request sunspots data from URL and extract response content as text
sunspots_csv_text = requests.get('http://www.sidc.be/silso/INFO/snmtotcsv.php').text
# Strip any leading or trailing whitespace from CSV text to skip any empty rows
sunspots_csv_text = sunspots_csv_text.strip()

As a quick check we can split the text on the newline character \n and print the last five entries in the resulting list

In [2]:
print(sunspots_csv_text.split('\n')[-5:])
['2022;05;2022.371;  96.5; 16.0; 1250;0', '2022;06;2022.453;  70.5; 12.9; 1219;0', '2022;07;2022.538;  91.4; 12.2; 1304;0', '2022;08;2022.623;  75.4; 10.5; 1289;0', '2022;09;2022.705;  96.3; 16.2; 1130;0']

We see that each line is a string with numeric calues separted by semicolon delimiters. We want to work programmatically with such delimited-separated value files.

These are files which (typically) have

  • One record per line (row)
  • Each record has multiple fields (columns)
  • Fields are separated by some delimiter

Typical delimiters are spaces, tabs, commas and semicolons, for example

  • Space separated values: field1, "field two", field3
  • Comma separated values: field1, another field, "wow, another field"

Comma-separated values is abbreviated as CSV, and tab-separated values as TSV.

CSV is also sometimes used to refer to all the different sub-kinds of delimiter-separated value files, that is, some people use CSV to refer to tab, space and semicolon separated files.

CSV is not a particularly superb data format, because it forces your data model to only have two 'axes', records and fields, with each record a flat object. As we will see in the next notebook, structured file formats can be used to represent a richer array of data formats, including for example hierarchically structured data where each record may itself have an internal structure.

Nevertheless, because you can always export spreadsheets as CSV files (each cell is a field, each row is a record), CSV files are very popular.

There are some variants on the basic CSV format specified above. For example, some CSV formats define a comment character, so that rows beginning with, for example, a hash character #, are not treated as data, but give additional human-readable information. Another variation is that some CSV formats assume that the first line is a header line which defines the names of the fields.

Python csv module

The Python standard library provides a csv module for reading and writing delimited-separated value files, including, as the name suggests, CSV files. As it is built-in to all Python installations, it is useful to be familiar with the csv module as an option for loading and saving CSV formatted data, though the CSV capabilities in third-party libraries such as NumPy (which we will cover later in the course) and Pandas are more powerful and will often be better options in practice.

In [3]:
import csv

The most straightforward way to read CSV files using the csv module is with the csv.reader function. This accepts an iterable object as its first argument which returns a line of delimited input for each iteration. Commonly this will be an opened file object however it can also for example be a sequence of strings which is what we will use here by using the split method to convert the sunspot CSV text object into a list of per-line strings. The csv.reader function also accepts various optional keyword arguments including importantly a delimiter argument to specify the character used as the delimiter separating the values in each line, with we setting this to a semicolon here.

In [4]:
csv_reader = csv.reader(sunspots_csv_text.split('\n'), delimiter=';')

The object returned by the csv.reader function is an iterator over the rows of the CSV file, with each row being returned as a list of the separated values in the row (with all values being read as strings by default). We can read all of the data in to a nested list-of-lists using a list comprehension:

In [5]:
sunspots_data = [row for row in csv_reader]
print(sunspots_data[-5:])
[['2022', '05', '2022.371', '  96.5', ' 16.0', ' 1250', '0'], ['2022', '06', '2022.453', '  70.5', ' 12.9', ' 1219', '0'], ['2022', '07', '2022.538', '  91.4', ' 12.2', ' 1304', '0'], ['2022', '08', '2022.623', '  75.4', ' 10.5', ' 1289', '0'], ['2022', '09', '2022.705', '  96.3', ' 16.2', ' 1130', '0']]

For this particular CSV file the first column corresponds to the measurement year, the second the measurement month number and the third the measurement date as a 'fractional year'. We can extract a list of the just the (fractional) years converted to floating point values using another list comprehension

In [6]:
fractional_years = [float(row[2]) for row in sunspots_data]

Similarly the fourth column in the CSV file contains the monthly mean total sunspot number, which we can extract with another list comprehension

In [7]:
monthly_mean_total_sunspot_numbers = [float(row[3]) for row in sunspots_data]

We can then for example use Matplotlib to create a plot of how the monthly average sunspot number varies over time, with this highlighting the cyclic nature of sunspot activity

In [8]:
from matplotlib import pyplot as plt
plt.plot(fractional_years, monthly_mean_total_sunspot_numbers)
plt.xlabel('Year')
plt.ylabel('Monthly mean total sunspot number');

Reading rows as dictionaries

Accessing the values in each row by an index corresponding to their column can be unclear and prone to bugs. The csv module also provides the csv.DictReader class to allow reading each record (line) in the CSV file as a dictionary keyed by a set of field names. For the dataset we are using we have the columns correspond to

Column 1-2: Gregorian calendar date
- Year
- Month
Column 3: Date in fraction of year.
Column 4: Monthly mean total sunspot number.
Column 5: Monthly mean standard deviation of the input sunspot numbers.
Column 6: Number of observations used to compute the monthly mean total sunspot number.
Column 7: Definitive/provisional marker. '1' indicates that the value is definitive. '0' indicates that the value is still provisional.

We can create an instance of the csv.DictReader class very similarly to how we called the csv.reader function, but with an additional keyword argument fieldnames specifying a sequence of strings corresponding to the keys to associate the values in each row with. Below we also set the optional keyword argument quoting to the special constant csv.QUOTE_NONNUMERIC which causes all non-quoted values in each line to be automatically converted to floating point values.

In [9]:
csv_reader = csv.DictReader(
    sunspots_csv_text.split('\n'),
    fieldnames=['year', 'month', 'fractional_year', 'mean', 'deviation', 'observations', 'definitive'],
    delimiter=';',
    quoting=csv.QUOTE_NONNUMERIC
)

Similarly to previously we can now extract all the data using a list comprehension, with the difference being that each item in the constructed list is now a dictionary keyed by the field names:

In [10]:
sunspots_data = [record for record in csv_reader]
print(sunspots_data[-1])
{'year': 2022.0, 'month': 9.0, 'fractional_year': 2022.705, 'mean': 96.3, 'deviation': 16.2, 'observations': 1130.0, 'definitive': 0.0}

We can then recreate the same plot as previously as follows, with the intention of the list comprehensions extracting the year and mean values now much more apparent

In [11]:
from matplotlib import pyplot as plt

plt.plot([r['fractional_year'] for r in sunspots_data], [r['mean'] for r in sunspots_data])
plt.xlabel('Year')
plt.ylabel('Monthly mean total sunspot number');

Writing CSV files

The csv module also provides functionality for writing delimiter-separated value files. The csv.writer function provides a simple interface for writing CSV files row by row, with the function accepting a file object (and optional keyword arugments specifying formatting options such as delimiter) and the returned object providing a writerow method to write a sequence of values to the file as a delimiter-separated string. For example we can save a table of data about the planets in the solar system as a comma-separated values file using the following code snippet

In [12]:
planets_data = [
    ['Name', 'Mean distance from sun / AU', 'Orbit period / years', 'Mass / M🜨', 'Radius / R🜨', 'Number of satellites'],
    ['Mercury', 0.39, 0.24, 0.06, 0.38, 0],
    ['Venus', 0.72, 0.62, 0.82, 0.95, 0],
    ['Earth', 1., 1., 1., 1., 1],
    ['Mars', 1.5, 1.9, 0.11, 0.53, 2],
    ['Jupiter', 5.2, 12., 320., 11., 63],
    ['Saturn', 9.5, 29., 95., 9.4, 61],
    ['Uranus', 19., 84., 15., 4.1, 27],
    ['Neptune', 30., 170., 17., 3.9, 14],
]

with open('planets_data.csv', 'w', encoding='utf-8') as f:
    csv_writer = csv.writer(f, delimiter=',')
    for row in planets_data:
        csv_writer.writerow(row)

A csv.DictWriter class is also provided which analogously to the csv.DictReader class, allows writing a CSV file using rows specified by dictionaries mapping from field names to values.