XClose

An introduction to research programming with Python

Home
Menu

Loading data from files

Loading data

An important part of this course is about using Python to analyse and visualise data. Most data, of course, is supplied to us in various formats: spreadsheets, database dumps, or text files in various formats (csv, tsv, json, yaml, hdf5, netcdf). It is also stored in some medium: on a local disk, a network drive, or on the internet in various ways. It is important to distinguish the data format (how the data is structured into a file) from the data's storage (where it is put).

We'll look first at the question of data transport: loading data from a disk, and at downloading data from the internet. Then we'll look at data parsing: building Python structures from the data. These are related, but separate questions.

An example datafile

Let's write an example datafile to disk so we can investigate it. We'll just use a plain-text file. Jupyter notebook provides a way to do this: if we put %%writefile at the top of a cell, instead of being interpreted as python, the cell contents are saved to disk.

In [1]:
%%writefile mydata.txt
A poet once said, 'The whole universe is in a glass of wine.'
We will probably never know in what sense he meant it, 
for poets do not write to be understood. 
But it is true that if we look at a glass of wine closely enough we see the entire universe. 
There are the things of physics: the twisting liquid which evaporates depending
on the wind and weather, the reflection in the glass;
and our imagination adds atoms.
The glass is a distillation of the earth's rocks,
and in its composition we see the secrets of the universe's age, and the evolution of stars. 
What strange array of chemicals are in the wine? How did they come to be? 
There are the ferments, the enzymes, the substrates, and the products.
There in wine is found the great generalization; all life is fermentation.
Nobody can discover the chemistry of wine without discovering, 
as did Louis Pasteur, the cause of much disease.
How vivid is the claret, pressing its existence into the consciousness that watches it!
If our small minds, for some convenience, divide this glass of wine, this universe, 
into parts -- 
physics, biology, geology, astronomy, psychology, and so on -- 
remember that nature does not know it!

So let us put it all back together, not forgetting ultimately what it is for.
Let it give us one more final pleasure; drink it and forget it all!
   - Richard Feynman
Writing mydata.txt

Where did that go? It went to the current folder, which for a notebook, by default, is where the notebook is on disk.

In [2]:
import os  # The 'os' module gives us all the tools we need to search in the file system

os.getcwd()  # Use the 'getcwd' function from the 'os' module to find where we are on disk.
Out[2]:
'/home/runner/work/doctoral-programming-intro/doctoral-programming-intro/01-beginner'

Can we see it is there?

In [3]:
os.listdir(os.getcwd())
Out[3]:
['mydata.txt',
 'index.md',
 '015variables.ipynb',
 '023types.ipynb',
 '038SolutionComprehension.ipynb',
 '016using_functions.ipynb',
 '035looping.ipynb',
 '032conditionality.ipynb',
 '030MazeSolution.ipynb',
 '029structures.ipynb',
 '036MazeSolution2.ipynb',
 '037comprehensions.ipynb',
 'data',
 '037comprehensions.html',
 '060files.nbconvert.ipynb',
 '040functions.ipynb',
 '028dictionaries.ipynb',
 '028dictionaries.html',
 '00pythons.ipynb',
 '010exemplar.ipynb',
 '025containers.ipynb',
 '028dictionaries.nbconvert.ipynb',
 '060MatplotlibPyplot.ipynb',
 '050import.ipynb',
 '037comprehensions.nbconvert.ipynb',
 '060files.ipynb']
In [4]:
[x for x in os.listdir(os.getcwd()) if ".txt" in x]
Out[4]:
['mydata.txt']

Yep! Note how we used a list comprehension to filter all the extraneous files.

Path independence and os

We can use dirname to get the parent folder for a folder, in a platform independent-way.

In [5]:
os.path.dirname(os.getcwd())
Out[5]:
'/home/runner/work/doctoral-programming-intro/doctoral-programming-intro'

We could do this manually using split:

In [6]:
"/".join(os.getcwd().split("/")[:-1])
Out[6]:
'/home/runner/work/doctoral-programming-intro/doctoral-programming-intro'

But this would not work on windows, where path elements are separated with a \ instead of a /. So it's important to use os.path for this stuff.

Supplementary Materials: If you're not already comfortable with how files fit into folders, and folders form a tree, with folders containing subfolders, then look at http://swcarpentry.github.io/shell-novice/02-filedir/.

Satisfy yourself that after using %%writedir, you can then find the file on disk with Windows Explorer, OSX Finder, or the Linux Shell.

We can see how in Python we can investigate the file system with functions in the os module, using just the same programming approaches as for anything else.

We'll gradually learn more features of the os module as we go, allowing us to move around the disk, walk around the disk looking for relevant files, and so on. These will be important to master for automating our data analyses.

File objects

So, let's read our file:

In [7]:
myfile = open("mydata.txt")
In [8]:
type(myfile)
Out[8]:
_io.TextIOWrapper

We can go line-by-line, by treating the file as an iterable:

In [9]:
[x for x in myfile]
Out[9]:
["A poet once said, 'The whole universe is in a glass of wine.'\n",
 'We will probably never know in what sense he meant it, \n',
 'for poets do not write to be understood. \n',
 'But it is true that if we look at a glass of wine closely enough we see the entire universe. \n',
 'There are the things of physics: the twisting liquid which evaporates depending\n',
 'on the wind and weather, the reflection in the glass;\n',
 'and our imagination adds atoms.\n',
 "The glass is a distillation of the earth's rocks,\n",
 "and in its composition we see the secrets of the universe's age, and the evolution of stars. \n",
 'What strange array of chemicals are in the wine? How did they come to be? \n',
 'There are the ferments, the enzymes, the substrates, and the products.\n',
 'There in wine is found the great generalization; all life is fermentation.\n',
 'Nobody can discover the chemistry of wine without discovering, \n',
 'as did Louis Pasteur, the cause of much disease.\n',
 'How vivid is the claret, pressing its existence into the consciousness that watches it!\n',
 'If our small minds, for some convenience, divide this glass of wine, this universe, \n',
 'into parts -- \n',
 'physics, biology, geology, astronomy, psychology, and so on -- \n',
 'remember that nature does not know it!\n',
 '\n',
 'So let us put it all back together, not forgetting ultimately what it is for.\n',
 'Let it give us one more final pleasure; drink it and forget it all!\n',
 '   - Richard Feynman\n']

If we do that again, the file has already finished, there is no more data.

In [10]:
[x for x in myfile]
Out[10]:
[]

We need to 'rewind' it!

In [11]:
myfile.seek(0)
[len(x) for x in myfile if "ut" in x]
Out[11]:
[94, 94, 64, 78]

It's really important to remember that a file is a different built-in type than a string.

Working with files

We can read one line at a time with readline:

In [12]:
myfile.seek(0)
first = myfile.readline()
In [13]:
first
Out[13]:
"A poet once said, 'The whole universe is in a glass of wine.'\n"
In [14]:
second = myfile.readline()
In [15]:
second
Out[15]:
'We will probably never know in what sense he meant it, \n'

We can read the whole remaining file with read:

In [16]:
rest = myfile.read()
In [17]:
rest
Out[17]:
"for poets do not write to be understood. \nBut it is true that if we look at a glass of wine closely enough we see the entire universe. \nThere are the things of physics: the twisting liquid which evaporates depending\non the wind and weather, the reflection in the glass;\nand our imagination adds atoms.\nThe glass is a distillation of the earth's rocks,\nand in its composition we see the secrets of the universe's age, and the evolution of stars. \nWhat strange array of chemicals are in the wine? How did they come to be? \nThere are the ferments, the enzymes, the substrates, and the products.\nThere in wine is found the great generalization; all life is fermentation.\nNobody can discover the chemistry of wine without discovering, \nas did Louis Pasteur, the cause of much disease.\nHow vivid is the claret, pressing its existence into the consciousness that watches it!\nIf our small minds, for some convenience, divide this glass of wine, this universe, \ninto parts -- \nphysics, biology, geology, astronomy, psychology, and so on -- \nremember that nature does not know it!\n\nSo let us put it all back together, not forgetting ultimately what it is for.\nLet it give us one more final pleasure; drink it and forget it all!\n   - Richard Feynman\n"

Which means that when a file is first opened, read is useful to just get the whole thing as a string:

In [18]:
feynman_quote = open("mydata.txt")
In [19]:
open("mydata.txt").read()
Out[19]:
"A poet once said, 'The whole universe is in a glass of wine.'\nWe will probably never know in what sense he meant it, \nfor poets do not write to be understood. \nBut it is true that if we look at a glass of wine closely enough we see the entire universe. \nThere are the things of physics: the twisting liquid which evaporates depending\non the wind and weather, the reflection in the glass;\nand our imagination adds atoms.\nThe glass is a distillation of the earth's rocks,\nand in its composition we see the secrets of the universe's age, and the evolution of stars. \nWhat strange array of chemicals are in the wine? How did they come to be? \nThere are the ferments, the enzymes, the substrates, and the products.\nThere in wine is found the great generalization; all life is fermentation.\nNobody can discover the chemistry of wine without discovering, \nas did Louis Pasteur, the cause of much disease.\nHow vivid is the claret, pressing its existence into the consciousness that watches it!\nIf our small minds, for some convenience, divide this glass of wine, this universe, \ninto parts -- \nphysics, biology, geology, astronomy, psychology, and so on -- \nremember that nature does not know it!\n\nSo let us put it all back together, not forgetting ultimately what it is for.\nLet it give us one more final pleasure; drink it and forget it all!\n   - Richard Feynman\n"

You can also read just a few characters:

In [20]:
myfile.seek(1335)
Out[20]:
1335
In [21]:
myfile.read(15)
Out[21]:
'\n   - Richard F'

Converting Strings to Files

Because files and strings are different types, we CANNOT just treat strings as if they were files:

In [22]:
mystring = "Hello World\n My name is James"
In [23]:
mystring
Out[23]:
'Hello World\n My name is James'
In [24]:
mystring.readline()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [24], line 1
----> 1 mystring.readline()

AttributeError: 'str' object has no attribute 'readline'

This is important, because some file format parsers expect input from a file and not a string. We can convert between them using the StringIO module in the standard library:

In [25]:
from io import StringIO
In [26]:
mystringasafile = StringIO(mystring)
In [27]:
mystringasafile.readline()
Out[27]:
'Hello World\n'
In [28]:
mystringasafile.readline()
Out[28]:
' My name is James'

Note that in a string, \n is used to represent a newline.

Closing files

We really ought to close files when we've finished with them, as it makes the computer more efficient. (On a shared computer, this is particuarly important)

In [29]:
feynman_quote.close()

Because it's so easy to forget this, python provides a context manager to open a file, then close it automatically at the end of an indented block:

In [30]:
somefile = open("mydata.txt")
content = somefile.read()
somefile.close()
In [31]:
with open("mydata.txt") as somefile:
    content = somefile.read()
# File automatically closed when the loop ends
print(content)
A poet once said, 'The whole universe is in a glass of wine.'
We will probably never know in what sense he meant it, 
for poets do not write to be understood. 
But it is true that if we look at a glass of wine closely enough we see the entire universe. 
There are the things of physics: the twisting liquid which evaporates depending
on the wind and weather, the reflection in the glass;
and our imagination adds atoms.
The glass is a distillation of the earth's rocks,
and in its composition we see the secrets of the universe's age, and the evolution of stars. 
What strange array of chemicals are in the wine? How did they come to be? 
There are the ferments, the enzymes, the substrates, and the products.
There in wine is found the great generalization; all life is fermentation.
Nobody can discover the chemistry of wine without discovering, 
as did Louis Pasteur, the cause of much disease.
How vivid is the claret, pressing its existence into the consciousness that watches it!
If our small minds, for some convenience, divide this glass of wine, this universe, 
into parts -- 
physics, biology, geology, astronomy, psychology, and so on -- 
remember that nature does not know it!

So let us put it all back together, not forgetting ultimately what it is for.
Let it give us one more final pleasure; drink it and forget it all!
   - Richard Feynman

The code to be done while the file is open is indented, just like for an if statement.

In [32]:
print(somefile)
<_io.TextIOWrapper name='mydata.txt' mode='r' encoding='UTF-8'>

You should pretty much always use this syntax for working with files.

Writing files

We might want to create a file from a string in memory. We can't do this with the notebook's %%writefile -- this is just a notebook convenience, and isn't very programmable.

When we open a file, we can specify a 'mode', in this case, 'w' for writing. ('r' for reading is the default.)

In [33]:
with open("mywrittenfile", "w") as target:
    target.write("Hello")
    target.write("World")
In [34]:
with open("mywrittenfile", "r") as source:
    print(source.read())
HelloWorld
In [35]:
import os

print(os.getcwd())
/home/runner/work/doctoral-programming-intro/doctoral-programming-intro/01-beginner

And we can "append" to a file with mode 'a':

In [36]:
with open("mywrittenfile", "a") as target:
    target.write("Hello")
    target.write("James")
In [37]:
with open("mywrittenfile", "r") as source:
    print(source.read())
HelloWorldHelloJames