Working with Structured data files

CSV files can only model data where each record has several fields, and each field is a simple datatype, a string or number.

We often want to store data which is more complicated than this, with nested structures of lists and dictionaries. Structured data formats like Json, YAML, and XML are designed for this.

Json

JSON is a very common open-standard data format that is used to store structured data in a human-readable way.

This allows us to represent data which is combinations of lists and dictionaries as a text file which looks a bit like a Javascript (or Python) data literal.

import json

Any nested group of dictionaries and lists can be saved:

mydata =  {'key': ['value1', 'value2'],
           'key2': {'key4':'value3'}}
json.dumps(mydata)

’{“key”: [“value1”, “value2”], “key2”: {“key4”: “value3”}}’

If you would like a more readable output, you can use the indent argument.

print(json.dumps(mydata, indent=4))

{

“key”: [

“value1”,

“value2”

],

“key2”: {

“key4”: “value3”

}

}

Loading data is also really easy:

%%writefile myfile.json
{
    "somekey": ["a list", "with values"]
}

Writing myfile.json

with open('myfile.json', 'r') as f:
    mydataasstring = f.read()
mydatastring

’{\n “somekey”: [“a list”, “with values”]\n}’

mydata = json.loads(mydataasstring)
mydata['somekey']

[‘a list’, ‘with values’]

This is a very nice solution for loading and saving Python data structures.

It’s a very common way of transferring data on the internet, and of saving datasets to disk.

There’s good support in most languages, so it’s a nice inter-language file interchange format.

Unicode

Supplementary Material: Why do the strings come back with ‘u’ everywhere? These are Unicode Strings, designed to hold hold all the world’s characters.

Yaml

Yaml is a very similar data format to Json, with some nice additions:

To be able to work with yaml files, we first need to install the ‘pyaml’ library:

from pip import main as pip
pip(['install', 'pyaml'])

Collecting pyaml

Using cached pyaml-16.12.2-py2.py3-none-any.whl

Collecting PyYAML (from pyaml)

Installing collected packages: PyYAML, pyaml

Successfully installed PyYAML-3.12 pyaml-16.12.2

    import yaml

You can write lists like this:

%%writefile myfile.yaml

somekey:
    # Look, this is a list
    - a list
    - with values
    - [1, 2, 4] # and another list of integers

Writing myfile.yaml

yaml.load(open('myfile.yaml'))
print(mydata)

{‘somekey’: [‘a list’, ‘with values’, [1, 2, 4]]}

Yaml is a popular format for ad-hoc data files, but the library doesn’t ship with default Python, (though it is part of Anaconda and Canopy) so some people still prefer Json for it’s universality.

Because Yaml gives the option of serialising a list either as newlines with dashes, or with square brackets, you can control this choice:

print(yaml.safe_dump(mydata))

somekey: [a list, with values]

print(yaml.safe_dump(mydata, default_flow_style=False))

somekey:

- a list

- with values

default_flow_style=False uses a “block style” (rather than an “inline” or “flow style”) to delineate data structures. See the YAML docs for more details.

XML

Supplementary material: XML is another popular choice when saving nested data structures. It’s very careful, but verbose. If your field uses XML data, you’ll need to learn a python XML parser, (there are a few), and about how XML works.

Next: Experience: Practical: Working with files