DH Bridge

Encouraging Computational Thinking and Digital Skills in the Humanities

Working with Local Data

In this module we will learn to:

  1. create a new script file and load in our JSON data
  2. select data from particular fields and deal with missing data
  3. convert our JSON data to strings of text

Now that we have a very large file of JSON data, we can work locally to find patterns that we could not find using the online interface for the DPLA's holdings. Since we are interested in the language being used in order to investigate how the descriptions of "cooking" items are gendered, our next step is to clean up that data and save it as text only so we can analyize it in later modules.

1. Create a New Script and Load in Our Data

It is time to create a new script file!

Go to Terminal and type:

1
touch my_second_script.py

or on Windows:

1
New-Item -ItemType file my_second_script.py

To open your new script file, type:

1
open my_second_script.py

or on Windows:

1
Start notepad++ my_second_script.py

First, we will need the JSON library again. To add this, add import json to the very top of the file.

1
2
# Load Libraries
import json

Later in this module, we will need an additional library called "kitchen" so let's install and import that now.

In Terminal run:

1
pip install kitchen

"kitchen" is a bit of a "kitchen sink" of libraries - it holds a lot of different things. We will be using it to help with writing our strings, or more specifically, with the translating of the characters we read into the binary the computer reads. For that, we need only a very particular part of the library, the 'to_bytes' method. After "import json", add:

1
2
3
#Load Libraries
import json
from kitchen.text.converters import to_bytes

Now that we have our libraries in place, the next thing is to load up the data from our "search_results.json" file.

The structure for this is:

1
2
3
# Set Variables
with open("search_results.json") as json_file:
    json_data = json.load(json_file)

Here, we tell Python to open our "search_results.json" file and assign it to the variable "json_file". Then we use the "load" method from the JSON library to load up the data and save it as the variable "json_data".

Our second script file should now look like:

1
2
3
4
5
6
7
8
9
10
11
# Load Libraries
import json
from kitchen.text.converters import to_bytes

# Set Variables
with open("search_results.json") as json_file:
    json_data = json.load(json_file)

# Define Functions

# Make Function Calls

To make sure this worked, let's print out one item from the JSON data. Work with your table to add a "print" statement that prints the second item in the "json_data" list.

2. Select the Relevant Text Data

Our research question for this data is how, across the entire DPLA collection, the descriptions of items related to cooking are gendered. To pursue this question, we need to do some basic text analysis, so let's save our search results in a format that makes it easy to do that.

The next step is to select the fields that will be most helpful for analyzing how "cooking" items are described across the items in the DPLA.

Looking at our items, there are three main fields containing description-type information for our different "cooking" items: the title, the description, and the subject headings.

To start, let's make a new function called "get_text" that takes in our "json_data" list:

1
2
3
# Define Functions
def get_text(json_data):
    # Do something to each item in json_data

Next, for each item in that list, we want to look for the "title", the "description", and the "subject heading fields". This means using a "for loop".

1
2
3
4
5
6
# Define Functions
def get_text(json_data):
    for each in json_data:
        # Get the titles
        # Get the descriptions
        # Get the subject headings

We can get the title, description, and subject headings by looking for those "keys" within each item in our "json_data" list.

1
2
3
4
5
6
7
8
9
10
11
# Define Functions
def get_text(json_data):
    for each in json_data:
        # Get the title
        find_titles = each['sourceResource']['title']

        # Get the Description
        find_descriptions = each['sourceResource']['description']

        # Get the Subject Headings
        find_subjects = each['sourceResource']['subject']

So far so good. But what if one of these fields is missing? Programming languages are very literal: if you tell it to do something that it cannot do, it just stops and gives you an error. To see this in action, let's add a line in "Make Function Calls" to call the function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Load Libraries
import json
from kitchen.text.converters import to_bytes

# Set Variables
with open('search_results.json') as json_file:
    json_data = json.load(json_file)

# Define Functions
def get_text(json_data):
    for each in json_data:
        # Get the Titles
        find_titles = each['sourceResource']['title']

        # Get the Descriptions
        find_descriptions = each['sourceResource']['description']

        # Get the Subject Headings
        find_subjects = each['sourceResource']['subject']

# Make Function Calls
print json_data[1]
get_text(json_data)

Towards the end of the error message, you should see a line that says "KeyError: 'sourceResource.description'". This is Python telling you that it cannot find a key "description" in one of the resources. Also, what do we do with when there are multiple values for each key, like with the Subject Headings?

Things are not as simple as they first appeared.

To fix this, let's create a second function that we can use to get the values associated with each key, and while we're at it, let's make sure those values are all in a consistent format. Since our final product will be a file of strings, let's save the values associated with our keys as a list of strings.

In your functions section, add a new function called "get_fields", which will do work on two variables, the "item" and the "key":

1
2
3
4
5
6
# Define Functions

def get_fields(item, key):
    # Return key from the sourceResource as a list of strings.

def get_texts(json_data):

To start, we will assign the variable "values" to hold the values of whichever key we send to this function. To get those values, we will combine the same pattern as before, but add the "get" function. The "get" function will return a default value if the key we are looking for does not exist for an item (this is the second things being passed into the "get" function). We will set the default values to a string with just a space inside:

1
2
3
4
5
6
7
8
# Define Functions

def get_fields(item, key):
    # Return key from the sourceResource as a list of strings.
    values = item[u'sourceResource'].get(key, [u' '])


def get_texts(json_data):

We also want to make sure that what we get back is consistent. Because we know that some fields have multiple entries, we want everything to be a list. So, we will add a conditional statement to check if what we got back for the value is a string, and, if it is, convert it into a list.

1
2
3
4
5
6
7
8
9
10
# Define Functions

def get_fields(item, key):
    # Return key from the sourceResource as a list of strings.
    values = item[u'sourceResource'].get(key, [u' '])
    if isinstance(values, basestring):
        values = [values]
    return values

def get_texts(json_data):

We end with a "return" statement to send whatever the final product is back to whichever process used the "get_fields" function.

Now, let's use our "get_fields()" function inside the "get_texts()" function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Define Functions

def get_fields(item, key):
    # Return key from the sourceResource as a list of strings.
    values = item[u'sourceResource'].get(key, [u' '])
    if isinstance(values, basestring):
        values = [values]
    return values

def get_texts(json_data):
    for each in json_data:
        # Get the title
        title = get_fields(each, u'title')

        # Get the Description
        description = get_fields(each, u'description')

        # Get the Subject Headings
        subject_info = get_fields(each, u'subject')

With your group, talk through what is going on in these functions. Work together to trace how the information is traveling between the functions. Add print statements to inspect the data at different points.

We are almost there! However, we have to do a little extra work on each of the values. Remember how we saved all of the values as lists, because sometimes there are multiple of values for every key? Now that we have all of the values, we need to combine them into a single string for each key (title, description, and subject headings).

For "title" and "description", this means using the "join" function to merge all of the list items into a string.

For the subject headings, however, it is a bit more complicated. We will create a second loop for the subject headings to isolate the values held by the "name" key, and then save each of the subject headings to a new list before joining the list together into a string.

Let's go for it!

First, add a join statement to the title and description sections. We will separate multiple entries with a comma and a space. Work with your table to understand the join statement. Use print statements to see what is happening

1
2
3
4
5
6
7
8
9
10
11
12
def get_texts(json_data):
    for each in json_data:
        # Get the title
        title = get_fields(each, u'title')
        title = (u', '.join(titles))

        # Get the Description
        description = get_fields(each, u'description')
        description = (u', '.join(descriptions))

        # Get the Subject Headings
        subject_info = get_fields(each, u'subject')

Then create an empty list named "subjects" to save the subject headings to:

1
2
3
4
5
6
7
8
9
10
11
12
13
def get_texts(json_data):
    for each in json_data:
        # Get the title
        title = get_fields(each, u'title')
        title = (u', '.join(titles))

        # Get the Description
        description = get_fields(each, u'description')
        description = (u', '.join(descriptions))

        # Get the Subject Headings
        subjects = []
        subject_info = get_fields(each, u'subject')

Now, create a loop to get the subject headings, which are the values associated with the key "name", and add those values to the new "subjects" list.

Because there are a few items with only one subject heading and no key "name", we will use "try" and "except". This way we can tell the computer to try first to get the values associated with the key "name", but if it can't find that key, to take the first item in the list.

Then, we will use "append" to add each value to our list of subjects. And finally, we can join that list of subjects into its own string, just like we did with "title" and "description".

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def get_texts(json_data):
    for each in json_data:
        
        # Get the title
        titles = get_fields(each, u'title')
        title = (u', '.join(titles))
  
        # Get the Description
        descriptions = get_fields(each, u'description')
        description = (u', '.join(descriptions))


        # Get the Subject Headings
        subjects = []
        subject_info = get_fields(each, u'subject')

        for info in subject_info:
            try:
                subj = info[u'name']
            except:
                subj = info[0]
            subjects.append(subj)

        subject = (u', '.join(subjects))

Now, let's pull all our pieces - the title, the description, and the subject headings - together into a new variable. We will make each result into a sentence, which will allow us to apply interesting text-mining processes to our text in a later module.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Define Functions

def get_fields(item, key):
    # Return key from the sourceResource as a list of strings.
    values = item[u'sourceResource'].get(key, [u' '])
    if isinstance(values, basestring):
        values = [values]
    return values

def get_texts(json_data):
    for each in json_data:
        
        # Get the title
        titles = get_fields(each, u'title')
        title = (u', '.join(titles))
  
        # Get the Description
        descriptions = get_fields(each, u'description')
        description = (u', '.join(descriptions))


        # Get the Subject Headings
        subjects = []
        subject_info = get_fields(each, u'subject')

        for info in subject_info:
            try:
                subj = info[u'name']
            except:
                subj = info[0]
            subjects.append(subj)

        subject = (u', '.join(subjects))

        # Combine the data into a single variable in the form of a sentence
        data = u'{}; {}; {}.\n'.format(title, description, subject)

The brackets indicate that some information will be inserted into that location. In this case, order really does matter. The first bracket takes the value held in title because it is the first thing passed into the "format" function. The second bracket takes the value of description because it was passed in next, and so on. Finally, the "\n" adds a "Return" to the end of the line, so that the information for each item appears on a new line.

The last step is to save this data into a text file. Similar to last time, we need to set up the file that will receive that data. At the top of the file, add a variable "f" that holds a new "text_results.txt" file.

1
2
3
4
5
6
7
8
9
# Load Libraries
import json
from kitchen.text.converters import to_bytes

# Set Variables
with open("search_results.json") as json_file:
    json_data = json.load(json_file)

f = open('text_results.txt', 'w')

Now back in our "get_texts()" function, the last step is to write all the information within "data" to our file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def get_texts(json_data):
    for each in json_data:
        
        # Get the title
        titles = get_fields(each, u'title')
        title = (u', '.join(titles))
  
        # Get the Description
        descriptions = get_fields(each, u'description')
        description = (u', '.join(descriptions))


        # Get the Subject Headings
        subjects = []
        subject_info = get_fields(each, u'subject')

        for info in subject_info:
            try:
                subj = info[u'name']
            except:
                subj = info[0]
            subjects.append(subj)

        subject = (u', '.join(subjects))

        # Combine the data into a single variable in the form of a sentence
        data = u'{}; {}; {}.\n'.format(title, description, subject)

        # Write the sentence to the 'text_results' file
        f.write(to_bytes(data))

Finally, let's add a call to the function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Load Libraries
import json
from kitchen.text.converters import to_bytes

# Set Variables
with open("search_results.json") as json_file:
    json_data = json.load(json_file)

f = open('text_results.txt', 'w')

# Define Functions
def get_fields(item, key):
    # Return key from the sourceResource as a list of strings.
    values = item[u'sourceResource'].get(key, [u' '])
    if isinstance(values, basestring):
        values = [values]
    return values

def get_texts(json_data):
    for each in json_data:
        
        # Get the title
        titles = get_fields(each, u'title')
        title = (u', '.join(titles))
  
        # Get the Description
        descriptions = get_fields(each, u'description')
        description = (u', '.join(descriptions))


        # Get the Subject Headings
        subjects = []
        subject_info = get_fields(each, u'subject')

        for info in subject_info:
            try:
                subj = info[u'name']
            except:
                subj = info[0]
            subjects.append(subj)

        subject = (u', '.join(subjects))

        # Combine the data into a single variable in the form of a sentence
        data = u'{}; {}; {}.\n'.format(title, description, subject)

        # Write the sentence to the 'text_results' file
        f.write(to_bytes(data))

# Make Function Calls
get_texts(json_data)

Save and run your second Python script. You should now have a file named "text_results.txt" in your "dhb_awesome" directory. If you open that file, you should see lines of beautiful text ready for analysis.

In the next module, we will use the Natural Language ToolKit, a powerful Python library for working with text, to find patterns in the text data.

Bonus Challenge:

Research Python data types with your table. Which data types have we already worked with? Why do data types matter?

Credits

Many thanks to Eric Rochester for his help (https://gist.github.com/erochest/4debe339ec9d2bf5ea5a) with refactoring this module.

Previous Module Next Module