DH Bridge

Encouraging Computational Thinking and Digital Skills in the Humanities

Analyzing the Data (Part 1)

Up to this point, we have been focusing on molding the data we got back from the DPLA into different formats. First, we chose the part of the data we wanted to save locally and looped through all of the search result "pages" to get our large dataset. Then, because our research question involved language use and text, we transformed the data again into a format that suits the kind of analysis we want to do. Now, we get to take the results of our hard work and start to interrogate our data.

In this module we will:

  1. install NLTK library
  2. use NLTK to do see patterns in the text

1. Installing NLTK

We are interested in the language used in the three fields we singled out across all the "cooking" items in the DPLA database. Fortunately, there is good support within Python for text analysis and one powerful library we can use is the Natural Language ToolKit (or NLTK).

To install NLTK, let's go back to our Terminal and use pip.

Run pip install nltk. You may need to use sudo pip install nltk (Mac).

There are also a number of datasets available for use with NLTK. For our purposes, we will only be using the "stopwords" dataset. You can browse the list of all the datasets you could download and use at http://www.nltk.org/nltk_data/.

To download the stopwords, we are going back into the Python Interactive Interpreter. Run python. Your Terminal window should now look something like this:

1
2
3
4
Python 2.7.5 (default, Mar  9 2014, 22:15:05)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

Type import nltk and press Enter.

Next type nltk.download('stopwords') and press Enter.

Once you see

1
2
True
>>>

you have successfully downloaded the stopwords file.

You will also need to download a tokenizing library. Still in the Python interpretor, run

1
nltk.download('punkt')

Again, once you see

1
2
True
>>>

the download is complete.

You can now exit the Python Interactive Interpreter using quit()

Let's Start Text Mining

Let's create a third script file, "text_mining.py":

1
touch text_mining.py

(Windows):

1
New-Item -ItemType file text_mining.py

And open that script file in your text editor:

1
open text_mining.py

(Windows):

1
Start notepad++ text_mining.py

Let's start by importing the NLTK library and the stopwords:

1
2
3
# Import Libraries
import nltk
from nltk.corpus import stopwords

Next let's load in our text file:

1
2
3
# Set Variables
with open('text_results.txt', 'r') as file:
    cooking_text = file.read().decode('utf8')

Let's add a "print" command so we can display part of our text to make sure everything is loading correctly so far:

1
2
3
4
5
6
7
8
9
10
11
12
# Import Libraries
import nltk
from nltk.corpus import stopwords

# Set Variables
with open('text_results.txt', 'r') as file:
    cooking_text = file.read().decode('utf8')

# Define Functions

# Make Function Calls
print cooking_text[0:20]

Save and run this script in Terminal. Let's take a look at the results. One thing to note here is that Python treats the text as a list of letters.

For our purposes, we want to work with the words, so let's use a function called "tokenize" from the NLTK library. The NLTK "tokenize" method has rules for splitting strings of text in to what we view as words. As the tokenizer goes through the text, it evaluates white space and punctuation, and then bundles the letters between the white space and punctuation together into "tokens". When we started, Python considered our text to be a list of letters; after tokenizing, the computer sees the text as a list of "words".

After "from nltk.corpus import stopwords" in the variables section, add:

1
from nltk import word_tokenize

Now let's comment out the last print statement and transform our "words" into tokens using the "word_tokenize" method. To see what has happened, let's also print the first 10 tokens.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Import Libraries
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize

# Set Variables
with open('text_results.txt', 'r') as file:
    cooking_text = file.read().decode('utf8')

cooking_tokens = word_tokenize(cooking_text)

# Define Functions

# Make Function Calls
#print cooking_text[0:20]
print cooking_tokens[0:10]

Next, we need to do one more transformation on our words so that they will play nicely with NLTK. Comment out "print cooking_tokens[0:10]" and add to the variables section after cooking_tokens…:

1
text = nltk.Text(cooking_tokens)

Your file should now look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Import Libraries
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize

# Set Variables
with open('text_results.txt', 'r') as file:
    cooking_text = file.read().decode('utf8')

cooking_tokens = word_tokenize(cooking_text)
text = nltk.Text(cooking_tokens)

# Define Functions

# Make Function Calls
#print cooking_text[0:20]
#print cooking_tokens[0:10]

The first thing we can do to get a sense of the words in our dataset is to use the "concordance" method within NLTK. This will print all the instances of a word with the surrounding words for context.

After #print cooking_tokens[0:10], add:

1
print text.concordance('cooking')

Save and run your script.

Pretty cool! Now change "cooking" to "economics", save and run the script, and see what the output is:

1
print text.concordance('economics')

Try some other words to get a sense of the word usage in the "text_results" file. You can either replace the word or add additional print statements.

Another useful method is "collocation". This shows us all the words that tend to appear together throughout the corpus.

Comment out print text.concordance('your_last_word') and add

1
print text.collocations()

Save and run your script.

One more method that is useful for surveying our data is "similar". This shows us words that are used similarly to the word we give it.

Comment out print text.collocations() and add:

1
print text.similar('Pot')

Your script should now look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Import Libraries
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize

# Set Variables
with open('text_results.txt', 'r') as file:
    cooking_text = file.read().decode('utf8')

cooking_tokens = word_tokenize(cooking_text)
text = nltk.Text(cooking_tokens)

# Define Functions

# Make Function Calls
#print cooking_text[0:20]
#print cooking_tokens[0:10]
#print text.concordance('economics')
#print text.collocations()

print text.similar('Pot')

What other patterns might be interesting to know about the words used to describe objects related to "cooking"?

In the next module, we will look at word counts to find the most common words used across all of the different DPLA contributors.

Previous Module Next Module