DH Bridge

Encouraging Computational Thinking and Digital Skills in the Humanities

Save Results to a File

We are almost there! We have been creating some interesting data on the word frequencies within the description fields. But so far, all of our results are stuck in Terminal, which makes it difficult for us to reuse them. So for this final module, we will write out the results of our count to a CSV (comma separated value) file.

Create a New CSV File

As we did when we wrote our JSON results, we will start by telling Python to open a CSV file and assign to a variable.

Currently our text_mining.py file should look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# Import Libraries
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.probability import *

# Set Variables
with open('text_results.txt', 'r') as file:
    cooking_text = file.read().decode('utf8')

cooking_tokens = word_tokenize(cooking_text)
text = nltk.Text(cooking_tokens)

# Load in Stopwords Library
stopwords = stopwords.words('english')

word_set = []

# Define Functions
def normalize_text(text):
    # Work through all the words in text and filter
    for word in text:
        # Check if word is a word, and not punctuation, AND check against stop words
        if word.isalpha() and word.lower() not in stopwords:
            # If it passes the filters, save to word_set
            word_set.append(word.lower())
    return word_set

# Make Function Calls
#print cooking_text[0:20]
#print cooking_tokens[0:10]
#print text.concordance('economics')
#print text.collocations()
#print text.similar('Pot')

normalize_text(text)

fd = FreqDist(word_set)
print fd.most_common(200)
print fd.hapaxes()

To create CSV files, we need to import the csv library, which is preinstalled, but not preloaded in Python. To do that, add import csv to our list of libraries at the top of the file.

1
2
3
4
5
6
# Import Libraries
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.probability import *
import csv

Now, we can create our CSV file right after the line where we opened the JSON file. CSV files open a little differently than text files, in that we open the file with a "writer" helper.

1
2
3
4
5
# Set Variables
with open('text_results.txt', 'r') as file:
    cooking_text = file.read().decode('utf8')

file = csv.writer(open('word_frequencies.csv', 'wb'))

Now, at the bottom of text_mining.py, we can save the key (the word) and the count (the frequency) as two columns in the CSV file word_frequencies.csv. If you did the graphing challenge, be sure to comment out fd.plot(50,cumulative=False) as well. The plotting function and the write csv functions don't work well together.

1
2
for key, count in fd.most_common(200):
    file.writerow([key, count])

The final product should look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Import Libraries
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.probability import *
import csv

# Set Variables
with open('text_results.txt', 'r') as file:
    cooking_text = file.read().decode('utf8')

file = csv.writer(open('word_frequencies.csv', 'w'))

cooking_tokens = word_tokenize(cooking_text)
text = nltk.Text(cooking_tokens)

# Load in Stopwords Library
stopwords = stopwords.words('english')

word_set = []

# Define Functions
def normalize_text(text):
    # Work through all the words in text and filter
    for word in text:
        # Check if word is a word, and not punctuation, AND check against stop words
        if word.isalpha() and word.lower() not in stopwords:
            # If it passes the filters, save to word_set
            word_set.append(word.lower())
    return word_set

# Make Function Calls
#print cooking_text[0:20]
#print cooking_tokens[0:10]
#print text.concordance('economics')
#print text.collocations()
#print text.similar('Pot')

normalize_text(text)

fd = FreqDist(word_set)
print fd.most_common(200)
#print fd.hapaxes()
#fd.plot(50,cumulative=False)

# Print results to a CSV file
for key, count in fd.most_common(200):
    file.writerow([key, count])

You can open your CSV file using Terminal by typing or by looking within your "dhb_awesome" directory:

1
open word_frequencies.csv

The file will most likely open in Excel or a similar program.

Look over your results. What patterns strike you as interesting? As expected? As unexpected? What additional questions do these word frequencies raise? Now that you have this data, what additional information do you need to know to interpret the patterns we see here?

Previous Module Next Module