DH Bridge

Encouraging Computational Thinking and Digital Skills in the Humanities

Analyzing a Subset of the Data (Part 2)

In this module we will learn to:

  1. normalize our text
  2. use the Frequency Distribution library to find patterns in our text

1. Normalizing the text

While we can get a lot of information from the entirety of the text, we can also find additional patterns once we do what is called "normalizing" the text. This entails removing all of the punctuation marks and transforming all the words to lower case. It also involves removing the small connection words such as "the" and "a", which are very common in a text, but carry less semantic meaning than nouns, verbs, and adjectives.

To clean up the text, we need to work through each word and and run it through a series of checks or filters.

Open your "text_mining.py" file. First, comment out print text.similar('Pot'). Next, let's create a new function that normalizes the text:

1
2
3
4
5
6
7
8
9
10
# Define Functions
def normalize_text(text):
    # Work through all the words in text and filter

# Make Function Calls
#print cooking_text[0:20]
#print cooking_tokens[0:10]
#print text.concordance('economics')
#print text.collocations()
#print text.similar('Pot')

To work through the words, we can use a new "for loop" and save all of the approved words into a new array named "word_set". Create your new array at the end of the "Set Variables" section.

1
2
3
4
5
6
word_set = []

# Define Functions
def normalize_text(text):
    # Work through all the words in text and filter
    for word in text:

First, for each word, let's check if it is alpha-numeric:

1
2
3
4
5
6
7
8
word_set = []

# Define Functions
def normalize_text(text):
    # Work through all the words in text and filter
    for word in text:
        # Check if word is a word, and not punctuation
        if word.isalpha()

Next, let's check the word against the NLTK collection of stopwords. First, let's load in the NLTK stopwords and assign them to the variable "stopwords":

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Import Libraries
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize

# Set Variables
with open('text_results.txt', 'r') as file:
    cooking_text = file.read().decode('utf8')

cooking_tokens = word_tokenize(cooking_text)
text = nltk.Text(cooking_tokens)

# Load Stopwords
stopwords = stopwords.words('english')

word_set = []

Next, we need to transform each word to lower-case, because the stopwords list won't catch the uppercase words. We do this by adding .lower() to word as we check the words against the "stopwords" collection.

1
2
3
4
5
6
7
8
9
word_set = []

# Define Functions
def normalize_text(text):
    # Work through all the words in text and filter
    for word in text:
        # Check if word is a word, and not punctuation, AND check against stop words
        if word.isalpha() and word.lower() not in stopwords:
            # If it passes the filters, save to word_set

Finally, we will add those words that pass through the filters to the "word_set" list and tell the function to return the whole list once it is finished.

1
2
3
4
5
6
7
8
9
10
11
word_set = []

# Define Functions
def normalize_text(text):
    # Work through all the words in text and filter
    for word in text:
        # Check if word is a word, and not punctuation, AND check against stop words
        if word.isalpha() and word.lower() not in stopwords:
            # If it passes the filters, save to word_set
            word_set.append(word.lower())
    return word_set

The last step is to call our "normalize_text" function and pass in our text.

1
2
3
4
5
6
7
8
# Make Function Calls
#print cooking_text[0:20]
#print cooking_tokens[0:10]
#print text.concordance('economy')
#print text.collocations()
#print text.similar('Pot')

normalize_text(text)

Well done! We're now ready to calculate word frequencies using the Frequency Distribution library.

2. Get Word Frequencies

Our Python file should now look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Import Libraries
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize

# Set Variables
with open('text_results.txt', 'r') as file:
    cooking_text = file.read().decode('utf8')

cooking_tokens = word_tokenize(cooking_text)
text = nltk.Text(cooking_tokens)

# Load in Stopwords Library
stopwords = stopwords.words('english')

word_set = []

# Define Functions
def normalize_text(text):
    # Work through all the words in text and filter
    for word in text:
        # Check if word is a word, and not punctuation, AND check against stop words
        if word.isalpha() and word.lower() not in stopwords:
            # If it passes the filters, save to word_set
            word_set.append(word.lower())
    return word_set

# Make Function Calls
#print cooking_text[0:20]
#print cooking_tokens[0:10]
#print text.concordance('economy')
#print text.collocations()
#print text.similar('Pot')

normalize_text(text)

We need one more library to get word frequencies from our array of approved words. After from nltk import word_tokenize, add:

1
from nltk.probability import *

Now, we will use the Frequency Distribution method to get word counts. NLTK is a very powerful library, and by using it, we can draw on the work others to build and verify that the functions do what they say they do. There is no need to reinvent the wheel with every new script. Let's first use the library, and then work through what it did.

The first step is to run the Frequency Distribution method and save the results to a variable so they are easier to use. After normalize_text(text) add:

1
fd = FreqDist(word_set)

To see the 200 most common words, add to the bottom of the 'Make Function Calls' section:

1
print fd.most_common(200)

and run your script.

Frequency Distribution goes through the words in our list, checks if it has seen that word before, if yes, adds 1 to the count for that word, and if not, notes it as a new word with a count of 1.

Grab your drawing materials again, and draw a diagram of what the Frequency Distribution function is doing.

Your file should look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Import Libraries
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.probability import *

# Set Variables
with open('text_results.txt', 'r') as file:
    cooking_text = file.read().decode('utf8')

cooking_tokens = word_tokenize(cooking_text)
text = nltk.Text(cooking_tokens)

# Load in Stopwords Library
stopwords = stopwords.words('english')

word_set = []

# Define Functions
def normalize_text(text):
    # Work through all the words in text and filter
    for word in text:
        # Check if word is a word, and not punctuation, AND check against stop words
        if word.isalpha() and word.lower() not in stopwords:
            # If it passes the filters, save to word_set
            word_set.append(word.lower())
    return word_set

# Make Function Calls
#print cooking_text[0:20]
#print cooking_tokens[0:10]
#print text.concordance('economy')
#print text.collocations()
#print text.similar('Pot')

normalize_text(text)

fd = FreqDist(word_set)
print fd.most_common(200)

You can comment out print statements along the way if you are seeing too much information.

Bonus Challenge (for the ambitious)

For some super magic, you can also generate a plot of the most frequent words. You will need to install numpy with pip install -U numpy and matplotlib with pip install matplotlib. Remember, if you get an error, try running the command again with sudo (e.g. sudo pip install -U numpy).

Add to your script:

1
fd.plot(50,cumulative=False)

and run.

Previous Module Next Module