Natural Language Processing - Milestone Report

By Tyler Byers
Coursera Data Science Capstone Project Midterm Check
November 16, 2014

Summary

This is our Milestone Report for the Natural Language Processing project for the Coursera Data Science Capstone. We have been given three US English text files containing millions of lines of text as written in blogs, tweets, and news articles, and the eventual goal is to use these data to create a prediction algorithm to fill in a missing word in a sentence. Once the prediction algorithm is created, we will then create a web application for a user to interact with the prediction machine. For this milestone report, we are showing that we have successfully loaded the data and begun processing and exploring the data. We have discovered some basic word frequencies, and created word frequency plots. Finally, at the end of this report we detail some of our next steps.

Basic File Summaries

We are conducting our analysis and application using data from three text files: a blog file, a news file, and a twitter file. The table below shows the number of words and lines within each files.

File Lines Words
blogs.txt 899,288 37,334,690
news.txt 1,010,242 34,372,720
twitter.txt 2,360,148 30,374,206
Total 4,269,678 102,081,616

Between the three files, we have 4.3 million different lines and 102.1 million words. Of note, the line count is equivalent to the “entry” count – that is, there are 4.2 million entries, where a blog entry, news item, or tweet counts as one line/entry, irrespective of how many lines of text each line/entry would appear to be on a typewritten page.

Sample Data

Because we have such large data sets, it will be computationally expensive to do our analysis on all the data. So, at this point we plan to use just 1% of the data from each file to build our model and complete the initial analysis. That should still leave us with over 42.6 thousand lines and 1.02 million words to create our model. We intend to explore methods for incorporating more data into our model, but in the interest of time we have sampled the data at just 1% for now.

Clean Data

After sampling the data for analysis, we then prepared the data for analysis by using a series of commands in the tm (TextMining) R package. The cleaning steps we took were:

  1. Convert all letters to lowercase
  2. Remove punctuation
  3. Remove numbers
  4. Remove profanity (we have defined profanity as George Carlin's 7 dirty words)
  5. Strip excess whitespace
## Loading required package: lattice
## Loading required package: ggplot2

Word Frequency

We next looked at the frequencies of various words from our sampled files. The table below shows the 20 most frequent words used (data from all three files combined).

##   the   and   for  that   you  with   was  this  have   are   but   not 
## 47522 24420 11053 10410  9290  7216  6196  5440  5217  4877  4743  4067 
##  from   its   all  will  they  said   his   out 
##  3920  3505  3329  3154  3119  3104  3049  3037

Unsurprisingly, most of the words above are part of the so-called “stop words,” words that tend to be “filler” words with little meaning. The tm package we are using has a list of 174 stop words. In fact, if we compare the above words with the stopwords() list, the only two that aren't in the stopwords list are “will” and “said”.

We also wanted to create plots of the frequency of words within our data set. Below are two log-scale plots that illustrate the right-skew for the frequency of words. That is, there are very many words that are only used once or twice within our text files, while there are relatively few words that are used with high frequency. Plot 1 shows the entire scale, from over 30,000 words that are only used once, to the roughly 90 that are used over one thousand times. Plot 2 zooms in on the higher-frequency words, showing only the number of words that are used over 100 times in our corpora of just over 1 million words. Note that Plot 1 uses a log scale for the y-axis, while Plot 2 does not.

plot of chunk plot word freqs plot of chunk plot word freqs

Remove Stop Words

Because our table of high-frequency words was somewhat unsurprising, we wanted to see if we could see patterns of word usage emerging between the three different types of writing – blogs, tweets, and news. So for the next three tables, we removed the 174 stopwords from the three corpora. Tables of of the top 20 most-used words from the resulting corpora are shown below.

blogs.txt

##    one   will    can   like   just   time    get    now people   know 
##   1291   1115   1006    990    940    888    735    605    599    583 
##   make   also   even  first    new    day   dont really   much   good 
##    552    539    536    536    529    528    528    520    507    501

twitter.txt

##   just   like    get   love   good   will    can    day   dont thanks 
##   1462   1210   1148   1075   1005    975    920    911    890    885 
##    now    one   know  great  today   time    new    see    lol    got 
##    839    818    782    776    756    752    694    694    680    578

news.txt

##    said    will     one     new     can     two    also   years    last 
##    2559    1064     808     686     581     577     569     536     533 
##    year    just   first    time   state    like  people     get percent 
##     533     525     519     515     500     484     470     439     370 
##   three    city 
##     367     360

There is a fair amount of overlap between the top non-stopwords in the three corpora; however, we can see some distinctions. For instance, the word “make” appears in the blogs table, but not in the other two, perhaps because many blogs are instructions about how to create things. In the Twitter file, it is interesting that the word “just” is the most-used non-stopword, probably because Twitter serves as a vehicle that many people use to describe something that they have just done. And finally, the top words used in the news corpus consist of many words that you would see in news articles, like “said”, “percent”, and “state.”

Next Steps

Following this milestone report, which described our data cleaning and initial exploratory data analysis, we intend to build our prediction machine. This will include building tables of so-called 'n-grams,' that is, words that are found next to each other in a text document ('a text document' is a 3-gram, or trigram, for the words at the end of the last sentence). We have not yet looked ahead to many of the task requirements for the end of the project, but we will be building an application in Shiny to put on the web to display our prediction machine. This looks to be an interesting project, and we are excited to dive in to the rest of the work.