By Tyler Byers
Coursera Data Science Capstone Project Midterm Check
November 16, 2014
This is our Milestone Report for the Natural Language Processing project for the Coursera Data Science Capstone. We have been given three US English text files containing millions of lines of text as written in blogs, tweets, and news articles, and the eventual goal is to use these data to create a prediction algorithm to fill in a missing word in a sentence. Once the prediction algorithm is created, we will then create a web application for a user to interact with the prediction machine. For this milestone report, we are showing that we have successfully loaded the data and begun processing and exploring the data. We have discovered some basic word frequencies, and created word frequency plots. Finally, at the end of this report we detail some of our next steps.
We are conducting our analysis and application using data from three text files: a blog file, a news file, and a twitter file. The table below shows the number of words and lines within each files.
| File | Lines | Words |
|---|---|---|
| blogs.txt | 899,288 | 37,334,690 |
| news.txt | 1,010,242 | 34,372,720 |
| twitter.txt | 2,360,148 | 30,374,206 |
| Total | 4,269,678 | 102,081,616 |
Between the three files, we have 4.3 million different lines and 102.1 million words. Of note, the line count is equivalent to the “entry” count – that is, there are 4.2 million entries, where a blog entry, news item, or tweet counts as one line/entry, irrespective of how many lines of text each line/entry would appear to be on a typewritten page.
Because we have such large data sets, it will be computationally expensive to do our analysis on all the data. So, at this point we plan to use just 1% of the data from each file to build our model and complete the initial analysis. That should still leave us with over 42.6 thousand lines and 1.02 million words to create our model. We intend to explore methods for incorporating more data into our model, but in the interest of time we have sampled the data at just 1% for now.
After sampling the data for analysis, we then prepared the data for analysis by using a series of commands in the tm (TextMining) R package. The cleaning steps we took were:
## Loading required package: lattice
## Loading required package: ggplot2
We next looked at the frequencies of various words from our sampled files. The table below shows the 20 most frequent words used (data from all three files combined).
## the and for that you with was this have are but not
## 47522 24420 11053 10410 9290 7216 6196 5440 5217 4877 4743 4067
## from its all will they said his out
## 3920 3505 3329 3154 3119 3104 3049 3037
Unsurprisingly, most of the words above are part of the so-called “stop words,” words that tend to be “filler” words with little meaning. The tm package we are using has a list of 174 stop words. In fact, if we compare the above words with the stopwords() list, the only two that aren't in the stopwords list are “will” and “said”.
We also wanted to create plots of the frequency of words within our data set. Below are two log-scale plots that illustrate the right-skew for the frequency of words. That is, there are very many words that are only used once or twice within our text files, while there are relatively few words that are used with high frequency. Plot 1 shows the entire scale, from over 30,000 words that are only used once, to the roughly 90 that are used over one thousand times. Plot 2 zooms in on the higher-frequency words, showing only the number of words that are used over 100 times in our corpora of just over 1 million words. Note that Plot 1 uses a log scale for the y-axis, while Plot 2 does not.
Because our table of high-frequency words was somewhat unsurprising, we wanted to see if we could see patterns of word usage emerging between the three different types of writing – blogs, tweets, and news. So for the next three tables, we removed the 174 stopwords from the three corpora. Tables of of the top 20 most-used words from the resulting corpora are shown below.
## one will can like just time get now people know
## 1291 1115 1006 990 940 888 735 605 599 583
## make also even first new day dont really much good
## 552 539 536 536 529 528 528 520 507 501
## just like get love good will can day dont thanks
## 1462 1210 1148 1075 1005 975 920 911 890 885
## now one know great today time new see lol got
## 839 818 782 776 756 752 694 694 680 578
## said will one new can two also years last
## 2559 1064 808 686 581 577 569 536 533
## year just first time state like people get percent
## 533 525 519 515 500 484 470 439 370
## three city
## 367 360
There is a fair amount of overlap between the top non-stopwords in the three corpora; however, we can see some distinctions. For instance, the word “make” appears in the blogs table, but not in the other two, perhaps because many blogs are instructions about how to create things. In the Twitter file, it is interesting that the word “just” is the most-used non-stopword, probably because Twitter serves as a vehicle that many people use to describe something that they have just done. And finally, the top words used in the news corpus consist of many words that you would see in news articles, like “said”, “percent”, and “state.”
Following this milestone report, which described our data cleaning and initial exploratory data analysis, we intend to build our prediction machine. This will include building tables of so-called 'n-grams,' that is, words that are found next to each other in a text document ('a text document' is a 3-gram, or trigram, for the words at the end of the last sentence). We have not yet looked ahead to many of the task requirements for the end of the project, but we will be building an application in Shiny to put on the web to display our prediction machine. This looks to be an interesting project, and we are excited to dive in to the rest of the work.