Introduction

The Coursera Capstone project involves building a predictive text model. When users type in a phrase, the model offers users choices for subsequent words, and allows users to minimize typing by choosing from among those words.

This milestone report summarizes the exploratory analysis of the corpus (the body of text used to build the predictive model) that will be used for the Coursera Data Science Capstone project. The report concludes with a description of the proposed strategy for building the predictive text model. The purpose of this report is as follows:

This project will use the R programming language1 to build the model.

Exploratory Analysis

The predictive model I plan to build will use three data sources: a text file of blog posts, a text file of twitter posts, and a text file of news stories. The three data sources are available here. The three files are quite large.

The Twitter file is 2,360,148 lines long and has 30,578,933 words (the results from the analysis are included below).

## [1] 2360148
##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##     125769474          3033      36047952      30578933           963 
##        Envirs 
##             0

The Blog file is 899,288 lines long and has 37,865,888 words.

## [1] 899288
##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##     163325412             9      43302825      37865888             3 
##        Envirs 
##             0

The News file is is 77,259 lines long and has 2,665,742 words.

## [1] 77259
##     CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds 
##      12502954             0       3114374       2665742             0 
##        Envirs 
##             0

The predictive model will use a corpus built from a random sample of 100,000 lines from each of the three files. This 100,000 sample size is just a starting point for initial model development and testing: it may be enlarged or reduced based upon the accuracy and performance of the model I build over the next few weeks.

The sample from each of the three file are combined into a corpus using tm package for R 2 3. This corpus will provide an example of phrases that I will use to build a predictive model.

In the remainder of this sector, I summarize results of exploratory analysis on this sample corpus, but first I will the R tm package to clean up the data in the following ways:

After cleaning the dataset, the corpus is tokenized (i.e. split the corpus into a series of individual words or phrases for analysis) and converted into a matrix that summarizes the frequency of words or phrases.

The matrix built from single words using the “cleaned up” corpus consists of 166,904 words.

Figure 1 below shows the 20 most common single words. As the chart shows, the most common words are articles such as “the” and “and”, pronouns such as “you”, and verb forms of “to be”, and prepositions. The most specific verb in the graph is “said”.

Figure 2 illustrates a word cloud with the 500 most common words. The word cloud affirms that the most common words are articles, pronouns, “to be” verb, etc; however, the cloud also reveals more specific words such as “university”, “book”, and “morning”.

## [1] 166904

plot of chunk countoneplot of chunk countone

The model will also use phrases in addition to individual words. Figures 3-5 show the most common 2-, 3- and 4-word phrases in the corpus. As with the frequency of single word phrases, the most common phrases are strings of articles and prepositional phrases. As the outputs show below, the size of unique n-grams grows from just under 2,000,000 to 5,700,000 unique values as the size of the phrase grows. We also see some unusual phrases at the 4 ngram size (e.g. “vested interest vested interest”)

## [1] 1991365

plot of chunk counttwo

## [1] 4533661

plot of chunk countthree

## [1] 5714924

plot of chunk countfour

The above charts of the most frequent phrases raises the issue of how useful the most common words in the corpus will be in predicting words in a specific phrase. I am keeping these common words for now; however, if model testing show that retaining these words is not useful for prediction, I might remove these “stopwords” later.

Summary of Strategy to Build Predictive Model

I will use the tokenized n-grams from the Corpus samples developed above to build an n-gram model. N-gram models estimate the probability of a word occuring in a phrase based on the previous words in a phrase. N-grams calculate this by probability by looking at the number of times the last word occurs in a phrase followed by the number of times the phrase minus the last word occur. For example, a n-gram model would calculate the probability that “store” is the last word in the phrase “I am going to the store” by:

The model will have to consider other factors, such as how to handle low probability phrases, and how to use “back off” models to estimate the probabilities of n-grams for which there are no examples in the corpus.

The above considerations will be address based on initial testing of a simple n-gram model. My strategy is to start out simple, and build in greater complexity as I learn how each progressively complex model performs against test sets of phrases.


  1. R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

  2. Ingo Feinerer and Kurt Hornik (2014). tm: Text Mining Package. R package version 0.6. http://CRAN.R-project.org/package=tm

  3. Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54. URL: http://www.jstatsoft.org/v25/i05/.