Coursera Data Science Specialization Capstone Project

Introduction

The objective of this project is to build sentence prediction model using n-gram predictive text model. The dictionary dataset is from a corpus called HC Corpora. See the readme file for details on the corpora. The scope of this report is to get acquainted with the data, do necessary pre-processing, and explore the text to get insight for building the predictive model.

Data Exploration

The corpus consists of three files: LOCALE.blogs.txt, LOCALE.news.txt and LOCALE.twitter.txt, where LOCALE is each of the four locales en_US, de_DE, ru_RU and fi_FI. For this project we will be using en_US LOCALE.

As part of data exploration, we check the size of the corpus given to us, which will help us determine how much data should we sample for building our model.

The documents in our corpus are

meta(crps[[1]])

## Metadata:
##   author       : character(0)
##   datetimestamp: 2014-11-14 18:47:07
##   description  : character(0)
##   heading      : character(0)
##   id           : en_US.blogs.txt
##   language     : en_US
##   origin       : character(0)

meta(crps[[2]])

## Metadata:
##   author       : character(0)
##   datetimestamp: 2014-11-14 18:47:38
##   description  : character(0)
##   heading      : character(0)
##   id           : en_US.news.txt
##   language     : en_US
##   origin       : character(0)

meta(crps[[3]])

## Metadata:
##   author       : character(0)
##   datetimestamp: 2014-11-14 18:48:10
##   description  : character(0)
##   heading      : character(0)
##   id           : en_US.twitter.txt
##   language     : en_US
##   origin       : character(0)

Examining how much memory is consumed by each of the documents in the corpus, blogs dataset takes 248.5 Mb, news dataset takes 249.6 Mb and twitter dataset takes 301.4 Mb

Next we check how many lines of text is present in each of the Document in our Corpus. The blogs dataset has 899288 lines, news dataset has 1010242 lines and twitter dataset has 2360148 lines.

Data Sampling

As the dataset is huge, we will use only 1% of the sample from each of news, blogs and twitter dataset as our training data. Depending on the performance and accuracy tradeoffs, we might have to sample more or less.

n <- length(crps[[1]]$content)
smpl1 <- sample(crps[[1]]$content,n * 0.01)
n <- length(crps[[2]]$content)
smpl2 <- sample(crps[[2]]$content, n*0.01)
n <- length(crps[[3]]$content)
smpl3 <- sample(crps[[3]]$content, n*0.01)
write(smpl1,"~/coursera/capstone/sample/en_US/en_US.blogs.txt")
write(smpl2,"~/coursera/capstone/sample/en_US/en_US.news.txt")
write(smpl3,"~/coursera/capstone/sample/en_US/en_US.twitter.txt")

Sample corpus smpl_crps is created from the sampled data, and is used as the training set.

Data Cleaning

Once we have sampled the data, the next step is to do some cleaning of the dataset, and prepare n-grams from it. For cleaning the data, we removed profanity, numbers and stopwords. We also do stemming using porter stemmer.

This is an e.g. of how our dataset looks like, after basic sentence tokenization

smpl_crps[[1]]$content[1:4]

## [1] " anything, continued, deep sadness hardening features."                                  
## [2] " grandmother escaped , hid guilty fear must've looked years come."                       
## [3] " google nexus s former lead android device samsung galaxy nexus's immediate predecessor."
## [4] "'s decent phone, "

n-gram Tokenization

Now we are ready to create n-grams from our cleaned up dataset. At this point, seems like max n-gram with n=3 should suffice for our prediction model. The n-grams are created using TermDocumentMatrix interface provided by the tm package.

Let’s looks at the top n-grams, for n=1..3

findFreqTerms(unigramTDM,500)

##  [1] "'ll"       "'re"       "'ve"       "also"      "always"   
##  [6] "another"   "around"    "back"      "best"      "better"   
## [11] "big"       "can"       "city"      "come"      "day"      
## [16] "days"      "even"      "every"     "first"     "game"     
## [21] "get"       "going"     "good"      "got"       "great"    
## [26] "happy"     "home"      "just"      "know"      "last"     
## [31] "life"      "like"      "little"    "long"      "look"     
## [36] "love"      "made"      "make"      "man"       "many"     
## [41] "may"       "much"      "need"      "never"     "new"      
## [46] "next"      "night"     "now"       "old"       "one"      
## [51] "people"    "place"     "really"    "right"     "said"     
## [56] "say"       "says"      "school"    "see"       "show"     
## [61] "since"     "something" "state"     "still"     "take"     
## [66] "thanks"    "things"    "think"     "three"     "time"     
## [71] "today"     "two"       "use"       "want"      "way"      
## [76] "week"      "well"      "will"      "work"      "world"    
## [81] "year"      "years"

findFreqTerms(bigramTDM,100)

##  [1] "'m going"    "'ve got"     "can get"     "first time"  "high school"
##  [6] "last night"  "last year"   "new york"    "right now"   "years ago"

findFreqTerms(trigramTDM,10)

##  [1] "'m pretty sure"         "'m sure will"          
##  [3] "caprera hotel venice"   "cinco de mayo"         
##  [5] "first time since"       "five years ago"        
##  [7] "happy mothers day"      "happy new year"        
##  [9] "hotel venice italy"     "let us know"           
## [11] "new york city"          "new york times"        
## [13] "president barack obama" "two weeks ago"         
## [15] "two years ago"          "will take place"       
## [17] "world war ii"           "yes yes yes"

It is interesting to observe that most frequent bigrams are pronoun I followed by transitive verb.

Let’s plot histogram of top 20 words and it’s frequencies

plot of chunk unnamed-chunk-10 plot of chunk unnamed-chunk-11 plot of chunk unnamed-chunk-12

We can also visualize the frequency data for unigrams using word cloud, and this time let’s do it per document, instead of across entire corpus. The min freq is kept as 500 for all the documents, same as what was used to select the most frequently occuring words in the entire corpus.

plot of chunk unnamed-chunk-13 plot of chunk unnamed-chunk-14 plot of chunk unnamed-chunk-15

We can observe that twitter and blogs have many more unique words occuring frequently as opposed to news document. As part of the prediction algorithm performance and accuracy testing, it might be useful to just use twitter and blogs documents, and see what kind of accuracy we get. And we can also see that most of the unigrams in the document matrix of the corpus are dominated by terms from the twitter and blogs documents.

Let’s see how many words are required to capture x% of the language plot of chunk unnamed-chunk-16

It is interesting to note that only 1024 unique words are required to capture 50% of the language.

Conclusion

Based on the exploratory analysis, the idea is to build the model using frequency distribution of n-grams. The performance of the model will determine the sampling tradeoff, which in turn might determine the accuracy.