Introduction

The objective of this project is to create an data product that can predict the next word given a word in context. The idea is to build a predictive model based on a library of text usage. The data which will be used to train the model will be the HC Corpora en_US blogs, twitter and news sets.

Loading and cleaning of data set

Each text data set is 200MB large. To run processes on 600MB of data is not feasible given the computing resources available. Instead a random sample of 10% of records from each set will be used. We can infer the characteristics of the population from the sample. Sampling will be in 10 line chunks. Using a function named random Sample to do this work.

Loading

##Random sample a tenth of the total set
randomSample("~/coursera/data scientist/Capstone/final/en_US/en_US.blogs.txt",
             "~/coursera/data scientist/Capstone/final/en_US/en_US.blogs.sample.txt",0.1,10);

randomSample("~/coursera/data scientist/Capstone/final/en_US/en_US.news.upd.txt",
             "~/coursera/data scientist/Capstone/final/en_US/en_US.news.sample.txt",0.1,10);

randomSample("~/coursera/data scientist/Capstone/final/en_US/en_US.twitter.txt",
             "~/coursera/data scientist/Capstone/final/en_US/en_US.twitter.sample.txt",0.1,10);

Preprocessing

To allow for meaningful analysis of the text bodies we need to do some cleaning. This includes removing additional whitespace, converting all text to lower case, remove stop words such as “and”,“it”,“so” (as listed below - which add no information).

stopwords("english");

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"       "will"

In addition, to enable meaningful aggregate analysis it is necessary to “tokenise” the text by converting the words of the texts into word stems.

#Load the three cleaned tokenised English text body samples
##Remove whitespace, convert to lower case, remove stop words and perform stemming
en_US_blogs<-swiftToken("~/coursera/data scientist/Capstone/final/en_US/en_US.blogs.sample.txt")
en_US_news<-swiftToken("~/coursera/data scientist/Capstone/final/en_US/en_US.news.sample.txt")
en_US_twitter<-swiftToken("~/coursera/data scientist/Capstone/final/en_US/en_US.twitter.sample.txt")

#combine into a single corpus
corpus_cmb<-c(en_US_blogs,en_US_news,en_US_twitter)

Exploratory analysis

For preliminary analysis, examine the number of lines in each of the texts.

#Check number of lines in each set
NROW(en_US_blogs[[1]][1]$content)

## [1] 89080

NROW(en_US_twitter[[1]][1]$content)

## [1] 235300

NROW(en_US_news[[1]][1]$content)

## [1] 100520

Next examine the number of words in the combined corpus. Specifically focussing on the words which occur in all 3 documents which are part of the corpus, and word stems which are between 3 and 20 characters long. Look at the 30 most frequently occuring stems in each document.

##Create the combined term document matrix only including word stems of length 3 to 20
## which occur in all 3 of documents in the corpus.
tdm <-TermDocumentMatrix(corpus_cmb, control=list(wordLengths=c(3, 20),
bounds = list(global = 3)))

##Examine the 30 most frequently used terms
findMostFreqTerms(tdm,30);

## $en_US.blogs.sample.txt
##   one  like  time   can  just   get  make   day  know  year   use  love 
## 13184 10837 10289  9932  9837  9296  7937  7023  6823  6759  6428  6393 
##  work peopl thing  want think   now   see  even  also  look  dont   new 
##  6202  6160  6116  6083  5920  5837  5597  5596  5584  5565  5553  5413 
##   way  well  back first  good  take 
##  5255  5169  5071  5040  4961  4879 
## 
## $en_US.news.sample.txt
##   said   year    one    new   time  state    say    can   also   like 
##  24775  11101   8526   6841   6647   6565   6268   6046   6002   5992 
##    get    two  first   last   just   make  peopl   work   game school 
##   5908   5654   5334   5313   5245   5176   4969   4966   4965   4575 
##   citi   play    day includ   want    use   take   team   back    now 
##   4497   4355   4310   3905   3773   3766   3748   3744   3612   3569 
## 
## $en_US.twitter.sample.txt
##   just    get  thank   like   love    day   good   dont    can    one 
##  15064  14578  12977  12933  12396  10874  10103   9060   8904   8675 
##   know   time    now follow  great    see  today   make    new    lol 
##   8586   8581   8121   7867   7695   7562   7445   7205   6973   6911 
##   look  think   come   work   need   want   back    got   cant  peopl 
##   6556   6425   6388   6335   6293   6175   5706   5679   5371   5301

It’s interesting to note the variation in the ranking of word stems across the 3 corpora. Although words seem to be common across the 3 sets and rankings are similar, they are not exactly the same. This could point to there being a difference in the type of language used when communicating in a blog as opposed to a twitter post or a news item.

This may also indicate the need to include context into the predictive model. I also suspect that language may differ between locations.

Looking a bit deeper at the words in the term document matrix as an aggregate - the top 100 most frequently occurring stems.

##convert to matrix
tdmaggr<-as.matrix(tdm)

##look at top 100 most occuring stems
v<-sort(rowSums(tdmaggr), decreasing=TRUE)
head(v, 100)

##    one   said   just    get   like   time    can    day   year   make 
##  30385  30286  30146  29782  29762  25517  24882  22207  22055  20318 
##   love    new   know   good   dont    now   work  peopl   want    say 
##  20230  19227  18301  18201  17561  17527  17503  16430  16031  16007 
##    see  think  thank   look   come   back   need  first    use   also 
##  15775  15331  15186  15055  14437  14389  13892  13427  13382  13210 
##  thing   last   well   take  great    way   even   much  today    two 
##  13030  12817  12715  12616  12530  12398  12004  11730  11535  11302 
##  right realli follow    got   week  start  still   play   game   call 
##  11241  11150  11040  10979  10638  10592  10203  10093  10064   9634 
##   show    tri  state   feel   that   life school   home   mani   cant 
##   9402   9401   9273   9200   8976   8933   8867   8809   8638   8553 
##   live   help  night  littl   made   hope  never    let    may   best 
##   8443   8356   8239   8214   8188   8135   8134   8105   7831   7766 
##   next friend   give    lol someth   book    lot  world   citi  happi 
##   7643   7535   7406   7167   7127   7055   6935   6907   6906   6884 
##    end   find    man  didnt  place   keep better  watch  alway  anoth 
##   6864   6857   6758   6749   6731   6719   6705   6685   6667   6607 
##    run    ive around  everi   team   your    put   talk    big   read 
##   6585   6580   6540   6470   6465   6456   6262   6251   6209   6085

It will take 986 word stems from the vocabulary to cover 60% of the text.

dv<-data.frame(v,names(v))
colnames(dv)<-c("count","wordstem")

dvp<-dv %>% 
  mutate(
          perc_cover = cumsum(count)/sum(count)
          ) 

tail(dvp[dvp$perc_cover<=0.6,])

##     count wordstem perc_cover
## 978  1111   common  0.5988054
## 979  1110     deep  0.5990116
## 980  1109     fast  0.5992176
## 981  1108    appar  0.5994233
## 982  1107  address  0.5996289
## 983  1106  absolut  0.5998343

If will take 7,889 of the 167,786 word stems to cover 90% of the text. So with a very small portion of the vocabulary can cover a large proportion of the text

tail(dvp[dvp$perc_cover<=0.9,])

##      count wordstem perc_cover
## 7874    51   reboot  0.8999452
## 7875    51   rejoic  0.8999547
## 7876    51      rhp  0.8999642
## 7877    51     roth  0.8999736
## 7878    51 salesman  0.8999831
## 7879    51   scenic  0.8999926

Plotting this relationship - number of word stems included versus percentage coverage.

g<-ggplot(dvp,mapping=aes(perc_cover,as.numeric(rownames(dvp)))) + geom_line(color="red",size=1) + scale_y_continuous(labels=comma) + scale_x_continuous(labels = percent) + labs(y="Number of Word Stems", x="Percentage of text covered") + ggtitle("Word Stems against coverage of text")
g

There is an exponential relationship between the number of word stems required to cover the text. In other words to achieve very high levels of cover requires an increasingly higher proportion of the total word stem vocabulary.

Conclusion

Exploratory analysis has shown that the body of text is extremely large. We have learned that it requires an increasingly higher proportion of the vocabulary to achieve higher levels of coverage of the text. In addition, the character of the language seems to be dependent on the context. This may influence the way in which predictive models are constructed.

Capstone Project - Swiftkey - exploratory data analysis

Jason Schmidt

October 30, 2017