##       sysname       release       version      nodename       machine 
##     "Windows"       "7 x64"  "build 9200" "LEARNING-PC"      "x86-64"

Objectives

The goal of this project is to implement Natural Language Processing, and develop an application capable of assisting text editing by proposing the next word a user might use while typing text. The objective of this project is to analyze, strategize, machine train and apply a machine learned predictor of (English) words, to help expedite text editing, in a way similar to the Shannon game, to predict the next letter or word. This interim report covers tasks 0 thru 5 from those outlined in (Ref.1) and provides some forward work plan to implement in the Shiny Application.

Executive Summary

  1. This milestone report covers the preliminary steps performed for the Coursera-Swiftkey Summer 2015 Capstone project: Implementing good practice and Data Science rigor in understanding problem, gathering and reproducibly cleaning data, exploring, charting and planning for the remaining steps. It prepares for model analysis, machine learning, performance optimization and implementing a creative Shiny Application solution.

  2. The 3 datasets provided are large (>556 MB total spread into 3 files containing between ~30 and 37 million words), with alpha, alphanumeric and other content, with sentence length varying between 13 and 42 words. The last line in the next table tallies these and provided the average of alpha words per line.

kable(df1, format="pandoc", caption="Table 1 - Original en_US Datafile Statistics")
Table 1 - Original en_US Datafile Statistics
file size.MB lines longest max.length alpha.words anum.words alpha.words.per.line
en_US.blogs.txt 200.42 899288 483415 40833 37334131 37874365 42
en_US.news.txt 196.28 1010242 123628 11384 34372530 34613673 35
en_US.twitter.txt 159.36 2360148 26 140 30373543 30556095 13
3 556.06 4269678 483415 40833 102080204 103044133 24
  1. The data was cleaned using a custom made aggregate filter to remove profanity and non-desireable content. Starting from an English dictionary, bad words, slang, FBI.Twitter and netlingo sources, obtained from online sources, made up the filtering, leading to a 6% reduction in vocabulary, affecting more the twitter content than news and blogs, as expected. English language contractions were applied to boost the vocabulary unigrams.
kable(df2, format="pandoc", caption="Table 2 - Clean en_US Datafile Statistics")
Table 2 - Clean en_US Datafile Statistics
file size.MB lines longest max.length alpha.words anum.words alpha.words.per.line
blogs_clean.txt 190.75 886220 478314 39180 36323626 37071887 41
news_clean.txt 185.86 997109 123091 10120 33022203 33860751 34
twitter_clean.txt 147.56 2249660 9402 140 28493126 29536635 13
3 524.17 4132989 478314 39180 97838955 100469273 24
  1. Based on Church and Gale Vocabulary Law trends estimates, a 10% sampling of the Corpus was performed and matched well the total set representation (both in word statistics and in specific vocabulary). A complete sample set of 3 document was reduced to ~52MB. For statistical sampling purposes, we might require further trim, but also implementing multiple sets (k-fold ~5), with a target not to exceed a total of 200 MB.
kable(df3, format="pandoc", caption="Table 3 - Sample en_US Datafile Statistics")
Table 3 - Sample en_US Datafile Statistics
file size.MB lines longest max.length alpha.words anum.words alpha.words.per.line
blogs_sample.txt 18.98 88611 59640 13869 3613106 3687370 41
news_sample.txt 18.48 99043 2713 2746 3283095 3366239 34
twitter_sample.txt 14.81 225525 958 140 2858906 2963857 13
3 52.27 413179 59640 13869 9755107 10017466 24
  1. Tokenization, the process of extraction of vocabulary (i.e. unique words) from the document, split into words or token, was performed successively on individual words, groups of 2, 3… up to 5 consecutive words, and produced uni-, bi-,… quinti-grams, generally called N-grams. Vocabulary content of the global set (corpus) ranked from most to least frequent occurences of each vocabulary token.
g1

The tokenizer developed by Maciej Szymkiewicz (Ref. 11) was used to derive basic frequency statistics from unigrams to quintigrams for the 3 sampled sets. An example of the twitter set is illustrated here, comparing the full and sample sets. Also note that contraction is present in position 23, 48 and 49 of the specific set unigram vocabulary.

head( names(tw_full_gram1),50)
##  [1] "the"   "to"    "i"     "a"     "you"   "and"   "in"    "for"  
##  [9] "of"    "is"    "my"    "it"    "on"    "that"  "me"    "be"   
## [17] "at"    "with"  "your"  "have"  "this"  "so"    "i'm"   "are"  
## [25] "just"  "but"   "like"  "all"   "we"    "was"   "out"   "up"   
## [33] "get"   "if"    "love"  "what"  "do"    "good"  "about" "can"  
## [41] "day"   "rt"    "from"  "now"   "when"  "not"   "one"   "it's" 
## [49] "don't" "know"
head( names(tw_gram1),50)
##  [1] "the"   "to"    "i"     "a"     "you"   "and"   "for"   "in"   
##  [9] "of"    "is"    "my"    "it"    "on"    "that"  "me"    "be"   
## [17] "at"    "with"  "your"  "this"  "have"  "so"    "i'm"   "are"  
## [25] "just"  "but"   "like"  "all"   "was"   "we"    "out"   "get"  
## [33] "up"    "if"    "what"  "love"  "do"    "good"  "about" "day"  
## [41] "can"   "rt"    "from"  "when"  "now"   "not"   "one"   "it's" 
## [49] "don't" "know"
  1. Although a commonality cloud reveals substantial overlap in the dataset vocabulary, and the comparison cloud highlights the differentiation of the vocabulary between the twitter, news and blog set, diverging as we move from the center of the wordcloud.
set.seed(103)
commonality.cloud(tdm,commonality.measure=min,max.words=100,colors=brewer.pal(ncol(tdm),"Dark2"))

Note the size of words is proportional to their frequency in the set.

comparison.cloud(tdm,max.words=100,random.order=FALSE,colors=brewer.pal(ncol(tdm),"Dark2"))

Most common words are located in the center of the wordcloud (e.g.“The”) and vocabulary specific to a set are pushed to the periphery (e.g. “awsome” for the twitter set, or “police” for the news set. or “into” for the blogs set).

  1. There were sufficient differences in the word contents (uni, to N-grams) to suggest training each set individually as well as the global corpus, and offer a selector in the final app to that extent.
g3
  1. Specifically, all sets are common from most frequent to the 8th word, and thereafter diverge more and more. You need a vocabulary of 71 words to capture the top50 most frequent words in the 3 sets, more than 1500 words to capture the top1000 most frequent words in the 3 sets, etc… This means the global vocabulary diverges as the set increases. We could predict on average that the 3 size of the specific vocabulary sets would require twice their individual size as aglobal dictionary. Looking at the vocabulary ratios, we then state it grows from 1X (8 first words) to 1.42X (initial 50 words) to, 1.5X (1000 words) and is expected to tend to 2X at the limit.
g2
  1. Coverage functions determined for all sets, from unigrams to quintigrams, as well as for the global corpus and the most frequent unigrams statistics obtained.
g4

The Zipf-Mandelbrot law relating Vocabulary frequency to Token (Ranked Word Index) was particularly followed for token greater than 5, as illustrated by the linear trend in the log-log plot.

g5
  1. It was observed, for example, on the twitter set, that the clean set unigram coverage reach ~75% with 1000 words dictionary, requires 5989 words to reach 90% coverage and contains 41337 non-unique vocabulary token.
kable(df4, format="pandoc", caption="Table 4 - N-Grams Data Analytics Performance Comparison")
Table 4 - N-Grams Data Analytics Performance Comparison
gram kToken.Coverage logkTokenCov Vocab50 Vocab90 NonUniqueVocab NonUniqueCoverage
uni 0.75 -0.13 133 5989 41337 0.98
bi 0.18 -0.74 34792 779272 224585 0.71
tri 0.04 -1.43 774325 1917693 195139 0.30
quadri 0.01 -2.01 1265325 2408693 77245 0.08
quinti 0.00 -2.46 1389775 2533143 24017 0.02
  1. The same dataset quintigram, only covers 0.35% with 1000 5-word groups, requires 2.53 Million tokens to reach 90% and would require computational resources (time and memory) beyond scope for a Shiny Application expected response. This key finding dictates the strategy for developing the App and the algorithm to use.

  2. We also observed that implementing grammar contractions can boost the unigrams (by partially including some bigrams, visible in the top unigrams) and the effect will cascade towards N-grams. Retaining the stop words are also helping in news and blogs, but perhaps not so in tweets. The top 50 twitter quintigrams extracted follow as an example.

head( names(tw_gram5),50)
##  [1] "thank you so much for"         "can't wait to see you"        
##  [3] "i can't wait to see"           "at the end of the"            
##  [5] "hope you have a great"         "let me know if you"           
##  [7] "thanks so much for the"        "for the first time in"        
##  [9] "keep up the good work"         "that awkward moment when you" 
## [11] "for the rest of the"           "happy mother's day to all"    
## [13] "thank you for the follow"      "thanks for the shout out"     
## [15] "i love you so much"            "in the middle of the"         
## [17] "for a chance to win"           "am i the only one"            
## [19] "happy mothers day to all"      "hope you had a great"         
## [21] "i hope you have a"             "looking forward to seeing you"
## [23] "you have a great day"          "thanks for the follow i"      
## [25] "to be a part of"               "look forward to seeing you"   
## [27] "hope to see you there"         "i have to go to"              
## [29] "let me know what you"          "thanks for the heads up"      
## [31] "the rest of the day"           "can't wait to see what"       
## [33] "cake cake cake cake cake"      "can't wait to see the"        
## [35] "the end of the day"            "on my way to the"             
## [37] "so so so so so"                "hope you are having a"        
## [39] "to everyone who came out"      "hope to see you soon"         
## [41] "i have no idea what"           "i wish i had a"               
## [43] "mother's day to all the"       "thank you for the rt"         
## [45] "thanks for the follow and"     "i'm not the only one"         
## [47] "keep up the great work"        "thanks for the kind words"    
## [49] "to figure out how to"          "i have a lot of"

Forward Plan for Shiny Application

  1. The Shiny Application may rely on the simplest implementation possible, using the “Stupid Backoff” as described in (Ref. 14). We intend to explore the optimization of the coefficient to determine if 0.4 is best, if time permits.
  2. We plan to retain specific Vocabularies for each set and one for the Global set to tailor the predictions as best as possible.
  3. We will use the package and retain the N-gram tokenizer used in this preliminary work for performance and compatibility, avoiding Java dependencies of Weka.
  4. We plan to use a sentence separator prior to tokenize, to improve on multi-grams where words separated by punctuation should not be related.
  5. We will eliminate all low frequency counts in the N-grams and retain possibly a set-specific vocabulary to cover 90% of the most frequent occurences and implement an ‘’ token, and develop our engine on multigrams.
  6. We plan to implement a K-fold sampling and use 30% for training (K=10), hold 30% for tuning the smoothing lambdas and and validate on the remainin 40% set.
  7. We will continue to enjoy this challenging discovery, working on the training and next steps to develop the application !

References

  1. Johns Hopkins Coursera Data Science Track Capstone
  2. About Data Corpus
  3. Course Dataset
  4. Stanford Coursera Natural Language Processing
  5. R Programming/Text Processing
  6. SCOLD US_English Dictionary
  7. CMU resource for English profanity words
  8. FBI Twitter Shorthand Guide
  9. netlingo List
  10. Internet Slang Dictionary
  11. Ngram_Tokenizer written by Maciej Szymkiewicz
  12. wordcloud package
  13. Wikipedia Stemming Article
  14. Large Language Models in Machine Translation