Objectives

The goal of this project is to implement Natural Language Processing, and develop an application capable of assisting text editing by proposing the next word a user might use while typing text. The objective of this project is to analyze, strategize, machine train and apply a machine learned predictor of (English) words, to help expedite text editing, in a way similar to the Shannon game, to predict the next letter or word. This interim report covers tasks 0 thru 5 from those outlined in (Ref.1) and provides some forward work plan to implement in the Shiny Application.

Executive Summary

This milestone report covers the preliminary steps performed for the Coursera-Swiftkey Summer 2015 Capstone project: Implementing good practice and Data Science rigor in understanding problem, gathering and reproducibly cleaning data, exploring, charting and planning for the remaining steps. It prepares for model analysis, machine learning, performance optimization and implementing a creative Shiny Application solution.
The 3 datasets provided are large (>556 MB total spread into 3 files containing between ~30 and 37 million words), with alpha, alphanumeric and other content, with sentence length varying between 13 and 42 words. The last line in the next table tallies these and provided the average of alpha words per line.

kable(df1, format="pandoc", caption="Table 1 - Original en_US Datafile Statistics")

Table 1 - Original en_US Datafile Statistics
file	size.MB	lines	longest	max.length	alpha.words	anum.words	alpha.words.per.line
en_US.blogs.txt	200.42	899288	483415	40833	37334131	37874365	42
en_US.news.txt	196.28	1010242	123628	11384	34372530	34613673	35
en_US.twitter.txt	159.36	2360148	26	140	30373543	30556095	13
3	556.06	4269678	483415	40833	102080204	103044133	24

The data was cleaned using a custom made aggregate filter to remove profanity and non-desireable content. Starting from an English dictionary, bad words, slang, FBI.Twitter and netlingo sources, obtained from online sources, made up the filtering, leading to a 6% reduction in vocabulary, affecting more the twitter content than news and blogs, as expected. English language contractions were applied to boost the vocabulary unigrams.

kable(df2, format="pandoc", caption="Table 2 - Clean en_US Datafile Statistics")

Table 2 - Clean en_US Datafile Statistics
file	size.MB	lines	longest	max.length	alpha.words	anum.words	alpha.words.per.line
blogs_clean.txt	190.75	886220	478314	39180	36323626	37071887	41
news_clean.txt	185.86	997109	123091	10120	33022203	33860751	34
twitter_clean.txt	147.56	2249660	9402	140	28493126	29536635	13
3	524.17	4132989	478314	39180	97838955	100469273	24

Based on Church and Gale Vocabulary Law trends estimates, a 10% sampling of the Corpus was performed and matched well the total set representation (both in word statistics and in specific vocabulary). A complete sample set of 3 document was reduced to ~52MB. For statistical sampling purposes, we might require further trim, but also implementing multiple sets (k-fold ~5), with a target not to exceed a total of 200 MB.

kable(df3, format="pandoc", caption="Table 3 - Sample en_US Datafile Statistics")

Table 3 - Sample en_US Datafile Statistics
file	size.MB	lines	longest	max.length	alpha.words	anum.words	alpha.words.per.line
blogs_sample.txt	18.98	88611	59640	13869	3613106	3687370	41
news_sample.txt	18.48	99043	2713	2746	3283095	3366239	34
twitter_sample.txt	14.81	225525	958	140	2858906	2963857	13
3	52.27	413179	59640	13869	9755107	10017466	24

Tokenization, the process of extraction of vocabulary (i.e. unique words) from the document, split into words or token, was performed successively on individual words, groups of 2, 3… up to 5 consecutive words, and produced uni-, bi-,… quinti-grams, generally called N-grams. Vocabulary content of the global set (corpus) ranked from most to least frequent occurences of each vocabulary token.

g1

The tokenizer developed by Maciej Szymkiewicz (Ref. 11) was used to derive basic frequency statistics from unigrams to quintigrams for the 3 sampled sets. An example of the twitter set is illustrated here, comparing the full and sample sets. Also note that contraction is present in position 23, 48 and 49 of the specific set unigram vocabulary.

head( names(tw_full_gram1),50)

##  [1] "the"   "to"    "i"     "a"     "you"   "and"   "in"    "for"  
##  [9] "of"    "is"    "my"    "it"    "on"    "that"  "me"    "be"   
## [17] "at"    "with"  "your"  "have"  "this"  "so"    "i'm"   "are"  
## [25] "just"  "but"   "like"  "all"   "we"    "was"   "out"   "up"   
## [33] "get"   "if"    "love"  "what"  "do"    "good"  "about" "can"  
## [41] "day"   "rt"    "from"  "now"   "when"  "not"   "one"   "it's" 
## [49] "don't" "know"

head( names(tw_gram1),50)

##  [1] "the"   "to"    "i"     "a"     "you"   "and"   "for"   "in"   
##  [9] "of"    "is"    "my"    "it"    "on"    "that"  "me"    "be"   
## [17] "at"    "with"  "your"  "this"  "have"  "so"    "i'm"   "are"  
## [25] "just"  "but"   "like"  "all"   "was"   "we"    "out"   "get"  
## [33] "up"    "if"    "what"  "love"  "do"    "good"  "about" "day"  
## [41] "can"   "rt"    "from"  "when"  "now"   "not"   "one"   "it's" 
## [49] "don't" "know"

Although a commonality cloud reveals substantial overlap in the dataset vocabulary, and the comparison cloud highlights the differentiation of the vocabulary between the twitter, news and blog set, diverging as we move from the center of the wordcloud.

set.seed(103)
commonality.cloud(tdm,commonality.measure=min,max.words=100,colors=brewer.pal(ncol(tdm),"Dark2"))

Note the size of words is proportional to their frequency in the set.

comparison.cloud(tdm,max.words=100,random.order=FALSE,colors=brewer.pal(ncol(tdm),"Dark2"))

Most common words are located in the center of the wordcloud (e.g.“The”) and vocabulary specific to a set are pushed to the periphery (e.g. “awsome” for the twitter set, or “police” for the news set. or “into” for the blogs set).

There were sufficient differences in the word contents (uni, to N-grams) to suggest training each set individually as well as the global corpus, and offer a selector in the final app to that extent.

g3

Specifically, all sets are common from most frequent to the 8th word, and thereafter diverge more and more. You need a vocabulary of 71 words to capture the top50 most frequent words in the 3 sets, more than 1500 words to capture the top1000 most frequent words in the 3 sets, etc… This means the global vocabulary diverges as the set increases. We could predict on average that the 3 size of the specific vocabulary sets would require twice their individual size as aglobal dictionary. Looking at the vocabulary ratios, we then state it grows from 1X (8 first words) to 1.42X (initial 50 words) to, 1.5X (1000 words) and is expected to tend to 2X at the limit.

g2

Coverage functions determined for all sets, from unigrams to quintigrams, as well as for the global corpus and the most frequent unigrams statistics obtained.

g4

The Zipf-Mandelbrot law relating Vocabulary frequency to Token (Ranked Word Index) was particularly followed for token greater than 5, as illustrated by the linear trend in the log-log plot.

g5

It was observed, for example, on the twitter set, that the clean set unigram coverage reach ~75% with 1000 words dictionary, requires 5989 words to reach 90% coverage and contains 41337 non-unique vocabulary token.

kable(df4, format="pandoc", caption="Table 4 - N-Grams Data Analytics Performance Comparison")

Table 4 - N-Grams Data Analytics Performance Comparison
gram	kToken.Coverage	logkTokenCov	Vocab50	Vocab90	NonUniqueVocab	NonUniqueCoverage
uni	0.75	-0.13	133	5989	41337	0.98
bi	0.18	-0.74	34792	779272	224585	0.71
tri	0.04	-1.43	774325	1917693	195139	0.30
quadri	0.01	-2.01	1265325	2408693	77245	0.08
quinti	0.00	-2.46	1389775	2533143	24017	0.02

The same dataset quintigram, only covers 0.35% with 1000 5-word groups, requires 2.53 Million tokens to reach 90% and would require computational resources (time and memory) beyond scope for a Shiny Application expected response. This key finding dictates the strategy for developing the App and the algorithm to use.
We also observed that implementing grammar contractions can boost the unigrams (by partially including some bigrams, visible in the top unigrams) and the effect will cascade towards N-grams. Retaining the stop words are also helping in news and blogs, but perhaps not so in tweets. The top 50 twitter quintigrams extracted follow as an example.

head( names(tw_gram5),50)

##  [1] "thank you so much for"         "can't wait to see you"        
##  [3] "i can't wait to see"           "at the end of the"            
##  [5] "hope you have a great"         "let me know if you"           
##  [7] "thanks so much for the"        "for the first time in"        
##  [9] "keep up the good work"         "that awkward moment when you" 
## [11] "for the rest of the"           "happy mother's day to all"    
## [13] "thank you for the follow"      "thanks for the shout out"     
## [15] "i love you so much"            "in the middle of the"         
## [17] "for a chance to win"           "am i the only one"            
## [19] "happy mothers day to all"      "hope you had a great"         
## [21] "i hope you have a"             "looking forward to seeing you"
## [23] "you have a great day"          "thanks for the follow i"      
## [25] "to be a part of"               "look forward to seeing you"   
## [27] "hope to see you there"         "i have to go to"              
## [29] "let me know what you"          "thanks for the heads up"      
## [31] "the rest of the day"           "can't wait to see what"       
## [33] "cake cake cake cake cake"      "can't wait to see the"        
## [35] "the end of the day"            "on my way to the"             
## [37] "so so so so so"                "hope you are having a"        
## [39] "to everyone who came out"      "hope to see you soon"         
## [41] "i have no idea what"           "i wish i had a"               
## [43] "mother's day to all the"       "thank you for the rt"         
## [45] "thanks for the follow and"     "i'm not the only one"         
## [47] "keep up the great work"        "thanks for the kind words"    
## [49] "to figure out how to"          "i have a lot of"

Forward Plan for Shiny Application

The Shiny Application may rely on the simplest implementation possible, using the “Stupid Backoff” as described in (Ref. 14). We intend to explore the optimization of the coefficient to determine if 0.4 is best, if time permits.
We plan to retain specific Vocabularies for each set and one for the Global set to tailor the predictions as best as possible.
We will use the package and retain the N-gram tokenizer used in this preliminary work for performance and compatibility, avoiding Java dependencies of Weka.
We plan to use a sentence separator prior to tokenize, to improve on multi-grams where words separated by punctuation should not be related.
We will eliminate all low frequency counts in the N-grams and retain possibly a set-specific vocabulary to cover 90% of the most frequent occurences and implement an ‘’ token, and develop our engine on multigrams.
We plan to implement a K-fold sampling and use 30% for training (K=10), hold 30% for tuning the smoothing lambdas and and validate on the remainin 40% set.
We will continue to enjoy this challenging discovery, working on the training and next steps to develop the application !

Data Science Capstone - Milestone Report

Marc Borowczak

2015-07-23

Objectives

Executive Summary

Forward Plan for Shiny Application

References