## sysname release version nodename machine
## "Windows" "7 x64" "build 9200" "LEARNING-PC" "x86-64"
The goal of this project is to implement Natural Language Processing, and develop an application capable of assisting text editing by proposing the next word a user might use while typing text. The objective of this project is to analyze, strategize, machine train and apply a machine learned predictor of (English) words, to help expedite text editing, in a way similar to the Shannon game, to predict the next letter or word. This interim report covers tasks 0 thru 5 from those outlined in (Ref.1) and provides some forward work plan to implement in the Shiny Application.
This milestone report covers the preliminary steps performed for the Coursera-Swiftkey Summer 2015 Capstone project: Implementing good practice and Data Science rigor in understanding problem, gathering and reproducibly cleaning data, exploring, charting and planning for the remaining steps. It prepares for model analysis, machine learning, performance optimization and implementing a creative Shiny Application solution.
The 3 datasets provided are large (>556 MB total spread into 3 files containing between ~30 and 37 million words), with alpha, alphanumeric and other content, with sentence length varying between 13 and 42 words. The last line in the next table tallies these and provided the average of alpha words per line.
kable(df1, format="pandoc", caption="Table 1 - Original en_US Datafile Statistics")
| file | size.MB | lines | longest | max.length | alpha.words | anum.words | alpha.words.per.line |
|---|---|---|---|---|---|---|---|
| en_US.blogs.txt | 200.42 | 899288 | 483415 | 40833 | 37334131 | 37874365 | 42 |
| en_US.news.txt | 196.28 | 1010242 | 123628 | 11384 | 34372530 | 34613673 | 35 |
| en_US.twitter.txt | 159.36 | 2360148 | 26 | 140 | 30373543 | 30556095 | 13 |
| 3 | 556.06 | 4269678 | 483415 | 40833 | 102080204 | 103044133 | 24 |
kable(df2, format="pandoc", caption="Table 2 - Clean en_US Datafile Statistics")
| file | size.MB | lines | longest | max.length | alpha.words | anum.words | alpha.words.per.line |
|---|---|---|---|---|---|---|---|
| blogs_clean.txt | 190.75 | 886220 | 478314 | 39180 | 36323626 | 37071887 | 41 |
| news_clean.txt | 185.86 | 997109 | 123091 | 10120 | 33022203 | 33860751 | 34 |
| twitter_clean.txt | 147.56 | 2249660 | 9402 | 140 | 28493126 | 29536635 | 13 |
| 3 | 524.17 | 4132989 | 478314 | 39180 | 97838955 | 100469273 | 24 |
kable(df3, format="pandoc", caption="Table 3 - Sample en_US Datafile Statistics")
| file | size.MB | lines | longest | max.length | alpha.words | anum.words | alpha.words.per.line |
|---|---|---|---|---|---|---|---|
| blogs_sample.txt | 18.98 | 88611 | 59640 | 13869 | 3613106 | 3687370 | 41 |
| news_sample.txt | 18.48 | 99043 | 2713 | 2746 | 3283095 | 3366239 | 34 |
| twitter_sample.txt | 14.81 | 225525 | 958 | 140 | 2858906 | 2963857 | 13 |
| 3 | 52.27 | 413179 | 59640 | 13869 | 9755107 | 10017466 | 24 |
g1
The tokenizer developed by Maciej Szymkiewicz (Ref. 11) was used to derive basic frequency statistics from unigrams to quintigrams for the 3 sampled sets. An example of the twitter set is illustrated here, comparing the full and sample sets. Also note that contraction is present in position 23, 48 and 49 of the specific set unigram vocabulary.
head( names(tw_full_gram1),50)
## [1] "the" "to" "i" "a" "you" "and" "in" "for"
## [9] "of" "is" "my" "it" "on" "that" "me" "be"
## [17] "at" "with" "your" "have" "this" "so" "i'm" "are"
## [25] "just" "but" "like" "all" "we" "was" "out" "up"
## [33] "get" "if" "love" "what" "do" "good" "about" "can"
## [41] "day" "rt" "from" "now" "when" "not" "one" "it's"
## [49] "don't" "know"
head( names(tw_gram1),50)
## [1] "the" "to" "i" "a" "you" "and" "for" "in"
## [9] "of" "is" "my" "it" "on" "that" "me" "be"
## [17] "at" "with" "your" "this" "have" "so" "i'm" "are"
## [25] "just" "but" "like" "all" "was" "we" "out" "get"
## [33] "up" "if" "what" "love" "do" "good" "about" "day"
## [41] "can" "rt" "from" "when" "now" "not" "one" "it's"
## [49] "don't" "know"
set.seed(103)
commonality.cloud(tdm,commonality.measure=min,max.words=100,colors=brewer.pal(ncol(tdm),"Dark2"))
Note the size of words is proportional to their frequency in the set.
comparison.cloud(tdm,max.words=100,random.order=FALSE,colors=brewer.pal(ncol(tdm),"Dark2"))
Most common words are located in the center of the wordcloud (e.g.“The”) and vocabulary specific to a set are pushed to the periphery (e.g. “awsome” for the twitter set, or “police” for the news set. or “into” for the blogs set).
g3
g2
g4
The Zipf-Mandelbrot law relating Vocabulary frequency to Token (Ranked Word Index) was particularly followed for token greater than 5, as illustrated by the linear trend in the log-log plot.
g5
kable(df4, format="pandoc", caption="Table 4 - N-Grams Data Analytics Performance Comparison")
| gram | kToken.Coverage | logkTokenCov | Vocab50 | Vocab90 | NonUniqueVocab | NonUniqueCoverage |
|---|---|---|---|---|---|---|
| uni | 0.75 | -0.13 | 133 | 5989 | 41337 | 0.98 |
| bi | 0.18 | -0.74 | 34792 | 779272 | 224585 | 0.71 |
| tri | 0.04 | -1.43 | 774325 | 1917693 | 195139 | 0.30 |
| quadri | 0.01 | -2.01 | 1265325 | 2408693 | 77245 | 0.08 |
| quinti | 0.00 | -2.46 | 1389775 | 2533143 | 24017 | 0.02 |
The same dataset quintigram, only covers 0.35% with 1000 5-word groups, requires 2.53 Million tokens to reach 90% and would require computational resources (time and memory) beyond scope for a Shiny Application expected response. This key finding dictates the strategy for developing the App and the algorithm to use.
We also observed that implementing grammar contractions can boost the unigrams (by partially including some bigrams, visible in the top unigrams) and the effect will cascade towards N-grams. Retaining the stop words are also helping in news and blogs, but perhaps not so in tweets. The top 50 twitter quintigrams extracted follow as an example.
head( names(tw_gram5),50)
## [1] "thank you so much for" "can't wait to see you"
## [3] "i can't wait to see" "at the end of the"
## [5] "hope you have a great" "let me know if you"
## [7] "thanks so much for the" "for the first time in"
## [9] "keep up the good work" "that awkward moment when you"
## [11] "for the rest of the" "happy mother's day to all"
## [13] "thank you for the follow" "thanks for the shout out"
## [15] "i love you so much" "in the middle of the"
## [17] "for a chance to win" "am i the only one"
## [19] "happy mothers day to all" "hope you had a great"
## [21] "i hope you have a" "looking forward to seeing you"
## [23] "you have a great day" "thanks for the follow i"
## [25] "to be a part of" "look forward to seeing you"
## [27] "hope to see you there" "i have to go to"
## [29] "let me know what you" "thanks for the heads up"
## [31] "the rest of the day" "can't wait to see what"
## [33] "cake cake cake cake cake" "can't wait to see the"
## [35] "the end of the day" "on my way to the"
## [37] "so so so so so" "hope you are having a"
## [39] "to everyone who came out" "hope to see you soon"
## [41] "i have no idea what" "i wish i had a"
## [43] "mother's day to all the" "thank you for the rt"
## [45] "thanks for the follow and" "i'm not the only one"
## [47] "keep up the great work" "thanks for the kind words"
## [49] "to figure out how to" "i have a lot of"