The following report examines the data required for the Johns Hopkins Data Science Specialization Capstone Project and explores plans for creating a word prediction app. Of three model types trialled (Recursive Neural Network with TensorFlow, Markov Chain Prediction and Stupid Back-off), the report will find the Stupid Back-off approach to be the best performer in terms of accuracy, speed and size, and simplest to create.
Using data obtained from the coursera website representing publicly available text from twitter, news and blog sources, create a predictive text product that takes as input a phrase (multiple words) and predict the next word.
This project will just look at the en_US data set:
## # A tibble: 10 x 3
## text source id
## <chr> <chr> <int>
## 1 "I hate when people say volleyball isn't a sport." twitt~ 2.02e6
## 2 "Great seieing you, too! Congrats for your wins as well. Every~ twitt~ 3.33e6
## 3 "thanks so much, Lyndsay - for the warm welcome back and for y~ twitt~ 3.27e6
## 4 "Up way too early so I can be at my school by 7 for a literary~ twitt~ 2.78e6
## 5 "Just handed Kraig Karson Lil Wayne's \"Green & Yellow\" to de~ twitt~ 3.72e6
## 6 "EK Success border punch." blog 4.03e5
## 7 "Don’t think in terms of creating for a child; think of creati~ blog 5.83e5
## 8 "Wanna see it go mano-a-mano w/ RT think Monkey Puzzle tree b/~ twitt~ 2.77e6
## 9 "\"Do you close a fire station, or do you close mounted patrol~ news 9.45e5
## 10 "616. Salad @ Tiki Bar (Spring Mt., PA) 4:48 p.m." blog 4.02e5
Sample size after initial filter:
Data set sizes:
The following process is performed on the subsets prepared above:
Prepared data sets for later use in modelling.
The Quanteda package is used to create the corpus representing all the text in the sample and for all subsequent data analysis in this document.
Tokenisation is the process of splitting the corpus into sentences and/or words with relative frequencies.
Sample token collections with and without stop words:
## Tokens consisting of 6 documents and 1 docvar.
## text237392 :
## [1] "i" "am" "feeling" "a" "little" "stressed" "out"
##
## text106390 :
## [1] "my" "friend" "james" "from" "atlanta" "georgia" "had"
## [8] "chosen" "to" "shave" "his" "legs"
## [ ... and 59 more ]
##
## text304108 :
## [1] "haters" "don't" "really" "hate" "you"
## [6] "they" "hate" "themselves" "cuz" "your"
## [11] "a" "reflection"
## [ ... and 6 more ]
##
## text408457 :
## [1] "i" "love" "my" "grandparents" "so"
## [6] "much"
##
## text295846 :
## [1] "omg" "yes" "i" "had" "to" "drink" "a" "ton"
## [9] "of" "water" "before" "mine"
## [ ... and 7 more ]
##
## text126055 :
## [1] "on" "the" "other" "hand" "a"
## [6] "good" "review" "is" "heartwarming" "and"
## [11] "may" "help"
## [ ... and 64 more ]
## Tokens consisting of 6 documents and 1 docvar.
## text382554 :
## [1] "new" "mone" "single" "stars"
## [5] "hands" "www.itunes.com" "search" "mack"
## [9] "mone" "doin" "bad"
##
## text345167 :
## [1] "love" "already" "new" "paltz" "girls" "looks"
##
## text342900 :
## [1] "teeth" "pants" "pants" "human" "centipede" "pants"
##
## text347518 :
## [1] "nice" "weather" "brings" "rebent" "cyclists"
## [6] "austin" "yet" "see" "one" "helmed"
## [11] "smug.looking" "bearded"
## [ ... and 2 more ]
##
## text249732 :
## [1] "little" "boy" "sister" "says" "can" "er" "comes"
##
## text322088 :
## [1] "yes" "eating" "outside" "better" "choice" "jared's"
## [7] "bo" "tweets" "cracking" "btw"
Observation word, sentence and character count by source (observations shorter than 5 characters removed)
feature | Source | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|
Words | blog | 44.323272 | 48.3291100 | 0 | 9 | 30 | 63 | 1750 |
Words | news | 36.129679 | 23.6054932 | 0 | 20 | 33 | 48 | 489 |
Words | 14.599421 | 7.5825984 | 0 | 8 | 14 | 21 | 56 | |
Sentences | blog | 1.000936 | 0.0525590 | 0 | 1 | 1 | 1 | 9 |
Sentences | news | 1.000357 | 0.0288397 | 0 | 1 | 1 | 1 | 4 |
Sentences | 1.005250 | 0.0733743 | 0 | 1 | 1 | 1 | 5 | |
Characters | blog | 224.663090 | 249.7077200 | 0 | 45 | 152 | 322 | 9079 |
Characters | news | 196.154882 | 129.2513029 | 0 | 106 | 180 | 262 | 2511 |
Characters | 67.087365 | 36.3720376 | 1 | 36 | 63 | 98 | 140 |
Distribution of word count is strongly exponential.
Looking at the size of the word count table gives us the number of unique words in the corpus:
Using the following formula for combinations with repetitions:
\[ (1)\ {}_{n+r-1}C_r={\large\frac{(n+r-1)!}{r!(n-1)!}}\\ \]
We can find the possible number of n-grams for r format(unique_words, big.mark=“,”) unique words:
We’ll look at bi-grams and tri-grams occurring in the sample for both with and without stop words keeping only those that occur frequently enough to consider (40 times with stop words & 25 times without for this analysis).
Samples from the corpus:
[1] "romney_has, has_an, an_enormous, enormous_lead, lead_in, in_delegates, delegates_but, but_still"
[1] "washington_colorado, colorado_attorney, attorney_general, general_john, john_suthers, suthers_called, called_monday's, monday's_opening"
[1] "it's_not, not_uncommon, uncommon_for, for_utilities, utilities_universities, universities_and, and_even, even_state"
[1] "city_controller, controller_wendy, wendy_greuel, greuel_another, another_mayoral, mayoral_candidate, candidate_also, also_spoke"
[1] "dalia_some, some_people, people_take, take_the, the_jeans, jeans_too, too_far, far_they"
[1] ""
[1] "the_spanish, spanish_had, had_contact, contact_with, with_marino's, marino's_people, people_early, early_in"
[1] "forward_carmelo, carmelo_anthony, anthony_said, said_he, he_didn't, didn't_see, see_stoudemire, stoudemire_after"
[1] "french_interior, interior_minister, minister_michele, michele_alliot.marie, alliot.marie_said, said_friday, friday_that, that_having"
[1] "andrew's_father, father_danny, danny_had, had_kept, kept_the, the_secret, secret_of, of_his"
[1] "romney_enormous, enormous_lead, lead_delegates, delegates_still, still_must, must_reach, reach_ure, ure_nomination"
[1] "washington_colorado, colorado_attorney, attorney_general, general_john, john_suthers, suthers_called, called_monday's, monday's_opening"
[1] "uncommon_utilities, utilities_universities, universities_even, even_state, state_tax, tax_departments, departments_charge, charge_convenience"
[1] "city_controller, controller_wendy, wendy_greuel, greuel_another, another_mayoral, mayoral_candidate, candidate_also, also_spoke"
[1] "dalia_people, people_take, take_jeans, jeans_far, far_think, think_means, means_jeans, jeans_tennis"
[1] ""
[1] "spanish_contact, contact_marino's, marino's_people, people_early, early_th, th_century, century_december, december_indians"
[1] "forward_carmelo, carmelo_anthony, anthony_said, said_see, see_stoudemire, stoudemire_game, game_know, know_happened"
[1] "french_interior, interior_minister, minister_michele, michele_alliot.marie, alliot.marie_said, said_friday, friday_parliamentary, parliamentary_commission"
[1] "andrew's_father, father_danny, danny_kept, kept_secret, secret_son's, son's_return, return_home, home_deployment"
The following table displays the top 40 most frequently occurring bi-grams and tri-grams:
word1 | word2 | frequency | word1 | word2 | frequency | word1 | word2 | word3 | frequency | word1 | word2 | word3 | frequency |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
of | the | 43098 | right | now | 2435 | one | of | the | 3508 | new | york | city | 273 |
in | the | 41445 | new | york | 1990 | a | lot | of | 2939 | let | us | know | 265 |
to | the | 21565 | last | year | 1857 | thanks | for | the | 2466 | happy | mother’s | day | 194 |
for | the | 19950 | last | night | 1596 | to | be | a | 1815 | happy | mothers | day | 180 |
on | the | 19650 | high | school | 1408 | going | to | be | 1708 | happy | new | year | 178 |
to | be | 16152 | years | ago | 1388 | the | end | of | 1516 | president | barack | obama | 162 |
at | the | 14362 | last | week | 1301 | as | well | as | 1475 | two | years | ago | 139 |
and | the | 12653 | feel | like | 1273 | i | want | to | 1457 | cinco | de | mayo | 133 |
in | a | 11958 | first | time | 1227 | out | of | the | 1434 | new | york | times | 126 |
with | the | 10638 | looking | forward | 1118 | it | was | a | 1412 | world | war | ii | 125 |
is | a | 10227 | can | get | 1099 | some | of | the | 1395 | st | louis | county | 108 |
it | was | 9651 | looks | like | 1014 | be | able | to | 1333 | looking | forward | seeing | 104 |
for | a | 9395 | make | sure | 1002 | part | of | the | 1288 | gov | chris | christie | 99 |
i | have | 8798 | even | though | 940 | i | have | a | 1267 | first | time | since | 96 |
from | the | 8699 | happy | birthday | 934 | i | have | to | 1162 | two | weeks | ago | 88 |
i | was | 8503 | st | louis | 915 | looking | forward | to | 1100 | three | years | ago | 77 |
with | a | 8280 | let | know | 850 | the | rest | of | 1090 | rock | n | roll | 72 |
it | is | 8160 | good | morning | 848 | i | don’t | know | 1072 | keep | good | work | 69 |
and | i | 8134 | new | jersey | 825 | the | first | time | 1021 | new | year’s | eve | 67 |
will | be | 8071 | united | states | 805 | is | going | to | 1018 | five | years | ago | 66 |
going | to | 8004 | just | got | 794 | thank | you | for | 1014 | martin | luther | king | 66 |
of | a | 7993 | next | week | 779 | there | is | a | 989 | four | years | ago | 65 |
i | am | 7676 | every | day | 765 | a | couple | of | 984 | come | see | us | 65 |
have | a | 7506 | one | day | 744 | you | want | to | 983 | cant | wait | see | 65 |
if | you | 7492 | can | see | 740 | i’m | going | to | 981 | county | sheriff’s | office | 61 |
one | of | 7378 | good | luck | 734 | i | love | you | 969 | thanks | following | us | 60 |
is | the | 7377 | los | angeles | 727 | this | is | a | 968 | past | two | years | 58 |
to | get | 7112 | just | like | 715 | the | fact | that | 957 | couple | years | ago | 58 |
as | a | 6806 | two | years | 709 | end | of | the | 922 | love | love | love | 58 |
want | to | 6346 | look | like | 702 | it | would | be | 909 | high | school | students | 57 |
by | the | 6239 | thanks | follow | 695 | i | need | to | 909 | george | w | bush | 57 |
have | to | 6178 | next | year | 671 | you | have | to | 907 | st | patrick’s | day | 57 |
that | the | 6029 | can | make | 619 | in | the | world | 872 | blah | blah | blah | 57 |
this | is | 5924 | little | bit | 614 | can’t | wait | to | 854 | look | forward | seeing | 55 |
to | do | 5868 | every | time | 599 | to | go | to | 853 | long | time | ago | 54 |
and | a | 5765 | follow | back | 596 | this | is | the | 831 | just | got | back | 53 |
i | think | 5755 | one | thing | 585 | one | of | my | 829 | wall | street | journal | 52 |
the | first | 5676 | long | time | 576 | is | one | of | 818 | county | prosecutor’s | office | 52 |
was | a | 5672 | sounds | like | 569 | for | the | follow | 818 | good | morning | everyone | 52 |
i | don’t | 5535 | get | back | 564 | to | have | a | 815 | world | trade | center | 51 |
Plot connections between words with and without stopwords.
## IGRAPH 8fdd5b4 DN-- 80 108 --
## + attr: name (v/c), word3 (e/c), frequency (e/n)
## + edges from 8fdd5b4 (vertex names):
## [1] of ->the in ->the to ->the for ->the on ->the
## [6] to ->be at ->the and ->the in ->a with ->the
## [11] is ->a it ->was for ->a i ->have from ->the
## [16] i ->was with ->a it ->is and ->i will ->be
## [21] going->to of ->a i ->am have ->a if ->you
## [26] one ->of is ->the to ->get as ->a want ->to
## [31] by ->the have ->to that ->the this ->is to ->do
## [36] and ->a i ->think the ->first was ->a i ->don't
## + ... omitted several edges
## IGRAPH 8fe0228 DN-- 100 87 --
## + attr: name (v/c), word3 (e/c), frequency (e/n)
## + edges from 8fe0228 (vertex names):
## [1] right ->now new ->york last ->year last ->night
## [5] high ->school years ->ago last ->week feel ->like
## [9] first ->time looking->forward can ->get looks ->like
## [13] make ->sure even ->though happy ->birthday st ->louis
## [17] let ->know good ->morning new ->jersey united ->states
## [21] just ->got next ->week every ->day one ->day
## [25] can ->see good ->luck los ->angeles just ->like
## [29] two ->years look ->like thanks ->follow next ->year
## + ... omitted several edges
Another useful analysis is to use Collocation Frequency which analyses co-occurrence of words within the document (as opposed to n-grams which only considers adjacent words).
collocation | count | length | lambda | collocation | count | length | lambda |
---|---|---|---|---|---|---|---|
of the | 43098 | 2 | 1.7897815 | right now | 2435 | 2 | 4.921475 |
in the | 41445 | 2 | 2.0006764 | new york | 1990 | 2 | 9.357920 |
to the | 21565 | 2 | 0.5581746 | last year | 1857 | 2 | 4.518750 |
for the | 19950 | 2 | 1.5374918 | last night | 1596 | 2 | 4.854133 |
on the | 19650 | 2 | 1.9036605 | high school | 1408 | 2 | 6.143525 |
to be | 16152 | 2 | 2.7171194 | years ago | 1388 | 2 | 6.368092 |
at the | 14362 | 2 | 1.9618189 | last week | 1301 | 2 | 4.561878 |
and the | 12653 | 2 | 0.1130091 | feel like | 1273 | 2 | 4.131166 |
in a | 11958 | 2 | 1.1759567 | first time | 1227 | 2 | 3.293214 |
with the | 10638 | 2 | 1.2777167 | looking forward | 1118 | 2 | 6.931798 |
is a | 10227 | 2 | 1.4774927 | can get | 1099 | 2 | 2.476897 |
it was | 9651 | 2 | 3.1409912 | looks like | 1014 | 2 | 5.033158 |
for a | 9395 | 2 | 1.3485377 | make sure | 1002 | 2 | 4.749336 |
i have | 8798 | 2 | 2.4981628 | even though | 940 | 2 | 4.905264 |
from the | 8699 | 2 | 1.7909021 | happy birthday | 934 | 2 | 6.525924 |
i was | 8503 | 2 | 2.2523182 | st louis | 915 | 2 | 8.883394 |
with a | 8280 | 2 | 1.6821337 | let know | 850 | 2 | 4.447955 |
it is | 8160 | 2 | 2.3330645 | good morning | 848 | 2 | 4.376039 |
and i | 8134 | 2 | 0.9401703 | new jersey | 825 | 2 | 6.146440 |
will be | 8071 | 2 | 4.2629453 | united states | 805 | 2 | 9.187266 |
going to | 8004 | 2 | 4.1497442 | just got | 794 | 2 | 2.794885 |
of a | 7993 | 2 | 0.5152602 | next week | 779 | 2 | 4.516142 |
i am | 7676 | 2 | 5.2009930 | every day | 765 | 2 | 3.720341 |
have a | 7506 | 2 | 1.9089028 | one day | 744 | 2 | 2.179035 |
if you | 7492 | 2 | 3.7418625 | can see | 740 | 2 | 2.557233 |
one of | 7378 | 2 | 2.8619950 | good luck | 734 | 2 | 5.989462 |
is the | 7377 | 2 | 0.4116014 | los angeles | 727 | 2 | 14.808728 |
to get | 7112 | 2 | 2.7942008 | just like | 715 | 2 | 1.611962 |
as a | 6806 | 2 | 1.8993038 | two years | 709 | 2 | 3.689750 |
want to | 6346 | 2 | 3.9804638 | look like | 702 | 2 | 3.309687 |
by the | 6239 | 2 | 1.6309077 | thanks follow | 695 | 2 | 4.664213 |
have to | 6178 | 2 | 1.5172676 | next year | 671 | 2 | 3.865891 |
that the | 6029 | 2 | 0.2374993 | can make | 619 | 2 | 2.428675 |
this is | 5924 | 2 | 2.4790413 | little bit | 614 | 2 | 5.067210 |
to do | 5868 | 2 | 2.4771676 | every time | 599 | 2 | 3.223041 |
and a | 5765 | 2 | -0.0252936 | follow back | 596 | 2 | 4.101560 |
i think | 5755 | 2 | 3.9429832 | one thing | 585 | 2 | 3.147544 |
the first | 5676 | 2 | 2.7208205 | long time | 576 | 2 | 3.348739 |
was a | 5672 | 2 | 1.4030182 | sounds like | 569 | 2 | 4.776484 |
i don’t | 5535 | 2 | 3.5512796 | get back | 564 | 2 | 2.343143 |
As expected, stop words and bi-grams rate the highest.
Top features in the FCM’s:
i of and the is a you my it was
3781329 3038974 2953831 2550223 2281944 2215231 1812772 1657482 1604269 1590184
have for we with be this one who about that
1440665 1373846 1223847 1107883 1038409 1032843 927765 890596 877592 876655
like one just things love make every today
213385 182524 146327 124638 122639 122321 108857 99118
great around many time well day keep little
97714 94339 93521 89269 86399 85104 79296 76907
go life god something
76538 74719 74306 72359
Plot relationship between collocating words:
The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.
Looking at the data split by source to see if there is much change in word importance, there is a similar distribution across all sources.
Zipf’s law states that the frequency that a word appears is inversely proportional to its rank.
Plots for the three sources are nearly identical.
There is a near log-linear relationship over much of the distribution as predicted by Zipf’s Law.
Finding the linear relationship for ranks from 10 to 10000:
## (Intercept) log10(rank)
## -0.4374398 -1.1858634
Find Words with Highest TF-IDF Score:
Interesting to note, the highest ranking words for Twitter are mostly “junk” words not found in a standard English dictionary.
The English dictionary from the hunspell
package is
loaded and tokens filtered by dictionary words only.
82.8% of words in the sample are recognised in the US English dictionary (comprising of 49,271 word definitions). The remaining 17.2% will be a mixture of foreign words, abbreviations, shorthand, mispellings and unrecognised slang. Determining the proportion of these that are actually foreign words would require a significant amount of processing.
68% of the words appearing in the US English dictionary are captured by a 10% sample. To capture 90% of vocabulary would likely require a far greater sample as frequency of usage of remaining words becomes increasingly rare.
Everyday words are a much smaller subset of the full dictionary vocabulary however.
“In English, for example, 3000 words make up about 95% of everyday conversation”1
There are 33,482 US English words captured in the sample. Using 3000 as an estimate of common usage vocabulary, it is very likely that nearly all are captured by the sample.
Using the Top 3000 Words list from Education First:
The 10% sample captures 98.2% of common usage words, 68.7% of the content is in the English common usage list.
Below is an estimation of the required sample size to meet common vocabulary coverage. Rather than looking at the required number of words to be sampled, the document frequency count is used to estimate the number of documents (text samples) that would need to be sampled to achieve coverage.
While this looks at word capture rate, n-gram capture rate significantly drops off for lower sample sizes.
Analysis of a 10% sample of the source data consisting of 426,714 examples of blog, news & Twitter items produced the following findings:
Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture available in R through the Tensor Flow Keras package.
The huge advantage of the RNN models is that they use unsupervised learning to take past choices into account, including new vocabulary terms to build a smart, personalised self-teaching model.
I had high hopes for this but quickly ran into memory issues when trying to build the feature matrix for any sample over a few thousand sentences. A Windows laptop is not the right environment for this pursuit, neither is R, where both have tight restrictions on memory usage.
Most text prediction using this model is character based rather than word based and has a tendency to produce nonsense words.
Of the few working models I managed, training the model in small batches, prediction accuracy was low, memory usage high and response times too long to be considered practical.
Final nail in the coffin for this approach was the discovery that the Shiny Apps servers do not support Keras.
This route similarly has memory issues, and still requires large scale n-gram and skip-gram matrices taking too much memory and in the end, not producing great results - 33% prediction rate was the best I saw here and not so quick to respond.
Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past. Stupid back-off is a smoothing model does away with the need to store vast amounts of probabilistic data and instead uses direct relative frequencies. Katz’s Back-off may require up to 4 searches and calculations per request while the Stupid Back-off will only ever require one and making use of pre-compiled C++ code, the execution time is much quicker.
Stupid back-off takes care of unknown words by splitting chains in to linked segments thus getting rid of the need for skip-grams.
In terms of accuracy, my first model (below) was trained on a 5% sample using default parameters gave me a 46% accuracy and typical response time of 2.5ms.
For training the initial model, the train data is split 50-50 to give a 5% sample of the original data source. It is then split into sentences, stripped of any excess whitespace and any sentences shorter than 4 words are removed (since our aim to predict on the previous 3 words.
Next, an SBO predictor table is trained with 4-grams, a target of 75% coverage of the source vocabulary and a lambda penalization of 0.4. A predictor model is generated from that table.
Below, we show some example predictions and also test the memory allocation and amount of time to respond:
[1] "information about the [ event, new, people ]"
[1] "i'd like to [ see, be, share ]"
[1] "they wanted to [ be, get, know ]"
[1] "Model Memory Allocation: 38.6 MB"
test | replications | elapsed | relative | user.self | sys.self |
---|---|---|---|---|---|
Test1 | 1000 | 2.56 | 1.028 | 2.52 | 0.02 |
Test2 | 1000 | 2.49 | 1.000 | 2.48 | 0.00 |
Test3 | 1000 | 2.89 | 1.161 | 2.89 | 0.00 |
Note, there are 1000 replications in the above benchmark test, the times displayed above can be thought of as the average time for each request in milliseconds.
Evaluating the accuracy for in-sample data:
## # A tibble: 198,772 x 4
## input true preds[,1] [,2] [,3] correct
## <chr> <chr> <chr> <chr> <chr> <lgl>
## 1 "met those limits" before before <EOS> are TRUE
## 2 "team and the" other first other game TRUE
## 3 "dab stencil brush" in <EOS> and the FALSE
## 4 "potential foreclosure of" another the my another TRUE
## 5 "he even ordered" the the to a TRUE
## 6 "of what legislators" want to was <EOS> FALSE
## 7 "pm pm monday" to to through <EOS> TRUE
## 8 "responding to a" conservation new record victory FALSE
## 9 " prepeyton" years is i the FALSE
## 10 "to resolve those" cases who are and FALSE
## # ... with 198,762 more rows
## # A tibble: 1 x 2
## accuracy uncertainty
## <dbl> <dbl>
## 1 0.459 0.00112
There are several avenues to explore to improve model accuracy and performance:
Ideal improvements beyond the scope of this project:
(Benoit et al. 2018)
[@quanteda.textplots]
(Silge and Robinson 2016)
(Gherardi 2020)
Large Language Models in Machine Translation
Backoff Inspired Features for Maximum Entropy Language Models