1 Executive Summary

The following report examines the data required for the Johns Hopkins Data Science Specialization Capstone Project and explores plans for creating a word prediction app. Of three model types trialled (Recursive Neural Network with TensorFlow, Markov Chain Prediction and Stupid Back-off), the report will find the Stupid Back-off approach to be the best performer in terms of accuracy, speed and size, and simplest to create.

2 Project Task

Using data obtained from the coursera website representing publicly available text from twitter, news and blog sources, create a predictive text product that takes as input a phrase (multiple words) and predict the next word.

3 Obtaining the Data

3.1 Downloading the Source Text

3.2 Load the data and combine the sources.

This project will just look at the en_US data set:

  • The three data sources are loaded and combined into a single data set
## # A tibble: 10 x 3
##    text                                                            source     id
##    <chr>                                                           <chr>   <int>
##  1 "I hate when people say volleyball isn't a sport."              twitt~ 2.02e6
##  2 "Great seieing you, too! Congrats for your wins as well. Every~ twitt~ 3.33e6
##  3 "thanks so much, Lyndsay - for the warm welcome back and for y~ twitt~ 3.27e6
##  4 "Up way too early so I can be at my school by 7 for a literary~ twitt~ 2.78e6
##  5 "Just handed Kraig Karson Lil Wayne's \"Green & Yellow\" to de~ twitt~ 3.72e6
##  6 "EK Success border punch."                                      blog   4.03e5
##  7 "Don’t think in terms of creating for a child; think of creati~ blog   5.83e5
##  8 "Wanna see it go mano-a-mano w/ RT think Monkey Puzzle tree b/~ twitt~ 2.77e6
##  9 "\"Do you close a fire station, or do you close mounted patrol~ news   9.45e5
## 10 "616. Salad @ Tiki Bar (Spring Mt., PA) 4:48 p.m."              blog   4.02e5

3.3 Initial Filter

  • The blog set contains many garbage entries with less than 5 characters. Filter all text lines from the data set shorter than 5 characters.
  • Duplicate lines are removed from the data set.

Sample size after initial filter:

  • Blogs: 897,476
  • News: 1,009,807
  • Twitter: 2,359,859

3.4 Create Sample Sets for Training, Testing and Validation

  • Each set is taken proportionally from the source data.
  • With 4,267,142 rows in the combined data set, 10% is a sufficient size for training, with 5% each for testing and validation.

Data set sizes:

  • Training: 426,714 rows
  • Testing: 213,357 rows
  • Validation: 213,357 rows

4 Preparing the Data Sample Sets

4.1 Clean Sample

The following process is performed on the subsets prepared above:

  1. convert to lower-case and strip out profanity
  2. replace smart single quote ’ with ’
  3. replace all sentence terminating characters with a full stop
  4. remove all remaining punctuation except single quote and full stop

4.2 Save Data Sets

Prepared data sets for later use in modelling.

4.3 Corpus

The Quanteda package is used to create the corpus representing all the text in the sample and for all subsequent data analysis in this document.

4.4 Create Tokens

Tokenisation is the process of splitting the corpus into sentences and/or words with relative frequencies.

  • Two sets are created, with and without stop words. Without stop words is useful for analysis, however they will be needed for text prediction later on.
  • A filter for profanity is applied first.
  • URL’s removed and any remaining symbols when creating the tokens.

Sample token collections with and without stop words:

## Tokens consisting of 6 documents and 1 docvar.
## text237392 :
## [1] "i"        "am"       "feeling"  "a"        "little"   "stressed" "out"     
## 
## text106390 :
##  [1] "my"      "friend"  "james"   "from"    "atlanta" "georgia" "had"    
##  [8] "chosen"  "to"      "shave"   "his"     "legs"   
## [ ... and 59 more ]
## 
## text304108 :
##  [1] "haters"     "don't"      "really"     "hate"       "you"       
##  [6] "they"       "hate"       "themselves" "cuz"        "your"      
## [11] "a"          "reflection"
## [ ... and 6 more ]
## 
## text408457 :
## [1] "i"            "love"         "my"           "grandparents" "so"          
## [6] "much"        
## 
## text295846 :
##  [1] "omg"    "yes"    "i"      "had"    "to"     "drink"  "a"      "ton"   
##  [9] "of"     "water"  "before" "mine"  
## [ ... and 7 more ]
## 
## text126055 :
##  [1] "on"           "the"          "other"        "hand"         "a"           
##  [6] "good"         "review"       "is"           "heartwarming" "and"         
## [11] "may"          "help"        
## [ ... and 64 more ]
## Tokens consisting of 6 documents and 1 docvar.
## text382554 :
##  [1] "new"            "mone"           "single"         "stars"         
##  [5] "hands"          "www.itunes.com" "search"         "mack"          
##  [9] "mone"           "doin"           "bad"           
## 
## text345167 :
## [1] "love"    "already" "new"     "paltz"   "girls"   "looks"  
## 
## text342900 :
## [1] "teeth"     "pants"     "pants"     "human"     "centipede" "pants"    
## 
## text347518 :
##  [1] "nice"         "weather"      "brings"       "rebent"       "cyclists"    
##  [6] "austin"       "yet"          "see"          "one"          "helmed"      
## [11] "smug.looking" "bearded"     
## [ ... and 2 more ]
## 
## text249732 :
## [1] "little" "boy"    "sister" "says"   "can"    "er"     "comes" 
## 
## text322088 :
##  [1] "yes"      "eating"   "outside"  "better"   "choice"   "jared's" 
##  [7] "bo"       "tweets"   "cracking" "btw"
  • There are 9,967,883 words in the corpus, 5,472,644 without stop words.

5 Exploratory Data Analysis

5.1 Corpus Structure

Observation word, sentence and character count by source (observations shorter than 5 characters removed)

feature Source mean sd p0 p25 p50 p75 p100
Words blog 44.323272 48.3291100 0 9 30 63 1750
Words news 36.129679 23.6054932 0 20 33 48 489
Words twitter 14.599421 7.5825984 0 8 14 21 56
Sentences blog 1.000936 0.0525590 0 1 1 1 9
Sentences news 1.000357 0.0288397 0 1 1 1 4
Sentences twitter 1.005250 0.0733743 0 1 1 1 5
Characters blog 224.663090 249.7077200 0 45 152 322 9079
Characters news 196.154882 129.2513029 0 106 180 262 2511
Characters twitter 67.087365 36.3720376 1 36 63 98 140
  • blogs and news have similar distribution in the mid range, however news word count is higher for the lower quantile while blogs is much higher for the upper two quantiles.
  • tweets are lower for word count across all quantiles
  • sentence distribution is similar across all formats until the upper quantile, with tweets the lowest & blogs the highest.
  • the 140 character limitation on twitter is evident in the above table, while the blogs are also significantly longer than the news items, although the lower quantile count for news items is higher.

5.2 Distribution of Word Count

Distribution of word count is strongly exponential.

  • Log of word count normalises the distribution somewhat.
  • Tweets are heavily skewed due to the 140 character limit and peak at around 14 words
  • News shows a strong peak at around 60 words.
  • Blogs have a greater spread and shows a main peak at around 70 words with a slight secondary peak at 8 words.

5.3 Most Common Words

Looking at the size of the word count table gives us the number of unique words in the corpus:

  • 211,231 unique words
  • 211,056 unique words without stop words

5.4 Wordcloud

5.5 N-gram Analysis

5.5.1 Create a list of bi-grams and tri-grams.

Using the following formula for combinations with repetitions:

\[ (1)\ {}_{n+r-1}C_r={\large\frac{(n+r-1)!}{r!(n-1)!}}\\ \]

We can find the possible number of n-grams for r format(unique_words, big.mark=“,”) unique words:

  • 22,309,373,296 possible bi-grams
  • 1,570,825,283,144,700 possible tri-grams

We’ll look at bi-grams and tri-grams occurring in the sample for both with and without stop words keeping only those that occur frequently enough to consider (40 times with stop words & 25 times without for this analysis).

Samples from the corpus:

[1] "romney_has, has_an, an_enormous, enormous_lead, lead_in, in_delegates, delegates_but, but_still"
[1] "washington_colorado, colorado_attorney, attorney_general, general_john, john_suthers, suthers_called, called_monday's, monday's_opening"
[1] "it's_not, not_uncommon, uncommon_for, for_utilities, utilities_universities, universities_and, and_even, even_state"
[1] "city_controller, controller_wendy, wendy_greuel, greuel_another, another_mayoral, mayoral_candidate, candidate_also, also_spoke"
[1] "dalia_some, some_people, people_take, take_the, the_jeans, jeans_too, too_far, far_they"
[1] ""
[1] "the_spanish, spanish_had, had_contact, contact_with, with_marino's, marino's_people, people_early, early_in"
[1] "forward_carmelo, carmelo_anthony, anthony_said, said_he, he_didn't, didn't_see, see_stoudemire, stoudemire_after"
[1] "french_interior, interior_minister, minister_michele, michele_alliot.marie, alliot.marie_said, said_friday, friday_that, that_having"
[1] "andrew's_father, father_danny, danny_had, had_kept, kept_the, the_secret, secret_of, of_his"
[1] "romney_enormous, enormous_lead, lead_delegates, delegates_still, still_must, must_reach, reach_ure, ure_nomination"
[1] "washington_colorado, colorado_attorney, attorney_general, general_john, john_suthers, suthers_called, called_monday's, monday's_opening"
[1] "uncommon_utilities, utilities_universities, universities_even, even_state, state_tax, tax_departments, departments_charge, charge_convenience"
[1] "city_controller, controller_wendy, wendy_greuel, greuel_another, another_mayoral, mayoral_candidate, candidate_also, also_spoke"
[1] "dalia_people, people_take, take_jeans, jeans_far, far_think, think_means, means_jeans, jeans_tennis"
[1] ""
[1] "spanish_contact, contact_marino's, marino's_people, people_early, early_th, th_century, century_december, december_indians"
[1] "forward_carmelo, carmelo_anthony, anthony_said, said_see, see_stoudemire, stoudemire_game, game_know, know_happened"
[1] "french_interior, interior_minister, minister_michele, michele_alliot.marie, alliot.marie_said, said_friday, friday_parliamentary, parliamentary_commission"
[1] "andrew's_father, father_danny, danny_kept, kept_secret, secret_son's, son's_return, return_home, home_deployment"

5.5.2 Frequency of n-grams

The following table displays the top 40 most frequently occurring bi-grams and tri-grams:

Bigrams with Stopwords
Bigrams w/o Stopwords
Trigrams with Stopwords
Trigrams w/o Stopwords
word1 word2 frequency word1 word2 frequency word1 word2 word3 frequency word1 word2 word3 frequency
of the 43098 right now 2435 one of the 3508 new york city 273
in the 41445 new york 1990 a lot of 2939 let us know 265
to the 21565 last year 1857 thanks for the 2466 happy mother’s day 194
for the 19950 last night 1596 to be a 1815 happy mothers day 180
on the 19650 high school 1408 going to be 1708 happy new year 178
to be 16152 years ago 1388 the end of 1516 president barack obama 162
at the 14362 last week 1301 as well as 1475 two years ago 139
and the 12653 feel like 1273 i want to 1457 cinco de mayo 133
in a 11958 first time 1227 out of the 1434 new york times 126
with the 10638 looking forward 1118 it was a 1412 world war ii 125
is a 10227 can get 1099 some of the 1395 st louis county 108
it was 9651 looks like 1014 be able to 1333 looking forward seeing 104
for a 9395 make sure 1002 part of the 1288 gov chris christie 99
i have 8798 even though 940 i have a 1267 first time since 96
from the 8699 happy birthday 934 i have to 1162 two weeks ago 88
i was 8503 st louis 915 looking forward to 1100 three years ago 77
with a 8280 let know 850 the rest of 1090 rock n roll 72
it is 8160 good morning 848 i don’t know 1072 keep good work 69
and i 8134 new jersey 825 the first time 1021 new year’s eve 67
will be 8071 united states 805 is going to 1018 five years ago 66
going to 8004 just got 794 thank you for 1014 martin luther king 66
of a 7993 next week 779 there is a 989 four years ago 65
i am 7676 every day 765 a couple of 984 come see us 65
have a 7506 one day 744 you want to 983 cant wait see 65
if you 7492 can see 740 i’m going to 981 county sheriff’s office 61
one of 7378 good luck 734 i love you 969 thanks following us 60
is the 7377 los angeles 727 this is a 968 past two years 58
to get 7112 just like 715 the fact that 957 couple years ago 58
as a 6806 two years 709 end of the 922 love love love 58
want to 6346 look like 702 it would be 909 high school students 57
by the 6239 thanks follow 695 i need to 909 george w bush 57
have to 6178 next year 671 you have to 907 st patrick’s day 57
that the 6029 can make 619 in the world 872 blah blah blah 57
this is 5924 little bit 614 can’t wait to 854 look forward seeing 55
to do 5868 every time 599 to go to 853 long time ago 54
and a 5765 follow back 596 this is the 831 just got back 53
i think 5755 one thing 585 one of my 829 wall street journal 52
the first 5676 long time 576 is one of 818 county prosecutor’s office 52
was a 5672 sounds like 569 for the follow 818 good morning everyone 52
i don’t 5535 get back 564 to have a 815 world trade center 51
  • Frequently occurring bi-gram count: 23,701 with stop words, 8,054 without stop words.
  • Frequently occurring tri-gram count: 7,839 with stop words, 212 without stop words.
  • Bigrams have a much higher frequency than trigrams (top trigram counts are approximately 10% of those for bigrams).
  • N-grams without stopwords have a much lower frequency, particularly trigrams where the highest score (not counting proper nouns) occurred only 265 times in 427K observations (0.06%).

5.5.3 Relationships Between Words in Bigrams

Plot connections between words with and without stopwords.

## IGRAPH 8fdd5b4 DN-- 80 108 -- 
## + attr: name (v/c), word3 (e/c), frequency (e/n)
## + edges from 8fdd5b4 (vertex names):
##  [1] of   ->the   in   ->the   to   ->the   for  ->the   on   ->the  
##  [6] to   ->be    at   ->the   and  ->the   in   ->a     with ->the  
## [11] is   ->a     it   ->was   for  ->a     i    ->have  from ->the  
## [16] i    ->was   with ->a     it   ->is    and  ->i     will ->be   
## [21] going->to    of   ->a     i    ->am    have ->a     if   ->you  
## [26] one  ->of    is   ->the   to   ->get   as   ->a     want ->to   
## [31] by   ->the   have ->to    that ->the   this ->is    to   ->do   
## [36] and  ->a     i    ->think the  ->first was  ->a     i    ->don't
## + ... omitted several edges
## IGRAPH 8fe0228 DN-- 100 87 -- 
## + attr: name (v/c), word3 (e/c), frequency (e/n)
## + edges from 8fe0228 (vertex names):
##  [1] right  ->now      new    ->york     last   ->year     last   ->night   
##  [5] high   ->school   years  ->ago      last   ->week     feel   ->like    
##  [9] first  ->time     looking->forward  can    ->get      looks  ->like    
## [13] make   ->sure     even   ->though   happy  ->birthday st     ->louis   
## [17] let    ->know     good   ->morning  new    ->jersey   united ->states  
## [21] just   ->got      next   ->week     every  ->day      one    ->day     
## [25] can    ->see      good   ->luck     los    ->angeles  just   ->like    
## [29] two    ->years    look   ->like     thanks ->follow   next   ->year    
## + ... omitted several edges

5.6 Feature Co-occurrence Matrix (FCM)

Another useful analysis is to use Collocation Frequency which analyses co-occurrence of words within the document (as opposed to n-grams which only considers adjacent words).

Collocations with Stopwords
Collocations without Stopwords
collocation count length lambda collocation count length lambda
of the 43098 2 1.7897815 right now 2435 2 4.921475
in the 41445 2 2.0006764 new york 1990 2 9.357920
to the 21565 2 0.5581746 last year 1857 2 4.518750
for the 19950 2 1.5374918 last night 1596 2 4.854133
on the 19650 2 1.9036605 high school 1408 2 6.143525
to be 16152 2 2.7171194 years ago 1388 2 6.368092
at the 14362 2 1.9618189 last week 1301 2 4.561878
and the 12653 2 0.1130091 feel like 1273 2 4.131166
in a 11958 2 1.1759567 first time 1227 2 3.293214
with the 10638 2 1.2777167 looking forward 1118 2 6.931798
is a 10227 2 1.4774927 can get 1099 2 2.476897
it was 9651 2 3.1409912 looks like 1014 2 5.033158
for a 9395 2 1.3485377 make sure 1002 2 4.749336
i have 8798 2 2.4981628 even though 940 2 4.905264
from the 8699 2 1.7909021 happy birthday 934 2 6.525924
i was 8503 2 2.2523182 st louis 915 2 8.883394
with a 8280 2 1.6821337 let know 850 2 4.447955
it is 8160 2 2.3330645 good morning 848 2 4.376039
and i 8134 2 0.9401703 new jersey 825 2 6.146440
will be 8071 2 4.2629453 united states 805 2 9.187266
going to 8004 2 4.1497442 just got 794 2 2.794885
of a 7993 2 0.5152602 next week 779 2 4.516142
i am 7676 2 5.2009930 every day 765 2 3.720341
have a 7506 2 1.9089028 one day 744 2 2.179035
if you 7492 2 3.7418625 can see 740 2 2.557233
one of 7378 2 2.8619950 good luck 734 2 5.989462
is the 7377 2 0.4116014 los angeles 727 2 14.808728
to get 7112 2 2.7942008 just like 715 2 1.611962
as a 6806 2 1.8993038 two years 709 2 3.689750
want to 6346 2 3.9804638 look like 702 2 3.309687
by the 6239 2 1.6309077 thanks follow 695 2 4.664213
have to 6178 2 1.5172676 next year 671 2 3.865891
that the 6029 2 0.2374993 can make 619 2 2.428675
this is 5924 2 2.4790413 little bit 614 2 5.067210
to do 5868 2 2.4771676 every time 599 2 3.223041
and a 5765 2 -0.0252936 follow back 596 2 4.101560
i think 5755 2 3.9429832 one thing 585 2 3.147544
the first 5676 2 2.7208205 long time 576 2 3.348739
was a 5672 2 1.4030182 sounds like 569 2 4.776484
i don’t 5535 2 3.5512796 get back 564 2 2.343143

As expected, stop words and bi-grams rate the highest.

Top features in the FCM’s:

      i      of     and     the      is       a     you      my      it     was 
3781329 3038974 2953831 2550223 2281944 2215231 1812772 1657482 1604269 1590184 
   have     for      we    with      be    this     one     who   about    that 
1440665 1373846 1223847 1107883 1038409 1032843  927765  890596  877592  876655 
     like       one      just    things      love      make     every     today 
   213385    182524    146327    124638    122639    122321    108857     99118 
    great    around      many      time      well       day      keep    little 
    97714     94339     93521     89269     86399     85104     79296     76907 
       go      life       god something 
    76538     74719     74306     72359 

Plot relationship between collocating words:

5.7 Term Frequency - Inverse Document Frequency (TF-IDF)

The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.

5.7.1 Term Frequency Calculation

Looking at the data split by source to see if there is much change in word importance, there is a similar distribution across all sources.

5.7.2 Zipf’s law

Zipf’s law states that the frequency that a word appears is inversely proportional to its rank.

Plots for the three sources are nearly identical.

There is a near log-linear relationship over much of the distribution as predicted by Zipf’s Law.

Finding the linear relationship for ranks from 10 to 10000:

## (Intercept) log10(rank) 
##  -0.4374398  -1.1858634

5.7.3 TF-IDF

Find Words with Highest TF-IDF Score:

Interesting to note, the highest ranking words for Twitter are mostly “junk” words not found in a standard English dictionary.

5.8 Analysing Language

The English dictionary from the hunspell package is loaded and tokens filtered by dictionary words only.

82.8% of words in the sample are recognised in the US English dictionary (comprising of 49,271 word definitions). The remaining 17.2% will be a mixture of foreign words, abbreviations, shorthand, mispellings and unrecognised slang. Determining the proportion of these that are actually foreign words would require a significant amount of processing.

68% of the words appearing in the US English dictionary are captured by a 10% sample. To capture 90% of vocabulary would likely require a far greater sample as frequency of usage of remaining words becomes increasingly rare.

Everyday words are a much smaller subset of the full dictionary vocabulary however.

“In English, for example, 3000 words make up about 95% of everyday conversation”1

There are 33,482 US English words captured in the sample. Using 3000 as an estimate of common usage vocabulary, it is very likely that nearly all are captured by the sample.

Using the Top 3000 Words list from Education First:

The 10% sample captures 98.2% of common usage words, 68.7% of the content is in the English common usage list.

Below is an estimation of the required sample size to meet common vocabulary coverage. Rather than looking at the required number of words to be sampled, the document frequency count is used to estimate the number of documents (text samples) that would need to be sampled to achieve coverage.

  • Samples required to capture 50% of common usage words: 69
  • Samples required to capture 90% of common usage words: 1018
  • The number of samples required grows exponentially with coverage required.

While this looks at word capture rate, n-gram capture rate significantly drops off for lower sample sizes.

5.9 EDA Summary

Analysis of a 10% sample of the source data consisting of 426,714 examples of blog, news & Twitter items produced the following findings:

  • 82.8% of words used were found in the US English dictionary, 68.7% coming from the 3000 most used words in English. The sample covers 68% of the US English dictionary and 98.2% of the 3000 most used words.
  • The sample produces 211,231 unique words with 22,309,373,296 possible bi-grams and 1,570,825,283,144,700 possible tri-grams.
  • 23,701 bi-grams occurring at least 40 times were found, and 7,839 tri-grams with a frequency of at least 25.

5.10 Further Thoughts

  • Even with a sample size of 426,714, the number of encountered tri-grams is relatively low.
  • Vectorising the vocabulary in the model will be necessary (words are indexed and represented as digits).
  • Constructing a probability matrix of expected n-grams is not feasible as the memory required for such a matrix would run into hundreds of gigabytes (a test of a vectorised feature sequence matrix on only 11,000 words needed 176GB memory).
  • Basic n-gram matrices do not take unknown words into consideration, some form of algorithm to deal with this is required.

6 Creating a Prediction Algorithm

6.1 Trialled Model Types

6.1.1 Long Short-term Memory (LTSM)

Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture available in R through the Tensor Flow Keras package.

The huge advantage of the RNN models is that they use unsupervised learning to take past choices into account, including new vocabulary terms to build a smart, personalised self-teaching model.

I had high hopes for this but quickly ran into memory issues when trying to build the feature matrix for any sample over a few thousand sentences. A Windows laptop is not the right environment for this pursuit, neither is R, where both have tight restrictions on memory usage.

Most text prediction using this model is character based rather than word based and has a tendency to produce nonsense words.

Of the few working models I managed, training the model in small batches, prediction accuracy was low, memory usage high and response times too long to be considered practical.

Final nail in the coffin for this approach was the discovery that the Shiny Apps servers do not support Keras.

6.1.2 Markov-Chain with Katz’s Back-off

This route similarly has memory issues, and still requires large scale n-gram and skip-gram matrices taking too much memory and in the end, not producing great results - 33% prediction rate was the best I saw here and not so quick to respond.

6.1.3 Stupid Back-off Model

Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past. Stupid back-off is a smoothing model does away with the need to store vast amounts of probabilistic data and instead uses direct relative frequencies. Katz’s Back-off may require up to 4 searches and calculations per request while the Stupid Back-off will only ever require one and making use of pre-compiled C++ code, the execution time is much quicker.

Stupid back-off takes care of unknown words by splitting chains in to linked segments thus getting rid of the need for skip-grams.

In terms of accuracy, my first model (below) was trained on a 5% sample using default parameters gave me a 46% accuracy and typical response time of 2.5ms.

6.2 Testing the Initial Model

For training the initial model, the train data is split 50-50 to give a 5% sample of the original data source. It is then split into sentences, stripped of any excess whitespace and any sentences shorter than 4 words are removed (since our aim to predict on the previous 3 words.

Next, an SBO predictor table is trained with 4-grams, a target of 75% coverage of the source vocabulary and a lambda penalization of 0.4. A predictor model is generated from that table.

Below, we show some example predictions and also test the memory allocation and amount of time to respond:

[1] "information about the [ event, new, people ]"
[1] "i'd like to [ see, be, share ]"
[1] "they wanted to [ be, get, know ]"
[1] "Model Memory Allocation:  38.6 MB"
test replications elapsed relative user.self sys.self
Test1 1000 2.56 1.028 2.52 0.02
Test2 1000 2.49 1.000 2.48 0.00
Test3 1000 2.89 1.161 2.89 0.00

Note, there are 1000 replications in the above benchmark test, the times displayed above can be thought of as the average time for each request in milliseconds.

Evaluating the accuracy for in-sample data:

## # A tibble: 198,772 x 4
##    input                      true         preds[,1] [,2]    [,3]    correct
##    <chr>                      <chr>        <chr>     <chr>   <chr>   <lgl>  
##  1 "met those limits"         before       before    <EOS>   are     TRUE   
##  2 "team and the"             other        first     other   game    TRUE   
##  3 "dab stencil brush"        in           <EOS>     and     the     FALSE  
##  4 "potential foreclosure of" another      the       my      another TRUE   
##  5 "he even ordered"          the          the       to      a       TRUE   
##  6 "of what legislators"      want         to        was     <EOS>   FALSE  
##  7 "pm pm monday"             to           to        through <EOS>   TRUE   
##  8 "responding to a"          conservation new       record  victory FALSE  
##  9 "  prepeyton"              years        is        i       the     FALSE  
## 10 "to resolve those"         cases        who       are     and     FALSE  
## # ... with 198,762 more rows
## # A tibble: 1 x 2
##   accuracy uncertainty
##      <dbl>       <dbl>
## 1    0.459     0.00112

6.3 Considerations for Improvement

There are several avenues to explore to improve model accuracy and performance:

  • Consider training the data with news and blog data only - the Twitter data tends to contain a lot of slang and abbreviations that could be inflating the vocabulary size.
  • Trial a larger sample size and also greater target dictionary coverage.
  • Test different values for the penalty value lambda.
  • Detect end-of-sentence punctuation in the user input and capitalize suggestions where appropriate.

Ideal improvements beyond the scope of this project:

  • Self-learning dictionary with weighting for user defined words and previously selected predictions.

7 References

(Benoit et al. 2018)

[@quanteda.textplots]

(Silge and Robinson 2016)

(Gherardi 2020)

Large Language Models in Machine Translation

N-gram Language Models

Backoff Inspired Features for Maximum Entropy Language Models


Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. 2018. “Quanteda: An r Package for the Quantitative Analysis of Textual Data” 3: 774. https://doi.org/10.21105/joss.00774.
Gherardi, Valerio. 2020. “Sbo: Text Prediction via Stupid Back-Off n-Gram Models.” https://CRAN.R-project.org/package=sbo.
Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in r” 1. https://doi.org/10.21105/joss.00037.

  1. https://www.fluentu.com/blog/how-many-words-do-i-need-to-know/↩︎