1 Executive Summary

The following report examines the data required for the Johns Hopkins Data Science Specialization Capstone Project and explores plans for creating a word prediction app. Of three model types trialled (Recursive Neural Network with TensorFlow, Markov Chain Prediction and Stupid Back-off), the report will find the Stupid Back-off approach to be the best performer in terms of accuracy, speed and size, and simplest to create.

2 Project Task

Using data obtained from the coursera website representing publicly available text from twitter, news and blog sources, create a predictive text product that takes as input a phrase (multiple words) and predict the next word.

3 Obtaining the Data

3.1 Downloading the Source Text

The Coursera material is downloaded and extracted
An English profanity filter is downloaded, required for cleaning the data.

3.2 Load the data and combine the sources.

This project will just look at the en_US data set:

The three data sources are loaded and combined into a single data set

## # A tibble: 10 x 3
##    text                                                            source     id
##    <chr>                                                           <chr>   <int>
##  1 "I hate when people say volleyball isn't a sport."              twitt~ 2.02e6
##  2 "Great seieing you, too! Congrats for your wins as well. Every~ twitt~ 3.33e6
##  3 "thanks so much, Lyndsay - for the warm welcome back and for y~ twitt~ 3.27e6
##  4 "Up way too early so I can be at my school by 7 for a literary~ twitt~ 2.78e6
##  5 "Just handed Kraig Karson Lil Wayne's \"Green & Yellow\" to de~ twitt~ 3.72e6
##  6 "EK Success border punch."                                      blog   4.03e5
##  7 "Don’t think in terms of creating for a child; think of creati~ blog   5.83e5
##  8 "Wanna see it go mano-a-mano w/ RT think Monkey Puzzle tree b/~ twitt~ 2.77e6
##  9 "\"Do you close a fire station, or do you close mounted patrol~ news   9.45e5
## 10 "616. Salad @ Tiki Bar (Spring Mt., PA) 4:48 p.m."              blog   4.02e5

3.3 Initial Filter

The blog set contains many garbage entries with less than 5 characters. Filter all text lines from the data set shorter than 5 characters.
Duplicate lines are removed from the data set.

Sample size after initial filter:

Blogs: 897,476
News: 1,009,807
Twitter: 2,359,859

3.4 Create Sample Sets for Training, Testing and Validation

Each set is taken proportionally from the source data.
With 4,267,142 rows in the combined data set, 10% is a sufficient size for training, with 5% each for testing and validation.

Data set sizes:

Training: 426,714 rows
Testing: 213,357 rows
Validation: 213,357 rows

4 Preparing the Data Sample Sets

4.1 Clean Sample

The following process is performed on the subsets prepared above:

convert to lower-case and strip out profanity
replace smart single quote ’ with ’
replace all sentence terminating characters with a full stop
remove all remaining punctuation except single quote and full stop

4.2 Save Data Sets

Prepared data sets for later use in modelling.

4.3 Corpus

The Quanteda package is used to create the corpus representing all the text in the sample and for all subsequent data analysis in this document.

4.4 Create Tokens

Tokenisation is the process of splitting the corpus into sentences and/or words with relative frequencies.

Two sets are created, with and without stop words. Without stop words is useful for analysis, however they will be needed for text prediction later on.
A filter for profanity is applied first.
URL’s removed and any remaining symbols when creating the tokens.

Sample token collections with and without stop words:

## Tokens consisting of 6 documents and 1 docvar.
## text237392 :
## [1] "i"        "am"       "feeling"  "a"        "little"   "stressed" "out"     
## 
## text106390 :
##  [1] "my"      "friend"  "james"   "from"    "atlanta" "georgia" "had"    
##  [8] "chosen"  "to"      "shave"   "his"     "legs"   
## [ ... and 59 more ]
## 
## text304108 :
##  [1] "haters"     "don't"      "really"     "hate"       "you"       
##  [6] "they"       "hate"       "themselves" "cuz"        "your"      
## [11] "a"          "reflection"
## [ ... and 6 more ]
## 
## text408457 :
## [1] "i"            "love"         "my"           "grandparents" "so"          
## [6] "much"        
## 
## text295846 :
##  [1] "omg"    "yes"    "i"      "had"    "to"     "drink"  "a"      "ton"   
##  [9] "of"     "water"  "before" "mine"  
## [ ... and 7 more ]
## 
## text126055 :
##  [1] "on"           "the"          "other"        "hand"         "a"           
##  [6] "good"         "review"       "is"           "heartwarming" "and"         
## [11] "may"          "help"        
## [ ... and 64 more ]

## Tokens consisting of 6 documents and 1 docvar.
## text382554 :
##  [1] "new"            "mone"           "single"         "stars"         
##  [5] "hands"          "www.itunes.com" "search"         "mack"          
##  [9] "mone"           "doin"           "bad"           
## 
## text345167 :
## [1] "love"    "already" "new"     "paltz"   "girls"   "looks"  
## 
## text342900 :
## [1] "teeth"     "pants"     "pants"     "human"     "centipede" "pants"    
## 
## text347518 :
##  [1] "nice"         "weather"      "brings"       "rebent"       "cyclists"    
##  [6] "austin"       "yet"          "see"          "one"          "helmed"      
## [11] "smug.looking" "bearded"     
## [ ... and 2 more ]
## 
## text249732 :
## [1] "little" "boy"    "sister" "says"   "can"    "er"     "comes" 
## 
## text322088 :
##  [1] "yes"      "eating"   "outside"  "better"   "choice"   "jared's" 
##  [7] "bo"       "tweets"   "cracking" "btw"

There are 9,967,883 words in the corpus, 5,472,644 without stop words.

5 Exploratory Data Analysis

5.1 Corpus Structure

Observation word, sentence and character count by source (observations shorter than 5 characters removed)

feature	Source	mean	sd	p0	p25	p50	p75	p100
Words	blog	44.323272	48.3291100	0	9	30	63	1750
Words	news	36.129679	23.6054932	0	20	33	48	489
Words	twitter	14.599421	7.5825984	0	8	14	21	56
Sentences	blog	1.000936	0.0525590	0	1	1	1	9
Sentences	news	1.000357	0.0288397	0	1	1	1	4
Sentences	twitter	1.005250	0.0733743	0	1	1	1	5
Characters	blog	224.663090	249.7077200	0	45	152	322	9079
Characters	news	196.154882	129.2513029	0	106	180	262	2511
Characters	twitter	67.087365	36.3720376	1	36	63	98	140

blogs and news have similar distribution in the mid range, however news word count is higher for the lower quantile while blogs is much higher for the upper two quantiles.
tweets are lower for word count across all quantiles
sentence distribution is similar across all formats until the upper quantile, with tweets the lowest & blogs the highest.
the 140 character limitation on twitter is evident in the above table, while the blogs are also significantly longer than the news items, although the lower quantile count for news items is higher.

5.2 Distribution of Word Count

Distribution of word count is strongly exponential.

Log of word count normalises the distribution somewhat.
Tweets are heavily skewed due to the 140 character limit and peak at around 14 words
News shows a strong peak at around 60 words.
Blogs have a greater spread and shows a main peak at around 70 words with a slight secondary peak at 8 words.

5.3 Most Common Words

Looking at the size of the word count table gives us the number of unique words in the corpus:

211,231 unique words
211,056 unique words without stop words

5.4 Wordcloud

5.5 N-gram Analysis

5.5.1 Create a list of bi-grams and tri-grams.

Using the following formula for combinations with repetitions:

\[ (1)\ {}_{n+r-1}C_r={\large\frac{(n+r-1)!}{r!(n-1)!}}\\ \]

We can find the possible number of n-grams for r format(unique_words, big.mark=“,”) unique words:

22,309,373,296 possible bi-grams
1,570,825,283,144,700 possible tri-grams

We’ll look at bi-grams and tri-grams occurring in the sample for both with and without stop words keeping only those that occur frequently enough to consider (40 times with stop words & 25 times without for this analysis).

Samples from the corpus:

[1] "romney_has, has_an, an_enormous, enormous_lead, lead_in, in_delegates, delegates_but, but_still"
[1] "washington_colorado, colorado_attorney, attorney_general, general_john, john_suthers, suthers_called, called_monday's, monday's_opening"
[1] "it's_not, not_uncommon, uncommon_for, for_utilities, utilities_universities, universities_and, and_even, even_state"
[1] "city_controller, controller_wendy, wendy_greuel, greuel_another, another_mayoral, mayoral_candidate, candidate_also, also_spoke"
[1] "dalia_some, some_people, people_take, take_the, the_jeans, jeans_too, too_far, far_they"
[1] ""
[1] "the_spanish, spanish_had, had_contact, contact_with, with_marino's, marino's_people, people_early, early_in"
[1] "forward_carmelo, carmelo_anthony, anthony_said, said_he, he_didn't, didn't_see, see_stoudemire, stoudemire_after"
[1] "french_interior, interior_minister, minister_michele, michele_alliot.marie, alliot.marie_said, said_friday, friday_that, that_having"
[1] "andrew's_father, father_danny, danny_had, had_kept, kept_the, the_secret, secret_of, of_his"

[1] "romney_enormous, enormous_lead, lead_delegates, delegates_still, still_must, must_reach, reach_ure, ure_nomination"
[1] "washington_colorado, colorado_attorney, attorney_general, general_john, john_suthers, suthers_called, called_monday's, monday's_opening"
[1] "uncommon_utilities, utilities_universities, universities_even, even_state, state_tax, tax_departments, departments_charge, charge_convenience"
[1] "city_controller, controller_wendy, wendy_greuel, greuel_another, another_mayoral, mayoral_candidate, candidate_also, also_spoke"
[1] "dalia_people, people_take, take_jeans, jeans_far, far_think, think_means, means_jeans, jeans_tennis"
[1] ""
[1] "spanish_contact, contact_marino's, marino's_people, people_early, early_th, th_century, century_december, december_indians"
[1] "forward_carmelo, carmelo_anthony, anthony_said, said_see, see_stoudemire, stoudemire_game, game_know, know_happened"
[1] "french_interior, interior_minister, minister_michele, michele_alliot.marie, alliot.marie_said, said_friday, friday_parliamentary, parliamentary_commission"
[1] "andrew's_father, father_danny, danny_kept, kept_secret, secret_son's, son's_return, return_home, home_deployment"

5.5.2 Frequency of n-grams

The following table displays the top 40 most frequently occurring bi-grams and tri-grams:

Bigrams with Stopwords			Bigrams w/o Stopwords			Trigrams with Stopwords				Trigrams w/o Stopwords
word1	word2	frequency	word1	word2	frequency	word1	word2	word3	frequency	word1	word2	word3	frequency
of	the	43098	right	now	2435	one	of	the	3508	new	york	city	273
in	the	41445	new	york	1990	a	lot	of	2939	let	us	know	265
to	the	21565	last	year	1857	thanks	for	the	2466	happy	mother’s	day	194
for	the	19950	last	night	1596	to	be	a	1815	happy	mothers	day	180
on	the	19650	high	school	1408	going	to	be	1708	happy	new	year	178
to	be	16152	years	ago	1388	the	end	of	1516	president	barack	obama	162
at	the	14362	last	week	1301	as	well	as	1475	two	years	ago	139
and	the	12653	feel	like	1273	i	want	to	1457	cinco	de	mayo	133
in	a	11958	first	time	1227	out	of	the	1434	new	york	times	126
with	the	10638	looking	forward	1118	it	was	a	1412	world	war	ii	125
is	a	10227	can	get	1099	some	of	the	1395	st	louis	county	108
it	was	9651	looks	like	1014	be	able	to	1333	looking	forward	seeing	104
for	a	9395	make	sure	1002	part	of	the	1288	gov	chris	christie	99
i	have	8798	even	though	940	i	have	a	1267	first	time	since	96
from	the	8699	happy	birthday	934	i	have	to	1162	two	weeks	ago	88
i	was	8503	st	louis	915	looking	forward	to	1100	three	years	ago	77
with	a	8280	let	know	850	the	rest	of	1090	rock	n	roll	72
it	is	8160	good	morning	848	i	don’t	know	1072	keep	good	work	69
and	i	8134	new	jersey	825	the	first	time	1021	new	year’s	eve	67
will	be	8071	united	states	805	is	going	to	1018	five	years	ago	66
going	to	8004	just	got	794	thank	you	for	1014	martin	luther	king	66
of	a	7993	next	week	779	there	is	a	989	four	years	ago	65
i	am	7676	every	day	765	a	couple	of	984	come	see	us	65
have	a	7506	one	day	744	you	want	to	983	cant	wait	see	65
if	you	7492	can	see	740	i’m	going	to	981	county	sheriff’s	office	61
one	of	7378	good	luck	734	i	love	you	969	thanks	following	us	60
is	the	7377	los	angeles	727	this	is	a	968	past	two	years	58
to	get	7112	just	like	715	the	fact	that	957	couple	years	ago	58
as	a	6806	two	years	709	end	of	the	922	love	love	love	58
want	to	6346	look	like	702	it	would	be	909	high	school	students	57
by	the	6239	thanks	follow	695	i	need	to	909	george	w	bush	57
have	to	6178	next	year	671	you	have	to	907	st	patrick’s	day	57
that	the	6029	can	make	619	in	the	world	872	blah	blah	blah	57
this	is	5924	little	bit	614	can’t	wait	to	854	look	forward	seeing	55
to	do	5868	every	time	599	to	go	to	853	long	time	ago	54
and	a	5765	follow	back	596	this	is	the	831	just	got	back	53
i	think	5755	one	thing	585	one	of	my	829	wall	street	journal	52
the	first	5676	long	time	576	is	one	of	818	county	prosecutor’s	office	52
was	a	5672	sounds	like	569	for	the	follow	818	good	morning	everyone	52
i	don’t	5535	get	back	564	to	have	a	815	world	trade	center	51

Frequently occurring bi-gram count: 23,701 with stop words, 8,054 without stop words.
Frequently occurring tri-gram count: 7,839 with stop words, 212 without stop words.
Bigrams have a much higher frequency than trigrams (top trigram counts are approximately 10% of those for bigrams).
N-grams without stopwords have a much lower frequency, particularly trigrams where the highest score (not counting proper nouns) occurred only 265 times in 427K observations (0.06%).

5.5.3 Relationships Between Words in Bigrams

Plot connections between words with and without stopwords.

## IGRAPH 8fdd5b4 DN-- 80 108 -- 
## + attr: name (v/c), word3 (e/c), frequency (e/n)
## + edges from 8fdd5b4 (vertex names):
##  [1] of   ->the   in   ->the   to   ->the   for  ->the   on   ->the  
##  [6] to   ->be    at   ->the   and  ->the   in   ->a     with ->the  
## [11] is   ->a     it   ->was   for  ->a     i    ->have  from ->the  
## [16] i    ->was   with ->a     it   ->is    and  ->i     will ->be   
## [21] going->to    of   ->a     i    ->am    have ->a     if   ->you  
## [26] one  ->of    is   ->the   to   ->get   as   ->a     want ->to   
## [31] by   ->the   have ->to    that ->the   this ->is    to   ->do   
## [36] and  ->a     i    ->think the  ->first was  ->a     i    ->don't
## + ... omitted several edges

## IGRAPH 8fe0228 DN-- 100 87 -- 
## + attr: name (v/c), word3 (e/c), frequency (e/n)
## + edges from 8fe0228 (vertex names):
##  [1] right  ->now      new    ->york     last   ->year     last   ->night   
##  [5] high   ->school   years  ->ago      last   ->week     feel   ->like    
##  [9] first  ->time     looking->forward  can    ->get      looks  ->like    
## [13] make   ->sure     even   ->though   happy  ->birthday st     ->louis   
## [17] let    ->know     good   ->morning  new    ->jersey   united ->states  
## [21] just   ->got      next   ->week     every  ->day      one    ->day     
## [25] can    ->see      good   ->luck     los    ->angeles  just   ->like    
## [29] two    ->years    look   ->like     thanks ->follow   next   ->year    
## + ... omitted several edges

5.6 Feature Co-occurrence Matrix (FCM)

Another useful analysis is to use Collocation Frequency which analyses co-occurrence of words within the document (as opposed to n-grams which only considers adjacent words).

Collocations with Stopwords				Collocations without Stopwords
collocation	count	length	lambda	collocation	count	length	lambda
of the	43098	2	1.7897815	right now	2435	2	4.921475
in the	41445	2	2.0006764	new york	1990	2	9.357920
to the	21565	2	0.5581746	last year	1857	2	4.518750
for the	19950	2	1.5374918	last night	1596	2	4.854133
on the	19650	2	1.9036605	high school	1408	2	6.143525
to be	16152	2	2.7171194	years ago	1388	2	6.368092
at the	14362	2	1.9618189	last week	1301	2	4.561878
and the	12653	2	0.1130091	feel like	1273	2	4.131166
in a	11958	2	1.1759567	first time	1227	2	3.293214
with the	10638	2	1.2777167	looking forward	1118	2	6.931798
is a	10227	2	1.4774927	can get	1099	2	2.476897
it was	9651	2	3.1409912	looks like	1014	2	5.033158
for a	9395	2	1.3485377	make sure	1002	2	4.749336
i have	8798	2	2.4981628	even though	940	2	4.905264
from the	8699	2	1.7909021	happy birthday	934	2	6.525924
i was	8503	2	2.2523182	st louis	915	2	8.883394
with a	8280	2	1.6821337	let know	850	2	4.447955
it is	8160	2	2.3330645	good morning	848	2	4.376039
and i	8134	2	0.9401703	new jersey	825	2	6.146440
will be	8071	2	4.2629453	united states	805	2	9.187266
going to	8004	2	4.1497442	just got	794	2	2.794885
of a	7993	2	0.5152602	next week	779	2	4.516142
i am	7676	2	5.2009930	every day	765	2	3.720341
have a	7506	2	1.9089028	one day	744	2	2.179035
if you	7492	2	3.7418625	can see	740	2	2.557233
one of	7378	2	2.8619950	good luck	734	2	5.989462
is the	7377	2	0.4116014	los angeles	727	2	14.808728
to get	7112	2	2.7942008	just like	715	2	1.611962
as a	6806	2	1.8993038	two years	709	2	3.689750
want to	6346	2	3.9804638	look like	702	2	3.309687
by the	6239	2	1.6309077	thanks follow	695	2	4.664213
have to	6178	2	1.5172676	next year	671	2	3.865891
that the	6029	2	0.2374993	can make	619	2	2.428675
this is	5924	2	2.4790413	little bit	614	2	5.067210
to do	5868	2	2.4771676	every time	599	2	3.223041
and a	5765	2	-0.0252936	follow back	596	2	4.101560
i think	5755	2	3.9429832	one thing	585	2	3.147544
the first	5676	2	2.7208205	long time	576	2	3.348739
was a	5672	2	1.4030182	sounds like	569	2	4.776484
i don’t	5535	2	3.5512796	get back	564	2	2.343143

As expected, stop words and bi-grams rate the highest.

Top features in the FCM’s:

      i      of     and     the      is       a     you      my      it     was 
3781329 3038974 2953831 2550223 2281944 2215231 1812772 1657482 1604269 1590184 
   have     for      we    with      be    this     one     who   about    that 
1440665 1373846 1223847 1107883 1038409 1032843  927765  890596  877592  876655

     like       one      just    things      love      make     every     today 
   213385    182524    146327    124638    122639    122321    108857     99118 
    great    around      many      time      well       day      keep    little 
    97714     94339     93521     89269     86399     85104     79296     76907 
       go      life       god something 
    76538     74719     74306     72359

Plot relationship between collocating words:

5.7 Term Frequency - Inverse Document Frequency (TF-IDF)

The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.

5.7.1 Term Frequency Calculation

Looking at the data split by source to see if there is much change in word importance, there is a similar distribution across all sources.

5.7.2 Zipf’s law

Zipf’s law states that the frequency that a word appears is inversely proportional to its rank.

Plots for the three sources are nearly identical.

There is a near log-linear relationship over much of the distribution as predicted by Zipf’s Law.

Finding the linear relationship for ranks from 10 to 10000:

## (Intercept) log10(rank) 
##  -0.4374398  -1.1858634

5.7.3 TF-IDF

Find Words with Highest TF-IDF Score:

Interesting to note, the highest ranking words for Twitter are mostly “junk” words not found in a standard English dictionary.

5.8 Analysing Language

The English dictionary from the hunspell package is loaded and tokens filtered by dictionary words only.

82.8% of words in the sample are recognised in the US English dictionary (comprising of 49,271 word definitions). The remaining 17.2% will be a mixture of foreign words, abbreviations, shorthand, mispellings and unrecognised slang. Determining the proportion of these that are actually foreign words would require a significant amount of processing.

68% of the words appearing in the US English dictionary are captured by a 10% sample. To capture 90% of vocabulary would likely require a far greater sample as frequency of usage of remaining words becomes increasingly rare.

Everyday words are a much smaller subset of the full dictionary vocabulary however.

“In English, for example, 3000 words make up about 95% of everyday conversation”¹

There are 33,482 US English words captured in the sample. Using 3000 as an estimate of common usage vocabulary, it is very likely that nearly all are captured by the sample.

Using the Top 3000 Words list from Education First:

The 10% sample captures 98.2% of common usage words, 68.7% of the content is in the English common usage list.

Below is an estimation of the required sample size to meet common vocabulary coverage. Rather than looking at the required number of words to be sampled, the document frequency count is used to estimate the number of documents (text samples) that would need to be sampled to achieve coverage.

Samples required to capture 50% of common usage words: 69
Samples required to capture 90% of common usage words: 1018
The number of samples required grows exponentially with coverage required.

While this looks at word capture rate, n-gram capture rate significantly drops off for lower sample sizes.

5.9 EDA Summary

Analysis of a 10% sample of the source data consisting of 426,714 examples of blog, news & Twitter items produced the following findings:

82.8% of words used were found in the US English dictionary, 68.7% coming from the 3000 most used words in English. The sample covers 68% of the US English dictionary and 98.2% of the 3000 most used words.
The sample produces 211,231 unique words with 22,309,373,296 possible bi-grams and 1,570,825,283,144,700 possible tri-grams.
23,701 bi-grams occurring at least 40 times were found, and 7,839 tri-grams with a frequency of at least 25.

5.10 Further Thoughts

Even with a sample size of 426,714, the number of encountered tri-grams is relatively low.
Vectorising the vocabulary in the model will be necessary (words are indexed and represented as digits).
Constructing a probability matrix of expected n-grams is not feasible as the memory required for such a matrix would run into hundreds of gigabytes (a test of a vectorised feature sequence matrix on only 11,000 words needed 176GB memory).
Basic n-gram matrices do not take unknown words into consideration, some form of algorithm to deal with this is required.

6 Creating a Prediction Algorithm

6.1 Trialled Model Types

6.1.1 Long Short-term Memory (LTSM)

Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture available in R through the Tensor Flow Keras package.

The huge advantage of the RNN models is that they use unsupervised learning to take past choices into account, including new vocabulary terms to build a smart, personalised self-teaching model.

I had high hopes for this but quickly ran into memory issues when trying to build the feature matrix for any sample over a few thousand sentences. A Windows laptop is not the right environment for this pursuit, neither is R, where both have tight restrictions on memory usage.

Most text prediction using this model is character based rather than word based and has a tendency to produce nonsense words.

Of the few working models I managed, training the model in small batches, prediction accuracy was low, memory usage high and response times too long to be considered practical.

Final nail in the coffin for this approach was the discovery that the Shiny Apps servers do not support Keras.

6.1.2 Markov-Chain with Katz’s Back-off

This route similarly has memory issues, and still requires large scale n-gram and skip-gram matrices taking too much memory and in the end, not producing great results - 33% prediction rate was the best I saw here and not so quick to respond.

6.1.3 Stupid Back-off Model

Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past. Stupid back-off is a smoothing model does away with the need to store vast amounts of probabilistic data and instead uses direct relative frequencies. Katz’s Back-off may require up to 4 searches and calculations per request while the Stupid Back-off will only ever require one and making use of pre-compiled C++ code, the execution time is much quicker.

Stupid back-off takes care of unknown words by splitting chains in to linked segments thus getting rid of the need for skip-grams.

In terms of accuracy, my first model (below) was trained on a 5% sample using default parameters gave me a 46% accuracy and typical response time of 2.5ms.

6.2 Testing the Initial Model

For training the initial model, the train data is split 50-50 to give a 5% sample of the original data source. It is then split into sentences, stripped of any excess whitespace and any sentences shorter than 4 words are removed (since our aim to predict on the previous 3 words.

Next, an SBO predictor table is trained with 4-grams, a target of 75% coverage of the source vocabulary and a lambda penalization of 0.4. A predictor model is generated from that table.

Below, we show some example predictions and also test the memory allocation and amount of time to respond:

[1] "information about the [ event, new, people ]"

[1] "i'd like to [ see, be, share ]"

[1] "they wanted to [ be, get, know ]"

[1] "Model Memory Allocation:  38.6 MB"

test	replications	elapsed	relative	user.self	sys.self
Test1	1000	2.56	1.028	2.52	0.02
Test2	1000	2.49	1.000	2.48	0.00
Test3	1000	2.89	1.161	2.89	0.00

Note, there are 1000 replications in the above benchmark test, the times displayed above can be thought of as the average time for each request in milliseconds.

Evaluating the accuracy for in-sample data:

## # A tibble: 198,772 x 4
##    input                      true         preds[,1] [,2]    [,3]    correct
##    <chr>                      <chr>        <chr>     <chr>   <chr>   <lgl>  
##  1 "met those limits"         before       before    <EOS>   are     TRUE   
##  2 "team and the"             other        first     other   game    TRUE   
##  3 "dab stencil brush"        in           <EOS>     and     the     FALSE  
##  4 "potential foreclosure of" another      the       my      another TRUE   
##  5 "he even ordered"          the          the       to      a       TRUE   
##  6 "of what legislators"      want         to        was     <EOS>   FALSE  
##  7 "pm pm monday"             to           to        through <EOS>   TRUE   
##  8 "responding to a"          conservation new       record  victory FALSE  
##  9 "  prepeyton"              years        is        i       the     FALSE  
## 10 "to resolve those"         cases        who       are     and     FALSE  
## # ... with 198,762 more rows

## # A tibble: 1 x 2
##   accuracy uncertainty
##      <dbl>       <dbl>
## 1    0.459     0.00112

6.3 Considerations for Improvement

There are several avenues to explore to improve model accuracy and performance:

Consider training the data with news and blog data only - the Twitter data tends to contain a lot of slang and abbreviations that could be inflating the vocabulary size.
Trial a larger sample size and also greater target dictionary coverage.
Test different values for the penalty value lambda.
Detect end-of-sentence punctuation in the user input and capitalize suggestions where appropriate.

Ideal improvements beyond the scope of this project:

Self-learning dictionary with weighting for user defined words and previously selected predictions.

7 References

(Benoit et al. 2018)

[@quanteda.textplots]

(Silge and Robinson 2016)

(Gherardi 2020)

Large Language Models in Machine Translation

N-gram Language Models

Backoff Inspired Features for Maximum Entropy Language Models

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. 2018. “Quanteda: An r Package for the Quantitative Analysis of Textual Data” 3: 774. https://doi.org/10.21105/joss.00774.

Gherardi, Valerio. 2020. “Sbo: Text Prediction via Stupid Back-Off n-Gram Models.” https://CRAN.R-project.org/package=sbo.

Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in r” 1. https://doi.org/10.21105/joss.00037.

https://www.fluentu.com/blog/how-many-words-do-i-need-to-know/↩︎

Johns Hopkins University Data Science Capstone Project
Milestone Report

Richard Allen

2022-04-24

1 Executive Summary

2 Project Task

3 Obtaining the Data

3.1 Downloading the Source Text

3.2 Load the data and combine the sources.

3.3 Initial Filter

3.4 Create Sample Sets for Training, Testing and Validation

4 Preparing the Data Sample Sets

4.1 Clean Sample

4.2 Save Data Sets

4.3 Corpus

4.4 Create Tokens

5 Exploratory Data Analysis

5.1 Corpus Structure

5.2 Distribution of Word Count

5.3 Most Common Words

5.4 Wordcloud

5.5 N-gram Analysis

5.5.1 Create a list of bi-grams and tri-grams.

5.5.2 Frequency of n-grams

5.5.3 Relationships Between Words in Bigrams

5.6 Feature Co-occurrence Matrix (FCM)

5.7 Term Frequency - Inverse Document Frequency (TF-IDF)

5.7.1 Term Frequency Calculation

5.7.2 Zipf’s law

5.7.3 TF-IDF

5.8 Analysing Language

5.9 EDA Summary

5.10 Further Thoughts

6 Creating a Prediction Algorithm

6.1 Trialled Model Types

6.1.1 Long Short-term Memory (LTSM)

6.1.2 Markov-Chain with Katz’s Back-off

6.1.3 Stupid Back-off Model

6.2 Testing the Initial Model

6.3 Considerations for Improvement

7 References

Johns Hopkins University Data Science Capstone ProjectMilestone Report

Richard Allen

2022-04-24

1 Executive Summary

2 Project Task

3 Obtaining the Data

3.1 Downloading the Source Text

3.2 Load the data and combine the sources.

3.3 Initial Filter

3.4 Create Sample Sets for Training, Testing and Validation

4 Preparing the Data Sample Sets

4.1 Clean Sample

4.2 Save Data Sets

4.3 Corpus

4.4 Create Tokens

5 Exploratory Data Analysis

5.1 Corpus Structure

5.2 Distribution of Word Count

5.3 Most Common Words

5.4 Wordcloud

5.5 N-gram Analysis

5.5.1 Create a list of bi-grams and tri-grams.

5.5.2 Frequency of n-grams

5.5.3 Relationships Between Words in Bigrams

5.6 Feature Co-occurrence Matrix (FCM)

5.7 Term Frequency - Inverse Document Frequency (TF-IDF)

5.7.1 Term Frequency Calculation

5.7.2 Zipf’s law

5.7.3 TF-IDF

5.8 Analysing Language

5.9 EDA Summary

5.10 Further Thoughts

6 Creating a Prediction Algorithm

6.1 Trialled Model Types

6.1.1 Long Short-term Memory (LTSM)

6.1.2 Markov-Chain with Katz’s Back-off

6.1.3 Stupid Back-off Model

6.2 Testing the Initial Model

6.3 Considerations for Improvement

7 References

Johns Hopkins University Data Science Capstone Project
Milestone Report