Synopsis

The Shiny Predictive Text App1 is a web-based application suggesting words the end user may wish to insert in a text field. The current report describes the technical aspect of the product.

The Text Predictive Shiny App

The Text Predictive Shiny App

We analyze a large corpus of text documents (more than 4 million lines, and over 102 million words) to discover the relationship between words. The basic training data for this project has been provided by Swiftkey2. The data3 is from a corpus called HC Corpora. There are four different databases, each for one specific language (German, English, Finnish, and Russian). In this project, we deal only with the English database. There are the following textual files in the English database:

  • en_us.blogs.txt
  • en_us.news.txt
  • en_us.twitter.txt
Raw Data - Information
Data Size Lines Words Range nchars Avg nchars
Blogs 210.16 MB 899,288 37,334,131 1 - 40,833 229.99
Twitter 167.11 MB 2,360,148 30,373,583 2 - 213 68.8
News 205.81 MB 1,010,242 34,372,530 1 - 11,384 201.16
Corpus 583.08 MB 4,269,678 102,080,244 1 - 40,833 499.95
Note:
Range nchars: range of number of chars in lines
Avg nchars: avgerage number of chars in lines

One of the main challenges in this project is the limited computational resources (memory and time). We need to find the right balance between size and runtime in order to provide a good experience to the user. I tried nine different models with two variable factors:

  • A fraction of the original corpus (50%, 60%, or 70%)
  • An n-gram model (tri-gram, quad-gram, or quint-gram)

I evaluated the quality (precision, average runtime, and memory consumption) of our models by scripts and a testing data provided by Hernn Foffani4. The testing data includes two datasets:

  • Dataset blogs (599 lines, 14587 words)
  • Dataset tweets (793 lines, 14071 words)

We can find the results of our evaluation in the following table. Each row shows the information of the evaluation of a given model, i.e., a row X% fraction & n-Gram for \(X \in \{70, 60, 50\}\) and \(n \in \{5, 4, 3\}\) denotes an n-Gram model trained on a X% fraction of the original corpus.

Model Evaluation
Top-3 Precision Precision Memory Used Avg Runtime Size
70% fraction & 5-Gram 21.87% 13.83% 148.03 MB 32.45 msec 40.7 MB
70% fraction & 4-Gram 21.84% 13.68% 147.97 MB 31.61 msec 33.9 MB
70% fraction & 3-Gram 21.20% 12.86% 147.91 MB 26.83 msec 17.4 MB
60% fraction & 5-Gram 21.80% 13.68% 131.50 MB 33.95 msec 35.8 MB
60% fraction & 4-Gram 21.75% 13.56% 131.44 MB 28.97 msec 29.0 MB
60% fraction & 3-Gram 21.25% 12.74% 131.39 MB 31.61 msec 15.2 MB
50% fraction & 5-Gram 21.48% 13.45% 109.23 MB 38.03 msec 29.3 MB
50% fraction & 4-Gram 21.46% 13.37% 109.17 MB 30.40 msec 24.0 MB
50% fraction & 3-Gram 21.06% 12.65% 109.12 MB 30.15 msec 13.0 MB
Note:
Precision denotes the top-1 precision.
Size denotes the size of the corresponding files.
Since we apply the back-off method, for an n-gram, model, we need to store 1-, …., n-grams.
Avg denotes average.

As we see in the table, there is no a significant difference between precisions (both top-1 and top-3) of the models. Therefore, based on the evaluation, I decied to work on a 50% fraction of the original corpus and implemented a quad-gram model on it.

After getting a random sample of the corpus, we perform the following preprossing steps on the sample data:

  • lower-case conversion
  • removing hyphens
  • removing twitter and other symbols
  • removing separators (white-spaces)
  • removing punctuations
  • removing numbers
  • removing profanities
  • removing non-English words

We then extract uni-grams (words), bi-grams (two consecutive words), and tri-grams (three consecutive words), and quad-grams (four consecutive words) from the clean data, and represent several interesting results about them. We perform some exploratory analysis to understand the distributions of term frequencies in n-grams.

Next, we build an n-gram model to predict next words given a phrase. To further optimize the memory usage, we do several more preprocessing on n-grams. We follow the Stupid Backoff method5 in building our model. That is, to predict the next word, we first use the quad-gram probability. If we do not have enough of a quad-gram count to make it, we back-off and use the tri-gram probability. If there still is not enough of a tri-gram count, we use the bi-gram probability. If we fail again, the algorithm tries uni-grams probabilities. To calculate the n-gram probabilities, we use the Kneser-Ney smoothing method6.

Preprocessing and Uni-Grams

As discussed in Introduction, we extract 50% random sample of the orginal corpus. The following table represents a random sample fraction of 50% from the data. Since the average numbers of characters have not changed much, the sample looks reasonable.

#============================================#

# Sampling the Data & Making a Corpus & 
# Getting Some General Info of the sample corpus

#============================================#

# Reading the orginal Corpus
blogs <- read_lines("rawData/blogs.txt") # reading blogs
twitter <- readLines("rawData/twitter.txt", skipNul = TRUE) # reading twitter
news <- read_lines("rawData/news.txt") # reading news


# Random Sampling the Coprpus
set.seed(2019) # make it reproducible
# random sampling the blogs
blogs.ind <- sample(length(blogs), length(blogs) * 0.5)
blogs.sample <- blogs[blogs.ind]
# random sampling the twitter
twitter.ind <- sample(length(twitter), length(twitter) * 0.5)
twitter.sample <- twitter[twitter.ind]
# random sampling the news
news.ind <- sample(length(news), length(news) * 0.5)
news.sample <- news[news.ind]

# create a corpus        
sample.corpus <- corpus(c(blogs.sample, 
                          twitter.sample,
                          news.sample))

# compute the number of lines in sample blogs
lines.blogs.sample <- length(blogs.sample)
# compute the number of lines in sample twitter
lines.twitter.sample <- length(twitter.sample)
# compute the number of lines in sample news
lines.news.sample <- length(news.sample)
# compute the number of lines in sample corpus
lines.corpus.sample <- lines.blogs.sample + 
        lines.twitter.sample + 
        lines.news.sample

# compute the number of words in sample blogs
words.blogs.sample <- format(wordcount(blogs.sample), big.mark=",",  scientific=FALSE)
# compute the number of words in sample twitter
words.twitter.sample <- format(wordcount(twitter.sample), big.mark=",", scientific=FALSE)
# compute the number of words in sample news
words.news.sample <- format(wordcount(news.sample), big.mark=",", scientific=FALSE)
# compute the number of words in sample corpus
words.corpus.sample <- format(wordcount(blogs.sample) + 
                                wordcount(twitter.sample) + 
                                wordcount(news.sample),
                              big.mark=",", 
                              scientific=FALSE)


# compute the number of characters in each line of sample blogs
length.blogs.sample <- sapply(blogs.sample, nchar)
# range of lengths in sample blogs
range.chars.blogs.sample <- paste(as.character(min(length.blogs.sample)), " - ",
                             as.character(max(length.blogs.sample)), sep = "")

# compute the number of characters in each line of sample twitter
length.twitter.sample <- sapply(twitter.sample, nchar)
# range of lengths in sample twitter
range.chars.twitter.sample <- paste(as.character(min(length.twitter.sample)), " - ", 
                            as.character(max(length.twitter.sample)), sep = "")

# compute the number of characters in each line of sample news
length.news.sample <- sapply(news.sample, nchar)
# range of lengths in sample news
range.chars.news.sample <- paste(as.character(min(length.news.sample)), " - ",
                            as.character(max(length.news.sample)), sep = "")

# minimum length in the corpus
min.sample.corp <- min(min(length.blogs.sample), 
                       min(length.twitter.sample),
                       min(length.news.sample))

# maximum length in the sample corpus
max.sample.corp <- max(max(length.blogs.sample), 
                       max(length.twitter.sample),
                       max(length.news.sample))

# range of lengths in the sample corpus
range.chars.corpus.sample <- paste(as.character(min.sample.corp), " - ",
                                   as.character(max.sample.corp), sep = "")

# average length in sample news
avg.nchar.news.sample <- round(mean(length.news.sample), 2)
# average length in sample twitter
avg.nchar.twitter.sample <- round(mean(length.twitter.sample), 2)
# average length in sample blogs
avg.nchar.blogs.sample <- round(mean(length.blogs.sample), 2)
# average length in sample corpus
avg.nchar.corpus.sample <- avg.nchar.news.sample + 
        avg.nchar.twitter.sample + 
        avg.nchar.blogs.sample

# a vector of info in sample blogs
info.blogs.sample <- c("Blogs", 
                format(lines.blogs.sample, big.mark=",", scientific=FALSE),
                words.blogs.sample, 
                range.chars.blogs.sample,
                avg.nchar.blogs.sample)

# a vector of info in sample twitter
info.twitter.sample <- c("Twitter", 
                  format(lines.twitter.sample, big.mark=",", scientific=FALSE),
                  words.twitter.sample, 
                  range.chars.twitter.sample,
                  avg.nchar.twitter.sample)

# a vector of info in sample news
info.news.sample <- c("News", 
               format(lines.news.sample, big.mark=",", scientific=FALSE),
               words.news.sample, 
               range.chars.news.sample, 
               avg.nchar.news.sample)

# a vector of info in sample corpus
info.corpus.sample <- c("Corpus", 
                format(lines.corpus.sample, big.mark=",", scientific=FALSE),
               words.corpus.sample, 
               range.chars.corpus.sample, 
               avg.nchar.corpus.sample)

# a data frame of info of the sample data 
raw.info.sample <- as.data.frame(rbind(info.blogs.sample, 
                                       info.twitter.sample, 
                                       info.news.sample,
                                       info.corpus.sample))
colnames(raw.info.sample) = c("Data", 
                       "Lines", 
                       "Words", 
                       "Range nchars",
                       "Avg nchars") # set the column names
rownames(raw.info.sample) = NULL # set the row names NULL


# represent the info as a table
raw.info.sample %>% 
  kable(booktabs = T,
        caption = "Random Sampled Raw Data - Information", align = "c") %>%
  kable_styling(full_width = F) %>%
  footnote(general =  c("Range nchars: range of number of chars in lines", 
                        "Avg nchars: avgerage number of chars in lines"))
Random Sampled Raw Data - Information
Data Lines Words Range nchars Avg nchars
Blogs 449,644 18,675,157 1 - 40833 230.11
Twitter 1,180,074 15,184,022 2 - 196 68.79
News 505,121 17,184,596 1 - 11384 201.17
Corpus 2,134,839 51,043,775 1 - 40833 500.07
Note:
Range nchars: range of number of chars in lines
Avg nchars: avgerage number of chars in lines

We extract the unigrams (words) excluding numbers, hyphens, URLs, separators, punctuations, and (twitter) symbols. Moreover, we convert the words to lower-case, and we exclude the words containing both numbers and letters. We also extract the profanities7 and non-english words in the corpuse. Also, we clean non-sense words out the data as much as possible, e.g., the words like zzzzzzz and bbbbb. Furthermore, we keep only the words whose occurrences in the corpus are at least 2. A summary of the results is represented in Table. .^[To speed up calculation, we trim the corresponding Document-Feature Matrix (DFM), and tidy them out. Then, we index them.

#============================================#

# Uni-Grams

#============================================#


# Tokenizing 
unigrams <- tokens(x = tolower(sample.corpus), # to lower-case 
                   remove_numbers = TRUE, # remove numbers
                   remove_hyphens = TRUE, # remove hyphens
                   remove_url = TRUE, # remove URLs
                   remove_symbols = TRUE, # remove symbols
                   remove_separators = TRUE, # remove separators 
                   remove_punct = TRUE, # remove punctuations
                   remove_twitter = TRUE) # remove twitters

# Remove words containing numbers
unigrams <- tokens_remove(unigrams, 
                          pattern = "\\w*[0-9]+\\w*\\s*", 
                          valuetype = "regex")


# a regular expression for non-sense words in the corpus
pattern <- "\\w*(kk|nn|zz|aaa|
bbb|ccc|ddd|eee|fff|ggg|hhh|jjj|lll|ppp|qqq|rrr|vvv|xxx|
iiii|ssss|wwww|mmmmm|ooooo|ttttt|uuuuu|yyyyy)+\\w*\\s*"
# remove nosnese words
unigrams <- tokens_remove(unigrams, pattern = pattern, valuetype = "regex")



# profanities filtering
profanities <- read_lines("rawData/bad_words.txt") # read the profanity dataset
unigrams <- tokens_remove(unigrams, 
                          pattern = profanities) # remove profanities

# remove non-English words
unigrams <- tokens_remove(unigrams,
                          pattern = "[A-z]*[^\x01-\x7F]+[A-z]*",
                          valuetype = "regex")
#============================================#

#  A Tiday Data Table for Unigrams 

#============================================#


# create a document feature document (DFM)
dfm.uni <- dfm(unigrams)
# trim the DFM 
dfm.uni <- dfm_trim(dfm.uni, min_termfreq = 3)

# creat a tidy data table of unigrams
dt.uni <- tidy(dfm.uni) # tidy
setDT(dt.uni) # set data table
# convert to 'word' & 'count'
dt.uni <- dt.uni[, document := NULL][, .(word = term, count)]
# aggregte by words
dt.uni <- dt.uni[, .(count = sum(count)), by = word]
# index the unigrams data table
setkey(dt.uni, word)
# order by count 
dt.ui <- dt.uni[order(-count)]

The following figure represents the top 30 most frequenct words in the clean corpus with their frequencies.

#============================================#

#  Frequency Plot - Unigrams 

#============================================#

ggplot(dt.uni[order(-count)][1:30, ], # the 30 most frequent
       aes(x = reorder(word, count), # reorder word by count 
           y = count)) + 
  geom_col(width = 0.4) + # width of bars
  xlab(NULL) + # no lable on the X axis
  ylab("Frequency") + # set label on Y axis
  coord_flip() +  # flip cartesian coordinates
  ggtitle("Top 30 most frequent words in the clean corpus") # title

As we see in the above frequency plot, the most frequent words in the sample corpus are stop-words. The following figure represents the 30 top frequenct words excluding the stop-words8 It follows by a word-cloud for the data, excluding profanities, hyphens, URLs, symbols, stop-words, and numbers.

#============================================#

#  Frequency Plot - Unigrams excluding Stop-Words

#============================================#

# get non-stop words
dt.uni.wostp <- dt.uni[word %!in% stopwords("english")]

# plot
ggplot(dt.uni.wostp[order(-count)][1:30, ],# the 30 most frequent
       aes(x = reorder(word, count), # reorder word by count
           y = count))  +
  geom_col(width = 0.4) + # width of bars
  xlab(NULL) + # no lable on the X axis
  ylab("Frequency") + # set label on Y axis
  coord_flip() + # flip cartesian coordinates
  ggtitle("Top 30 most frequent words excluding stopwords") # title

#============================================#

#  Worldcloud - Unigrams without Stop-Words

#============================================#

# build a DFM for unigrams without stop-words
dfm.uni.wostp <- dfm_remove(dfm.uni, stopwords("english"))


set.seed(1000) # to make it reproducible
# word cloud
textplot_wordcloud(dfm.uni.wostp, 
                   rotation = .25, 
                   color = RColorBrewer::brewer.pal(8,"Dark2"))
A Word-Cloud for unigrams (excl. stopwords, numbers, profanities, URLs, symbols)

A Word-Cloud for unigrams (excl. stopwords, numbers, profanities, URLs, symbols)

Bi-, Tri-, and Quad-Grams

In this section, we extract the bi-grams, tri-grams, and quad-grams from clean unigrams, and we analyze their frequencies. We keep only those grams that have at least two instances in our corpus. Moreover, we remove those grams which include empty words, or duplicated words, e.g., “a a”.

#============================================#

# Bi-Grams & Tidy Data Table for Bi-Grams

#============================================#

# compute bigrams
bigrams <- tokens_ngrams(unigrams, n = 2)

# create a DFM for bigrams
dfm.bi <- dfm(bigrams)
# trim the DFM
dfm.bi <- dfm_trim(dfm.bi, min_termfreq = 3)

# tidy the DFM
dt.bi <- tidy(dfm.bi)
# set data table
setDT(dt.bi) 
# remove 'document'
dt.bi <- dt.bi[, document := NULL] 
# aggregate frequencies based on the terms
dt.bi <- dt.bi[, .(count = sum(count)), by = term]
# split the terms into two words
dt.bi[, c("word1", "word2") := tstrsplit(term, "_", fixed=TRUE)]
# keep only word1, word2, & the frequency (count)
dt.bi <- dt.bi[, .(word1, word2, count)]
# clean empty words
dt.bi <- dt.bi[word1 != ""][word2 != ""]
# remove the terms with the same words
dt.bi <- dt.bi[(word1 != word2)]
# set index
setkey(dt.bi, word1, word2)
# order by count 
dt.bi <- dt.bi[order(-count)]

The followig figure represents the 30 most frequent bi-grams:

#============================================#

# Frequency Plot for Bi-Grams

#============================================#


ggplot(dt.bi[1:30, ], # the 30 most frequent
       aes(x = reorder(paste(word1, word2, sep = " "), # paste two words  
                       count), # reorder by count
           y = count)) + 
  geom_col(width = 0.4) + # width of bars
  xlab(NULL) + # no lable on the X axis
  ylab("Frequency") + # set label on Y axis
  coord_flip() + # flip cartesian coordinates
  ggtitle("Top 30 most frequent bigrams") # title

#============================================#

# Tri-Grams & Tidy Data Table for Tri-Grams

#============================================#


# compute trigrams
trigrams <- tokens_ngrams(unigrams, n = 3)

# create a DFM for trigrams
dfm.tri <- dfm(trigrams)
# trim the DFM
dfm.tri <- dfm_trim(dfm.tri, min_termfreq = 3)

# tidy the DFM
dt.tri <- tidy(dfm.tri) 
# set data table
setDT(dt.tri)
# remove 'document'
dt.tri <- dt.tri[, document := NULL]
# aggregate frequencies by terms
dt.tri <- dt.tri[, .(count = sum(count)), by = term]
# split the terms into two words
dt.tri[, c("word1", "word2", "word3") := tstrsplit(term, "_", fixed=TRUE)]
# keep only word1, word2, word3, & the frequency (count)
dt.tri <- dt.tri[, .(word1, word2, word3, count)]
# remove the terms with the same words
dt.tri <- dt.tri[!((word1 == word2) | (word2 == word3))]
# remove empty words
dt.tri <- dt.tri[word1!="" & word2!="" & word3!=""]
# set index
setkey(dt.tri, word1, word2, word3)
# order by frequency (count)
dt.tri <- dt.tri[order(-count)]

The following figure represents the 30 most frequent tri-grams in the corpus:

#============================================#

# Frequency Plot for Tri-Grams

#============================================#

ggplot(dt.tri[1:30, ],
       aes(x = reorder(paste(word1, word2, word3, sep = " "), # paste two words
                       count), # reorder by count
           y = count)) + 
  geom_col(width = 0.4) + # width of bars
  xlab(NULL) + # no lable on the X axis
  ylab("Frequency") + # set label on Y axis 
  coord_flip() + # flip cartesian coordinates
  ggtitle("Top 30 most frequent tri-grams") # title

#============================================#

# Quad-Grams & Tidy Data Table for Quad-Grams

#============================================#

# compute quad-grams
quadgrams <- tokens_ngrams(unigrams, n = 4)

# create a DFM for quad-grams
dfm.quad <- dfm(quadgrams)
# trim the DFM
dfm.quad <- dfm_trim(dfm.quad, min_termfreq = 3)

# tidy the DFM
dt.quad <- tidy(dfm.quad) 
# set data table
setDT(dt.quad)
# remove 'document'
dt.quad <- dt.quad[, document := NULL]
# aggregate by term
dt.quad <- dt.quad[, .(count = sum(count)), by = term]
# split the terms into four words
dt.quad[, c("word1", "word2", "word3", "word4") := tstrsplit(term, "_", fixed=TRUE)]
# keep only word1, word2, word3, word4, & the frequency (count)
dt.quad <- dt.quad[, .(word1, word2, word3, word4, count)]
# remove the terms with the same words
dt.quad <- dt.quad[!((word1 == word2) | (word2 == word3) | (word3 == word4))]
# remove terms containing empty words
dt.quad <- dt.quad[word1!="" & word2!="" & word3!="" & word4!=""]
# set index
setkey(dt.quad, word1, word2, word3, word4)
# order by count (frequency)
dt.quad <- dt.quad[order(-count)]

The following figure represents the 30 most frequent quad-garms:

#============================================#

# Frequency Plot for Quad-Grams

#============================================#

ggplot(dt.quad[1:30, ], # the 30 most frequent
       aes(x = reorder(paste(word1, word2, word3, word4, sep = " "), # paste the words
                       count), # reorder by count 
           y = count)) + 
  geom_col(width = 0.4) + # width of bars 
  xlab(NULL) + # no label on X axis
  ylab("Frequency") + # set the label of Y axis
  coord_flip() + # flip cartesian coordinates
  ggtitle("Top 30 most frequent quad-grams") # title

The following table represents the number of terms (grams), the number of instances, the maximum frequency of a term for each n-grams (uni-, bi-, tri, quad-grams). Moreover, it shows how many unique terms we need in a frequency sorted way to cover 50% and 90% of all term instances in corpus.

#============================================#

# General Information about the N-grams

#============================================#

dt.coverage <- data.frame(
  # numner of terms
  Terms = c(format(nrow(dt.uni), big.mark=",", scientific=FALSE), # unigrams
            format(nrow(dt.bi), big.mark=",", scientific=FALSE), # bigrams
            format(nrow(dt.tri), big.mark=",", scientific=FALSE), # trigrams
            format(nrow(dt.quad), big.mark=",", scientific=FALSE)), # quadgrams
  # number of instances
  Instances = c(format(sum(dt.uni$count), big.mark=",", scientific=FALSE), # unigrams 
                format(sum(dt.bi$count), big.mark=",", scientific=FALSE), # bigrams
                format(sum(dt.tri$count), big.mark=",", scientific=FALSE), # trigrams
                format(sum(dt.quad$count), big.mark=",", scientific=FALSE)), # quadgrams
  # maximum frequency
  MaxFreq = c(format(max(dt.uni$count), big.mark=",", scientific=FALSE), # unigrams
               format(max(dt.bi$count), big.mark=",", scientific=FALSE), # bigrams
               format(max(dt.tri$count), big.mark=",", scientific=FALSE), # trigrams
               format(max(dt.quad$count), big.mark=",", scientific=FALSE)), # quadgrams
  # Number of terms covering 50% of the whole
  Cov50 = c(format(findMany(dt.uni, .5), big.mark=",", scientific=FALSE), # unigrams unigrams
             format(findMany(dt.bi, .5), big.mark=",", scientific=FALSE), # bigrams
             format(findMany(dt.tri, .5), big.mark=",", scientific=FALSE), # trigrams
             format(findMany(dt.quad, .5), big.mark=",", scientific=FALSE)), # quadgrams
  # Number of terms covering 90% of the whole
  Cov90 = c(format(findMany(dt.uni, .9), big.mark=",", scientific=FALSE), # unigrams
             format(findMany(dt.bi, .9), big.mark=",", scientific=FALSE), # bigrams
             format(findMany(dt.tri, .9), big.mark=",", scientific=FALSE), # trigrams
             format(findMany(dt.quad, .9), big.mark=",", scientific=FALSE))) # quadgrams

# set column names
colnames(dt.coverage) = c("Terms", 
                          "Instances", 
                          "Max Frequency", 
                          "50% Coverage", 
                          "90% Coverage")
# set row names
rownames(dt.coverage) = c("Uni-Grams", 
                          "Bi-Grams", 
                          "Tri-Grams", 
                          "Quad-Grams")

# represent the info as a table 
dt.coverage %>% 
  kable(booktabs = T, caption = "Coverage Table", align = "c") %>%
  kable_styling(full_width = F)
Coverage Table
Terms Instances Max Frequency 50% Coverage 90% Coverage
Uni-Grams 138,437 49,562,317 2,386,198 73,814 127,396
Bi-Grams 1,468,786 39,949,050 216,281 16,814 525,330
Tri-Grams 1,912,143 20,825,724 17,420 127,927 1,227,004
Quad-Grams 1,004,342 6,963,399 3,903 153,266 772,229

Language Modeling

This section discusses the n-gram model I used to predict next words. It models sequences of words using the statistical properties of n-grams. For simplicitiy and practicability, I follow the Markov assumption9 (or independence assumption). That is, in an n-gram model, each word depends only on the last \(n-1\) words. This assumption is important because it massively simplifies the problem of estimating the language model from data.

As for probabilities, I use smoothing to give a probability to words we have not seen in our training data. There are a few smoothing methods, including Good-Turing Smoothing10 and Kneser-Ney Smoothing11. I use the latter, as it works better in most cases.12

The Kneser-Ney (KN) probability of a given n-gram \(w_1 \ldots w_n\) is the conditional probability \(P_{KN}(w_n|w_1 \ldots w_{n-1})\), i.e., the probability of \(n\)th word given the first to \(n-1\) words.

The uni-gram KN probability is as follows: \[ P_{KN}(w_i) = \frac{|\{w': 0 < c(w', w_i)\}|}{|\{(w', w''): 0 < c(w', w'')\}|} \]

The bi-gram KN probability is as follows: \[ P_{KN}(w_i | w_{i-1}) = \frac{\max(c(w_{i-1}, w_i) - \delta, 0)}{\sum_{w'} c(w_{i-1}, w')} + \lambda_{w_{i-1}} P_{KN}(w_i) \] \[ \lambda_{w_{i-1}} = \frac{\delta}{\sum_{w'} c(w_{i-1}, w')} |\{w': 0 < c(w_{i-1}, w')\}| \]

The extension of this equation to the n-grams is as follows: \[ P_{KN}(w_i | w^{i-1}_{i-n+1}) = \frac{max(c(w^{i-1}_{i-n+1}, w_i) - \delta, 0)}{\sum_{w'} c(w^{i-1}_{i-n+1}, w')} + \delta\ \frac{|\{w': 0 < c(w^{i-1}_{i-n+1}, w')\}|}{\sum_{w_i} c(w^i_{i-n+1})} P_{KN}(w_i | w^{i-1}_{i-n+2}) \]

where,

  • \(\delta\) is called the discount weigth. We choose \(0.75\) for the this constant.
  • \(c(w, w')\) is the number of occurrences of the expression \(w\) followed by the word \(w'\) in the corpus.
  • \(w^{i-1}_{i-n+1}\) is the n-1 words before \(w_i\)

The following code chunks show how I implemented the KN probability to uni-, bi-, and quad-grams.

#============================================#

# Uni- and BI-Grams Proabability

#============================================#

# Discount Weigth
D <- 0.75

# Total Number of Bigrams
total.bi <- nrow(dt.bi)


# create a data table representing the number of occurrences of a word as the first word in bigrams 
count.Biw1 <- dt.bi[ , .(count.Biw1 = sum(count)), by = word1]
setkey(count.Biw1, word1) # index 
# add a corresponding column into dt.bi (the bigrams data table)
dt.bi[, count.w1 := count.Biw1[word1, count.Biw1]] 

# create a data table representing the number of unique bigrams with w1 as their first word
num.Biw1 <- dt.bi[, .(num = .N), by = word1]
setkey(num.Biw1, word1) # index
# add a corresponding column into dt.bi (the bigrams data table)
dt.bi[, num.w1 := num.Biw1[word1, num]]

# create a data table representing the number of unique bigrams with w2 as their second word 
num.Biw2 <- dt.bi[, .(num = .N), by = word2]
setkey(num.Biw2, word2) # index
dt.bi[, num.w2 := num.Biw2[word2, num]]

# Compute Uni-Grams Probability
# compute & add a column into the bigrams data table (dt.bi)
dt.bi[, prob.w2 := num.w2/total.bi] # see the unigram KN equation 
# get a corresponding subset
prob.uni <- unique(dt.bi[, .(word = word2, Prob = prob.w2)])
setkey(prob.uni, word) # index
# add unigram probabilities into the unigrams data table (dt.uni)
dt.uni[, Prob := prob.uni[word, Prob]]
dt.uni <- dt.uni[!is.na(Prob)] # drop the NA values
dt.uni <- dt.uni[order(-Prob)] # order unigrams by proabibility

# Compute the Bi-Grams Probability
dt.bi[, Prob := ((count - D)/ count.w1) +
        ((D/count.w1) * num.w1 * prob.w2)] # see the  bigram KN equation

# clean the bigrams data table (dt.bi)
dt.bi[, count.w1:=NULL][,num.w1:=NULL][,num.w2:=NULL][,prob.w2:=NULL] # remove intermediate results
dt.bi <- dt.bi[order(-Prob)] # order by probability 

The following table represent the most and the least likely bi-grams:

#============================================#

# Most and Least Likely Bi-Grams

#============================================#

# get the 10-most likely bigrams
bi.most <- head(dt.bi[order(-Prob)], 10)  
# paste the two words
bi.most <- bi.most[, bigram := paste(word1, word2)]
# clean the data table
bi.most <- bi.most[, .(bigram, Prob)]

# get the 10-least likely bigrams
bi.least <- tail(dt.bi[order(-Prob)], 10)
# paste the two words
bi.least <- bi.least[, bigram := paste(word1, word2)]
# clean the data table
bi.least <- bi.least[, .(bigram, Prob)]

# represent as a table 
kable(list(bi.most, bi.least), booktabs = T, 
      caption = "10-Most (Left) and 10-Least (Right) Likely Bi-Grams", 
      align = "c") %>%
  kable_styling()          
10-Most (Left) and 10-Least (Right) Likely Bi-Grams
bigram Prob
reminiscent of 0.9972347
gearing up 0.9972166
accounted for 0.9970479
specializes in 0.9969995
conjunction with 0.9965980
according to 0.9965018
assortment of 0.9964912
mardi gras 0.9962312
incapable of 0.9957002
cinco de 0.9948983
bigram Prob
the ywca’s 1e-06
the zephyr 1e-06
the zimmerli 1e-06
the zingaro 1e-06
the zirconium 1e-06
the zirkles 1e-06
the zohan 1e-06
the zohar 1e-06
the zona 1e-06
the zydeco 1e-06
#============================================#

# Tri-Grams Proabability

#============================================#


# create a data table representing the number of occurrences of a bigram as the 1st & 2nd words in trigrams 
count.Triw1w2 <- dt.tri[ , .(count = sum(count)), by = .(word1, word2)]
count.Triw1w2[, term := paste(word1, word2)] # paste the the two words
count.Triw1w2[, word1 := NULL][, word2 := NULL] # remove the separate words
setkey(count.Triw1w2, term) # index
# add a corresponding column into the trigrams table (dt.tri)
dt.tri[, term := paste(word1, word2)] # prepare dt.tri
dt.tri[, count.w1w2 := count.Triw1w2[term, count]] # add a column

# create a data table representing the number of unique trigrams for given first two words
num.Triw1w2 <- dt.tri[, .(num = .N), by = .(word1, word2)]
num.Triw1w2[, term := paste(word1, word2)] # paste two words
num.Triw1w2[, word1 := NULL][, word2 := NULL] # remove separate words
setkey(num.Triw1w2, term) # index
# add a corresponding column into the trigrams data table (dt.tri)
dt.tri[, num.w1w2 := num.Triw1w2[term, num]]

# Compute the normalizing constants
dt.tri[, lambda.w1w2 := (D/count.w1w2)*num.w1w2]

# Compute the Tri-Grams Probability 
bi.temp <- dt.bi[, term := paste(word1, word2)] # an helper data table
setkey(bi.temp, term) # index the helper data table
dt.tri[, term2 := paste(word2, word3)] # prepare the trigrams data (dt.tri)
# compute the trigram KN probability 
dt.tri[, Prob := ((count - D)/count.w1w2) +
         (lambda.w1w2 * bi.temp[term2, Prob])] # see the KN equation

# clean the bigrams
dt.bi[, term := NULL] 
# clean the trigrams data table
dt.tri <- dt.tri[, .(word1, word2, word3, count, Prob)] # remove intermediate results
dt.tri <- dt.tri[!is.na(Prob)] # drop NA values
dt.tri <- dt.tri[order(-Prob)] # order by probability

The following table represent the most and the least likely tri-grams:

#============================================#

# Most and Least Likely Tri-Grams

#============================================#

# get the 10-most likely trigrams
tri.most <- head(dt.tri[order(-Prob)], 10)
# paste three words
tri.most <- tri.most[, trigram := paste(word1, word2, word3)]
# clean the data table
tri.most <- tri.most[, .(trigram, Prob)]

# get the 10-least likely trigrams
tri.least <- tail(dt.tri[order(-Prob)], 10)
# paste three words
tri.least <- tri.least[, trigram := paste(word1, word2, word3)]  
# clean the data table
tri.least <- tri.least[, .(trigram, Prob)]

# represent as a table        
kable(list(tri.most, tri.least), booktabs = T,
      caption = "10-Most (Left) and 10-Least (Right) Likely Tri-Grams", 
      align = "c") %>%
  kable_styling()          
10-Most (Left) and 10-Least (Right) Likely Tri-Grams
trigram Prob
been able to 0.9999935
being able to 0.9999935
was supposed to 0.9999922
were able to 0.9999906
year according to 0.9999895
is supposed to 0.9999889
in conjunction with 0.9999883
in according to 0.9999871
and according to 0.9999841
are supposed to 0.9999839
trigram Prob
of the lepus 1.15e-05
of the looking 1.15e-05
of the naming 1.15e-05
of the rugged 1.15e-05
of the sandman 1.15e-05
of the saturn 1.15e-05
of the send 1.15e-05
of the similarities 1.15e-05
of the skilled 1.15e-05
of the unschooling 1.15e-05
#============================================#

# Quad-Grams Proabability

#============================================#


# create a data table representing the number of occurrences of a trigram as the 1st, 2nd & 3rd words in quadgrams 
count.Quadw1w2w3 <- dt.quad[ , .(count = sum(count)), by = .(word1, word2, word3)]
count.Quadw1w2w3[, term := paste(word1, word2, word3)] # paste words
count.Quadw1w2w3[, word1 := NULL][, word2 := NULL][, word3 := NULL] # remove the separate words
setkey(count.Quadw1w2w3, term) # index
# add a corresponding column into the quadgrams table (dt.quad)
dt.quad[, term := paste(word1, word2, word3)] # prepare dt.quad
dt.quad[, count.w1w2w3 := count.Quadw1w2w3[term, count]] # add a column


# create a data table representing the number of unique quadgrams for given first three words
num.Quadw1w2w3 <- dt.quad[, .(num = .N), by = .(word1, word2, word3)]
num.Quadw1w2w3[, term := paste(word1, word2, word3)]  # paste three words
num.Quadw1w2w3[, word1 := NULL][, word2 := NULL][, word3 := NULL] # remove separate words
setkey(num.Quadw1w2w3, term) # index 
# add a corresponding column into the quadgrams data table (dt.quad)
dt.quad[, num.w1w2w3 := num.Quadw1w2w3[term, num]] 

# Compute the normalizing constants (\lambda_{w1w2})\
dt.quad[, lambda.w1w2w3 := (D/count.w1w2w3)*num.w1w2w3]


# Compute the Quad-Grams Probability 
tri.temp <- dt.tri[, term := paste(word1, word2, word3)] # an helper data table
setkey(tri.temp, term) # index the helper table
dt.quad[, term2 := paste(word2, word3, word4)] # prepare the quadgrams data (dt.quad)
# compute the quadgram KN probability 
dt.quad[, Prob := ((count - D)/count.w1w2w3) + 
          (lambda.w1w2w3 * tri.temp[term2, Prob])]

# clean the trigrams data table
dt.tri[, term := NULL]
# clean the quadgrams data table
dt.quad <- dt.quad[, .(word1, word2, word3, word4, count, Prob)] # remove intermediate results
dt.quad <- dt.quad[!is.na(Prob)] # drop NA values
dt.quad <- dt.quad[order(-Prob)] # order by probabibility 

The following table represents the most and the least likely quad-grams:

#============================================#

# Most and Least Likely Quad-Grams

#============================================#

# get the 10-most likely quadgrams
quad.most <- head(dt.quad[order(-Prob)], 10)
# paste four words
quad.most <- quad.most[, quadgram := paste(word1, word2, word3, word4)]
# clean the data table
quad.most <- quad.most[, .(quadgram, Prob)]


# get the 10-least likely quadgrams
quad.least <- tail(dt.quad[order(-Prob)], 10)
# paste four words
quad.least <- quad.least[, quadgram := paste(word1, word2, word3, word4)]
# clean the data table
quad.least <- quad.least[, .(quadgram, Prob)]

# represent as a table
kable(list(quad.most, quad.least), booktabs = T,  
      caption = "10-Most (Left) and 10-Least (Right) Likely Quad-Grams", 
      align = "c") %>%
  kable_styling()          
10-Most (Left) and 10-Least (Right) Likely Quad-Grams
quadgram Prob
have been able to 1.0000000
not being able to 1.0000000
haven’t been able to 1.0000000
we were able to 1.0000000
in the midst of 1.0000000
i was supposed to 1.0000000
has been able to 1.0000000
not been able to 1.0000000
of being able to 0.9999999
it was supposed to 0.9999999
quadgram Prob
one of the tests 0.0001576
one of the back 0.0001576
one of the blessings 0.0001576
one of the dozen 0.0001576
one of the book 0.0001576
one of the deadly 0.0001576
one of the tour 0.0001576
one of the critters 0.0001576
one of the missing 0.0001576
one of the listed 0.0001576

Since our app will return the top-5 most likely word for a given sequence of words, it does not make sense to keep lower-rated n-grams. Moreover, we do not need to keep all unigrams. We just keep top-50 rated unigrams. These help us to have a much faster and mempry-efficient web-based app.

#============================================#

# Further Cleaning the N-Grams Data Tables

#============================================#

# keep top-50 unigrams
dt.uni <- dt.uni[order(-Prob)][1:50]

# keep top-5 bigrams for given first word
dt.bi <- dt.bi[, .SD[1:5], by = word1 ][!is.na(word2)]

# keep top-5 trigrams for given first two words
dt.tri <- dt.tri[, .SD[1:5], by = .(word1, word2) ][!is.na(word3)]

# keep top-5 quadgrams for given first three words
dt.quad <- dt.quad[, .SD[1:5], by = .(word1, word2, word3) ][!is.na(word4)]

Finally, we need to store our data into some objects. These objects will be uploaded into the server so that they can be used in the prediction alogirthm.

#============================================#

# Store the Processed Object

#============================================#


saveRDS(dt.uni, "unigrams.rsd") # unigrams
saveRDS(dt.bi, "bigrams.rsd") # bigrams
saveRDS(dt.tri, "trigrams.rsd") # trigrams
saveRDS(dt.quad, "quadgrams.rsd") # quadgrams

The Shiny Application

In this section, I describe the implementation of the Shiny application. The web-based application can be found at https://asafilian.shinyapps.io/as_txtpredict/.

The app conatins three sources:

  • The Prediction Algorithm (saved into predict.R)
  • The User Interface (function ui in app.R)
  • The Server Side (function server in app.R)

predict.R

First, the source loads the processed data.

#============================================#

# Reading the Objects

#============================================#


# load the unigrams
unigrams <- readRDS("unigrams.rsd")
setkey(unigrams, word) # index

# load the bigrams
bigrams <- readRDS("bigrams.rsd")
setkey(bigrams, word1, word2) # index

# load the trigrams
trigrams <- readRDS("trigrams.rsd") 
setkey(trigrams, word1, word2, word3) # index

# load the quadgrams
quadgrams <- readRDS("quadgrams.rsd")
setkey(quadgrams, word1, word2, word3, word4) # index

# load profanities
profanities <- readRDS("profanities.rds")

Now, let us see how I implemented the prediction algorithm. The algorithm follows the stupid back-off method.

We need a function to suggest \(n\) (given, \(n\in\{1, \ldots,5\}\)) words, based on the 3 previous words. The function gets a string and tokenizes and preprocesses it. Then, it extracts the last three words of the string. To predict the next word, the function uses the quad-gram probability.

If the function above cannot find \(n\) quad-grams with the three given words, the algorithm backoffs to the three-gram model to find the missing suggestions.

If the algorithm cannot make \(n\) suggestions by the tri-gram models, it backoffs to bi-grams.

Fianlly, if the algorithm cannot even find the corresponding bigram, it randomly gets a word from unigrams with high probability.

The app allows the end user to control the algoithm by choosing between tri- or quad-gram models, and specifying the number of suggestion that she wants the app make.

The following functions together implement our algorithm:

#============================================#

# Prediction Algorithm

#============================================#


Preprocess_str <- function(str){
  # Tokenize and Preprocess the input
  #
  # Args:
  #   str: the string expression read from the text filed
  # 
  # Returns:
  #   preprocessed tokens
  
  
  
  if (str == "") {  # if str is empty, return NULL
    return(NULL) 
  }
  
  # tokenize
  toks <-  tokens(x = char_tolower(str), # lower-case 
                  remove_hyphens = TRUE, # remove hyphens
                  remove_url = TRUE, # remove URLs
                  remove_symbols = TRUE, # remove symbols
                  remove_separators = TRUE, # remove separators
                  remove_punct = TRUE, # remove punctuations
                  remove_twitter = TRUE, # remove twitters
                  remove_numbers = TRUE) # remove numbers
  # remove profanities
  toks <- tokens_remove(toks, pattern = profanities)
  # remove non-English words      
  toks <- tokens_remove(toks, 
                        pattern = "[A-z]*[^\x01-\x7F]+[A-z]*", 
                        valuetype = "regex")
  # remove words containing numbers       
  toks <- tokens_remove(toks, 
                        pattern = "\\w*[0-9]+\\w*\\s*", 
                        valuetype = "regex")
  toks
}


GetWords_4grams <- function(str, n = 5){
  # Is called when the Quad-Gram model is requested 
  #
  # Args:
  #   str: the string expression read from the text filed
  #   n: number of suggestions the user wish to have as possible next word; the default is 5
  #
  # Returns:
  #   a data frame containing the suggestions
  
  
  
  if (str == "") { # if str is empty, return NULL
    return(NULL) 
  }
  
  # tokenize & preprocess the input
  toks <- Preprocess_str(str)
  
  # extract the last three words      
  toks <- rev(rev(unlist(toks))[1:3])
  
  # call predict_quad to predict based on the 4-gram model      
  words <- Predict_quad(toks[1], toks[2], toks[3],  n)
  
  # return the suggestions as a data frame
  data.frame(Words = unique(words))
}



Predict_quad <- function(w1, w2, w3, n = 5){
  # Implements the Quad-Gram model
  #
  # Args:
  #   w1: the first word
  #   w2: the second word
  #   w3: the third word
  #   n: number of suggestions the user wish to have as possible next word
  #
  # Returns
  #   a character vector with length n of words
  
  
  
  # extract the most likely words following w1w2w3 from the quadgram data table
  selected.quad <- quadgrams[.(w1, w2, w3)][order(-Prob)][1:n]
  
  # left_num: there may not be enough terms in quadgrams
  left.num <- sum(is.na(selected.quad$word4))
  
  # The case in which all n words are got from quadgrams
  if (left.num == 0) {
    return(selected.quad$word4)
  }
  
  # The case where nothing is found in the quadgram data table
  if (left.num == n){
    return(Predict_tri(w2, w3, n)) # try the trigram model
  }
  
  
  # The case where some (not all) suggestions has been made from quadgrams
  selected.tri = Predict_tri(w2, w3, left.num)
  return(c(selected.quad$word4[1:(n - left.num)], selected.tri))  
}


Predict_tri <- function(w1, w2, n = 5){
  # Implements the Tri-Gram model
  #
  # Args:
  #   w1: the first word
  #   w2: the second word
  #   n: number of suggestions the user wish to have as possible next word
  #
  # Returns
  #   a character vector with length n of words
  
  
  
  # extract the most likely words following w1w2 from the trigrams data table
  selected.tri <- trigrams[.(w1, w2)][order(-Prob)][1:n]
  
  # left_num: there may not be enough terms in trigram
  left.num <- sum(is.na(selected.tri$word3))
        
  # The case in which all n words are got from trigrams
  if (left_num == 0){
    return(selected.tri$word3)
  } 
                
  # The case where nothing is found in the trigram model
  if (left.num == n){
     return(Predict_bi(w2, n))  # try the bigram model
  }
               
  # The case where some (not all) suggestions has been made from trigrams
  selected.bi = Predict_tri(w2, left.num)
  return(c(selected.tri$word3[1:(n - left.num)], selected.bi))
}



Predict_bi <- function(w1, n = 5){
  # Implements the Tri-Gram model
  #
  # Args:
  #   w1: the first word
  #   w2: the second word
  #   n: number of suggestions the user wish to have as possible next word
  #
  # Returns
  #   a character vector with length n of words
  
  
  
  # extract the most likely words following w1 from the bigram model
  selected.bi <- bigrams[w1][order(-Prob)][1:n]
  
  # left_num: there may not be enough terms in bigrams
  left.num <- sum(is.na(selected.bi$word2))
        
  # The case in which all n words are got from bigrams
  if (left.num == 0){ 
    return(selected.bi$word2) 
  } 
  
  # The case where nothing is found in the bigram model
  if (left.num == n) {
    return(predict_uni(n)) # try unigrams model
  } 
  
  # The case where some (not all) suggestions has been made from bigrams
  selected.uni <- Predict_uni(left.num) 
  return(c(selected.bi$word2[1:(n - left_num)], selected.bi))
}


Predict_uni <- function(n = 5) {
  # Reutrns randomely selected words from unigrams data table 
  #
  # Args:
  #   n: number of suggestions the user wish to have as possible next word
  #
  # Returns
  #   a character vector with length n of words
  
  
  return(sample(unigrams$word, size = n))
}

app.R

#============================================#

# User Interface

#============================================#




# Define UI for application 
ui <- fluidPage(tabsetPanel(
            
        tabPanel("App",
                 
                 titlePanel("Predictive Text App"),
                 
                 sidebarLayout(
                         sidebarPanel(
                                 radioButtons("grams", "N-Gram Model:",
                                              c("Tri-Grams" = "tri",
                                                "Quad-Grams" = "quad"
                                                ),
                                              selected = "quad"), 
                                 hr(),
                                 radioButtons("nums", "Number of Suggestions:",
                                              c("One" = "1",
                                                "Two" = "2",
                                                "Three" = "3",
                                                "Four" = "4",
                                                "Five" = "5"),
                                              selected = "5")
                                 
                         ), # End of SliderbarPaner
                         mainPanel(
                                 textInput("txt",
                                           label = h4("Type your input phrase below:"),
                                           placeholder="Enter text...",
                                           width = "100%"),
                                 htmlOutput("textOut"), 
                                 actionButton("clear",label="Clear"),
                                 actionButton("auto",label="Auto Next"),
                                 hr(),
                                 h4("Top Suggestions:"),
                                 tableOutput("options")
                         ) # End of mainPanel
                 ) # End of SliderLayout
                 
                         
                 
                 ), # End of tabPaner: App
        
        
        tabPanel("Help",
                 
                 hr(),
                 
                 includeMarkdown("help.md")
                 
        ),
        
        tabPanel("About",
                 
                 hr(),
                 
                 includeMarkdown("about.md")
                 
        )
        
)
)


#============================================#

# Sever

#============================================#


# Define server logic 
server <- function(input, output, session) {
        session$onSessionEnded(stopApp)
        observe({
                grm <- input$grams
                n <- as.numeric(input$nums)
                if(grm == "tri"){
                        sug <- getWords_3grams(input$txt, n)
                } else{
                        sug <- getWords_4grams(input$txt, n)
                }
                
                output$textOut <- renderText({HTML(
                        paste0("<div style='background-color:#E8E8E8'>",
                               str_squish(input$txt),
                               " ", "<mark>",
                               sug$Words[1],"</mark></body>"
                               )
                        )
                })
                output$options <- renderTable({ sug }) 
        })
        
        observeEvent(
                
                # Clear button 
                input$clear, { 
                        updateTextInput(session, "txt", value = "")
                }
                
                
                
       )
        
        observeEvent(
                
                # Auto button
                
                input$auto, { 
                        grm <- input$grams 
                        n <- as.numeric(input$nums)
                        if(grm == "tri"){
                                sug <- getWords_3grams(input$txt, n)
                        } else if(grm == "quad"){
                                sug <- getWords_4grams(input$txt, n)
                        } else{
                                sug <- getWords_5grams(input$txt, n)
                        }
                        
                        updateTextInput(session, "txt", 
                                        value = paste(str_squish(input$txt),
                                                      sug$Words[1])
                        )
                }
        )
        
}

# Run the application 
shinyApp(ui = ui, server = server)

  1. https://asafilian.shinyapps.io/as_txtpredict

  2. https://www.microsoft.com/en-us/swiftkey/about-us

  3. The data can be downloaded at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

  4. The Git repository at https://github.com/jan-san/dsci-benchmark

  5. Brants et al (2007). ``Large language models in machine translation.’’ https://www.aclweb.org/anthology/D07-1090

  6. Ney and Reinhard Kneser (1994). “On structuring probabilistic dependences in stochastic language modelling.” https://www.sciencedirect.com/science/article/pii/S0885230884710011

  7. We have used https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words as a reference of profanities, which contains 376 items.

  8. However, note that we do not clean the stop-words out the unigrams.

  9. The markov assumption is that the next state depends only on the current state and is independent of previous history: https://en.wikipedia.org/wiki/Markov_chain

  10. https://en.wikipedia.org/wiki/Good-Turing_frequency_estimation

  11. https://en.wikipedia.org/wiki/Kneser-Ney_smoothing

  12. The Kneser-Ney smoothing algorithm has a notion of continuation probability. It also saves us from having to recalculate all our counts using Good-Turing smoothing.