This is a capstone project for the Data Science Specialization course offered by the John Hopkins University through Coursera. The capstone project is developed in partnership with SwiftKey, a leading company in the field of predictive text analytics.
SwiftKey builds a smart keyboard that makes it easier for people to type on their mobile devices including Android and iOS keyboards. One cornerstone of their smart keyboard is predictive text models using Natural Language Processing (NLP) techniques.
The goal of this Capstone project is to understand and build predictive text models like those used by SwiftKey. The product will be an application that takes in a word phrase and returns next-word choices.
Analysis of text data with natural language processing is a brand new application that is introduced at the end of the Data Science Specialization course. A rational for the choice of the capstone topic is the belief that a practicing data scientist will be frequently confronted with new data types and problems, and a big part of the fun and challenge of being a data scientist is figuring out how to work with these new data types to build data products people love.
The results of the capstone will be reported as follows:
The current report concerns the first phase: exploratory analysis.
The data is from a corpus called HC Corpora, which were previously collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language. The data consists of collections of texts organized into four languages.
The codes below downloaded the dataset.
if(!file.exists("data")){dir.create("data")}
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
fileDest <- "C:/Users/...../Coursera_SwiftKey.zip"
download.file(fileUrl, destfile = fileDest)
dateDownloaded <- date()
list.files()
Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In the current exercise, the English database will be used. The three other databases in German, Russian and Finnish may not be considered in the project.
The codes below are used to read in only the English text files.
## After setting the working directory, get the data directory
data_dir <- getwd()
## read in text data: all English files
en_texts <- readtext(paste0(data_dir, "/final/en_US/*"))
Using quanteda, a corpus object is built from the English texts and explored. The codes below accomplish this task.
# create a corpus
en_corpus <- corpus(en_texts)
# summarize the texts from the corpus
summ <- as.data.frame(summary(en_corpus))
summ$Size <- c(format(object.size(en_texts[1,2]), "GB", digits = 3),
format(object.size(en_texts[2,2]), "GB", digits = 3),
format(object.size(en_texts[3,2]), "GB", digits = 3))
# extract the numerical values from character string of sizes
x1 <- summ$Size[1]
x1 <- as.numeric(stringr::word(x1, 1, 1))
x2 <- summ$Size[2]
x2 <- as.numeric(stringr::word(x2, 1, 1))
x3 <- summ$Size[3]
x3 <- as.numeric(stringr::word(x3, 1, 1))
# organize in a table
summ <- rbind(summ,
data.frame(Text = "Corpus",
Types = sum(summ$Types),
Tokens = sum(summ$Tokens),
Sentences = sum(summ$Sentences),
Size = paste(sum(x1,x2,x3), "Gb", sep = " ")))
summ$TTR <- round(c(summ$Types/summ$Tokens), 3)
# output summary table
knitr::kable(summ, caption = "Summary description of the English corpus")
| Text | Types | Tokens | Sentences | Size | TTR |
|---|---|---|---|---|---|
| en_US.blogs.txt | 530014 | 44346797 | 2020072 | 0.195 Gb | 0.012 |
| en_US.news.txt | 118765 | 3113070 | 140515 | 0.015 Gb | 0.038 |
| en_US.twitter.txt | 581869 | 36898241 | 2583841 | 0.153 Gb | 0.016 |
| Corpus | 1230648 | 84358108 | 4744428 | 0.363 Gb | 0.015 |
The English corpus includes three text document types:Blogs, News, and Twitter. The number of words (tokens) and word types are shown for these documents. The type-token ratio (TTR) is the number of different words (types) in a text document divided by the total number of words (tokens), an index of lexical diversity or degree of variation.
Compared to the other two documents, the News document appears to show half as much word variation given its twice as high TTR of 0.038. However TTR is known to be sensitive to text-length and may not be a reliable comparative index (https://en.wikipedia.org/wiki/Lexical_diversity).
The three documents consists of over 84 million words and uses 0.36 Gb of computer memory.
To build the predictive model, given the fairly large size of the English corpus, it is decided to work with a random sample of the text data to get an accurate approximation to results that would be obtained using all the data. This will minimize computer memory use.
To get a representative sample that can be used to infer facts about population of texts specific for Blog, News and Twitter, random sampling is performed within each document type. Furthermore, to predict a next word within the context of a sentence construction, sentence is treated as the unit of sampling.
The next task is to write a function to randomly sample sentences.
randSampleFN <- function(corp, p=0.1) {
corp_sent <- corpus_reshape(corp, to = 'sentences')
n1 <- ndoc(corp_sent) # get number of docs
n <- n1*p # get % random sample from total sentences
corp_samp <- corpus_sample(corp_sent, n, replace = FALSE)
}
Next, apply the function to randomly select 10% sentences from each document; also display some sample texts.
x <- corpus(en_texts[1,2])
set.seed(4321)
blogs_corp <- randSampleFN(x, p=0.1)
x <- corpus(en_texts[2,2])
set.seed(4321)
news_corp <- randSampleFN(x, p=0.1)
x <- corpus(en_texts[3,2])
set.seed(4321)
tweet_corp <- randSampleFN(x, p=0.1)
rm(x)
# print some example texts
blogs_corp$documents[1:5,1]
## [1] "(For Oswald Plummer it is Delia's bottom that is of especial interest â\200\" as evidenced by the fact that his hand is at present handling it â\200\" but we will come to that.)"
## [2] "Keep some space for try another food in another shop as well!!!"
## [3] "â\200\231 nonsense then maybe thereâ\200\231d be an interesting show here."
## [4] "Practice at least one."
## [5] "It's such an outrageous story that one is reminded of Josef Goebbels' famous dictum that \"the bigger the lie, the more people will believe it.\""
news_corp$documents[1:5,1]
## [1] "But only in recent years, experts say, have they begun routinely disenrolling Indians deemed inauthentic members of a group."
## [2] "My daughter is a modern dancer, working in New York."
## [3] "During our conversation, I learned he was the first in his family to go to college, and that his undergraduate studies were made possible in part by a Pell Grant from the federal government."
## [4] "I really enjoyed this evening.\""
## [5] "Adams was taken Saturday afternoon in critical condition to St. Louis University Hospital, after he shot himself in the head."
tweet_corp$documents[1:5,1]
## [1] "you're back only twitches if you're sexy DUH mine does it all the time ;) Maybe it'sme: can stay off forever!"
## [2] "Happy Birthday to !"
## [3] "Definitely something to be afraid of. should bring back Assy McGee."
## [4] "Foundation honors muckraking journalists."
## [5] "Starts 5:30 #Fulfillment takes resolve."
The codes below provide the summary of the final analytic sample.
summ2 <- data.frame(Corpus = c("blogs", "news", "tweeter"),
Size = c(format(object.size(blogs_corp), "GB", digits = 3),
format(object.size(news_corp), "GB", digits = 3),
format(object.size(tweet_corp), "GB", digits = 3)),
Sentences = c(ndoc(blogs_corp), ndoc(news_corp),
ndoc(tweet_corp)) )
knitr::kable(summ2, caption = "Summary description of the analytic sample")
| Corpus | Size | Sentences |
|---|---|---|
| blogs | 0.05 Gb | 202007 |
| news | 0.004 Gb | 14051 |
| tweeter | 0.053 Gb | 258384 |
The analytic text data is now divided into training and testing sets in ratio 60% to 40%. The training set will be used to train the algorithm and statistical parameters. These will be used to compute probabilities on the test set. The split is done separately by document to minimize potential bias of one set against the other with respect to data source domain.
# blogs
n <- length(blogs_corp$documents[,1])
n.train <- round(0.8*n)
train.blogs <- blogs_corp$documents[1:n.train, 1]
test.blogs <- blogs_corp$documents[(n.train + 1):n, 1]
# news
n <- length(news_corp$documents[,1])
n.train <- round(0.8*n)
train.news <- news_corp$documents[1:n.train, 1]
test.news <- news_corp$documents[(n.train + 1):n, 1]
# tweeter
n <- length(tweet_corp$documents[,1])
n.train <- round(0.8*n)
train.tweet <- tweet_corp$documents[1:n.train, 1]
test.tweet <- tweet_corp$documents[(n.train + 1):n, 1]
The Blogs, News, and Twitter text documents are from different domains, as reflected by their different TTRs for example. Notwithstanding, in order to further reduce the overall document size they will be combined into one dataset for modeling purpose.
The following codes combine the documents, currently represented as vectors of sentences, separately by training and testing data.
### Combine sentence vectors for training, testing samples
train <- c(train.blogs, train.news, train.tweet)
length(train)
## [1] 379554
train[1:10]
## [1] "(For Oswald Plummer it is Delia's bottom that is of especial interest â\200\" as evidenced by the fact that his hand is at present handling it â\200\" but we will come to that.)"
## [2] "Keep some space for try another food in another shop as well!!!"
## [3] "â\200\231 nonsense then maybe thereâ\200\231d be an interesting show here."
## [4] "Practice at least one."
## [5] "It's such an outrageous story that one is reminded of Josef Goebbels' famous dictum that \"the bigger the lie, the more people will believe it.\""
## [6] "His wood working studio was adjacent and he had to walk through the pea gravel to come and go."
## [7] "Also it can be helpful to play the melody on a piano and sing with it to get the notes right in the head."
## [8] "In Genesis 15: God told Abraham that he would have a son."
## [9] "I am sure I can't be the only paranoid wife of a deployed service member."
## [10] "Mim is a top position now Sarika."
rm("train.blogs", "train.news", "train.tweet")
test <- c(test.blogs, test.news, test.tweet)
length(test)
## [1] 94888
rm("test.blogs", "test.news", "test.tweet")
The data cleaning approach chosen for the free text content is influenced by the nature of the NLP task at hand. The primary interest is in sequential words within a sentence. Therefore the sentence structure, a string of consectives words, will be preserved through preprocessing.
To improve computational ef???ciency, efforts are made to remove specified text characters while retaining only the 26 English alphabets.
It is common practice in some text analytics to exclude certain frequently used words, the standard stopwords (e.g. the, and, for), to reduce the working data object size. This is favored especially since the removal of these functional or frequent words may not change the meaning of the sentence. However, for the same reason (of commonly used), it is decided to not exclude them from the next word prediction dictionary: end-users should reap greater benefit by being helped to type these frequently used words.
A function is written to clean the sentences; this function is modified from http://amunategui.github.io/speak-like-a-doctor/. Specific details of the preprocessing can be found along with the codes.
Clean_SentencesFn <- function(text_blob) {
# swap all sentence ends with code 'ootoo'
text_blob <- gsub(pattern=';|\\.|!|\\?', x=text_blob, replacement='ootoo')
# convert encoding from Latin-1 to ASCII
text_blob <- iconv(text_blob, "latin1", "ASCII", sub="")
# To avoid the conversion of e.g. "U.S."" to "us"
text_blob <- gsub(pattern="U.S.A.| U.S.A | U.S | U.S.| U S | U.S | u s | u.s | u.s.a |united states |United States", x=text_blob, replacement='USA')
# remove all non-alpha text (numbers etc)
text_blob <- gsub(pattern="[^[:alpha:]]", x=text_blob, replacement = ' ')
# force all characters to lower case
text_blob <- tolower(text_blob)
# Restore instances of e.g. "can't" when not appropriately converted
text_blob <- gsub("\u0092", "'", text_blob)
text_blob <- gsub("\u0093|\u0094", "", text_blob)
# remove any small (1 character) words {size} or {min,max}
text_blob <- gsub(pattern="\\W*\\b\\w{1}\\b", x=text_blob, replacement=' ')
# remove contiguous spaces
text_blob <- gsub(pattern="\\s+", x=text_blob, replacement=' ')
# split sentences by split code
sentence_vector <- unlist(strsplit(x=text_blob, split='ootoo',fixed = TRUE))
return (sentence_vector)
}
Both training and testing data are cleaned by applying the Clean_SentencesFn. Six sample sentences are printed to check the success of cleaning.
# Apply function
train.clean <- Clean_SentencesFn(paste(train, collapse = " "))
test.clean <- Clean_SentencesFn(paste(test, collapse = " "))
train.clean[1:6]
## [1] " for oswald plummer it is delia bottom that is of especial interest as evidenced by the fact that his hand is at present handling it but we will come to that"
## [2] " keep some space for try another food in another shop as well"
## [3] ""
## [4] ""
## [5] " nonsense then maybe thered be an interesting show here"
## [6] " practice at least one"
Results show there are few empty elements in the text vectors. The following codes removed the empty elements:
# delete empty vector elements: ''
train.clean <- train.clean[train.clean != ""]
test.clean <- test.clean[test.clean != ""]
table(train.clean == "")
##
## FALSE
## 463200
table(test.clean == "")
##
## FALSE
## 115590
Next, the train.clean dataset needs to contain only english words. This requires an English vocabulary list against which to check each text string of words in the training data.
A free online English dictionary has been found useful for building a basic algorithm. It is available for download at https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt .
# download dictionary
fileUrl <- "https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt"
fileDest <- "C:/Users/Peter/DataScience/MachineLearningCourse/TM_CapstoneProj/words_alpha.txt"
download.file(fileUrl, destfile = fileDest)
dateDownloaded <- date()
Next, read in the dictionary file.
dictEn <- read.csv(file = "words_alpha.txt", header = FALSE, sep = ",", stringsAsFactors = FALSE)
dim(dictEn)
## [1] 370099 1
head(dictEn)
## V1
## 1 a
## 2 aa
## 3 aaa
## 4 aah
## 5 aahed
## 6 aahing
There are over 370,000 words in the downloaded english dictionary.
Once loaded into R, the english dictionary is converted to a quanteda dictionary for removing non english vocabularies from the training data.
mydict <- dictionary(list(engwords = dictEn[,1]))
The constructed mydict will be applied in the next step when creating n-grams.
There are 463,219 final sentences in the training data. The test data consists of 1155596 sentences.
The next step in this NLP pipeline is to split the text into smaller units (tokens) corresponding to words or combinations of words (e.g. pairs, triplets) in their natural sequence as used in a sentence.
Specifically the algorithm will use 2-grams (two adjacent words) and 3-grams (three adjacent words) created from the clean training data, with non-English words removed, to predict the next word based on the previous 1, 2, or 3 words in the phrase.
These tokens (1-grams, 2-grams, 3-grams) are stored in the n-gram dictionary that will be part of the text predictive model app to be developed.
To evaluate the performance of predictive algorithm, the test data will also be subjected to exactly the same preprocessing to ensure consistency between training and testing.
Here a function is written to tokenize input text data into n-grams while specifying the desired number of words (or n), and removing non-English words (i.e. words not found in mydict). This function outputs a data-feature matrix.
# Write function that creates a dfm for only english words
dfm_usedict_Fn <- function(text_set, dict, n_gram = 1) {
if (n_gram < 2) {
dtm <- text_set %>% tokens() %>% tokens_select(pattern = dict, selection = "keep",
case_insensitive = TRUE) %>% dfm()
} else {
dtm <- text_set %>% tokens() %>% tokens_select(pattern = dict, selection = "keep",
case_insensitive = TRUE) %>% tokens_ngrams(n = n_gram, concatenator = " ") %>%
dfm()
}
return(dtm)
}
Next, apply the above function to create n-grams from the training data. Then generate datasets ordered with most frequent n-grams at the top.
dfm_1gram_train <- dfm_usedict_Fn(text_set = train.clean, dict = mydict, n_gram = 1)
format(object.size(dfm_1gram_train), "GB", digits = 3)
## [1] "0.119 Gb"
head(textstat_frequency(dfm_1gram_train)[, 1:4])
## feature frequency rank docfreq
## 1 the 235811 1 153531
## 2 to 154406 2 120147
## 3 and 127667 3 102207
## 4 of 103850 4 82405
## 5 in 82467 5 71313
## 6 you 73832 6 60882
tail(textstat_frequency(dfm_1gram_train)[, 1:4])
## feature frequency rank docfreq
## 56088 impermanence 1 56088 1
## 56089 proctologist 1 56089 1
## 56090 consults 1 56090 1
## 56091 vitae 1 56091 1
## 56092 cristi 1 56092 1
## 56093 biomechanical 1 56093 1
dfm_2gram_train <- dfm_usedict_Fn(text_set = train.clean, dict = mydict, n_gram = 2)
format(object.size(dfm_2gram_train), "GB", digits = 3)
## [1] "0.212 Gb"
head(textstat_frequency(dfm_2gram_train)[, 1:4])
## feature frequency rank docfreq
## 1 of the 20794 1 19092
## 2 in the 19535 2 18603
## 3 for the 11047 3 10811
## 4 to the 10839 4 10516
## 5 on the 10454 5 10162
## 6 to be 9561 6 9292
tail(textstat_frequency(dfm_2gram_train)[, 1:4])
## feature frequency rank docfreq
## 1354618 was suspect 1 1354618 1
## 1354619 suspect because 1 1354619 1
## 1354620 because ideas 1 1354620 1
## 1354621 grow big 1 1354621 1
## 1354622 grow because 1 1354622 1
## 1354623 on slippery 1 1354623 1
dfm_3gram_train <- dfm_usedict_Fn(text_set = train.clean, dict = mydict, n_gram = 3)
format(object.size(dfm_3gram_train), "GB", digits = 3)
## [1] "0.348 Gb"
head(textstat_frequency(dfm_3gram_train)[, 1:4])
## feature frequency rank docfreq
## 1 thanks for the 1832 1 1831
## 2 one of the 1720 2 1708
## 3 going to be 1050 3 1047
## 4 thank you for 837 4 836
## 5 looking forward to 802 5 801
## 6 the end of 780 6 775
tail(textstat_frequency(dfm_3gram_train)[, 1:4])
## feature frequency rank docfreq
## 3044466 they tried not 1 3044466 1
## 3044467 show their shock 1 3044467 1
## 3044468 their shock as 1 3044468 1
## 3044469 shock as they 1 3044469 1
## 3044470 did as she 1 3044470 1
## 3044471 as she asked 1 3044471 1
Some words are more frequent than others. Word clouds provide a visual analytics approach to identifying the most frequent words.
textplot_wordcloud(dfm_1gram_train, min_size = 2, max_size = 5, max_words = 100,
rotation = 0.25, color = rev(RColorBrewer::brewer.pal(11, "RdBu")))
As expected, the stopwords are the most frequently typed words.
A quantitative alternative to word clouds is to plot the distributions of word frequencies.
freqData <- head(textstat_frequency(dfm_1gram_train), 30)
ggplot(freqData, aes(x = reorder(feature, frequency), y = frequency/1000)) +
geom_bar(stat = "identity") + coord_flip() + labs(title = "30 most frequent words",
x = "1-gram")
rm(freqData)
freqData <- head(textstat_frequency(dfm_2gram_train), 30)
g1 <- ggplot(freqData, aes(x = reorder(feature, frequency), y = frequency/1000)) +
geom_bar(stat = "identity") + coord_flip() + labs(title = "Bigrams", x = "2-gram")
freqData <- head(textstat_frequency(dfm_3gram_train), 30)
g2 <- ggplot(freqData, aes(x = reorder(feature, frequency), y = frequency/1000)) +
geom_bar(stat = "identity") + coord_flip() + labs(title = "Trigrams", x = "3-gram")
grid.arrange(g1, g2, ncol = 2)
rm(freqData)
dat <- textstat_frequency(dfm_1gram_train)
x1.1 <- nrow(dat) # number of tokens
x1.2 <- nrow(dat[dat$frequency == 1, ])
x1.3 <- round(x1.2/x1.1, 2)
dat <- textstat_frequency(dfm_2gram_train)
x2.1 <- nrow(dat) # number of tokens
x2.2 <- nrow(dat[dat$frequency == 1, ])
x2.3 <- round(x2.2/x2.1, 2)
dat <- textstat_frequency(dfm_3gram_train)
x3.1 <- nrow(dat)
x3.2 <- nrow(dat[dat$frequency == 1, ])
x3.3 <- round(x3.2/x3.1, 2)
singles <- data.frame(n_tokens = c(x1.1, x2.1, x3.1), n_freq_of_1 = c(x1.2,
x2.2, x3.2), p_freq_of_1 = c(x1.3, x2.3, x3.3))
row.names(singles) <- c("1-gram", "2-gram", "3-gram")
rm(dat)
knitr::kable(singles, caption = "Proportion of 2-grams and 3-grams that appear once")
| n_tokens | n_freq_of_1 | p_freq_of_1 | |
|---|---|---|---|
| 1-gram | 56093 | 15212 | 0.27 |
| 2-gram | 1354623 | 992816 | 0.73 |
| 3-gram | 3044471 | 2712279 | 0.89 |
Many of the n-grams are rarely used yet they potentially contribute significantly to the model dictionary size. About a third of the 1-grams occur only once; worst still, up to 90% of the 3-grams occur only once. It may be more computational efficient and accurate to re-classify all these singletons as unseen n-grams in the dictionary.
It is of interest to know how many unique words are needed in a frequency sorted dictionary to cover say 50% of all word instances in the language.
Write a function to calculate coverage statistics, that accepts document dataframe (sorted by ‘proportion’ variable) and its label as input.
# Write coverage funtion
coverageFn <- function(dat, datName) {
n <- sum(dat$frequency)
dat$prop <- dat$frequency/n
dat$cumprop <- cumsum(dat$prop)
# calculate coverage statistics
x <- nrow(dat)
x.5 <- nrow(dat[dat$cumprop <= 0.5, ])
x.9 <- nrow(dat[dat$cumprop <= 0.9, ])
x.95 <- nrow(dat[dat$cumprop <= 0.95, ])
x.975 <- nrow(dat[dat$cumprop <= 0.975, ])
x.99 <- nrow(dat[dat$cumprop <= 0.99, ])
# organize into datatable
d <- data.frame(Number = c(x.5, x.9, x.95, x.975, x.99, x), Proportion = c(round(x.5/x,
3), round(x.9/x, 3), round(x.95/x, 3), round(x.975/x, 3), round(x.99/x,
3), 1))
row.names(d) <- c("Cover 50% of all word instances", "Cover 90% of all word instances",
"Cover 95% of all word instances", "Cover 975% of all word instances",
"Cover 99% of all word instances", "Cover 100% of all word instances")
names(d) <- c(paste0(datName, "_N"), paste0(datName, "_Prop"))
return(d)
}
First, the textstat_frequency() function of quanteda is applied to a dfm to output a dataframe of ngrams ranked according to frequency. Then the coverageFn function is applied on this output separately by n-gram:
dfm_1gram_train %>% textstat_frequency() %>% coverageFn(., datName = "Unigram")
## Unigram_N Unigram_Prop
## Cover 50% of all word instances 121 0.002
## Cover 90% of all word instances 4821 0.086
## Cover 95% of all word instances 9916 0.177
## Cover 975% of all word instances 16752 0.299
## Cover 99% of all word instances 27374 0.488
## Cover 100% of all word instances 56093 1.000
dfm_2gram_train %>% textstat_frequency() %>% coverageFn(., datName = "Bigram")
## Bigram_N Bigram_Prop
## Cover 50% of all word instances 28731 0.021
## Cover 90% of all word instances 888927 0.656
## Cover 95% of all word instances 1121775 0.828
## Cover 975% of all word instances 1238199 0.914
## Cover 99% of all word instances 1308053 0.966
## Cover 100% of all word instances 1354623 1.000
dfm_3gram_train %>% textstat_frequency() %>% coverageFn(., datName = "Trigram")
## Trigram_N Trigram_Prop
## Cover 50% of all word instances 928958 0.305
## Cover 90% of all word instances 2621368 0.861
## Cover 95% of all word instances 2832919 0.931
## Cover 975% of all word instances 2938695 0.965
## Cover 99% of all word instances 3002160 0.986
## Cover 100% of all word instances 3044471 1.000
While less than 50% of the unique words covers 99% instances of all words used in the corpora, essentially all of the unique bigrams and triagrams are required for the same coverage. This suggests that it may be more efficient in terms of dictionary size to keep only the to 50% of the most frequent 1-grams. But regarding bigrams and triagrams, a better argument may be to still consider them as unseen n-grams.
A look-up n-gram dictionary with acceptable data size (< 1.0 Gb) will be produced from the current 1-grams (0.1 Gb), 2-grams (0.2 Gb) and 3-grams (0.35 Gb). This will also limit the amount of memory (physical RAM) required to run the model in R. The dictionary will be saved as a .csv datafile. Each n-gram will occupy a row, with observations of frequncy proportions.
Markov Assumption will be applied: the probability of a next word depends only on the n-1 previous words (the history), where n = 2 (bigram) or 3 (trigram) in this case.
For example, when the input is a 2-word phrase (w1 w2), that is, a bigram, the algorithm will apply the chain rule to estimate the probability of the next word w3 to complete the string w1 w2 w3 (trigram):
P(w3|w1 w2)= P(w1 w2 w3)/ P(w1)P(w2|w1)
The proportion (probability) column of the n-gram dictionary will be adjusted by smoothing using [Good-Turing] (https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation) estimation method. The smoothing will assign non-zero values to tokens absent from the dictionary but are supplied as input phrase e.g. from test data. The number of n-grams (or estimated probabilities) in the dictionary is the number of parameters to work with. To reduce the number of parameters the n-grams seen once will be treated same as unseen and represented by one parameter.
Also, Katz-Backoff approach will be considered for estimating the probability of unobserved n-grams which are expected to be high for the bigrams and trigrams. For the basic model, if no matching 3-gram is found the algorithm backs off to the 2-gram with the highest count, and if no matching 2-gram is found it backs off to the most frequent 1-gram.
The testing data will be similarly tokenized. The bigrams and trigrams will be split into input phrase and known next word, e.g. a trigram is split into a 2-word input phrase and next word. The algorithm will accept each input phrase and predict the most likely next word. Percent correctly predicted will be a measure of its accuracy.
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gridExtra_2.3 tibble_1.4.2 ggplot2_2.2.1 rio_0.5.10
## [5] dplyr_0.7.4 readtext_0.50 quanteda_1.2.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.16 RColorBrewer_1.1-2 formatR_1.5
## [4] highr_0.6 cellranger_1.1.0 compiler_3.5.0
## [7] pillar_1.2.2 plyr_1.8.4 bindr_0.1.1
## [10] forcats_0.3.0 tools_3.5.0 stopwords_0.9.0
## [13] digest_0.6.15 lubridate_1.7.4 evaluate_0.10.1
## [16] gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.1
## [19] rlang_0.2.0 openxlsx_4.0.17 Matrix_1.2-14
## [22] fastmatch_1.1-0 curl_3.2 yaml_2.1.19
## [25] haven_1.1.1 bindrcpp_0.2.2 stringr_1.3.0
## [28] httr_1.3.1 knitr_1.20 rprojroot_1.3-2
## [31] grid_3.5.0 glue_1.2.0 data.table_1.11.0
## [34] R6_2.2.2 readxl_1.1.0 foreign_0.8-70
## [37] rmarkdown_1.9 spacyr_0.9.9 magrittr_1.5
## [40] backports_1.1.2 scales_0.5.0 htmltools_0.3.6
## [43] assertthat_0.2.0 colorspace_1.3-2 labeling_0.3
## [46] stringi_1.1.7 lazyeval_0.2.1 RcppParallel_4.4.0
## [49] munsell_0.4.3