In this report, an actual model and a word predictor are built. First, the proces of data retrieval, cleaning and tidying is repeated from the exploratory data analysis in a more efficient manner. Then, multiple models with different parameters are trained on a training set and tested on a validation set. Some accuracy of the best performing model was given up in favor of performance.
First of all, we read our corpus from three files, store them in variables, and close the connections.
con1 <- file("corpusdata/en_US/en_US.twitter.txt", "r")
con2 <- file("corpusdata/en_US/en_US.news.txt", "r")
con3 <- file("corpusdata/en_US/en_US.blogs.txt", "r")
tweets <- readLines(con1)
news <- readLines(con2)
blogs <- readLines(con3)
close(con1); close(con2); close(con3)
Then we sample 10% of the tweets and blogposts, and 100% of the news items. As mentioned in the exploratory analysis, this keeps the ratio somewhat balanced. The seed is set for reproducibility.
set.seed(50)
samplelines <- c(sample(tweets, length(tweets) * 0.1),
sample(news, length(news) * 1),
sample(blogs, length(blogs) * 0.1))
tweets, news and blogs are now shuffled, after the seed is set again. Next, we determine two indices: one for splitting between a training and validation set, and one for splitting between the validation and testing set. Afterwards, we perform the actual splits.
set.seed(50)
samplelines <- sample(samplelines)
validationIndex <- floor(length(samplelines) * 0.8)
testingIndex <- floor(length(samplelines) * 0.9)
training <- samplelines[1:validationIndex]
validation <- samplelines[(validationIndex+1):testingIndex]
testing <- samplelines[(testingIndex+1):length(samplelines)]
The training, validation and testing data are set to lower case and the apostrophes are being normalized. Next, the whole set is split on anything that is not a letter or an apostrophe. Empty tokens are being removed.
tokenizer <- function(lines) {
lines <- tolower(lines)
lines <- gsub("'", "'", lines)
lines <- gsub("[.!?]$|[.!?] |$", " ''split'' ", lines)
tokens <- unlist(strsplit(lines, "[^a-z']"))
tokens <- tokens[tokens != ""]
return(tokens)
}
tokens <- tokenizer(training)
vtokens <- tokenizer(validation)
ttokens <- tokenizer(testing)
We would like n-gram counts. How many times does a certain n-gram appear in the training corpus? This contains four steps.
tokens2 <- c(tokens[-1], ".")
tokens3 <- c(tokens2[-1], ".")
tokens4 <- c(tokens3[-1], ".")
tokens5 <- c(tokens4[-1], ".")
tokens6 <- c(tokens5[-1], ".")
unigrams <- tokens
bigrams <- paste(tokens, tokens2)
trigrams <- paste(tokens, tokens2, tokens3)
quadgrams <- paste(tokens, tokens2, tokens3, tokens4)
fivegrams <- paste(tokens, tokens2, tokens3, tokens4, tokens5)
sixgrams <- paste(tokens, tokens2, tokens3, tokens4, tokens5, tokens6)
unigrams <- unigrams[!grepl("''split''", unigrams)]
bigrams <- bigrams[!grepl("''split''", bigrams)]
trigrams <- trigrams[!grepl("''split''", trigrams)]
quadgrams <- quadgrams[!grepl("''split''", quadgrams)]
fivegrams <- fivegrams[!grepl("''split''", fivegrams)]
sixgrams <- sixgrams[!grepl("''split''", sixgrams)]
table function to get the n-gram counts. We sort to get the most frequent n-grams on top.unigrams <- sort(table(unigrams), decreasing=T)
bigrams <- sort(table(bigrams), decreasing=T)
trigrams <- sort(table(trigrams), decreasing=T)
quadgrams <- sort(table(quadgrams), decreasing=T)
fivegrams <- sort(table(fivegrams), decreasing=T)
sixgrams <- sort(table(sixgrams), decreasing=T)
Here are some plots of the n-grams, showing the most occuring 15 of each.
The following functions will be helpfull. getLastWords() retrieves the last n words of a string, removeLastWord() removes the last word from a string. Both use regular expressions and the stringr package.
library(stringr)
getLastWords <- function(string, words) {
pattern <- paste("[a-z']+( [a-z']+){", words - 1, "}$", sep="")
return(substring(string, str_locate(string, pattern)[,1]))
}
removeLastWord <- function(string) {
sub(" [a-z']+$", "", string)
}
From Wikipedia:
Kneser-Ney smoothing is a method primarily used to calculate the probability distribution of n-grams in a document based on their histories. It is widely considered the most effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability’s lower order terms to omit n-grams with lower frequencies. This approach has been considered equally effective for both higher and lower order n-grams.
A common example that illustrates the concept behind this method is the frequency of the bigram “San Francisco”. If it appears several times in a training corpus, the frequency of the unigram “Francisco” will also be high. Relying on only the unigram frequency to predict the frequencies of n-grams leads to skewed results; however, Kneser-Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.
Let w1wn be the number of n-gram occurrences, then w1wn-1 would be the (n-1)gram count with the last word excluded and w2wn would be the (n-1)gram count with the first word excluded. The formula for Kneser-Ney smoothing probabilities would then be:
\[ \begin{aligned} &PKN(\text{w1wn}) = \frac{\text{max}(\text{w1wn} - d, 0)}{\text{w1wn-1}} + (d * \frac{\text{#ngrams starting with w1wn-1}}{\text{w1wn-1}}) + PKN(\text{w2wn}) \\ &PKN(\text{w2wn}) = \frac{\text{max}(\text{w2wn} - d, 0)}{\text{w2wn-1}} + (d * \frac{\text{#ngrams starting with w2wn-1}}{\text{w2wn-1}}) + PKN(\text{w3wn}) \\ &...\\ &PKN(\text{wn}) = \frac{\text{#bigrams ending with wn}}{\text{length(bigrams)}} \end{aligned} \]
The discount value d is a chosen constant. Notice how the formula works recursively on the n-gram with the first word excluded, all the way to a different formula for unigrams. Here, we implement the Kneser-Nay smoothing as a function that takes the n-gram counts of the training set and calculates a probability for every one of them.
kneserNay <- function(ngrams, d) {
n <- length(strsplit(names(ngrams[1]), " ")[[1]])
# Special case for unigrams
if(n==1) {
noFirst <- unigrams[getLastWords(names(bigrams), 1)]
pContinuation <- table(names(noFirst))[names(unigrams)] / length(bigrams)
return(pContinuation)
}
# Get needed counts
nMinusOne <- list(unigrams, bigrams, trigrams, quadgrams, fivegrams, sixgrams)[[n-1]]
noLast <- nMinusOne[removeLastWord(names(ngrams))]
noFirst <- nMinusOne[getLastWords(names(ngrams), n-1)]
# Calculate discounts, lambda and pContinuation
discounts <- ngrams - d
discounts[discounts < 0] <- 0
lambda <- d * table(names(noLast))[names(noLast)] / noLast
if(n == 2) pContinuation <- table(names(noFirst))[names(noFirst)] / length(ngrams)
else pContinuation <- kneserNay(noFirst, d)
# Put it all together
probabilities <- discounts / noLast + lambda * pContinuation / length(ngrams)
return(probabilities)
}
We immediately apply this function on all our n-gram counts.
unigramProbs <- kneserNay(unigrams, 0.75)
bigramProbs <- kneserNay(bigrams, 0.75)
trigramProbs <- kneserNay(trigrams, 0.75)
quadgramProbs <- kneserNay(quadgrams, 0.75)
fivegramProbs <- kneserNay(fivegrams, 0.75)
sixgramProbs <- kneserNay(sixgrams, 0.75)
Now that we have probabilities for every n-gram, we can start building a model that predicts a word based on a given input. One idea is just to retrieve the latest n-1 words of the input, and look for the n-gram with the highest probability that starts with these n-1 words. The last word of that particular n-gram would be our prediction. Two problems arise however.
A back-off model solves both problems. Whenever an n-gram starting with the last n-1 words of the input doesn’t occur or it doesn’t exceeds a certain threshold, we ‘back off’ to the lower tier n-grams and hope that the relevant (n-1)gram does exists or results in a higher probability.
We have two parameters to choose when building a model. The threshold value: how soon do we prefer backing off to the lower tier n-grams? The starting tier: in which set of n-gram probabilities do we start looking? We determine both by creating multiple models with different parameters, and see how well they perform on our validation corpus. First, we split the validation corpus in the same way we split the training corpus. The idea is to treat all possible validation sixgrams as if they were inputs by a user.
vtokens2 <- c(vtokens[-1], ".")
vtokens3 <- c(vtokens2[-1], ".")
vtokens4 <- c(vtokens3[-1], ".")
vtokens5 <- c(vtokens4[-1], ".")
vtokens6 <- c(vtokens5[-1], ".")
vsixgrams <- paste(vtokens, vtokens2, vtokens3, vtokens4, vtokens5, vtokens6)
For every set of six words in the validation set, we like to calculate the probability to correctly predict the last word, based on the five (or less) previous words. The higher these probabilities are, the better the model. We write a function that takes as arguments the ‘n-gram tier’ to start looking for a present probability, and the threshold value from which we start backing off to a lower tier. Notice how NA values (for n-grams not found in the training corpus) and lower values than the threshold are substituted.
createModel <- function(n, threshold) {
ngrams <- list(bigramProbs, trigramProbs, quadgramProbs, fivegramProbs, sixgramProbs)[[n-1]]
model <- ngrams[getLastWords(vsixgrams[1:10000], n)]
names(model) <- vsixgrams[1:10000]
if(n > 5) model[is.na(model) | model < threshold] <-
fivegramProbs[getLastWords(names(model[is.na(model) | model < threshold]), 5)]
if(n > 4) model[is.na(model) | model < threshold] <-
quadgramProbs[getLastWords(names(model[is.na(model) | model < threshold]), 4)]
if(n > 3) model[is.na(model) | model < threshold] <-
trigramProbs[getLastWords(names(model[is.na(model) | model < threshold]), 3)]
if(n > 2) model[is.na(model) | model < threshold] <-
bigramProbs[getLastWords(names(model[is.na(model) | model < threshold]), 2)]
if(n > 1) model[is.na(model) | model < threshold] <-
unigramProbs[getLastWords(names(model[is.na(model) | model < threshold]), 1)]
return(model)
}
Say we want to create a model that only takes the last four words of the input into account (which means we need to start looking for fivegram probabilities). Whenever the probability is lower than a threshold value of 5% (which is rather high), we would only take the last three words into account, and so on.
model <- createModel(5, 0.005)
Now, how good is this model? Perplexity is an internal quality measure that gives us an idea of the quality of the model. It is basically the product of all probabilities, powered by -1 over the number of probabilities. Or:
\[ \text{perplexity} = \sqrt[-\dfrac{1}{N}]{\prod_{i=1}^N \text{probability i}} \]
One issue with perplexity is that an implementation according to the formula above creates numerical underflow easily. The product of small probabilities quickly rounds off to zero. This formula is equivalent to the one above, and solves the issue. We write a function that calculates the perplexity when given a list of probabilities.
\[ \text{perplexity} = \exp(- \dfrac{\sum_{i=1}^N \log(\text{probability i})}{N}) \]
perplexity <- function(probabilities) {
return(exp(-sum(log(probabilities)) / length(probabilities)))
}
We immediately apply the function on our previously built model. The lower the perplexity, the better. This is just an example, perplexity values will become more meaningful when compared to others.
perplexity(model[!is.na(model)])
## [1] 444.7638
Now that we can build a model and see how well it performs, we can build multiple models to get the parameters right: threshold and starting tier. The following function builds 10 models with a ranging threshold value and a fixed n-gram list as a starting point. For every model, the perplexity is calculated and stored in a dataframe. The idea is to pick the parameters that result in the lowest perplexity.
perplexityFor10Models <- function(minT, maxT, ngram) {
perplexities <- data.frame("Threshold" = seq(minT, maxT, by=maxT/10), "Perplexity" = seq(0, 10, by=1))
for(i in seq(0, 10, by=1)) {
model <- createModel(ngram, i*maxT/10)
perplexities[i+1, 2] <- perplexity(model[!is.na(model)])
}
return(perplexities)
}
Now we use this function with all possible n-gram lists as a starting point. I have gone ahead and found out that the best possible threshold should find itself in an interval between 0 and 0.0001 Note that 0 as a threshold value simply means that we don’t back off as soon as the n-gram is present. Per n-gram model, the perplexity values for the different threshold values are plotted (we don’t bother with a unigram model).
sixgramModels <- perplexityFor10Models(minT=0, maxT=0.001, ngram=6)
fivegramModels <- perplexityFor10Models(minT=0, maxT=0.001, ngram=5)
quadgramModels <- perplexityFor10Models(minT=0, maxT=0.001, ngram=4)
trigramModels <- perplexityFor10Models(minT=0, maxT=0.001, ngram=3)
bigramModels <- perplexityFor10Models(minT=0, maxT=0.001, ngram=2)
We can already conclude several things based on the plots.
rbind(bigramModels[which.min(bigramModels$Perplexity),],
trigramModels[which.min(trigramModels$Perplexity),],
quadgramModels[which.min(quadgramModels$Perplexity),],
fivegramModels[which.min(fivegramModels$Perplexity),],
sixgramModels[which.min(sixgramModels$Perplexity),])
## Threshold Perplexity
## 2 1e-04 471.7109
## 21 1e-04 364.9897
## 22 1e-04 348.6243
## 23 1e-04 347.3109
## 24 1e-04 347.1130
These numbers show the threshold resulting in the lowest perplexity, per model. They confirm what we saw in the plots: the lowest perplexity for the sixgram model with a threshold value of 0.0001.
unigramDF <- data.frame("Words" = (names(unigrams)), "Probability" = unigramProbs, stringsAsFactors=F)
bigramsDF <- data.frame("FirstWords" = removeLastWord(names(bigrams)),
"LastWord" = getLastWords(names(bigrams), 1),
"Probability" = bigramProbs, stringsAsFactors=F)
trigramsDF <- data.frame("FirstWords" = removeLastWord(names(trigrams)),
"LastWord" = getLastWords(names(trigrams), 1),
"Probability" = trigramProbs, stringsAsFactors=F)
quadgramsDF <- data.frame("FirstWords" = removeLastWord(names(quadgrams)),
"LastWord" = getLastWords(names(quadgrams), 1),
"Probability" = quadgramProbs, stringsAsFactors=F)
library(dplyr)
unigramDF <- (unigramDF %>% arrange(desc(Probability)))
bigramsDF <- bigramsDF %>% arrange(desc(Probability)) %>% filter(Probability > 0.0001)
trigramsDF <- trigramsDF %>% arrange(desc(Probability)) %>% filter(Probability > 0.0001)
quadgramsDF <- quadgramsDF %>% arrange(desc(Probability)) %>% filter(Probability > 0.0001)
predictor <- function(input) {
n <- length(strsplit(input, " ")[[1]])
prediction <- c()
if(n >= 3 && length(prediction)<3)
prediction <- c(prediction, filter(quadgramsDF, getLastWords(input, 3) == FirstWords)$LastWord)
if(n >= 2 && length(prediction)<3)
prediction <- c(prediction, filter(trigramsDF, getLastWords(input, 2) == FirstWords)$LastWord)
if(n >= 1 && length(prediction)<3)
prediction <- c(prediction, filter(bigramsDF, getLastWords(input, 1) == FirstWords)$LastWord)
if(length(prediction)<3 ) prediction <- c(prediction, unigramDF$Words)
return(unique(prediction)[1:3])
}
Let’s see the predictor in action on some examples. Many models might generate the same predictions for the following two examples because the last two words are the same. Here, the different word preceding the two apparently results in different predictions. This matches our intuition.
| Input | Top 3 predictions |
|---|---|
| predictor(“He sings from time to”) | time, get, go |
| predictor(“It is time to”) | get, start, take |
Here are some more examples of results we do like to see popping up when typing.
| Input | Top 3 predictions |
|---|---|
| predictor(“in the united”) | states, kingdom, states’ |
| predictor(“oh my”) | god, gosh, goodness |
| predictor(“i’m from new”) | jersey, york, orleans |
| predictor(“in the middle of the”) | night, street, game |
In the above examples, we used the best performing model (the one with the lowest perplexity) in our predictor function. Note however, that some accuracy might be given up in favor of performance in the application. When developing the application, we will search for a sweet spot that gives seemingly instant predictions with sufficient accuracy.
At this point, we wrote all the necessary parts for a predictor application. The only thing left to do is developing the app itself.