Synopsis

In this report, an actual model and a word predictor are built. First, the proces of data retrieval, cleaning and tidying is repeated from the exploratory data analysis in a more efficient manner. Then, multiple models with different parameters are trained on a training set and tested on a validation set. Some accuracy of the best performing model was given up in favor of performance.

From raw to tidy data

Getting the raw data

First of all, we read our corpus from three files, store them in variables, and close the connections.

con1 <- file("corpusdata/en_US/en_US.twitter.txt", "r")
con2 <- file("corpusdata/en_US/en_US.news.txt", "r")
con3 <- file("corpusdata/en_US/en_US.blogs.txt", "r")

tweets <- readLines(con1)
news <- readLines(con2)
blogs <- readLines(con3)

close(con1); close(con2); close(con3)

Then we sample 10% of the tweets and blogposts, and 100% of the news items. As mentioned in the exploratory analysis, this keeps the ratio somewhat balanced. The seed is set for reproducibility.

set.seed(50)
samplelines <- c(sample(tweets, length(tweets) * 0.1),
                 sample(news, length(news) * 1),
                 sample(blogs, length(blogs) * 0.1))

Training, validation and testing set

tweets, news and blogs are now shuffled, after the seed is set again. Next, we determine two indices: one for splitting between a training and validation set, and one for splitting between the validation and testing set. Afterwards, we perform the actual splits.

set.seed(50)
samplelines <- sample(samplelines)
validationIndex <- floor(length(samplelines) * 0.8)
testingIndex <- floor(length(samplelines) * 0.9)

training <- samplelines[1:validationIndex]
validation <- samplelines[(validationIndex+1):testingIndex]
testing <- samplelines[(testingIndex+1):length(samplelines)]

Tidying the data

The training, validation and testing data are set to lower case and the apostrophes are being normalized. Next, the whole set is split on anything that is not a letter or an apostrophe. Empty tokens are being removed.

tokenizer <- function(lines) {
    lines <- tolower(lines)
    lines <- gsub("'", "'", lines)
    lines <- gsub("[.!?]$|[.!?] |$", " ''split'' ", lines)
    tokens <- unlist(strsplit(lines, "[^a-z']"))
    tokens <- tokens[tokens != ""]
    return(tokens)
}

tokens <- tokenizer(training)
vtokens <- tokenizer(validation)
ttokens <- tokenizer(testing)

Generating n-grams counts

From tokens to counts

We would like n-gram counts. How many times does a certain n-gram appear in the training corpus? This contains four steps.

  1. Shift the tokens by one token at a time by removing the first token and add a dummy token (.) at the end.
tokens2 <- c(tokens[-1], ".")
tokens3 <- c(tokens2[-1], ".")
tokens4 <- c(tokens3[-1], ".")
tokens5 <- c(tokens4[-1], ".")
tokens6 <- c(tokens5[-1], ".")
  1. Paste these lists together to get the n-gram names.
unigrams <- tokens
bigrams <- paste(tokens, tokens2)
trigrams <- paste(tokens, tokens2, tokens3)
quadgrams <- paste(tokens, tokens2, tokens3, tokens4)
fivegrams <- paste(tokens, tokens2, tokens3, tokens4, tokens5)
sixgrams <- paste(tokens, tokens2, tokens3, tokens4, tokens5, tokens6)
  1. Remove the n-grams trespassing sentence boundaries.
unigrams <- unigrams[!grepl("''split''", unigrams)]
bigrams <- bigrams[!grepl("''split''", bigrams)]
trigrams <- trigrams[!grepl("''split''", trigrams)]
quadgrams <- quadgrams[!grepl("''split''", quadgrams)]
fivegrams <- fivegrams[!grepl("''split''", fivegrams)]
sixgrams <- sixgrams[!grepl("''split''", sixgrams)]
  1. Apply the table function to get the n-gram counts. We sort to get the most frequent n-grams on top.
unigrams <- sort(table(unigrams), decreasing=T)
bigrams <- sort(table(bigrams), decreasing=T)
trigrams <- sort(table(trigrams), decreasing=T)
quadgrams <- sort(table(quadgrams), decreasing=T)
fivegrams <- sort(table(fivegrams), decreasing=T)
sixgrams <- sort(table(sixgrams), decreasing=T)

Visualizing n-grams

Here are some plots of the n-grams, showing the most occuring 15 of each.

Generating n-gram probabilities

The following functions will be helpfull. getLastWords() retrieves the last n words of a string, removeLastWord() removes the last word from a string. Both use regular expressions and the stringr package.

library(stringr)

getLastWords <- function(string, words) {
    pattern <- paste("[a-z']+( [a-z']+){", words - 1, "}$", sep="")
    return(substring(string, str_locate(string, pattern)[,1]))
}

removeLastWord <- function(string) {
    sub(" [a-z']+$", "", string)
}

Kneser-Ney Smoothing

From Wikipedia:

Kneser-Ney smoothing is a method primarily used to calculate the probability distribution of n-grams in a document based on their histories. It is widely considered the most effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability’s lower order terms to omit n-grams with lower frequencies. This approach has been considered equally effective for both higher and lower order n-grams.

A common example that illustrates the concept behind this method is the frequency of the bigram “San Francisco”. If it appears several times in a training corpus, the frequency of the unigram “Francisco” will also be high. Relying on only the unigram frequency to predict the frequencies of n-grams leads to skewed results; however, Kneser-Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.

Let w1wn be the number of n-gram occurrences, then w1wn-1 would be the (n-1)gram count with the last word excluded and w2wn would be the (n-1)gram count with the first word excluded. The formula for Kneser-Ney smoothing probabilities would then be:

\[ \begin{aligned} &PKN(\text{w1wn}) = \frac{\text{max}(\text{w1wn} - d, 0)}{\text{w1wn-1}} + (d * \frac{\text{#ngrams starting with w1wn-1}}{\text{w1wn-1}}) + PKN(\text{w2wn}) \\ &PKN(\text{w2wn}) = \frac{\text{max}(\text{w2wn} - d, 0)}{\text{w2wn-1}} + (d * \frac{\text{#ngrams starting with w2wn-1}}{\text{w2wn-1}}) + PKN(\text{w3wn}) \\ &...\\ &PKN(\text{wn}) = \frac{\text{#bigrams ending with wn}}{\text{length(bigrams)}} \end{aligned} \]

The discount value d is a chosen constant. Notice how the formula works recursively on the n-gram with the first word excluded, all the way to a different formula for unigrams. Here, we implement the Kneser-Nay smoothing as a function that takes the n-gram counts of the training set and calculates a probability for every one of them.

kneserNay <- function(ngrams, d) {
    n <- length(strsplit(names(ngrams[1]), " ")[[1]])
    
    # Special case for unigrams
    if(n==1) {
        noFirst <- unigrams[getLastWords(names(bigrams), 1)]
        pContinuation <- table(names(noFirst))[names(unigrams)] / length(bigrams)
        return(pContinuation)
    }
    
    # Get needed counts
    nMinusOne <- list(unigrams, bigrams, trigrams, quadgrams, fivegrams, sixgrams)[[n-1]]
    noLast <- nMinusOne[removeLastWord(names(ngrams))]
    noFirst <- nMinusOne[getLastWords(names(ngrams), n-1)]
    
    # Calculate discounts, lambda and pContinuation
    discounts <- ngrams - d
    discounts[discounts < 0] <- 0
    lambda <- d * table(names(noLast))[names(noLast)] / noLast
    if(n == 2) pContinuation <- table(names(noFirst))[names(noFirst)] / length(ngrams)
    else pContinuation <- kneserNay(noFirst, d)
    
    # Put it all together
    probabilities <- discounts / noLast + lambda * pContinuation / length(ngrams)
    return(probabilities)
}

Calculating probabilities

We immediately apply this function on all our n-gram counts.

unigramProbs <- kneserNay(unigrams, 0.75)
bigramProbs <- kneserNay(bigrams, 0.75)
trigramProbs <- kneserNay(trigrams, 0.75)
quadgramProbs <- kneserNay(quadgrams, 0.75)
fivegramProbs <- kneserNay(fivegrams, 0.75)
sixgramProbs <- kneserNay(sixgrams, 0.75)

Building a model

Now that we have probabilities for every n-gram, we can start building a model that predicts a word based on a given input. One idea is just to retrieve the latest n-1 words of the input, and look for the n-gram with the highest probability that starts with these n-1 words. The last word of that particular n-gram would be our prediction. Two problems arise however.

  1. There might be no n-gram present starting with the last n-1 words of the input. Especially for the higher n-grams, this will many times be the case.
  2. When the n-gram is present, the probability might be so low that we might prefer a different prediction of a lower n-gram, one with a much higher probability for instance.

A back-off model solves both problems. Whenever an n-gram starting with the last n-1 words of the input doesn’t occur or it doesn’t exceeds a certain threshold, we ‘back off’ to the lower tier n-grams and hope that the relevant (n-1)gram does exists or results in a higher probability.

We have two parameters to choose when building a model. The threshold value: how soon do we prefer backing off to the lower tier n-grams? The starting tier: in which set of n-gram probabilities do we start looking? We determine both by creating multiple models with different parameters, and see how well they perform on our validation corpus. First, we split the validation corpus in the same way we split the training corpus. The idea is to treat all possible validation sixgrams as if they were inputs by a user.

vtokens2 <- c(vtokens[-1], ".")
vtokens3 <- c(vtokens2[-1], ".")
vtokens4 <- c(vtokens3[-1], ".")
vtokens5 <- c(vtokens4[-1], ".")
vtokens6 <- c(vtokens5[-1], ".")
vsixgrams <- paste(vtokens, vtokens2, vtokens3, vtokens4, vtokens5, vtokens6)

For every set of six words in the validation set, we like to calculate the probability to correctly predict the last word, based on the five (or less) previous words. The higher these probabilities are, the better the model. We write a function that takes as arguments the ‘n-gram tier’ to start looking for a present probability, and the threshold value from which we start backing off to a lower tier. Notice how NA values (for n-grams not found in the training corpus) and lower values than the threshold are substituted.

createModel <- function(n, threshold) {
    ngrams <- list(bigramProbs, trigramProbs, quadgramProbs, fivegramProbs, sixgramProbs)[[n-1]]
    
    model <- ngrams[getLastWords(vsixgrams[1:10000], n)]
    names(model) <- vsixgrams[1:10000]
    
    if(n > 5) model[is.na(model) | model < threshold] <- 
        fivegramProbs[getLastWords(names(model[is.na(model) | model < threshold]), 5)]
    if(n > 4) model[is.na(model) | model < threshold] <- 
        quadgramProbs[getLastWords(names(model[is.na(model) | model < threshold]), 4)]
    if(n > 3) model[is.na(model) | model < threshold] <- 
        trigramProbs[getLastWords(names(model[is.na(model) | model < threshold]), 3)]
    if(n > 2) model[is.na(model) | model < threshold] <- 
        bigramProbs[getLastWords(names(model[is.na(model) | model < threshold]), 2)]
    if(n > 1) model[is.na(model) | model < threshold] <- 
       unigramProbs[getLastWords(names(model[is.na(model) | model < threshold]), 1)]
    return(model)
}

Say we want to create a model that only takes the last four words of the input into account (which means we need to start looking for fivegram probabilities). Whenever the probability is lower than a threshold value of 5% (which is rather high), we would only take the last three words into account, and so on.

model <- createModel(5, 0.005)

Perplexity as quality measure

Now, how good is this model? Perplexity is an internal quality measure that gives us an idea of the quality of the model. It is basically the product of all probabilities, powered by -1 over the number of probabilities. Or:

\[ \text{perplexity} = \sqrt[-\dfrac{1}{N}]{\prod_{i=1}^N \text{probability i}} \]

One issue with perplexity is that an implementation according to the formula above creates numerical underflow easily. The product of small probabilities quickly rounds off to zero. This formula is equivalent to the one above, and solves the issue. We write a function that calculates the perplexity when given a list of probabilities.

\[ \text{perplexity} = \exp(- \dfrac{\sum_{i=1}^N \log(\text{probability i})}{N}) \]

perplexity <- function(probabilities) {
    return(exp(-sum(log(probabilities)) / length(probabilities)))
}

We immediately apply the function on our previously built model. The lower the perplexity, the better. This is just an example, perplexity values will become more meaningful when compared to others.

perplexity(model[!is.na(model)])
## [1] 444.7638

Exploring parameter values

Now that we can build a model and see how well it performs, we can build multiple models to get the parameters right: threshold and starting tier. The following function builds 10 models with a ranging threshold value and a fixed n-gram list as a starting point. For every model, the perplexity is calculated and stored in a dataframe. The idea is to pick the parameters that result in the lowest perplexity.

perplexityFor10Models <- function(minT, maxT, ngram) {
    perplexities <- data.frame("Threshold" = seq(minT, maxT, by=maxT/10), "Perplexity" = seq(0, 10, by=1))
    for(i in seq(0, 10, by=1)) {
        model <- createModel(ngram, i*maxT/10)
        perplexities[i+1, 2] <- perplexity(model[!is.na(model)])
    }
    return(perplexities)
}

Now we use this function with all possible n-gram lists as a starting point. I have gone ahead and found out that the best possible threshold should find itself in an interval between 0 and 0.0001 Note that 0 as a threshold value simply means that we don’t back off as soon as the n-gram is present. Per n-gram model, the perplexity values for the different threshold values are plotted (we don’t bother with a unigram model).

sixgramModels <- perplexityFor10Models(minT=0, maxT=0.001, ngram=6)
fivegramModels <- perplexityFor10Models(minT=0, maxT=0.001, ngram=5)
quadgramModels <- perplexityFor10Models(minT=0, maxT=0.001, ngram=4)
trigramModels <- perplexityFor10Models(minT=0, maxT=0.001, ngram=3)
bigramModels <- perplexityFor10Models(minT=0, maxT=0.001, ngram=2)

We can already conclude several things based on the plots.

  1. The threshold values resulting in the lowest perplexity lie around 1e-5 for any n-grams.
  2. The perplexity increases as we take fewer preceding words into account; sixgram perplexities are highest, bigram ones are lowest. This is also shown in the last plot, displaying the five minima of all previous plots.
  3. However, the difference in perplexity between the six-, five- and quadgrams is truly minimal, and the three plots look very similar. Although the quadgram model has a slightly higher perplexity, the fact that it is computationally less intensive makes it interesting.
rbind(bigramModels[which.min(bigramModels$Perplexity),], 
      trigramModels[which.min(trigramModels$Perplexity),],
      quadgramModels[which.min(quadgramModels$Perplexity),],
      fivegramModels[which.min(fivegramModels$Perplexity),],
      sixgramModels[which.min(sixgramModels$Perplexity),])
##    Threshold Perplexity
## 2      1e-04   471.7109
## 21     1e-04   364.9897
## 22     1e-04   348.6243
## 23     1e-04   347.3109
## 24     1e-04   347.1130

These numbers show the threshold resulting in the lowest perplexity, per model. They confirm what we saw in the plots: the lowest perplexity for the sixgram model with a threshold value of 0.0001.

Predicting next words

Building a function

unigramDF <- data.frame("Words" = (names(unigrams)), "Probability" = unigramProbs, stringsAsFactors=F)

bigramsDF <- data.frame("FirstWords" = removeLastWord(names(bigrams)), 
                        "LastWord" = getLastWords(names(bigrams), 1), 
                        "Probability" = bigramProbs, stringsAsFactors=F)

trigramsDF <- data.frame("FirstWords" = removeLastWord(names(trigrams)), 
                         "LastWord" = getLastWords(names(trigrams), 1), 
                         "Probability" = trigramProbs, stringsAsFactors=F)

quadgramsDF <- data.frame("FirstWords" = removeLastWord(names(quadgrams)), 
                          "LastWord" = getLastWords(names(quadgrams), 1), 
                          "Probability" = quadgramProbs, stringsAsFactors=F)
library(dplyr)
unigramDF <- (unigramDF %>% arrange(desc(Probability)))
bigramsDF <- bigramsDF %>% arrange(desc(Probability)) %>% filter(Probability > 0.0001)
trigramsDF <- trigramsDF %>% arrange(desc(Probability)) %>% filter(Probability > 0.0001)
quadgramsDF <- quadgramsDF %>% arrange(desc(Probability)) %>% filter(Probability > 0.0001)
predictor <- function(input) {
    n <- length(strsplit(input, " ")[[1]])
    prediction <- c()
    if(n >= 3 && length(prediction)<3) 
        prediction <- c(prediction, filter(quadgramsDF, getLastWords(input, 3) == FirstWords)$LastWord)
    if(n >= 2 && length(prediction)<3) 
        prediction <- c(prediction, filter(trigramsDF, getLastWords(input, 2) == FirstWords)$LastWord)
    if(n >= 1 && length(prediction)<3) 
        prediction <- c(prediction, filter(bigramsDF, getLastWords(input, 1) == FirstWords)$LastWord)
    if(length(prediction)<3 ) prediction <- c(prediction, unigramDF$Words)
    
    return(unique(prediction)[1:3])
}

Prediction examples

Let’s see the predictor in action on some examples. Many models might generate the same predictions for the following two examples because the last two words are the same. Here, the different word preceding the two apparently results in different predictions. This matches our intuition.

Input Top 3 predictions
predictor(“He sings from time to”) time, get, go
predictor(“It is time to”) get, start, take

Here are some more examples of results we do like to see popping up when typing.

Input Top 3 predictions
predictor(“in the united”) states, kingdom, states’
predictor(“oh my”) god, gosh, goodness
predictor(“i’m from new”) jersey, york, orleans
predictor(“in the middle of the”) night, street, game

Final thoughts

In the above examples, we used the best performing model (the one with the lowest perplexity) in our predictor function. Note however, that some accuracy might be given up in favor of performance in the application. When developing the application, we will search for a sweet spot that gives seemingly instant predictions with sufficient accuracy.

At this point, we wrote all the necessary parts for a predictor application. The only thing left to do is developing the app itself.