Introduction

As part of the Data Science Specialisation in Coursera, we are tasked to build a predictive text product.

After an initial exploratory data analysis, I decided to merge the conventional Knesey-Ney smoothing algorithm with the Stupid Backoff algorithm to predict for the next word. In this case, we will rely on n-grams of up to the 5th order.

Data

We will be using data provided by SwiftKey to build our prediction algorithm. For more information, you can take the Data Science Specialization offered by John Hopkins University on Coursera.

Methodology

When the user imputes a text input of length 1 (for example the word ‘I’), the prediction algorithm understands that only a word was provided, and will use the Stupid Backoff Algorithm to search for the most common bigram, conditional on the first word that is provided by the user. If no such bigrams are found, the algorithm will return the most common unigrams. If such bigrams are found, it will output the most common one.

For text inputs of length 2 or more, the prediction algorithm first tries to predict for the next word using a Kneser-Ney smoothing algorithm. For example, if the user imputes a text input of length 3 (‘I am not’), our prediction algorithm attempts to find the most common 4-grams, conditional on the first 3 words which were provided by the user using a Kneser-Ney smoothing algorithm. If no such 4-grams are found, it will use the last 2 words of the text input imputed (‘am not’), and attempt to search for the most common 3-grams, conditional on the first 2 words provided by the user. If no such 3-grams are found, it will turn to the last word, and find the most common bigram using the Stupid Backoff Algorithm.

Rationale

I wanted a prediction algorithm which was fast, yet accurate. I understand that the Kneser-Ney smoothing algorithm was the most popular word prediction algorithm and widely considered as the most effective method of smoothing, as it accounts for the novelty and how common the predicted word occurs as the final term. You can read more about it here and here.

At the same point in time, for bigrams, the algorithm simply took too much time. This is due to the fact that there were altogether bigrams which occurred at least 5 times in our corpus. It was computationally expensive to calculate the ‘novelty factor’ of the smoothing algorithm (i.e. the second term of Kneser-Ney), especially if the user imputes a common input for the first word of the bigram (e.g. to, the).

Hence, I decided to use the Stupid Backoff algorithm to predict the most common bigram conditional on the input provided by the user.

Word Prediction Model

We begin by importing reading the files which were provided by the course. Following which, we will work on a subset as these files are too large.

setwd("~/R_Final_Project")
options(java.parameters = "-Xmx8000m")
library(RWeka); library(tm); library(NLP); library(SnowballC)
library(magrittr); library(plyr); library(ggplot2)

# Previously, we have read the dataframes, and stored the results.

### con_blogs <- file("final/en_US/en_US.blogs.txt", 'r')
### con_news <- file("final/en_US/en_US.news.txt", 'r')
### con_twitter <- file("final/en_US/en_US.twitter.txt", 'r')

### blogs <- readLines(con_blogs)
### news <- readLines(con_news)
### twitter <- readLines(con_twitter)

### close(con_blogs)
### close(con_news)
### close(con_twitter)

To reduce computational complexity, let’s work with 200,000 lines from the news, blogs and twitter dataset. Let’s set a seed to ensure reproducibility.

### set.seed(1212)
### subset_blogs = sample(blogs, 200000, replace = F)
### subset_news = sample(news, 200000, replace = F)
### subset_twitter = sample(twitter, 200000, replace = F)

Now, let’s save the new files in another folder, and remove the huge files from our working environment to save on memory.

### writeLines(subset_blogs, con = "~/R_Final_Project/final/subset_EN/subset_blogs.txt")
### writeLines(subset_news, con = "~/R_Final_Project/final/subset_EN/subset_news.txt")
### writeLines(subset_twitter, con = "~/R_Final_Project/final/subset_EN/subset_twitter.txt")

### rm(list=ls())

After saving these new files, let’s read the files which we saved. This time, we will read the files via a Corpus.

# The corpus has been read and cleaned. Following which, the unigram - quintgram 
# have been derived from the corpus, and saved to a file.

### docs <- VCorpus(DirSource("~/R_Final_Project/final/subset_EN/"))

Data Cleaning

After loading the dataframes, we now proceed to clean the data. In this phase, we convert our words in the corpus to lower case, remove punctuations, numbers and whitespaces from our corpus. After cleaning the corpus, we proceed to remove the ‘dirty’ corpus from our environment. Again, this is to save on memory space.

### cleaned <- docs %>% tm_map(content_transformer(tolower)) %>% # Converting words to lower case
###          tm_map(stripWhitespace) %>% # Remove Whitespaces
###          tm_map(removePunctuation) %>% # Remove Punctuations
###          tm_map(removeNumbers) # Remove Numbers

### rm(docs)

Creating Unigram - Quintgram Dataframes

With the clean corpus, let’s use it to create a dataframe containing the frequency of the unigrams, bigrams, trigrams quadgrams and quintgrams! To do so, we first create a DocumentTermMatrix using the DocumentTermMatrix function available from the tm library.

### UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
### BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
### TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
### QuadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
### QuintgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 5, max = 5))

### dtm_unigram <- DocumentTermMatrix(cleaned, control = list(tokenize = UnigramTokenizer))
### word_counts_unigram <- as.data.frame(sort(colSums(as.matrix(dtm_unigram)), 
###                                         decreasing=TRUE))
### rm(dtm_unigram); rm(UnigramTokenizer)

### dtm_bigram <- DocumentTermMatrix(cleaned, control = list(tokenize = BigramTokenizer))
### word_counts_bigram <- as.data.frame(sort(colSums(as.matrix(dtm_bigram)), 
###                                         decreasing=TRUE))   
### rm(dtm_bigram); rm(BigramTokenizer)

### dtm_trigram <- DocumentTermMatrix(cleaned, control = list(tokenize = TrigramTokenizer))
### word_counts_trigram <- as.data.frame(sort(colSums(as.matrix(dtm_trigram)), 
###                                         decreasing=TRUE))
### rm(dtm_trigram); rm(TrigramTokenizer)

### dtm_quadgram <- DocumentTermMatrix(cleaned, control = list(tokenize = QuadgramTokenizer))
### word_counts_quadgram <- as.data.frame(sort(colSums(as.matrix(dtm_quadgram)), 
###                                           decreasing=TRUE))
### rm(dtm_quadgram); rm(QuadgramTokenizer)

### dtm_quintgram <- DocumentTermMatrix(cleaned, control = list(tokenize = QuintgramTokenizer))
### word_counts_quintgram <- as.data.frame(sort(colSums(as.matrix(dtm_quintgram)), 
###                                            decreasing=TRUE))
### rm(dtm_quintgram); rm(QuintgramTokenizer)

### rm(cleaned)
### save.image("~/R_Final_Project/ngram_data_train.RData")

### rm(list = ls())

After obtaining the term frequency tables, let’s clean up the dataframes. First, we split the bigram and trigram terms into 2 separate columns, and append them to the respective dataframes. Lastly, we use a term frequency threshold of 3 to reduce the dimensionality of the dataframes.

### load("~/R_Final_Project/ngram_data_train.RData")

### colnames(word_counts_unigram)[1] <- 'Counts'
### word_counts_unigram <- subset(word_counts_unigram, Counts >= 3)
### word_counts_unigram$Word1 <- row.names(word_counts_unigram)
### row.names(word_counts_unigram) <- NULL

### colnames(word_counts_bigram)[1] <- 'Counts'
### word_counts_bigram <- subset(word_counts_bigram, Counts >= 3)
### word_counts_bigram$Word1 <- sapply(strsplit(as.character(row.names(word_counts_bigram)), 
###                                            split = ' '), `[`, 1)
### word_counts_bigram$Word2 <- sapply(strsplit(as.character(row.names(word_counts_bigram)), 
###                                            split = ' '), `[`, 2)
### row.names(word_counts_bigram) <- NULL

### colnames(word_counts_trigram)[1] <- 'Counts'
### word_counts_trigram <- subset(word_counts_trigram, Counts >= 3)
### word_counts_trigram$Word1 <- sapply(strsplit(as.character(row.names(word_counts_trigram)), 
###                                            split = ' '), `[`, 1)
### word_counts_trigram$Word2 <- sapply(strsplit(as.character(row.names(word_counts_trigram)), 
###                                            split = ' '), `[`, 2)
### word_counts_trigram$Word3 <- sapply(strsplit(as.character(row.names(word_counts_trigram)), 
###                                            split = ' '), `[`, 3)
### row.names(word_counts_trigram) <- NULL

### colnames(word_counts_quadgram)[1] <- 'Counts'
### word_counts_quadgram <- subset(word_counts_quadgram, Counts >= 3)
### word_counts_quadgram$Word1 <- sapply(strsplit(as.character(row.names(word_counts_quadgram)), 
###                                              split = ' '), `[`, 1)
### word_counts_quadgram$Word2 <- sapply(strsplit(as.character(row.names(word_counts_quadgram)), 
###                                              split = ' '), `[`, 2)
### word_counts_quadgram$Word3 <- sapply(strsplit(as.character(row.names(word_counts_quadgram)), 
###                                              split = ' '), `[`, 3)
### word_counts_quadgram$Word4 <- sapply(strsplit(as.character(row.names(word_counts_quadgram)), 
###                                             split = ' '), `[`, 4)
### row.names(word_counts_quadgram) <- NULL

### colnames(word_counts_quintgram)[1] <- 'Counts'
### word_counts_quintgram <- subset(word_counts_quintgram, Counts >= 3)
### word_counts_quintgram$Word1 <- sapply(strsplit(as.character(row.names(word_counts_quintgram)), 
###                                              split = ' '), `[`, 1)
### word_counts_quintgram$Word2 <- sapply(strsplit(as.character(row.names(word_counts_quintgram)), 
###                                              split = ' '), `[`, 2)
### word_counts_quintgram$Word3 <- sapply(strsplit(as.character(row.names(word_counts_quintgram)), 
###                                              split = ' '), `[`, 3)
### word_counts_quintgram$Word4 <- sapply(strsplit(as.character(row.names(word_counts_quintgram)), 
###                                            split = ' '), `[`, 4)
### word_counts_quintgram$Word5 <- sapply(strsplit(as.character(row.names(word_counts_quintgram)), 
###                                            split = ' '), `[`, 5)
### row.names(word_counts_quintgram) <- NULL

### save.image("~/R_Final_Project/final_project.RData")

With the dataframes in place, let’s implement the modified Kneser-Ney smoothing algorithms as our prediction algorithms, using a 5-gram model.

Modified Kneser-Ney Smoothing Algorithm

Noting that the dataframes only contains terms that occur at least 5 times in the corpus, we set the Kneser-Ney discount factor to 3. Intuitively, this prevents the probability of certain terms occurring as the next word from becoming negative.

### rm(list = ls())
load("final_project.RData")

word_counts_unigram$Prob <- word_counts_unigram$Counts/sum(word_counts_unigram$Counts)

kn_trigram <- function(input){
        exp1 <- paste0('^', input[1], '$')
        exp2 <- paste0('^', input[2], '$')
        trigram_row <- word_counts_trigram[
                grepl(exp1, word_counts_trigram$Word1) &
                grepl(exp2, word_counts_trigram$Word2) == T,]
        if (dim(trigram_row)[1] > 0){
                trigram_row <- trigram_row[1:min(as.integer(0.1*dim(trigram_row)[1]), 
                                                 3), ]        
                totalCounts <- sum(trigram_row$Counts)
                trigram_row$firstTerm <- (trigram_row$Counts - 3)/totalCounts
                
                discFactor <- 3/totalCounts * dim(trigram_row)[1]
                novel_trigram <- sapply(trigram_row$Word3, function(x) {
                        dim(word_counts_trigram[grepl(x,  
                                                      word_counts_trigram$Word3) == T,])[1]
                        })
                trigram_row$secondTerm <- ((discFactor * novel_trigram)/
                                                   (dim(word_counts_trigram)[1]))
                trigram_row$Prob <- trigram_row$firstTerm + trigram_row$secondTerm
                trigram_row <- trigram_row[order(trigram_row$Prob, decreasing = T),]
                return(data.frame(Word = trigram_row$Word3, Prob = trigram_row$Prob))
        } else { return(NA) }
}

kn_quadgram <- function(input){
        exp1 <- paste0('^', input[1], '$')
        exp2 <- paste0('^', input[2], '$')
        exp3 <- paste0('^', input[3], '$')
        quadgram_row <- word_counts_quadgram[
                grepl(exp1, word_counts_quadgram$Word1) &
                grepl(exp2, word_counts_quadgram$Word2) & 
                grepl(exp3, word_counts_quadgram$Word3) == T,]
        if (dim(quadgram_row)[1] > 0){
                quadgram_row <- quadgram_row[1:min(as.integer(0.2*dim(quadgram_row)[1]), 
                                                   3), ]
                totalCounts <- sum(quadgram_row$Counts)
                quadgram_row$firstTerm <- (quadgram_row$Counts - 3)/totalCounts
                
                discFactor <- 3/totalCounts * dim(quadgram_row)[1]
                novel_quadgram <- sapply(quadgram_row$Word4, function(x) {
                        dim(word_counts_quadgram[grepl(x,  
                                                       word_counts_quadgram$Word4) == T,])[1]
                        })
                quadgram_row$secondTerm <- ((discFactor * novel_quadgram)/
                                                    (dim(word_counts_quadgram)[1]))
                quadgram_row$Prob <- quadgram_row$firstTerm + quadgram_row$secondTerm
                quadgram_row <- quadgram_row[order(quadgram_row$Prob, decreasing = T),]
                return(data.frame(Word = quadgram_row$Word4, Prob = quadgram_row$Prob))
        } else { return(NA) }
}

kn_quintgram <- function(input){
        exp1 <- paste0('^', input[1], '$')
        exp2 <- paste0('^', input[2], '$')
        exp3 <- paste0('^', input[3], '$')
        exp4 <- paste0('^', input[4], '$')
        quintgram_row <- word_counts_quintgram[
                grepl(exp1, word_counts_quintgram$Word1) &
                grepl(exp2, word_counts_quintgram$Word2) & 
                grepl(exp3, word_counts_quintgram$Word3) & 
                grepl(exp4, word_counts_quintgram$Word4) == T,]
        if (dim(quintgram_row)[1] > 0){
                quintgram_row <- quintgram_row[1:min(as.integer(0.2*dim(quintgram_row)[1]), 
                                                     3), ]
                totalCounts <- sum(quintgram_row$Counts)
                quintgram_row$firstTerm <- (quintgram_row$Counts - 3)/totalCounts
                
                discFactor <- 3/totalCounts * dim(quintgram_row)[1]
                novel_quintgram <- sapply(quintgram_row$Word5, function(x) {
                        dim(word_counts_quintgram[grepl(x,  
                                                        word_counts_quintgram$Word5) == T,])[1]
                        })
                quintgram_row$secondTerm <- ((discFactor * novel_quintgram)/
                                                     (dim(word_counts_quintgram)[1]))
                quintgram_row$Prob <- quintgram_row$firstTerm + quintgram_row$secondTerm
                quintgram_row <- quintgram_row[order(quintgram_row$Prob, decreasing = T),]
                return(data.frame(Word = quintgram_row$Word5, Prob = quintgram_row$Prob))
        } else { return(NA) }
}

stupid_backoff_kn <- function(input){
        list_of_words <- tail(input, 1)
        exp <- paste0('^', wordStem(list_of_words[1]), '$')
        bigram_row <- word_counts_bigram[grepl(exp, word_counts_bigram$Word1) == T,]
        bigram_row$Prob <- bigram_row$Counts/sum(bigram_row$Counts)
        if (dim(bigram_row)[1] == 0) {
                return(data.frame(Word = word_counts_unigram$Word1, 
                                  Prob = word_counts_unigram$Prob)[1:5,])
        } else {return(data.frame(Word = bigram_row$Word2,
                                  Prob = bigram_row$Prob)[1:10, ])}        
}

kneser_ney <- function(input = NULL){
        
        if (is.null(input)) {
                return(data.frame(Word = word_counts_unigram$Word1,
                                  Prob = word_counts_unigram$Prob)[1:5,]) 
        }
        
        cleaned_input <- gsub("[[:punct:][:blank:]]+", " ", input) # Remove Punctuations
        cleaned_input <- gsub("[']", " ", input) # Remove Interesting Punctuations
        cleaned_input <- gsub("[[:digit:]]", "", cleaned_input) # Remove Numbers
        cleaned_input <- cleaned_input %>% 
                tolower() %>% # Converting words to lower case
                trimws() # Remove Whitespaces
        list_of_words <- unlist(strsplit(cleaned_input, split = ' '))
        if (length(list_of_words) == 0) {
                return(data.frame(Word = word_counts_unigram$Word1,
                                  Prob = word_counts_unigram$Prob)[1:5,])
        } else if (length(list_of_words) == 1) {
                return(stupid_backoff_kn(cleaned_input))
        } else if (length(list_of_words) == 2) {
                if (is.na(kn_trigram(list_of_words))) {
                        return(stupid_backoff_kn(cleaned_input))
                }  else { return(kn_trigram(list_of_words)) }
        } else if (length(list_of_words) == 3) {
                if (is.na(kn_quadgram(list_of_words))) {
                        if (is.na(kn_trigram(tail(list_of_words, 2)))) {
                                return(stupid_backoff_kn(cleaned_input))
                        } else { return(kn_trigram(tail(list_of_words, 2))) }
                } else { return(kn_quadgram(tail(list_of_words, 3))) }
        } else {
                if (is.na(kn_quintgram(tail(list_of_words, 4)))) {
                        if (is.na(kn_quadgram(tail(list_of_words, 3)))) {
                                if (is.na(kn_trigram(tail(list_of_words, 2)))) {
                                        return(stupid_backoff_kn(cleaned_input))
                                } else { return(kn_trigram(tail(list_of_words, 2))) }
                        } else { return(kn_quadgram(tail(list_of_words, 3))) }
                } else { return(kn_quintgram(tail(list_of_words, 4))) }
        }
}

save.image("~/R_Final_Project/kneser_ney.RData")

Model testing

Let’s test our model before we create an application for our word prediction algorithm! We can provide some text input, and see whether its predictions are any good.

Example 1

kneser_ney('She has a')
##    Word      Prob
## 1 great 0.5720937

Example 2

kneser_ney('ajwgl5 2k5 2nwf')
##   Word        Prob
## 1  the 0.065147355
## 2  and 0.033455767
## 3 that 0.014036700
## 4  for 0.013541972
## 5  you 0.009715564

Example 3

kneser_ney('.6^@ 15mtrvd dsv09v 09and i')
##   Word      Prob
## 1 have 0.3609884
## 2  was 0.3238642
## 3   am 0.3113528

Real Example 1

kneser_ney('I think she stood me')
##   Word        Prob
## 1  the 0.065147355
## 2  and 0.033455767
## 3 that 0.014036700
## 4  for 0.013541972
## 5  you 0.009715564

Real Example 2

kneser_ney('They screech until I make them')
##   Word      Prob
## 1    a 0.3466612
## 2 more 0.3029629
## 3 feel 0.2095249

Real Example 3

kneser_ney('No matter what you do in life, I"ll always be there for')
##   Word      Prob
## 1  you 0.8029777