NLP. Word suggestion using Markov Chains and the stupid backoff model

About this project

Objective

The aim of this project was to put to practice all knowledge aquired in the Coursera Data Science Specialisation by making an application that is able to suggest the next word to the user.

How the algorithm works

The learning process is done through the processing of a large txt dataset into ngrams.
The text input by the user is also processed and then searched into the ngrams lists. The 3 most frequent pentagrams that ‘start’ with the words input by the user are returned. Their last words are displayed as suggestions.
If no pentagrams are found, then a stupid backoff model is applied. That means that the application will try to search for fitting quadgrams, still, if no quadgrams are found, trigrams will be tested.
If in all levels no ngrams are found, the app returns a message asking the user to check for spelling errors.

The Shiny App

An simple shiny application was also developed. It allowed the user to experience the code accuracy and response time at hand. Developing a algorithm that worked on the shiny app proved a nice challenge since it added a significant restriction to the final response time, requiring as much optimization as possible to the code.

Please, feel free to try the application.

Data Import

The data was collected from publicly available news sources by a web crawler. It is a subset of the Corpora Heliopost, a collection of free text corpora, available to download in 20 different languages and can be found here.

Input file Summary
	info.table
Class	character
Length	77,259
File Size (MB)	196.3
Number of Characters	15,683,768

Data preparation

The raw data comes in a txt file it needs to be prepared, tokenized and made into ngrams.

# Removing unwanted characters
data.raw <- iconv(data.raw, from = "UTF-8", to = "ASCII", sub = "")

# The raw data comes with a lot of unwanted characters, they need to be removed and the txt data tokenized.
data.tokens <- tokens(char_tolower(data.raw), ## All characters to lower case
                 remove_punct = TRUE,         ## Removing Punctuations
                 remove_symbols = T,          ## Removing Symbols
                 remove_separators = T,       ## Removing separators
                 remove_twitter = T,          ## Removing twitter symbols (e.g. "@" and "#")
                 remove_hyphens = T,          ## Removing hyphens
                 remove_numbers = T)          ## Removing numbers

# Once the data is tokenized, I will generate different ngrams. Here is shown the proccess for the trigrams, the same process is applied to generate uni, bi, quad and pentagrams
trigrams <- tokens_ngrams(data.tokens, 3, concatenator = " ") ## Generating trigrams
trigrams <- dfm(trigrams) ## Creating document-feature matrix
trigrams <- data.table(gram = featnames(trigrams), freq = colSums(trigrams))[order(-freq)] ## Creating and ordering data.table
trigrams <- trigrams[, gram] ## Removing freq columns to reduce table size

# Also, let's make the table that will be loaded in the shiny application
write.csv(trigrams, "./Ngram tables/trigrams.csv")

The Predictor Algorithm

Once the data has been prepared and the ngrams table generated, we can work on the algorithm. The main idea behind this is to pick the last 4 words input by the user, form a phrase with it and then check if there is a corresponding pentagram that contains this phrase. If the return is positive, we print the last word of the pentagram, if negative, we try again, but this time with 3 words from the input and search in a quadgram. The process is repeated until all ngrams are tested. If nothing is found the message “We haven’t found any suggestion, please check spelling” is displayed.

predictor <- function(text.input) {
        
        # The same preparation that was done to generate the ngrams is done with the user input
        text.input <- iconv(text.input, from = "UTF-8", to = "ASCII", sub = "")
        text.token <- tokens(char_tolower(text.input), 
                                 remove_punct = TRUE, 
                                 remove_symbols = T, 
                                 remove_separators = T, 
                                 remove_twitter = T, 
                                 remove_hyphens = T, 
                                 remove_numbers = T)

        text.token <- as.vector(unlist(text.token))
        
        n <- length(text.token)
        
        # If the user input has length 0 (the user hasn't input anything), then return nothing
        if (n == 0) {
                return() 
        }
        
        # If the user input has length 4 or more, try to find a corresponding pentagram
        if (n >= 4) {
                n <- 4
                
                ## Create a search criteria of the last 4 words input by the user
                search.criteria <- paste(text.token[length(text.token)-3], 
                                text.token[length(text.token)-2],
                                text.token[length(text.token)-1],
                                text.token[length(text.token)],
                                "", sep = " ")
                ## Find all corresponding ngrams that contain the search criteria
                search.result <- grep(search.criteria, pentagrams, value = T)
                
                ## if no gram is found than repeat the same process with a lower ngram.
                if (identical(character(0),search.result)) { n <- n - 1 } 
        }
        
        if (n == 3) {
                search.criteria <- paste(text.token[length(text.token)-2],
                                text.token[length(text.token)-1],
                                text.token[length(text.token)],
                                "", sep = " ")
                search.result <- grep(search.criteria, quadgrams, value = T)
                if (identical(character(0),search.result)) { n <- n - 1 }
        }

        if (n == 2) {
                search.criteria <- paste(text.token[length(text.token)-1],
                                text.token[length(text.token)],
                                "", sep = " ")
                search.result <- grep(search.criteria, trigrams, value = T)
                if (identical(character(0),search.result)) { n <- n - 1 }

        }

        if (n == 1) {
                search.criteria <- paste(text.token[length(text.token)]
                                ,"", sep = " ")
                search.result <- grep(search.criteria, bigrams, value = T)
        }
        
        if (identical(character(0),search.result)) {
                print("We haven't found any suggestion, please check for spelling errors.") ## If all levels are searched and no ngram is found, this message is returned
        } else {
                search.result <- strsplit(search.result, split = " ") ## Splitting all words of all ngrams found
        
                ## Picking the last word of the first, second and third most frequent results
                fst <- search.result[[1]][length(search.result[[1]])] 
                if (length(search.result) > 1) {snd <- search.result[[2]][length(search.result[[2]])]} else {snd <- ""} ## If there is a second result, return the last word of it
                if (length(search.result) > 2) {trd <- search.result[[3]][length(search.result[[3]])]} else {trd <- ""} ## If there is a third result, return the last word of it
                sug.table <- data.table(fst, snd, trd)
                return(sug.table)
        }
}

Output Examples

Here are a few examples so you can have an idea of algorithm’s output.

Output Examples
	Suggestion #1	Suggestion #2	Suggestion #3
just	a	as	the
AS OF	the	p.m	this
.The end is	nigh
Don’t be such A	succulent	treat	lovely

The “Accuracy vs Response time” challenge

As stated before, the “accuracy vs response time” trade off was a constant during the development. Using larger datasets and higher grams increased the accuracy of the application, but eventually reduced its response time.
Imagine having a phone app that takes 2 seconds to suggest the a word for you: when it finally suggests the nth word of a certain phrase, you are already typing the nth+5 word. So a long response time is highly undesirable because it can make the application useless.

The key approaches to overcome the challenge

The solution was to optimize the code as much as possible. That allowed the use of a large dataset and higher ngrams while keeping the response time low. Three key approaches made this viable:

Quanteda package: Benchmark tests showed that the quanteda package was able to handle more data and faster than other packages like tm and weka.
Data.table: Data.tables have low memory usage and higher performance than data frames. This allowed faster response times in the shiny application and the use of bigger ngrams datasets.
Fread: One of the most important features, the fread function along with its nrows argument allowed a dramatic decrease of time in the data importing in the shiny application. This reduced the loading phase of the app, enabling the use of quadgrams and pentagrams.