NextIt (Word/Sentence Prediction System)

In part 01 of the series I covered the theories I will be using in the application, and now let’s see how to use it. In this part I won’t be going through the exact details of the theories but just the implementations. Let’s dive in.

For a working application sample: https://shining-thiloshon.shinyapps.io/NextIT/

Requirements

R
R Studio
quanteda, data.table libraries
Carnegie Mellon University Luis von Ahn’s Research Group’s bad word collection (Optional)
Dataset: The dataset I am using is provided by the Coursera Course and is created by folks at Swiftkey. You don’t need the exact dataset to understand the blog as i will use sample texts to explain, but to reproduce the exact work, you might need it. Data was downloaded from the Coursera Course Page by using the url https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

Understanding the Dataset

The dataset is in 4 languages,

English
German
Finnish
Russian

and each language contains text documents from different sources,

Twitter Tweets
Blogs
News

Let’s try to quantify the raw English dataset.

RAW Data
TextDocument	Lines	Words	MaxWords	AvgWords	Characters	MaxChars	AvgChars
Twitter	2360148	30373545	47	12.86934	164744972	213	68.80281
News	899288	37334131	6630	41.51521	209260725	40835	231.69601
Blogs	77259	2643969	1031	34.22215	15761023	5760	203.00243

Wow! We have 70 Million words in total from around 3.5 million sentences. That is a huge dataset!

Loading Data

The dataset was loaded into R by using file and readline function.

con1 <- file("SwiftKey/en_US/en_US.twitter.txt", "r")
US_Twitter <- readLines(con1)

The main constraint in doing text mining is the resource. Text mining requires a lot of computational power and memory in particular. Even though the data set i have is couple of millions of records long, I could only load around 50000 records of data in memory in one go. So I have decided to random sub sample the data and get a fraction (10%) of data to continue the project.

sampleHolderTwitter <- sample(length(US_Twitter), length(US_Twitter) * 0.1)
sampleHolderBlog <- sample(length(US_Blogs), length(US_Blogs) * 0.1)
sampleHolderNews <- sample(length(US_News), length(US_News) * 0.1)

US_Twitter_Sample <- US_Twitter[sampleHolderTwitter]
US_Blogs_Sample <- US_Blogs[sampleHolderBlog]
US_News_Sample <- US_News[sampleHolderNews]

Now the statistics are as follows,

Sample Data
TextDocument	Lines	Words	MaxWords	AvgWords	Characters	MaxChars	AvgChars
Twitter	236014	3036570	40	12.86605	16465547	152	68.76513
News	89928	3727711	1219	41.45217	20880702	7034	231.19357
Blogs	7725	264458	319	34.23405	1573004	1634	202.62524

Interesting! The word count is now around 7 million which is because we sampled 10% of 70 million. But the average words and average characters haven’t changed much, so we can conclude our sampling is pretty good.

And finally the text documents were combined and converted to corpus objects.

master_vector <- c(US_Twitter_Sample, US_Blogs_Sample, US_News_Sample)
corp <- corpus(master_vector)

Preprocessing

Before continuing with processing data, there are few concerns to be addressed in the dataset we gathered. These are simple cleaning procedures which makes it easier to use the data in subsequent steps. The following few steps show how these concerns were addressed.

# the puncuations and numbers in the texts were removed as there is no need to predict punctations or numbers
master_Tokens <- tokens(
    x = tolower(corp),
    remove_punct = TRUE,
    remove_twitter = TRUE,
    remove_numbers = TRUE,
    remove_hyphens = TRUE,
    remove_symbols = TRUE,
    remove_url = TRUE
)

After cleaning the data, we have to stem it.

stemed_words <- tokens_wordstem(master_Tokens, language = "english")

Tokenization and N-Gram Modelling

Now that we have a cleaned and stemmed text corpus, let’s get into the modelling.

Let’s create N-grams and Document Frequency Matrices from the corpus.

bi_gram <- tokens_ngrams(stemed_words, n = 2)
tri_gram <- tokens_ngrams(stemed_words, n = 3)

uni_DFM <- dfm(stemed_words)
bi_DFM <- dfm(bi_gram)
tri_DFM <- dfm(tri_gram)

To speedup calculations, let’s trim the N-Grams.

uni_DFM <- dfm_trim(uni_DFM, 3)
bi_DFM <- dfm_trim(bi_DFM, 3)
tri_DFM <- dfm_trim(tri_DFM, 3)

Now let’s group the word tokens to find the count in the corpus.

# Create named vectors with counts of words 
sums_U <- colSums(uni_DFM)
sums_B <- colSums(bi_DFM)
sums_T <- colSums(tri_DFM)

# Create data tables with individual words as columns
uni_words <- data.table(word_1 = names(sums_U), count = sums_U)

bi_words <- data.table(
        word_1 = sapply(strsplit(names(sums_B), "_", fixed = TRUE), '[[', 1),
        word_2 = sapply(strsplit(names(sums_B), "_", fixed = TRUE), '[[', 2),
        count = sums_B)

tri_words <- data.table(
        word_1 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 1),
        word_2 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 2),
        word_3 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 3),
        count = sums_T)

The 20 most used words in the dataset:

graph.data <- uni_words[order(uni_words$count, decreasing = T), ]
graph.data <- graph.data[1:20, ]
graph.data$word_1 <- factor(graph.data$word_1, levels = graph.data$word_1)

ggplot(data=graph.data, aes(x=word_1, y=count)) + geom_bar(stat="identity") +
    theme(axis.text.x = element_text(angle = 40, hjust = 1))

The 20 most used bi-grams in the dataset:

graph.data <- bi_words[order(bi_words$count, decreasing = T), ]
graph.data <- graph.data[1:20, ]
graph.data$word <- paste(graph.data$word_1, graph.data$word_2)
graph.data$word <- factor(graph.data$word, levels = graph.data$word)

ggplot(data=graph.data, aes(x=word, y=count)) + geom_bar(stat="identity")

The 20 most used tri-grams in the dataset:

graph.data <- tri_words[order(tri_words$count, decreasing = T), ]
graph.data <- graph.data[1:20, ]
graph.data$word <- paste(graph.data$word_1, graph.data$word_2, graph.data$word_3)
graph.data$word <- factor(graph.data$word, levels = graph.data$word)

ggplot(data=graph.data, aes(x=word, y=count)) + geom_bar(stat="identity") + 
    theme(axis.text.x = element_text(angle = 40, hjust = 1))

From the N-Grams created we can get an idea of how to predict the texts. The 1-Gram can be used to predict most frequent words while 2-Grams and 3-Grams can be used to predict subsequent words.

To speedup calculations even more, let’s index the N-Grams.

setkey(uni_words, word_1)
setkey(bi_words, word_1, word_2)
setkey(tri_words, word_1, word_2, word_3)

Kneser-Kney Smoothing

Let’s add Kneser-Kney smoothing to the dataset. First we will find bi-gram probabilities and then add smoothing.

discount_value <- 0.75

######## Finding Bi-Gram Probability #################

# Finding number of bi-gram words
numOfBiGrams <- nrow(bi_words[by = .(word_1, word_2)])

# Dividing number of times word 2 occurs as second part of bigram, by total number of bigrams.  
# ( Finding probability for a word given the number of times it was second word of a bigram)
ckn <- bi_words[, .(Prob = ((.N) / numOfBiGrams)), by = word_2]
setkey(ckn, word_2)

# Assigning the probabilities as second word of bigram, to unigrams
uni_words[, Prob := ckn[word_1, Prob]]
uni_words <- uni_words[!is.na(uni_words$Prob)]

# Finding number of times word 1 occured as word 1 of bigrams
n1wi <- bi_words[, .(N = .N), by = word_1]
setkey(n1wi, word_1)

# Assigning total times word 1 occured to bigram cn1
bi_words[, Cn1 := uni_words[word_1, count]]

# Kneser Kney Algorithm
bi_words[, Prob := ((count - discount_value) / Cn1 + discount_value / Cn1 * n1wi[word_1, N] * uni_words[word_2, Prob])]

######## End of Finding Bi-Gram Probability #################

Then let’s find tri-gram probabilities and add smoothing.

######## Finding Tri-Gram Probability #################

# Finding count of word1-word2 combination in bigram 
suppressWarnings(
    tri_words[, Cn2 := bi_words[.(word_1, word_2), count]])

# Finding count of word1-word2 combination in trigram
n1w12 <- tri_words[, .N, by = .(word_1, word_2)]
setkey(n1w12, word_1, word_2)

# Kneser Kney Algorithm
suppressWarnings(
    tri_words[, Prob := (count - discount_value) / Cn2 + discount_value / Cn2 * n1w12[.(word_1, word_2), N] *
              bi_words[.(word_1, word_2), Prob]]
)

######## End of Finding Tri-Gram Probability #################

Finally let’s tweak the unigram to be used more effectively. Here we single out 50 most occurring unigrams as it is more likely to occur. This will be used as the last resort in backing-off.

# Finding the most frequently used 50 unigrmas
uni_words <- uni_words[order(-Prob)][1:50]

With that, we have found, the word probabilities for the words in the dataset. Now let’s create few functions to build the prediction app.

The prediction App

First the function to predict the third word, given two previous words.

# function to return highly probable previous word given two successive words
triWords <- function(w1, w2, n = 5) {
    pwords <- tri_words[.(w1, w2)][order(-Prob)]
    if (any(is.na(pwords)))
        return(biWords(w2, n))
    if (nrow(pwords) > n)
        return(pwords[1:n, word_3])
    count <- nrow(pwords)
    bwords <- biWords(w2, n)[1:(n - count)]
    return(c(pwords[, word_3], bwords))
}

If we don’t find a tri-gram with the two given words, we backoff to the bi-gram. We find the next word given one previous word.

# function to return highly probable previous word given a word
biWords <- function(w1, n = 5) {
    pwords <- bi_words[w1][order(-Prob)]
    if (any(is.na(pwords)))
        return(uniWords(n))
    if (nrow(pwords) > n)
        return(pwords[1:n, word_2])
    count <- nrow(pwords)
    unWords <- uniWords(n)[1:(n - count)]
    return(c(pwords[, word_2], unWords))
}

If we couldn’t even find the corresponding bigram, we randomly get a word from unigrams with high probability. This is the last resort for n-grams that are not found in the dataset.

# function to return random words from unigrams
uniWords <- function(n = 5) {  
    return(sample(uni_words[, word_1], size = n))
}

Finally, a function to bind all these,

# The prediction app
getWords <- function(str){
    require(quanteda)
    tokens <- tokens(x = char_tolower(str))
    tokens <- char_wordstem(rev(rev(tokens[[1]])[1:2]), language = "english")
    
    words <- triWords(tokens[1], tokens[2], 5)
    chain_1 <- paste(tokens[1], tokens[2], words[1], sep = " ")

    print(words[1])
}

Let’s see how our model has worked so far.

getWords("Shall we go to")

## [1] "the"

getWords(getWords("Shall we go to"))

## [1] "the"
## [1] "same"

getWords(getWords(getWords("Shall we go to")))

## [1] "the"
## [1] "same"
## [1] "time"

Whooh. It works really well!!!

Congratulations, you’ve built a prediction app.

Build Your Own Text Prediction Application

Thiloshon Nagarajah

January 23, 2018