NextIt (Word/Sentence Prediction System)

In part 01 of the series I covered the theories I will be using in the application, and now let’s see how to use it. In this part I won’t be going through the exact details of the theories but just the implementations. Let’s dive in.

For a working application sample: https://shining-thiloshon.shinyapps.io/NextIT/

Requirements

  1. R
  2. R Studio
  3. quanteda, data.table libraries
  4. Carnegie Mellon University Luis von Ahn’s Research Group’s bad word collection (Optional)
  5. Dataset: The dataset I am using is provided by the Coursera Course and is created by folks at Swiftkey. You don’t need the exact dataset to understand the blog as i will use sample texts to explain, but to reproduce the exact work, you might need it. Data was downloaded from the Coursera Course Page by using the url https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

Understanding the Dataset

The dataset is in 4 languages,

  • English
  • German
  • Finnish
  • Russian

and each language contains text documents from different sources,

  • Twitter Tweets
  • Blogs
  • News

Let’s try to quantify the raw English dataset.

RAW Data
TextDocument Lines Words MaxWords AvgWords Characters MaxChars AvgChars
Twitter 2360148 30373545 47 12.86934 164744972 213 68.80281
News 899288 37334131 6630 41.51521 209260725 40835 231.69601
Blogs 77259 2643969 1031 34.22215 15761023 5760 203.00243

Wow! We have 70 Million words in total from around 3.5 million sentences. That is a huge dataset!

Loading Data

The dataset was loaded into R by using file and readline function.

con1 <- file("SwiftKey/en_US/en_US.twitter.txt", "r")
US_Twitter <- readLines(con1)

The main constraint in doing text mining is the resource. Text mining requires a lot of computational power and memory in particular. Even though the data set i have is couple of millions of records long, I could only load around 50000 records of data in memory in one go. So I have decided to random sub sample the data and get a fraction (10%) of data to continue the project.

sampleHolderTwitter <- sample(length(US_Twitter), length(US_Twitter) * 0.1)
sampleHolderBlog <- sample(length(US_Blogs), length(US_Blogs) * 0.1)
sampleHolderNews <- sample(length(US_News), length(US_News) * 0.1)

US_Twitter_Sample <- US_Twitter[sampleHolderTwitter]
US_Blogs_Sample <- US_Blogs[sampleHolderBlog]
US_News_Sample <- US_News[sampleHolderNews]

Now the statistics are as follows,

Sample Data
TextDocument Lines Words MaxWords AvgWords Characters MaxChars AvgChars
Twitter 236014 3036570 40 12.86605 16465547 152 68.76513
News 89928 3727711 1219 41.45217 20880702 7034 231.19357
Blogs 7725 264458 319 34.23405 1573004 1634 202.62524

Interesting! The word count is now around 7 million which is because we sampled 10% of 70 million. But the average words and average characters haven’t changed much, so we can conclude our sampling is pretty good.

And finally the text documents were combined and converted to corpus objects.

master_vector <- c(US_Twitter_Sample, US_Blogs_Sample, US_News_Sample)
corp <- corpus(master_vector)

Preprocessing

Before continuing with processing data, there are few concerns to be addressed in the dataset we gathered. These are simple cleaning procedures which makes it easier to use the data in subsequent steps. The following few steps show how these concerns were addressed.

# the puncuations and numbers in the texts were removed as there is no need to predict punctations or numbers
master_Tokens <- tokens(
    x = tolower(corp),
    remove_punct = TRUE,
    remove_twitter = TRUE,
    remove_numbers = TRUE,
    remove_hyphens = TRUE,
    remove_symbols = TRUE,
    remove_url = TRUE
)

After cleaning the data, we have to stem it.

stemed_words <- tokens_wordstem(master_Tokens, language = "english")

Tokenization and N-Gram Modelling

Now that we have a cleaned and stemmed text corpus, let’s get into the modelling.

Let’s create N-grams and Document Frequency Matrices from the corpus.

bi_gram <- tokens_ngrams(stemed_words, n = 2)
tri_gram <- tokens_ngrams(stemed_words, n = 3)

uni_DFM <- dfm(stemed_words)
bi_DFM <- dfm(bi_gram)
tri_DFM <- dfm(tri_gram)

To speedup calculations, let’s trim the N-Grams.

uni_DFM <- dfm_trim(uni_DFM, 3)
bi_DFM <- dfm_trim(bi_DFM, 3)
tri_DFM <- dfm_trim(tri_DFM, 3)

Now let’s group the word tokens to find the count in the corpus.

# Create named vectors with counts of words 
sums_U <- colSums(uni_DFM)
sums_B <- colSums(bi_DFM)
sums_T <- colSums(tri_DFM)

# Create data tables with individual words as columns
uni_words <- data.table(word_1 = names(sums_U), count = sums_U)

bi_words <- data.table(
        word_1 = sapply(strsplit(names(sums_B), "_", fixed = TRUE), '[[', 1),
        word_2 = sapply(strsplit(names(sums_B), "_", fixed = TRUE), '[[', 2),
        count = sums_B)

tri_words <- data.table(
        word_1 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 1),
        word_2 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 2),
        word_3 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 3),
        count = sums_T)

The 20 most used words in the dataset:

graph.data <- uni_words[order(uni_words$count, decreasing = T), ]
graph.data <- graph.data[1:20, ]
graph.data$word_1 <- factor(graph.data$word_1, levels = graph.data$word_1)

ggplot(data=graph.data, aes(x=word_1, y=count)) + geom_bar(stat="identity") +
    theme(axis.text.x = element_text(angle = 40, hjust = 1))

The 20 most used bi-grams in the dataset:

graph.data <- bi_words[order(bi_words$count, decreasing = T), ]
graph.data <- graph.data[1:20, ]
graph.data$word <- paste(graph.data$word_1, graph.data$word_2)
graph.data$word <- factor(graph.data$word, levels = graph.data$word)

ggplot(data=graph.data, aes(x=word, y=count)) + geom_bar(stat="identity")

The 20 most used tri-grams in the dataset:

graph.data <- tri_words[order(tri_words$count, decreasing = T), ]
graph.data <- graph.data[1:20, ]
graph.data$word <- paste(graph.data$word_1, graph.data$word_2, graph.data$word_3)
graph.data$word <- factor(graph.data$word, levels = graph.data$word)

ggplot(data=graph.data, aes(x=word, y=count)) + geom_bar(stat="identity") + 
    theme(axis.text.x = element_text(angle = 40, hjust = 1))

From the N-Grams created we can get an idea of how to predict the texts. The 1-Gram can be used to predict most frequent words while 2-Grams and 3-Grams can be used to predict subsequent words.

To speedup calculations even more, let’s index the N-Grams.

setkey(uni_words, word_1)
setkey(bi_words, word_1, word_2)
setkey(tri_words, word_1, word_2, word_3)

Kneser-Kney Smoothing

Let’s add Kneser-Kney smoothing to the dataset. First we will find bi-gram probabilities and then add smoothing.

discount_value <- 0.75

######## Finding Bi-Gram Probability #################

# Finding number of bi-gram words
numOfBiGrams <- nrow(bi_words[by = .(word_1, word_2)])

# Dividing number of times word 2 occurs as second part of bigram, by total number of bigrams.  
# ( Finding probability for a word given the number of times it was second word of a bigram)
ckn <- bi_words[, .(Prob = ((.N) / numOfBiGrams)), by = word_2]
setkey(ckn, word_2)

# Assigning the probabilities as second word of bigram, to unigrams
uni_words[, Prob := ckn[word_1, Prob]]
uni_words <- uni_words[!is.na(uni_words$Prob)]

# Finding number of times word 1 occured as word 1 of bigrams
n1wi <- bi_words[, .(N = .N), by = word_1]
setkey(n1wi, word_1)

# Assigning total times word 1 occured to bigram cn1
bi_words[, Cn1 := uni_words[word_1, count]]

# Kneser Kney Algorithm
bi_words[, Prob := ((count - discount_value) / Cn1 + discount_value / Cn1 * n1wi[word_1, N] * uni_words[word_2, Prob])]

######## End of Finding Bi-Gram Probability #################

Then let’s find tri-gram probabilities and add smoothing.

######## Finding Tri-Gram Probability #################

# Finding count of word1-word2 combination in bigram 
suppressWarnings(
    tri_words[, Cn2 := bi_words[.(word_1, word_2), count]])

# Finding count of word1-word2 combination in trigram
n1w12 <- tri_words[, .N, by = .(word_1, word_2)]
setkey(n1w12, word_1, word_2)

# Kneser Kney Algorithm
suppressWarnings(
    tri_words[, Prob := (count - discount_value) / Cn2 + discount_value / Cn2 * n1w12[.(word_1, word_2), N] *
              bi_words[.(word_1, word_2), Prob]]
)

######## End of Finding Tri-Gram Probability #################

Finally let’s tweak the unigram to be used more effectively. Here we single out 50 most occurring unigrams as it is more likely to occur. This will be used as the last resort in backing-off.

# Finding the most frequently used 50 unigrmas
uni_words <- uni_words[order(-Prob)][1:50]

With that, we have found, the word probabilities for the words in the dataset. Now let’s create few functions to build the prediction app.

The prediction App

First the function to predict the third word, given two previous words.

# function to return highly probable previous word given two successive words
triWords <- function(w1, w2, n = 5) {
    pwords <- tri_words[.(w1, w2)][order(-Prob)]
    if (any(is.na(pwords)))
        return(biWords(w2, n))
    if (nrow(pwords) > n)
        return(pwords[1:n, word_3])
    count <- nrow(pwords)
    bwords <- biWords(w2, n)[1:(n - count)]
    return(c(pwords[, word_3], bwords))
}

If we don’t find a tri-gram with the two given words, we backoff to the bi-gram. We find the next word given one previous word.

# function to return highly probable previous word given a word
biWords <- function(w1, n = 5) {
    pwords <- bi_words[w1][order(-Prob)]
    if (any(is.na(pwords)))
        return(uniWords(n))
    if (nrow(pwords) > n)
        return(pwords[1:n, word_2])
    count <- nrow(pwords)
    unWords <- uniWords(n)[1:(n - count)]
    return(c(pwords[, word_2], unWords))
}

If we couldn’t even find the corresponding bigram, we randomly get a word from unigrams with high probability. This is the last resort for n-grams that are not found in the dataset.

# function to return random words from unigrams
uniWords <- function(n = 5) {  
    return(sample(uni_words[, word_1], size = n))
}

Finally, a function to bind all these,

# The prediction app
getWords <- function(str){
    require(quanteda)
    tokens <- tokens(x = char_tolower(str))
    tokens <- char_wordstem(rev(rev(tokens[[1]])[1:2]), language = "english")
    
    words <- triWords(tokens[1], tokens[2], 5)
    chain_1 <- paste(tokens[1], tokens[2], words[1], sep = " ")

    print(words[1])
}

Let’s see how our model has worked so far.

getWords("Shall we go to")
## [1] "the"
getWords(getWords("Shall we go to"))
## [1] "the"
## [1] "same"
getWords(getWords(getWords("Shall we go to")))
## [1] "the"
## [1] "same"
## [1] "time"

Whooh. It works really well!!!

Congratulations, you’ve built a prediction app.