In part 01 of the series I covered the theories I will be using in the application, and now let’s see how to use it. In this part I won’t be going through the exact details of the theories but just the implementations. Let’s dive in.
For a working application sample: https://shining-thiloshon.shinyapps.io/NextIT/
Requirements
The dataset is in 4 languages,
and each language contains text documents from different sources,
Let’s try to quantify the raw English dataset.
| TextDocument | Lines | Words | MaxWords | AvgWords | Characters | MaxChars | AvgChars |
|---|---|---|---|---|---|---|---|
| 2360148 | 30373545 | 47 | 12.86934 | 164744972 | 213 | 68.80281 | |
| News | 899288 | 37334131 | 6630 | 41.51521 | 209260725 | 40835 | 231.69601 |
| Blogs | 77259 | 2643969 | 1031 | 34.22215 | 15761023 | 5760 | 203.00243 |
Wow! We have 70 Million words in total from around 3.5 million sentences. That is a huge dataset!
The dataset was loaded into R by using file and readline function.
con1 <- file("SwiftKey/en_US/en_US.twitter.txt", "r")
US_Twitter <- readLines(con1)
The main constraint in doing text mining is the resource. Text mining requires a lot of computational power and memory in particular. Even though the data set i have is couple of millions of records long, I could only load around 50000 records of data in memory in one go. So I have decided to random sub sample the data and get a fraction (10%) of data to continue the project.
sampleHolderTwitter <- sample(length(US_Twitter), length(US_Twitter) * 0.1)
sampleHolderBlog <- sample(length(US_Blogs), length(US_Blogs) * 0.1)
sampleHolderNews <- sample(length(US_News), length(US_News) * 0.1)
US_Twitter_Sample <- US_Twitter[sampleHolderTwitter]
US_Blogs_Sample <- US_Blogs[sampleHolderBlog]
US_News_Sample <- US_News[sampleHolderNews]
Now the statistics are as follows,
| TextDocument | Lines | Words | MaxWords | AvgWords | Characters | MaxChars | AvgChars |
|---|---|---|---|---|---|---|---|
| 236014 | 3036570 | 40 | 12.86605 | 16465547 | 152 | 68.76513 | |
| News | 89928 | 3727711 | 1219 | 41.45217 | 20880702 | 7034 | 231.19357 |
| Blogs | 7725 | 264458 | 319 | 34.23405 | 1573004 | 1634 | 202.62524 |
Interesting! The word count is now around 7 million which is because we sampled 10% of 70 million. But the average words and average characters haven’t changed much, so we can conclude our sampling is pretty good.
And finally the text documents were combined and converted to corpus objects.
master_vector <- c(US_Twitter_Sample, US_Blogs_Sample, US_News_Sample)
corp <- corpus(master_vector)
Before continuing with processing data, there are few concerns to be addressed in the dataset we gathered. These are simple cleaning procedures which makes it easier to use the data in subsequent steps. The following few steps show how these concerns were addressed.
# the puncuations and numbers in the texts were removed as there is no need to predict punctations or numbers
master_Tokens <- tokens(
x = tolower(corp),
remove_punct = TRUE,
remove_twitter = TRUE,
remove_numbers = TRUE,
remove_hyphens = TRUE,
remove_symbols = TRUE,
remove_url = TRUE
)
After cleaning the data, we have to stem it.
stemed_words <- tokens_wordstem(master_Tokens, language = "english")
Now that we have a cleaned and stemmed text corpus, let’s get into the modelling.
Let’s create N-grams and Document Frequency Matrices from the corpus.
bi_gram <- tokens_ngrams(stemed_words, n = 2)
tri_gram <- tokens_ngrams(stemed_words, n = 3)
uni_DFM <- dfm(stemed_words)
bi_DFM <- dfm(bi_gram)
tri_DFM <- dfm(tri_gram)
To speedup calculations, let’s trim the N-Grams.
uni_DFM <- dfm_trim(uni_DFM, 3)
bi_DFM <- dfm_trim(bi_DFM, 3)
tri_DFM <- dfm_trim(tri_DFM, 3)
Now let’s group the word tokens to find the count in the corpus.
# Create named vectors with counts of words
sums_U <- colSums(uni_DFM)
sums_B <- colSums(bi_DFM)
sums_T <- colSums(tri_DFM)
# Create data tables with individual words as columns
uni_words <- data.table(word_1 = names(sums_U), count = sums_U)
bi_words <- data.table(
word_1 = sapply(strsplit(names(sums_B), "_", fixed = TRUE), '[[', 1),
word_2 = sapply(strsplit(names(sums_B), "_", fixed = TRUE), '[[', 2),
count = sums_B)
tri_words <- data.table(
word_1 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 1),
word_2 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 2),
word_3 = sapply(strsplit(names(sums_T), "_", fixed = TRUE), '[[', 3),
count = sums_T)
The 20 most used words in the dataset:
graph.data <- uni_words[order(uni_words$count, decreasing = T), ]
graph.data <- graph.data[1:20, ]
graph.data$word_1 <- factor(graph.data$word_1, levels = graph.data$word_1)
ggplot(data=graph.data, aes(x=word_1, y=count)) + geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 40, hjust = 1))
The 20 most used bi-grams in the dataset:
graph.data <- bi_words[order(bi_words$count, decreasing = T), ]
graph.data <- graph.data[1:20, ]
graph.data$word <- paste(graph.data$word_1, graph.data$word_2)
graph.data$word <- factor(graph.data$word, levels = graph.data$word)
ggplot(data=graph.data, aes(x=word, y=count)) + geom_bar(stat="identity")
The 20 most used tri-grams in the dataset:
graph.data <- tri_words[order(tri_words$count, decreasing = T), ]
graph.data <- graph.data[1:20, ]
graph.data$word <- paste(graph.data$word_1, graph.data$word_2, graph.data$word_3)
graph.data$word <- factor(graph.data$word, levels = graph.data$word)
ggplot(data=graph.data, aes(x=word, y=count)) + geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 40, hjust = 1))
From the N-Grams created we can get an idea of how to predict the texts. The 1-Gram can be used to predict most frequent words while 2-Grams and 3-Grams can be used to predict subsequent words.
To speedup calculations even more, let’s index the N-Grams.
setkey(uni_words, word_1)
setkey(bi_words, word_1, word_2)
setkey(tri_words, word_1, word_2, word_3)
Let’s add Kneser-Kney smoothing to the dataset. First we will find bi-gram probabilities and then add smoothing.
discount_value <- 0.75
######## Finding Bi-Gram Probability #################
# Finding number of bi-gram words
numOfBiGrams <- nrow(bi_words[by = .(word_1, word_2)])
# Dividing number of times word 2 occurs as second part of bigram, by total number of bigrams.
# ( Finding probability for a word given the number of times it was second word of a bigram)
ckn <- bi_words[, .(Prob = ((.N) / numOfBiGrams)), by = word_2]
setkey(ckn, word_2)
# Assigning the probabilities as second word of bigram, to unigrams
uni_words[, Prob := ckn[word_1, Prob]]
uni_words <- uni_words[!is.na(uni_words$Prob)]
# Finding number of times word 1 occured as word 1 of bigrams
n1wi <- bi_words[, .(N = .N), by = word_1]
setkey(n1wi, word_1)
# Assigning total times word 1 occured to bigram cn1
bi_words[, Cn1 := uni_words[word_1, count]]
# Kneser Kney Algorithm
bi_words[, Prob := ((count - discount_value) / Cn1 + discount_value / Cn1 * n1wi[word_1, N] * uni_words[word_2, Prob])]
######## End of Finding Bi-Gram Probability #################
Then let’s find tri-gram probabilities and add smoothing.
######## Finding Tri-Gram Probability #################
# Finding count of word1-word2 combination in bigram
suppressWarnings(
tri_words[, Cn2 := bi_words[.(word_1, word_2), count]])
# Finding count of word1-word2 combination in trigram
n1w12 <- tri_words[, .N, by = .(word_1, word_2)]
setkey(n1w12, word_1, word_2)
# Kneser Kney Algorithm
suppressWarnings(
tri_words[, Prob := (count - discount_value) / Cn2 + discount_value / Cn2 * n1w12[.(word_1, word_2), N] *
bi_words[.(word_1, word_2), Prob]]
)
######## End of Finding Tri-Gram Probability #################
Finally let’s tweak the unigram to be used more effectively. Here we single out 50 most occurring unigrams as it is more likely to occur. This will be used as the last resort in backing-off.
# Finding the most frequently used 50 unigrmas
uni_words <- uni_words[order(-Prob)][1:50]
With that, we have found, the word probabilities for the words in the dataset. Now let’s create few functions to build the prediction app.
First the function to predict the third word, given two previous words.
# function to return highly probable previous word given two successive words
triWords <- function(w1, w2, n = 5) {
pwords <- tri_words[.(w1, w2)][order(-Prob)]
if (any(is.na(pwords)))
return(biWords(w2, n))
if (nrow(pwords) > n)
return(pwords[1:n, word_3])
count <- nrow(pwords)
bwords <- biWords(w2, n)[1:(n - count)]
return(c(pwords[, word_3], bwords))
}
If we don’t find a tri-gram with the two given words, we backoff to the bi-gram. We find the next word given one previous word.
# function to return highly probable previous word given a word
biWords <- function(w1, n = 5) {
pwords <- bi_words[w1][order(-Prob)]
if (any(is.na(pwords)))
return(uniWords(n))
if (nrow(pwords) > n)
return(pwords[1:n, word_2])
count <- nrow(pwords)
unWords <- uniWords(n)[1:(n - count)]
return(c(pwords[, word_2], unWords))
}
If we couldn’t even find the corresponding bigram, we randomly get a word from unigrams with high probability. This is the last resort for n-grams that are not found in the dataset.
# function to return random words from unigrams
uniWords <- function(n = 5) {
return(sample(uni_words[, word_1], size = n))
}
Finally, a function to bind all these,
# The prediction app
getWords <- function(str){
require(quanteda)
tokens <- tokens(x = char_tolower(str))
tokens <- char_wordstem(rev(rev(tokens[[1]])[1:2]), language = "english")
words <- triWords(tokens[1], tokens[2], 5)
chain_1 <- paste(tokens[1], tokens[2], words[1], sep = " ")
print(words[1])
}
Let’s see how our model has worked so far.
getWords("Shall we go to")
## [1] "the"
getWords(getWords("Shall we go to"))
## [1] "the"
## [1] "same"
getWords(getWords(getWords("Shall we go to")))
## [1] "the"
## [1] "same"
## [1] "time"
Whooh. It works really well!!!
Congratulations, you’ve built a prediction app.