Before we start building the prediction model. We will import the tokenized corpus from “blogtokenized.txt”, “twittertokenized.txt”, and “newstokenized.txt”.
#Word counts
con <- file("blogtokenized.txt", "r")
blogchar<- readLines(con)
blogword <- length(blogchar)
close(con)
con <- file("twittertokenized.txt", "r")
twitterchar <- readLines(con)
twitterword <- length(twitterchar)
close(con)
con <- file("newstokenized.txt", "r")
newschar <- readLines(con)
newsword <- length(newschar)
close(con)
In this project, we will build a word prediction model which provides the most likely next word followed by its previous n words. This model is so called n-gram model, and n-gram refers to a \(n\) sequence of words. The specific n-gram model we will use is called Stupid Back-off Model, and will be described below.
This model follows these step:
In previous section a term MLE was metioned which deserves more detailed explanations. N-gram model predicts \(x_i\) based on $ x_{i-(n-1)},…,x_{i-1}$. MLE assigns probability on each word that exist in corpus. The probability is the number of n-gram events divided by the number of total n-gram events.
In the n-gram model, independence assumptions (Markov assumption) are made so that each word depends only on the last \(n-i\) words. In probability terms, this is \(P(x_{i}\mid x_{i-(n-1)},...,x_{i-1})\). Simply we will count up the number of particular word or combination of words and divide by the total number of occurances.
In equations MLE can be written as below
Now we will build stupid back off model.
First we need to generate unigrams, bigrams and trigrams from the corpus. These n-grams will be used to construct the Katz’s BO model throughout this project. The code below will create unigram, bigram, and trigram with frequencies.
unigs <- tokens_ngrams(tokens(str_c(c(blogchar,twitterchar,newschar),collapse = " ")),n=1)[[1]]
unigs <- count(unigs)
unigs <- unigs[order(-unigs$freq),]
row.names(unigs) <- 1:nrow(unigs)
bigrs <- tokens_ngrams(tokens(str_c(c(blogchar,twitterchar,newschar),collapse = " ")),n=2)[[1]]
bigrs <- count(bigrs)
bigrs <- bigrs[order(-bigrs$freq),]
row.names(bigrs) <- 1:nrow(bigrs)
trigs <- tokens_ngrams(tokens(str_c(c(blogchar,twitterchar,newschar),collapse = " ")),n=3)[[1]]
trigs <- count(trigs)
trigs <- trigs[order(-trigs$freq),]
row.names(trigs) <- 1:nrow(trigs)
getNextWord <- function(bigPre=bigPre, unigs=unigs, bigrs = bigrs, trigs=trigs, choice=choice){
Trigs <- trigs[grep(paste0("^", bigPre,"_"), trigs$x),]
if (paste0(bigPre,"_", choice) %in% Trigs$x) {
Trigs <- Trigs[order(-Trigs$freq),]
output <- data.frame(ngram=choice, prob=Trigs[Trigs$x==paste0(bigPre,"_", choice),]$freq/sum(Trigs$freq))
}
else {
uniPre <- tail(str_split(bigPre, '_')[[1]],1)
Bigrs <- bigrs[grep(paste0("^", uniPre,"_"), bigrs$x),]
if (paste0(uniPre,"_", choice) %in% Bigrs$x){
Bigrs <- Bigrs[order(-Bigrs$freq),]
output <- data.frame(ngram=choice, prob=Bigrs[Bigrs$x==paste0(uniPre,"_", choice),]$freq/sum(Bigrs$freq))
}
else {
output <- data.frame(ngram=choice, prob=unigs[unigs$x=='choice',]$freq/sum(unigs$freq))
}
}
return(output)
}
Let’s try using the model with i_love as the bigram prefix.
bigPre <- 'i_love'
choices <- c('you', 'him', 'her', 'dog')
predictNextWord <- function(choices, bigPre,unigs,bigrs,trigs){
output <- (lapply(choices, FUN= function(y) getNextWord(bigPre, unigs, bigrs, trigs, y))[[1]])$ngram
}
NextWord <- predictNextWord(choices,bigPre, unigs,bigrs, trigs)
We can see that the suggested word after i love is you.
Since the model is constructed now we will assess how well the model performs.
We measure how fast the model runs.
gamma <- 0.5
bigPre <- "i_love"
start_time <- Sys.time()
NextWord <- getNextWord(bigPre, unigs, bigrs, trigs, choices)
end_time <- Sys.time()
round(end_time - start_time, 2)
## Time difference of 0.89 secs
You can see it takes a out 0.27 seconds to get completed. This is not quite quick.
Now we will measure accuracy. Simply we will measure the proportion of correct answers provided by the model. We will randomly select 4 unigram words with top 100 frequencies from the test dataset and sample random 100 trigrams that ends with those 4 unigram words.
#Word counts
con <- file("blogtokenized_test.txt", "r")
blogchar_test<- readLines(con)
blogword_test <- length(blogchar_test)
close(con)
con <- file("twittertokenized_test.txt", "r")
twitterchar_test <- readLines(con)
twitterword_test <- length(twitterchar_test)
close(con)
con <- file("newstokenized_test.txt", "r")
newschar_test <- readLines(con)
newsword_test <- length(newschar_test)
close(con)
unigs_test <- tokens_ngrams(tokens(str_c(c(blogchar_test,twitterchar_test,newschar_test),collapse = " ")),n=1)[[1]]
unigs_test <- count(unigs_test)
unigs_test <- unigs_test[order(-unigs_test$freq),]
row.names(unigs_test) <- 1:nrow(unigs_test)
bigrs_test <- tokens_ngrams(tokens(str_c(c(blogchar_test,twitterchar_test,newschar_test),collapse = " ")),n=2)[[1]]
bigrs_test <- count(bigrs_test)
bigrs_test <- bigrs_test[order(-bigrs_test$freq),]
row.names(bigrs_test) <- 1:nrow(bigrs_test)
trigs_test <- tokens_ngrams(tokens(str_c(c(blogchar_test,twitterchar_test,newschar_test),collapse = " ")),n=3)[[1]]
trigs_test <- count(trigs_test)
trigs_test <- trigs_test[order(-trigs_test$freq),]
row.names(trigs_test) <- 1:nrow(trigs_test)
set.seed(1234)
SampleWords<-sample(unigs_test$x[1:100], 4)
TestTrigrams <- trigs_test[(sapply(trigs_test$x, FUN= function(y) paste(tail(strsplit(as.character(y), "_")[[1]],1))) %in% SampleWords),]
set.seed(1234)
SampleTrigs <- sample(TestTrigrams$x, 50)
BigPreList<- sapply(SampleTrigs, FUN= function(y) paste(strsplit(as.character(y), "_")[[1]][1], strsplit(as.character(y), "_")[[1]][2],sep='_'))
prediction <- sapply(BigPreList, FUN = function(y) predictNextWord(SampleWords, y, unigs_test,bigrs_test, trigs_test))
sum(paste0(BigPreList,'_', prediction)==SampleTrigs)/length(SampleTrigs)
## [1] 0.64
We can see that the accuracy of this model is about 46%. This is not very reliable so we need to develop a better model.