Import Corpus

Before we start building the prediction model. We will import the tokenized corpus from “blogtokenized.txt”, “twittertokenized.txt”, and “newstokenized.txt”.

#Word counts
con <- file("blogtokenized.txt", "r")
blogchar<- readLines(con)
blogword <- length(blogchar)
close(con) 

con <- file("twittertokenized.txt", "r")
twitterchar <- readLines(con)
twitterword <- length(twitterchar)
close(con) 

con <- file("newstokenized.txt", "r")
newschar <- readLines(con)
newsword <- length(newschar)
close(con) 

N-gram Prediction Model

In this project, we will build a word prediction model which provides the most likely next word followed by its previous n words. This model is so called n-gram model, and n-gram refers to a \(n\) sequence of words. The specific n-gram model we will use is called Stupid Back-off Model, and will be described below.

Stupid Back-off Model

This model follows these step:

Maximum Likelihood Estimate (MLE)

In previous section a term MLE was metioned which deserves more detailed explanations. N-gram model predicts \(x_i\) based on $ x_{i-(n-1)},…,x_{i-1}$. MLE assigns probability on each word that exist in corpus. The probability is the number of n-gram events divided by the number of total n-gram events.

In the n-gram model, independence assumptions (Markov assumption) are made so that each word depends only on the last \(n-i\) words. In probability terms, this is \(P(x_{i}\mid x_{i-(n-1)},...,x_{i-1})\). Simply we will count up the number of particular word or combination of words and divide by the total number of occurances.

In equations MLE can be written as below

Model Building

Now we will build stupid back off model.

Step 1. Save unigram, bigram and trigram info

First we need to generate unigrams, bigrams and trigrams from the corpus. These n-grams will be used to construct the Katz’s BO model throughout this project. The code below will create unigram, bigram, and trigram with frequencies.

unigs <- tokens_ngrams(tokens(str_c(c(blogchar,twitterchar,newschar),collapse = " ")),n=1)[[1]]
unigs <- count(unigs)
unigs <- unigs[order(-unigs$freq),]
row.names(unigs) <- 1:nrow(unigs)
bigrs <- tokens_ngrams(tokens(str_c(c(blogchar,twitterchar,newschar),collapse = " ")),n=2)[[1]]
bigrs <- count(bigrs)
bigrs <- bigrs[order(-bigrs$freq),]
row.names(bigrs) <- 1:nrow(bigrs)
trigs <- tokens_ngrams(tokens(str_c(c(blogchar,twitterchar,newschar),collapse = " ")),n=3)[[1]]
trigs <- count(trigs)
trigs <- trigs[order(-trigs$freq),]
row.names(trigs) <- 1:nrow(trigs)

Step 2: Build Model

getNextWord <- function(bigPre=bigPre, unigs=unigs, bigrs = bigrs, trigs=trigs, choice=choice){
  Trigs <- trigs[grep(paste0("^", bigPre,"_"), trigs$x),]
  if (paste0(bigPre,"_", choice) %in% Trigs$x) {
    Trigs <- Trigs[order(-Trigs$freq),]
    output <- data.frame(ngram=choice, prob=Trigs[Trigs$x==paste0(bigPre,"_", choice),]$freq/sum(Trigs$freq))
  }
  else {
    uniPre <- tail(str_split(bigPre, '_')[[1]],1)
    Bigrs <- bigrs[grep(paste0("^", uniPre,"_"), bigrs$x),]
    if (paste0(uniPre,"_", choice) %in% Bigrs$x){
      Bigrs <- Bigrs[order(-Bigrs$freq),]
      output <- data.frame(ngram=choice, prob=Bigrs[Bigrs$x==paste0(uniPre,"_", choice),]$freq/sum(Bigrs$freq))
    }
    else {
      output <- data.frame(ngram=choice, prob=unigs[unigs$x=='choice',]$freq/sum(unigs$freq))
    }
  }
  return(output)
}

Using the model

Let’s try using the model with i_love as the bigram prefix.

bigPre <- 'i_love'
choices <- c('you', 'him', 'her', 'dog')
predictNextWord <- function(choices, bigPre,unigs,bigrs,trigs){
  output <- (lapply(choices, FUN= function(y) getNextWord(bigPre, unigs, bigrs, trigs, y))[[1]])$ngram
}
NextWord <- predictNextWord(choices,bigPre, unigs,bigrs, trigs)

We can see that the suggested word after i love is you.

Assessing model performance

Since the model is constructed now we will assess how well the model performs.

Speed

We measure how fast the model runs.

gamma <- 0.5
bigPre <- "i_love"
start_time <- Sys.time()
NextWord <- getNextWord(bigPre, unigs, bigrs, trigs, choices)
end_time <- Sys.time()
round(end_time - start_time, 2)
## Time difference of 0.89 secs

You can see it takes a out 0.27 seconds to get completed. This is not quite quick.

Accuracy

Now we will measure accuracy. Simply we will measure the proportion of correct answers provided by the model. We will randomly select 4 unigram words with top 100 frequencies from the test dataset and sample random 100 trigrams that ends with those 4 unigram words.

#Word counts
con <- file("blogtokenized_test.txt", "r")
blogchar_test<- readLines(con)
blogword_test <- length(blogchar_test)
close(con) 

con <- file("twittertokenized_test.txt", "r")
twitterchar_test <- readLines(con)
twitterword_test <- length(twitterchar_test)
close(con) 

con <- file("newstokenized_test.txt", "r")
newschar_test <- readLines(con)
newsword_test <- length(newschar_test)
close(con) 

unigs_test <- tokens_ngrams(tokens(str_c(c(blogchar_test,twitterchar_test,newschar_test),collapse = " ")),n=1)[[1]]
unigs_test <- count(unigs_test)
unigs_test <- unigs_test[order(-unigs_test$freq),]
row.names(unigs_test) <- 1:nrow(unigs_test)

bigrs_test <- tokens_ngrams(tokens(str_c(c(blogchar_test,twitterchar_test,newschar_test),collapse = " ")),n=2)[[1]]
bigrs_test <- count(bigrs_test)
bigrs_test <- bigrs_test[order(-bigrs_test$freq),]
row.names(bigrs_test) <- 1:nrow(bigrs_test)

trigs_test <- tokens_ngrams(tokens(str_c(c(blogchar_test,twitterchar_test,newschar_test),collapse = " ")),n=3)[[1]]
trigs_test <- count(trigs_test)
trigs_test <- trigs_test[order(-trigs_test$freq),]
row.names(trigs_test) <- 1:nrow(trigs_test)
set.seed(1234)
SampleWords<-sample(unigs_test$x[1:100], 4)

TestTrigrams <- trigs_test[(sapply(trigs_test$x, FUN= function(y) paste(tail(strsplit(as.character(y), "_")[[1]],1))) %in% SampleWords),]
set.seed(1234)
SampleTrigs <- sample(TestTrigrams$x, 50)

BigPreList<- sapply(SampleTrigs, FUN= function(y) paste(strsplit(as.character(y), "_")[[1]][1], strsplit(as.character(y), "_")[[1]][2],sep='_')) 

prediction <- sapply(BigPreList, FUN = function(y) predictNextWord(SampleWords, y, unigs_test,bigrs_test, trigs_test))
sum(paste0(BigPreList,'_', prediction)==SampleTrigs)/length(SampleTrigs)
## [1] 0.64

We can see that the accuracy of this model is about 46%. This is not very reliable so we need to develop a better model.