Overview

Word prediction is a technology tools that suggests words as one types on a keyboard, tablet, or phone. Behind the autocorrect smartphone text is this technology that allow users to select a predicted word or phrases. Using this as a backdrop, the key takeaways for this project answers the following questions: * Which approach works the best when predicting the next word based on previous n words? * How do these results differ? * How can we improve our model?

I investigated predicting words in the sequence as entered by the users. This word predictor model takes a phrase (multiple words or a word) as an input, and after running the model, it will predict the next word as the output. The data used for this project is the SwiftKey dataset. The dataset comprises of the English version of three documents: Twitter, Blogs and Texts. The combined data set of these three documents is around 500 MB. A brief description of the dataset is given below:

My approach to build this word predictor is based upon forming the Markov chain or n-gram approach. I have built and compared two models:

A baseline model that is based upon simple Katz-Backoff Algorithm
Another model that takes into account the Good-Turing smoothing followed by Katz-Backoff Model which, in turn, is based upon a simple linear interpolation of the unigram, bigram and trigram probabilities to estimate the probability of the next predicted word (Katz, 1987)

Finally, I have validated both the models by calculating their accuracy.

Pre-processing the Dataset

Using the blogs and the twitter documents, we built a single corpus. This corpus was used to train the model. Before building our model, we have performed some basic pre-processing steps on our text corpus. These steps were necessary in order to clean the data: * Removing duplicates * Removing URLs * Removing Twitter symbols * Profanity filtering: Removing Bad words * Removing contractions * Removing alphabets except ‘a’ and ‘I’ * Removing numbers * Removing punctuation * Removing stop-words * Removing whitespace

For the pre-processing of the dataset we have extensively used the ‘quanteda’ library in R. Besides quanteda, we have also used the following R packages: Tm, RWeka, SnowballC, readtext, doParallel, wordcloud, ggplot2, stringr, stringi, readtext.

Tokenizing and Exploratory Data Analysis

Post pre-processing, I generated the 1,2 and 3-grams from our text corpus. After generating these n-gram tokens, I performed a basic exploratory analysis.

Building the Predictive Models

The n-gram frequency table generated in the previous section is the pre-requisite for building our predictive model. See Appendix 1 for the complete code.

Since the text corpus is massive, I have randomly sampled our dataset to include just 1% of the total data available for training our models. Also, I have only considered blogs and twitter data to train our model. 1% of the news document is used to validate the models.

Model 1: Base-model

I have first modelled our training data set based on a simple application of the Katz-Backoff model. The Katz-Backoff model roughly follows the following steps:

Load four compressed data sets containing sorted n-grams (1/2/3/4-grams) with cumulative frequencies.
Filter the user specified sequence of words by applying same techniques to clean the training data sets.
Depending upon the number of words specified by the user, extract last three or two or the last one word.
First use a 4-gram; the first three words of which are the last three words of the user provided sentence.
If no 4-gram is matched, back-off to 3-gram. Match first two words of 3-gram with last two words of the sentence.
If no 3-gram is matched, back-off to 2-gram. Match first word of 2-gram with last word of the sentence.
Finally, if no match found in 2-grams, use the most frequent word from 1-gram as next word.

Model 2: GTModel

In this approach, along with the frequencies, I also calculated the probabilities of each of the n-grams as they appear in our training corpus. The bigram and trigram probabilities were calculated using the maximum likelihood estimation as follows:

The next word’s trigram probability was estimated by the simple linear interpolation method (Daniel & Martin, 2018) in which the probabilities of the next word’s unigram, bigram, and trigram were weighted and combined (New University of Lisbon)

The optimal weighting factors resulted from searches by the simplex method (Singer & Nelder, 2009).

An important factor that needs to be considered when listing the probabilities of all the n-grams is the fact that there might be unobserved n-grams that have a zero frequency in the corpus. To mitigate this issue, I have attempted to smooth the probability distribution in such a way so as to account for n-grams that have not been observed in the corpus. This has been done by smoothing the frequencies of the n-grams using the Good-Turing smoothing method (Gale & Sampson, 1995).

Validation of the Models and Results

Both the models were calculated using a custom accuracy function. Accuracy is calculated by running each model against a validation data set. I have taken 1% of the total news dataset. Each sentence of this data set is extracted in a row of a data frame. Each sentence (in each row) is split into two parts: all the words of the sentence except the last word (2nd Column), last word (3rd Column) of the sentence. The input to the model is the data in the 2nd column and the last word is predicted. The prediction is then checked against the actual last word in the 3rd column and a score of 1 is assigned if the prediction and the actual word are same. The accuracy is calculated as the sum of the final score divided by the number of rows in the data frame.

2 variants of each model were considered: One in which the English stop-words were removed and another one where the stop-words were not removed. For the second model, we considered the default value of lambda coefficients (0, 0.3 and 0.7 for unigram, bigram and trigrams respectively). The following results were obtained:

Thus, I saw that the model accuracy decreased (albeit marginally) when including the stop-words. This is intuitive because the frequency of the stop-words will be disproportionately high hence there is more likelihood of the both the models to predict the next word as a stop-word; even in those cases where the correct prediction is not a stop-word.

Now, comparing the accuracy of the base-model and the GTModel, I have inferred that the accuracy of the GTModel is high (~6%) than as compared to the base model (~3%). This is consistent with our intuition that the second model also considers the probability of n-grams which are not present in the training corpus (through Good-Turing Smoothing). This was not the case in base-model where the probability of non-occurring n-grams in the training data was straightaway set to zero.

Once I inferred that the GTModel was better than the base model, I also experimented with the various values of lambda to determine the most favourable coefficients yielding highest probability of the predicted word which, in turn, increases the accuracy of the predictions made using the GTModel. It is noteworthy to mention that I have also considered GTModel where the stop-words were removed for the reasons mentioned above.

From the above table, we can infer that higher the weight for the third coefficient that corresponds to the trigram probability, higher is the accuracy of the model. Conversely, higher values of the first coefficient results in decreased accuracy of the GTModel. This suggests that while predicting the probabilities of the predicted word, higher probability of the predicted words will depend more on the trigrams than as compared to the unigrams.

Conclusion and Future Work

From this work, we can conclude that the Good Turing Smoothing is actually a pretty good method to smooth out the probabilities of the observed n-grams by taking into consideration the un-observed n-grams in the corpus. Also, for a word predictor algorithm, as evidenced by the accuracy scores, it is best to remove the stop-words while training the model.

The accuracy scores are generally low for this project. This can be attributed to a variety of factors, including but not limited to:

A small training corpus which is about just over 1% of the total data available. Although, due to the computational limitations I haven’t been able to experiment with various sizes of training data and their corresponding accuracies, we are sure that the richer is the vocabulary the better the models work; particularly the GTModel because it will have a larger number of n-grams and the smoothing will be more pronounced.
The pre-processing of the training corpus is inaccurate as the pre-processed data also contain improper contractions, abbreviations, and short-hands, the presence of which reduces the significance of valid n-grams. Such edge cases can best be handled by extensive use of manually applying regular expressions as opposed to the existing libraries.
The bigrams and the trigrams generated from the text corpus also has some incoherent combinations because of the lack of beginning and end sentence markers. Example consider the string, “Hi! Nice to meet you. I am a student at the University of Washington”. In this sentence, one of the 3-grams that will be generated after the pre-processing is “meet you I”. This is obviously incorrect and doesn’t make sense.

Another major limitation of this approach of using n-gram based Katz Back-off algorithm is that that the semantic relationship between the words is not taken into consideration. To build a word predictor with a more accuracy prediction, we probably need to implement a hybrid model that also determines the semantic relationship between the words (like the Word2Vec approach) as well as base the predictions on past data depending on n-gram approach like the Katz-Backoff algorithm.

References

Daniel, J., & Martin, J. H. (2018). Naive Bayes and Sentiment. In J. H. Daniel Jurafsky, Speech and Language Processing (pp. 14-15).
Gale, W. A., & Sampson, G. (1995). Good-Turing Smoothing Without Tears. Journal of Quantitative Linguistics, 217-237. Katz, S. M. (1987). Estimation of Probabilities from Sparse Data for the. IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 400-401.
New University of Lisbon. (n.d.). Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Retrieved from New University of Lisbon Web Site: http://ctp.di.fct.unl.pt/~jmag/ir/slides/a05%20Language%20models.pdf
Singer, S., & Nelder, J. (2009). Nelder-Mead algorithm. Retrieved from Scholarpedia Web site: http://www.scholarpedia.org/article/Nelder-Mead_algorithm
Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. (2018) “quanteda: An R package for the quantitative analysis of textual data”. Journal of Open Source Software. 3(30), 774.https://doi.org/10.21105/joss.00774

Appendix: Code

#THe following libraries are used in this analysis
library(tm)
library(RWeka)
library(SnowballC)
library(quanteda)
library(readtext)
library(doParallel)
library(wordcloud)
library(ggplot2)
library(stringr)
library(stringi)
library(readtext)
library(dplyr)

Overview

Follwoing tasks are accomplished in this report:

The data is dowloaded from the website
A summary of the statistics of the data
Random sampling of the dataset
Creation of text corpus
Cleaning the sampled dataset
Exploratory analysis of the dataset
Basemodel
GTModel

For this project, I am using the “quanteda” package as it provides a rich set of text analysis features coupled with excellent performance relative to Java-based R packages for text analysis. More information here [https://github.com/lgreski/datasciencectacontent/blob/master/markdown/capstone-choosingATextPackage.md].

Random sampling of the dataset

The following code performs random sampling on the on each of the three files and store them separately in the same directory

samplingFun <- function(filename, p, type="train")
{
  con <- file(paste0("en_US.",filename,".txt"))
  tempData <- readLines(con)
  numrows <- length(tempData)
  close(con)
  
  #Random sampling of the file
  set.seed(100)
  sampleFile <- tempData[rbinom(n=numrows, size = 1, prob = p) == 1]
  
  #Making a new file
  
  if(type=="train")
  con <- file(paste0("train_en_US.",filename,".txt"))
  else if (type=="validate")
  con <- file(paste0("validate_en_US.",filename,".txt"))
  else
  con <- file(paste0("test_en_US.",filename,".txt"))  
  
  writeLines(sampleFile, con)
  close(con)
}

Creation of corpus

Here I am combining the data from all the three sample files created above into a single monolith corpus. This will subsequently be used for further analysis.

generateCorpus <- function(docsTrain, docsValidate, probTrain, probValidate){
  require(stringi)
  
  #Saving the file names
  files <- docsTrain
  
  
  probTrain <- probTrain
  probValidate <- probValidate
  
  #Calling the function to sample each of the three files
  for(i in 1:length(docsTrain)) {
  samplingFun(docsTrain[i],probTrain, type="train")
  }
  
  for(i in 1:length(docsValidate)) {
  samplingFun(docsValidate,probValidate, type="validate")
  }
  
  #After the above code runs successfully, three sample files are created in the same existing directory.
  
  # Creating a vector of the sample files
  #samplefiles <- c("train_en_US.blogs.txt", "train_en_US.news.txt", "train_en_US.twitter.txt")
  samplefiles <- c()
                   
  for (i in 1:length(docsTrain))
    samplefiles <- paste0("train_en_US.",docsTrain[i],".txt")      
  
  # Creating a quanteda corpus of the three sample documents created above
  
  textC <- stringi::stri_trans_general(readtext(samplefiles), "latin-ascii")
  #myCorpus <- corpus(readtext(samplefiles))
  myCorpus <- corpus(textC)
  
  return(myCorpus)
}

Cleaning of the dataset and tokenizing the corpus

In this section I will clean the sample data set (the corpus created above) followed by profanity filtering from the corpus, i.e., removal of the bad words. This section performs the following functions: 1. Removal of numbers 2. Removal of whitespaces 3. Transforming content to lower case 4. Removal of puntucation 5. Removal of stopwords 6. Profanity filtering

tokenizeCorpus <- function(myCorpus, ngrm, removeStopWords=T, verbose = T)
{
#Tokenizing the corpus
myTokens <- tokens(myCorpus,
                   what = "word",
                   remove_punct = TRUE,     #Remove Punctuation
                   remove_hyphens = FALSE,  #preserve_intra_word_dashes
                   remove_numbers = TRUE,   #Remove Numbers
                   remove_symbols = TRUE,   #Remove Symbols
                   remove_twitter = TRUE,   #Remove Twitter Symbols
                   remove_url = TRUE,       #Remove URLs
                   remove_separators = TRUE,
                   concatenator = " ")       
                   
                   
#Tokenizing the corpus
#myTokens <- tokens(myCorpus, what = "sentence")

#Getting the profane words data
profanity <- read.table(url("https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"))
profanity <- as.character(unlist(profanity))

#removing all single letters except 'a' and 'i'
alphabets <- letters
alphabets <- alphabets[!alphabets %in% c("a","i")]
#alphabets

#Creating a vector of nGrams
#For this report, we are considering unigrams, bigrams and trigrams
nGram = 1:ngrm


if(removeStopWords == TRUE)
removeList <- c(stopwords("english"),profanity, alphabets)
else
removeList <- c(profanity, alphabets)

#This stores a list of data frames (term document matrix for each unigram, bigram and trigram

myNgramDF <- lapply(nGram, function(x){
  
                        #Creating a document feature matrix
                      mydfm <- dfm(tokens_ngrams(myTokens, n=x), 
                                   tolower=TRUE,                                  #Covert to lowercase
                                   remove = removeList)  #remove stopwords and profanity
                      #print(head(mydfm))
                                     

                      #Trimming DFM to consider only those term which appear a minimum of 10 times in all the 3 documents
                      mydfm <- dfm_trim(mydfm, min_termfreq = 1)
                      
                      #Converting the DFM to a data-frame
                      mydf <- convert(mydfm, to = "data.frame")
                      
                      #Creating a Document Term Matrix from DFM
                      #A Document Term Matrix is a tranposed DFM
                      mydf <- t(mydf)
                      
                      #Setting column names of DTM as the name of the documents
                      colnames(mydf) <- mydf[1,]
                      mydf <- mydf[-1,]
                      mydf <- as.data.frame(mydf)
                      #colnames(mydf) <- c("sample_en_US.blogs.txt", "sample_en_US.news.txt", "sample_en_US.twitter.txt")
                      #print(colnames(mydf))
})

#Printing how the data set looks like
if(verbose == T){
for(i in 1:4){
  cat("\n\n\t\tThis is how the Document Term Matrix of",i,"-gram looks like.\n\n")
  print(head(myNgramDF[[i]]))}
}

return(myNgramDF)
}

Exploratory data analysis of the corpus

The following code draws the histograms and word-clouds (of most frequent unigrams, bigrams and trigrams) of each of the three documents in the corpus

#This function plots the Frequency Histogram of the n-grams for each of the three documents
plotFreqWords <- function(x, myNgramDF){
                
                #Getting the m-gram data frame from the list and storing it temporarily
                a <- myNgramDF[[x]]
                
                
                #Coverting each of the columns into numeric data-type
                #We need this since the data frame returned from the DFM function has numbers as type factors
                a[] <- lapply(a, function(x) {
                    if(is.factor(x)) as.numeric(as.character(x)) else x
                })
                
                #Removing row names and maing it as a separate column
                a <- data.frame(DocTerms = row.names(a), a, row.names = NULL)
                
                #A vector containg the document types. This will be used for naming each of the histograms
                doc = c("US Blogs", "US News", "US Twitter")
                
                #For each n-gram we need 3 histograms corresponding to each of the three types of documents
                for(i in 1:3){
                    main = paste0("15 Most Frequent ",x,"-grams ", "in ",doc[i], " article")
      
                    b <- a[,c(1,i+1)]
                    #b <- b[order(-b$sample_en_US.news.txt),]
                    b <- b[order(-b[,2]),]
                    colnames(b) <- c("word","Frequency")
                    #b
                    
                    p <- ggplot(data=b[1:20,], 
                           aes(x=reorder(word,Frequency), 
                               y=Frequency,
                               fill=Frequency))+ 
                      geom_bar(stat="identity")+
                      xlab("Word")+
                      labs(title = main) +
                      #theme(legend.title=element_blank()) + 
                      coord_flip()
                    
                    print(p)
                    
                    cat("\n\n\tWord-cloud of 15 Most Frequent ",x,"-grams ", "in ",doc[i], " article")
                    w <-  wordcloud(b$word[1:15], b$Frequency[1:20],
                                    colors=brewer.pal(8, "Dark2"))
                    
                    print(w)
                }
}

Building n-gram dataframes

#temp <- myNgramDF[[1]]
corpusNGramGenerator <- function(nGramDF, n){
  #nGramDF <- temp
  
  #colnames(nGramDF) <- c("sample_en_US.blogs.txt", "sample_en_US.news.txt", "sample_en_US.twitter.txt")
  #colnames(nGramDF) <- c("sample_en_US.blogs.txt", "sample_en_US.twitter.txt")
  #

  #Coverting each of the columns into numeric data-type
  #We need this since the data frame returned from the DFM function has numbers as type factors
  nGramDF[] <- lapply(nGramDF, function(x) {
      if(is.factor(x)) as.numeric(as.character(x)) else x
  })
  
  
  
  if (ncol(nGramDF) == 3){
    colnames(nGramDF) <- c("sample_en_US.blogs.txt", "sample_en_US.twitter.txt")
    nGramDF$Frequency <- nGramDF$sample_en_US.blogs.txt  + nGramDF$sample_en_US.twitter.txt
  }
  else
  {
    colnames(nGramDF) <- c("sample_en_US.news.txt")
    nGramDF$Frequency <- nGramDF$sample_en_US.news.txt
  }
  
  #Removing row names and maing it as a separate column
  nGramDF <- data.frame(Terms = row.names(nGramDF), nGramDF, row.names = NULL)
  
  nGramDF <- subset(nGramDF, select = c("Terms", "Frequency"))
  
  #sorting the data frame based on frequencies
  nGramDF <- nGramDF[order(nGramDF$Frequency, decreasing = TRUE),]
  
  #nGramDF$RelativeFrequency <- round(100*(nGramDF$Frequency/sum(nGramDF$Frequency)),2)
  #nGramDF$CumulativeFrequency <- cumsum(nGramDF$Frequency)
  #nGramDF$RelativeCumulativeFrequency <- 100*(nGramDF$CumulativeFrequency/sum(nGramDF$Frequency))
  
  nGramDF[] <- lapply(nGramDF, function(x) {
      if(is.factor(x)) as.character(x) else x
  })
  
  nGramDF$Terms <- gsub("_", " ", nGramDF$Terms, fixed=TRUE)
  
  return(nGramDF)
}

Baseline Model

#The baseline model running off simple Katz-Backoff model
baselineModel <- function(inStr)
{
  fDF1 <- readRDS("C:/Users/akhan/OneDrive/University of Washington/Final Project/fDF1.rds");
  fDF2 <- readRDS("C:/Users/akhan/OneDrive/University of Washington/Final Project/fDF2.rds");
  fDF3 <- readRDS("C:/Users/akhan/OneDrive/University of Washington/Final Project/fDF3.rds");
  fDF4 <- readRDS("C:/Users/akhan/OneDrive/University of Washington/Final Project/fDF4.rds");
    assign("mesg", "in PredNextTerm", envir = .GlobalEnv)
  
    # Clean up the input string and extract only the words with no leading and trailing white spaces
    inStr <- CleanInputString(inStr);

    # Split the input string across white spaces and then extract the length
    inStr <- unlist(strsplit(inStr, split=" "));
    inStrLen <- length(inStr);

    nxtTermFound <- FALSE;
    predNxtTerm <- as.character(NULL);
    #mesg <<- as.character(NULL);
    # 1. First test the Four Gram using the four gram data frame
    if (inStrLen >= 3 & !nxtTermFound)
    {
        # Assemble the terms of the input string separated by one white space each
        inStr1 <- paste(inStr[(inStrLen-2):inStrLen], collapse=" ");

        # Subset the Four Gram data frame 
        searchStr <- paste("^",inStr1, sep = "");
        fDF4Temp <- fDF4[grep (searchStr, fDF4$Terms), ];
        
        # Check to see if any matching record returned
        if ( length(fDF4Temp[, 1]) > 1 )
        {
            predNxtTerm <- fDF4Temp[1,1];
            nxtTermFound <- TRUE;
            mesg <<- "Next word is predicted using 4-gram."
        }
        fDF4Temp <- NULL;
    }

    # 2. Next test the Three Gram using the three gram data frame
    if (inStrLen >= 2 & !nxtTermFound)
    {
        # Assemble the terms of the input string separated by one white space each
        inStr1 <- paste(inStr[(inStrLen-1):inStrLen], collapse=" ");

        # Subset the Three Gram data frame 
        searchStr <- paste("^",inStr1, sep = "");
        fDF3Temp <- fDF3[grep (searchStr, fDF3$Terms), ];
        
        # Check to see if any matching record returned
        if ( length(fDF3Temp[, 1]) > 1 )
        {
            predNxtTerm <- fDF3Temp[1,1];
            nxtTermFound <- TRUE;
            mesg <<- "Next word is predicted using 3-gram."
        }
        fDF3Temp <- NULL;
    }

    # 3. Next test the Two Gram using the three gram data frame
    if (inStrLen >= 1 & !nxtTermFound)
    {
        # Assemble the terms of the input string separated by one white space each
        inStr1 <- inStr[inStrLen];

        # Subset the Two Gram data frame 
        searchStr <- paste("^",inStr1, sep = "");
        fDF2Temp <- fDF2[grep (searchStr, fDF2$Terms), ];
        
        # Check to see if any matching record returned
        if ( length(fDF2Temp[, 1]) > 1 )
        {
            predNxtTerm <- fDF2Temp[1,1];
            nxtTermFound <- TRUE;
            mesg <<- "Next word is predicted using 2-gram.";
        }
        fDF2Temp <- NULL;
    }

    # 4. If no next term found in Four, Three and Two Grams return the most
    #    frequently used term from the One Gram using the one gram data frame
    if (!nxtTermFound & inStrLen > 0)
    {
        predNxtTerm <- fDF1$Terms[1];
        mesg <- "No next word found, the most frequent word is selected as next word."
    }

    nextTerm <- word(predNxtTerm, -1);
       
    if (inStrLen > 0){
        dfTemp1 <- data.frame(nextTerm, mesg);
        return(dfTemp1);
    } else {
        nextTerm <- "";
        mesg <-"";
        dfTemp1 <- data.frame(nextTerm, mesg);
        return(dfTemp1);
    }
}

Clean input string

# This function "Cleans up" the user input string 
CleanInputString <- function(inStr)
{
   # Test sentence
   #inStr <- "This is. the; -  .   use's 12"

   # First remove the non-alphabatical characters
   inStr <- iconv(inStr, "latin1", "ASCII", sub=" ");
   inStr <- gsub("[^[:alpha:][:space:][:punct:]]", "", inStr);

   # Then convert to a Corpus
   inStrCrps <- VCorpus(VectorSource(inStr))

   # Convert the input sentence to lower case
   # Remove punctuations, numbers, white spaces
   # non alphabets characters
   inStrCrps <- tm_map(inStrCrps, content_transformer(tolower))
   inStrCrps <- tm_map(inStrCrps, removePunctuation)
   inStrCrps <- tm_map(inStrCrps, removeNumbers)
   inStrCrps <- tm_map(inStrCrps, stripWhitespace)
   inStr <- as.character(inStrCrps[[1]])
   inStr <- gsub("(^[[:space:]]+|[[:space:]]+$)", "", inStr)

   # Return the cleaned resulting senytense
   # If the resulting string is empty return empty and string.
   # if (nchar(inStr) > 0 && !is.null(instr)) {
   #     return(inStr); 
   # } else {
   #     return("");
   # }
   return(inStr)
}

GTModel

Good Turning model

## Simple Good-Turing method
## code adapted from S codes: gtfuncs.S (nrzest, rstest) and gtanal.S
## by Willian A. Gale
## https://faculty.cs.byu.edu/~ringger/CS479/papers/Gale-SimpleGoodTuring.pdf

## Simple Good-Turing method:
##   1. Use Turing estimate for small r
##   2. Use Linear Good-Turing estimate for large r
##   3. Switch to LGT when r(T) - r(LGT) > 1.65*sd
##   4. Once switched to LGT, keep using LGT
##
## simpleGT takes a data table of frequencies r and frequencies of frequencies 
## Nr and calculated the smoothed frequencies by the Simple Good Turing method.
## freqhist = imput, a data table of r (freq) and Nr (NFreq)
## rrstar = output, a data table of r (freq) and rstar (freqGT)
## where freq = original freq, freqGT = freq smoothed by simple GT method.
## the first row is the pair of 0 freq = (0, freqGT_at_0)
##
require(data.table)

## Loading required package: data.table

simpleGT <- function(freqhist) {
    
    ## nrzest: Averaging transformation
    ## Replace nr by zr = nr/(0.5*(t - q))
    ## where q, r, t are successive indices of non-zero values
    ## in gtfuncs.S by William A. Gale
    ##
    nrzest <- function(r, nr) {
        d <- c(1, diff(r))
        dr <- c(0.5 * (d[-1] + d[-length(d)]), d[length(d)])
        return(nr/dr)
    }
    
    ## rstest: Linear Good-Turing estimate
    ## log(nr) = a + b * log(r)
    ## b = coef[2]
    ## rstest r(star)est = r *(1 + 1/r)^(b + 1)
    ## b < -1
    ## in gtfuncs.S by William A. Gale
    ##
    rstest <- function(r, coef) {
        return(r * (1 + 1/r)^(1 + coef[2]))
    }

    ## The following code comes from gtanal.S by William A. Gale

    ## Get the input xr and xnr    
    xm <- freqhist
    xr <- xm[, 1]
    xnr <- xm[, 2]
    xN <- sum(xr * xnr)
    
    ## make averaging transform
    xnrz <- nrzest(xr, xnr)
    
    ## get Linear Good-Turing estimate
    xf <- lsfit(log(xr), log(xnrz))
    xcoef <- xf$coef
    xrst <- rstest(xr, xcoef)
    xrstrel <- xrst/xr
    
    ## get Turing estimate
    xrtry <- xr == c(xr[-1]-1, 0)
    xrstarel <- rep(0, length(xr))
    xrstarel[xrtry] <- (xr[xrtry]+1) / xr[xrtry] * c(xnr[-1], 0)[xrtry] / xnr[xrtry]
    
    ## make switch from Turing to LGT estimates
    tursd <- rep(1, length(xr))
    for (i in 1:length(xr)) {
        tursd[i] <- (i+1) / xnr[i] * sqrt(xnr[i=1] * (1 + xnr[i+1] / xnr[i]))
    }
    xrstcmbrel <- rep(0, length(xr))
    useturing <- TRUE
    for (r in 1:length(xr)) {
        if (!useturing) {
            xrstcmbrel[r] <- xrstrel[r]
        } else if (abs(xrstrel - xrstarel)[r] * r / tursd[r] > 1.65) {
            xrstcmbrel[r] <- xrstarel[r]
        } else {
            useturing <- FALSE
            xrstcmbrel[r] <- xrstrel[r]
        }
    }
    
    ## renormalize the probabilities for observed objects
    sumpraw <- sum(xrstcmbrel * xr * xnr / xN)
    xrstcmbrel <- xrstcmbrel * (1 - xnr[1]/xN) / sumpraw
    
    ## output to file
    # cat(xN, sum(xnr), file="gtanal", sep=",")
    # cat(0, xnr[1]/xN, file="gtanal", sep=",", append=TRUE)
    # for (i in 1:length(xr)) {
    #     cat(xr[i], xr[i]*xrstcmbrel[i], file="gtanal", append=TRUE)
    # }
    
    ## output matrix (0, r0est) + (xr, xnrstarnormalized)
    rrstar <- cbind(c(0, xr), c(xnr[1]/xN, xr*xrstcmbrel))
    
    ## output data table by pairs = (r = freq, rstar = freqGT)
    ## keyed (ordered) by freq.
    rrstar <- data.table(rrstar)
    colnames(rrstar) <- c("freq", "freqGT")
    setkey(rrstar, freq)
    return(rrstar)
}

CleanNgram

# This function converts a data frame of ngrams to a data table, removes ngrams
## with frequencies less than the minFreq, and calculates the frequencies 
## smoothed by Simple Good-Turing method. 

cleanNgram <- function(ngramDf) {
  
    library(data.table)
  
    ngramDt <- data.table(ngramDf)
  
    ## Convert the data frame to data table and rename columns
    colnames(ngramDf) <- c("ngram", "freq")
    ngramDt <- data.table(ngramDf)
    
    ## Remove ngrams with frequencies below the cutoff minFreq
    #ngramDt <- ngramDt[freq >= minFreq]
    
    ## Get frequency (Nr) of frequency r,
    freqNf <- data.table(table(ngramDt[, "freq"]))
    colnames(freqNf) <- c("freq", "NFreq")
    freqNf <- sapply(freqNf, as.numeric)
    
    #print(head(freqNf))
    
    
    freqSimpleGT <- simpleGT(freqNf)
    #print(head(freqSimpleGT))
    
    ## Merge ngramDt with freqSimpleGT with bgramDt
    setkey(ngramDt, freq)
    ngramDt <- merge(ngramDt, freqSimpleGT)
    #print(head(ngramDt))
    
    ## Calculate probability of zero frequency
    pZero <- freqSimpleGT[1, 2]/sum(c(1, freqNf[, "NFreq"]) * freqSimpleGT[, 2])
    #print(pZero)
    
    ## Output the clean smoothed ngrams and probability of zero frequency
    ngramAndPzero <- list("ngramDt"=ngramDt, "pZero"=pZero)
    return(ngramAndPzero)
}

# split1gram: trims single words and recalculates frequencies and probability:
# p(w1) = count(w1)/count(all w1)
# output is a data table with each row is a set of (unigram, word1, 
# freq, prob, freqGT, probGT)

split1Gram <- function(ugramDt) {
    require(stringr)
  
    ## Trim trailing spaces
    ugramDt[, word1 := str_trim(ngram)]
    setkey(ugramDt, word1)
    
    ## Reset frequencies and calculate words' probability (unsmoothed)
    ugramDt <- ugramDt[, freq := sum(freq), by=c("word1")]
    
    ugramTotalFreq <- sum(ugramDt$freq)
    ugramDt[, prob := freq/ugramTotalFreq]
    
    ## Reset frequencies and calculate words' probability (smoothed)
    ugramDt <- ugramDt[, freqGT := sum(freqGT), by=c("word1")]
    ugramTotalFreqGT <- sum(ugramDt$freqGT)
    ugramDt[, probGT := freqGT/ugramTotalFreqGT]

    ## Set key column
    setkey(ugramDt, word1)

    ## Reorder the columns in bigrams
    setcolorder(ugramDt, c("ngram", "word1",
                           "freq", "prob", "freqGT", "probGT"))
    return(ugramDt)
}


# split2gram: splits the 2-grams into words,recalculates frequencies and probability:
# p(w2|w1) = count(w1,w2)/count(w1)
# outputis a data table with each row is a set of (bigram, word1, 
# word2, freq, prob, freqGT, probGT)

split2Gram <- function(bgramDt){
  
    require(stringr)
  
    ## Split the bigram into words
    bgramSplits <- str_split(bgramDt$ngram, boundary("word"))
    bgramDt[, word1 := sapply(bgramSplits, function(m) m[1])]
    bgramDt[, word2 := sapply(bgramSplits, function(m) m[2])]
    
    ## Count instances of word1-word2 and word1 by freq (unsmoothed)
    bgramDt[, count_w1_w2 := sum(freq), by=c("word1", "word2")]
    bgramDt[, count_w1 := sum(freq), by=c("word1")]
    
    ## Calculate p(w2|w1) = count(w1,w2)/count(w1)
    bgramDt[, prob := count_w1_w2/count_w1]
    
    ## Count instances of word1-word2 and word1 by freqGT (smoothed)
    bgramDt[, count_w1_w2_GT := sum(freqGT), by=c("word1", "word2")]
    bgramDt[, count_w1_GT := sum(freqGT), by=c("word1")]
    
    ## Calculate p(w2|w1) = count(w1,w2)/count(w1) by freqGT
    bgramDt[, probGT := count_w1_w2_GT/count_w1_GT]
    
    ## Remove temporary columns
    bgramDt[, c("count_w1_w2", "count_w1", "count_w1_w2_GT", "count_w1_GT") := NULL]
    
    ## Set key columns
    setkey(bgramDt, word1, word2)
    
    ## Reorder the columns in bigrams
    setcolorder(bgramDt, c("ngram", "word1", "word2", 
                           "freq", "prob", "freqGT", "probGT"))
    return(bgramDt)
}


# split3gram: splits the 3-grams into words, recalculates frequencies and probability:
# p(w3|w1w2) = count(w1,w2,w3)/count(w1,w2)
# outputis a data table with each row is a set of (trigram, word1, 
# word2, word3, freq, prob, freqGT, probGT)

split3Gram <- function(tgramDt) {
    if (!require(stringr)) {
        stop("Library stringr is missing.")
    }
    
    ## Split the bigram into words
    tgramSplits <- str_split(tgramDt$ngram, boundary("word"))
    tgramDt[, word1 := sapply(tgramSplits, function(m) m[1])]
    tgramDt[, word2 := sapply(tgramSplits, function(m) m[2])]
    tgramDt[, word3 := sapply(tgramSplits, function(m) m[3])]
    
    
    ## Count instances of word1-word2-word3 and word1-word2 by freq (unsmoothed)
    tgramDt[, count_w1_w2_w3 := sum(freq), by=c("word1", "word2", "word3")]
    tgramDt[, count_w1_w2 := sum(freq), by=c("word1", "word2")]
    
    ## Calculate p(w3|w1w2) = count(w1,w2,w3)/count(w1,w2)
    tgramDt[, prob := count_w1_w2_w3/count_w1_w2]
    
    ## Count instances of word1-word2-word3 and word1-word2 by freqGT (smoothed)
    tgramDt[, count_w1_w2_w3_GT := sum(freqGT), by=c("word1", "word2", "word3")]
    tgramDt[, count_w1_w2_GT := sum(freqGT), by=c("word1", "word2")]
    
    ## Calculate p(w2|w1) = count(w1,w2)/count(w1) by freqGT
    tgramDt[, probGT := count_w1_w2_w3_GT/count_w1_w2_GT]
    
    ## Remove temporary columns
    tgramDt[, c("count_w1_w2_w3", "count_w1_w2", 
                "count_w1_w2_w3_GT", "count_w1_w2_GT") := NULL]
    setkey(tgramDt, word1, word2, word3)
    
    ## Reorder the columns in bigrams
    setcolorder(tgramDt, c("ngram", "word1", "word2", "word3", 
                           "freq", "prob", "freqGT", "probGT"))
    return(tgramDt)
}

# Now converting n-gram data frame into a table such that the n-gram splits into 
# n words and this data table also has the smoothed frequencies of each n-gram 

unigram <- function(ugramDf) {
    ugramPzero <- cleanNgram(ugramDf)
    ugramPzero$ngramDt <- split1Gram(ugramPzero$ngramDt)
    return(ugramPzero)
}


bigram <- function(bgramDf) {
    bgramPzero <- cleanNgram(bgramDf)
    bgramPzero$ngramDt <- split2Gram(bgramPzero$ngramDt)
    return(bgramPzero)
}


trigram <- function(tgramDf) {
    tgramPzero <- cleanNgram(tgramDf)
    tgramPzero$ngramDt <- split3Gram(tgramPzero$ngramDt)
    return(tgramPzero)
}

# bimodel function predicts the second words of a bigram, if not found, reports
# the most frequent unigram.
# Calculate trigram probabilities as the sum of three weighted 
# probabilities: unigram, bigram, and trigram.
# The Jelinek & Mercer method
# with predictProb = coef[1]*uprobGT + coef[2]*probGT + coef[3]*tprobGT
#

bimodel <- function(lastWord, coef, ugramPzero, bgramPzero, tgramPzero) {
    newWord1 <- lastWord
    
    #if ("<UNK>" %in% newWord1) return(data.table())
    
    ## Get bigrams containing the input first word
    #bNextWords <- biNextWords(bgramPzero$ngramDt, newWord1)
    bNextWords <- subset(bgramPzero$ngramDt, word1==newWord1)
    #bNextWords <- bNextWords[word2 != "<UNK>", ]
    
    if (dim(bNextWords)[1] > 0) {
        ## Get probabilities of unigrams = nextWord
        setkey(bNextWords, word2)
        setkey(ugramPzero$ngramDt, word1)
        bNextWords <- bNextWords[ugramPzero$ngramDt, nomatch=0L]
        names(bNextWords) <- gsub("i.", "u", names(bNextWords))
        
        ## Add probability of trigram at zero frequency
        bNextWords[, tprobGT := tgramPzero$pZero]
        
        ## Calculate trigram probabilities
        bNextWords[, predictProb := coef[1]*uprobGT + coef[2]*probGT 
                   + coef[3]*tprobGT]
        
        ## Sort predicted probabilities in decreasing order
        setorder(bNextWords, -predictProb)
        predictions <- bNextWords
    } else {
        ## Get the most frequent word if trigrams and bigrams not found
        uNextWords <- ugramPzero$ngramDt[order(-probGT)][1]
        
        ## Add probabilities of bigrams and trigrams at zero frequency
        uNextWords[, bprobGT := bgramPzero$pZero]
        uNextWords[, tprobGT := tgramPzero$pZero]
        
        ## Calculate trigram probabilities
        uNextWords[, predictProb := coef[1]*probGT + coef[2]*bprobGT 
                   + coef[3]*tprobGT]
        predictions <- uNextWords
    }
    
    return(predictions)
}


# triimodel function predicts the 3rd word of a trigram, if not found, reports
# the most frequent unigram.
# Calculate trigram probabilities as the sum of three weighted 
# probabilities: unigram, bigram, and trigram.
# The Jelinek & Mercer method
# with predictProb = coef[1]*uprobGT + coef[2]*probGT + coef[3]*tprobGT

trimodel <- function(lastWords, coef, ugramPzero, bgramPzero, tgramPzero, nKeep) {
    newWord1 = lastWords[[1]]
    newWord2 = lastWords[[2]]
    
    #if ("<UNK>" %in% newWord2) return(data.table())
    
    ## Get trigrams of two new words
    #tNextWords <- triNextWords(tgramPzero$ngramDt, newWord1, newWord2)
    tNextWords <- subset(tgramPzero$ngramDt, word1==newWord1 & word2==newWord2)
    
    if (nrow(tNextWords) > 0) {
        ## Get probabilities of bigrams = newWord2-nextWord
        setkey(tNextWords, word2, word3)
        setkey(bgramPzero$ngramDt, word1, word2)
        tNextWords <- tNextWords[bgramPzero$ngramDt, nomatch=0L]
        names(tNextWords) <- gsub("i.", "b", names(tNextWords))
        
        ## Get probabilities of unigrams = nextWord
        setkey(tNextWords, word3)
        setkey(ugramPzero$ngramDt, word1)
        tNextWords <- tNextWords[ugramPzero$ngramDt, nomatch=0L]
        names(tNextWords) <- gsub("i.", "u", names(tNextWords))
    
        tNextWords[, predictProb := coef[1]*uprobGT + coef[2]*bprobGT 
                   + coef[3]*probGT]
        
        ## Sort predicted probabilities in decreasing order
        setorder(tNextWords, -predictProb)
        predictions <- tNextWords
    } else {
        ## Get bigrams if trigrams not found
        predictions <- bimodel(newWord2, coef, ugramPzero, bgramPzero, tgramPzero)
    }
    
    if (nrow(predictions) > nKeep) {
        predictions <- predictions[1:nKeep, ]
    }
    return(predictions)
}

# GTModel function get the next words for the input string from the ugramPzero/bgramPzero/tgramPzero

GTModel <- function(inputStr, coef = c(0.0, 0.3, 0.7)) {
    
    require(tm)
    require(ngram)
    library(stringr)
    
    cleanStr <-  CleanInputString(inputStr)
    
    uGram <- readRDS("C:/Users/akhan/OneDrive/University of Washington/Final Project/uGram.rds")
    bGram <- readRDS("C:/Users/akhan/OneDrive/University of Washington/Final Project/bGram.rds")
    tGram <- readRDS("C:/Users/akhan/OneDrive/University of Washington/Final Project/tGram.rds")
    
    
    ## Split the clean string into words
    inputWords <- str_split(cleanStr, boundary("word"))[[1]]
    
    if (length(inputWords) > 1) {
        testWords <- tail(inputWords, 2)
    } else if (length(inputWords) > 0) {
        testWords <- c("<UNK>", tail(inputWords, 1))
    } else {
        return("<UNK>")
    }
    
    predictions <- trimodel(testWords, coef, 
                            uGram, 
                            bGram, 
                            tGram, 3)
    
    cnames <- colnames(predictions)
    if ("word3" %in% cnames) {
        nextWords <- predictions$word3
    } else if ("word2" %in% cnames) {
        nextWords <- predictions$word2
    } else {
        nextWords <- predictions$word1
    }
    #print(head(predictions))
    return(nextWords)
}

# Validation using n-grams
# CalculateScore <- function(validateDocs, removeSW=T, model="baseline"){
#   
#   # Creating a vector of the validation files
#   #samplefiles <- c("train_en_US.blogs.txt", "train_en_US.news.txt", "train_en_US.twitter.txt")
#   samplefilesValidation <- c()
#                    
#   for (i in 1:length(validateDocs))
#     samplefilesValidation[i] <- paste0("validate_en_US.",validateDocs[i],".txt")
#   
#   textCValidate <- stringi::stri_trans_general(readtext(samplefilesValidation), "latin-ascii")
#   myCorpusValidation <- corpus(textCValidate)
#   
#   myNgramDFValidation <- tokenizeCorpus(myCorpusValidation, 4, removeStopWords=removeSW, verbose = F)
#   
#   threeGramValidation <- corpusNGramGenerator(myNgramDFValidation[[3]])
#   
#   df <- threeGramValidation
#   df <- df[1:500,]
#   df$lastTerms <- stri_extract_last_words(df$Terms)
#   df$predictTrems <- str_sub(df$Terms, start = 1L, end = (nchar(df$Terms) - nchar(df$lastTerms)))
#   
#   if (as.character(model) == "baseline"){
#     pre <- sapply(df$predictTrems, function(x){ t = baselineModel(x)
#                                             return(word(t[1,1], -1))})
#   }else{
#     pre <- sapply(df$predictTrems, function(x){ t = GTModel(x)
#                                             return(t[1])})
#   }
# 
#   df$predictions <- pre
#   df$score <- df$predictions == df$lastTerms
# 
#   return(sum(df$score/nrow(df)))
# }

Validation using full text

calculateScore2 <- function(validateDocs, model="baseline", coef = c(0.0, 0.3, 0.7))
{
  samplefilesValidation <- c()
                   
  for (i in 1:length(validateDocs))
    samplefilesValidation[i] <- paste0("validate_en_US.",validateDocs[i],".txt")
  
  textData <- readtext(samplefilesValidation)
  sent <-  tokens(stri_trans_general(textData[1,2], "latin-ascii"), what = "sentence")
  
  df2 <- as.data.frame(sent$text1)
  #df2 <- df2[1:500,]
  df2[] <- lapply(df2, function(x) {
      if(is.factor(x)) as.character(x) else x
  })
  
  cs <- unname(sapply(df2$`sent$text1`, function(x) CleanInputString(x)))
  df2$cleanString <- cs
  df2$lastWord <- stri_extract_last_words(df2$cleanString)
  #words <- strsplit(df2$`sent$text1`, " ")
  df2$testString <- str_sub(df2$cleanString, start = 1L, end = (nchar(df2$cleanString) - nchar(df2$lastWord)))
  
  if (as.character(model) == "baseline"){
    pre2 <- sapply(df2$testString, function(x){ t = baselineModel(x)
                                            return(word(t[1,1], -1))})
  }else{
    pre2 <- sapply(df2$testString, function(x){ t = GTModel(x, coef)
                                            return(t[1])})
  }
  
  df2$predictions <- pre2
  df2$score <- df2$predictions == df2$lastWord
  
  return(sum(df2$score/nrow(df2), na.rm = T))


}

############################################################## 
# Executing the baseline model; Stop words removed
##############################################################

#generating training corpus from blogs and twitter and validation corpus from news documents
#training data will have about 2% of the total corpus data and validation data will have about 1% of the total corpus data
myCorpus <- generateCorpus(c("blogs","twitter"), c("news"), 0.01, 0.01)

#tokenizing corpus with 1,2,3,and 4 grams
myNgramDF <- tokenizeCorpus(myCorpus, 4, removeStopWords=T, verbose = T)

#Exploratory Data Anlysis
for(i in 1:4)
  plotFreqWords(i)

#Building the frequency matrices for 1/2/3/4-grams and saving it
oneGram <- corpusNGramGenerator(myNgramDF[[1]])
twoGram <- corpusNGramGenerator(myNgramDF[[2]])
threeGram <- corpusNGramGenerator(myNgramDF[[3]])
fourGram <- corpusNGramGenerator(myNgramDF[[4]])

saveRDS(oneGram, file = "fDF1.rds")
saveRDS(twoGram, file = "fDF2.rds")
saveRDS(threeGram, file = "fDF3.rds")
saveRDS(fourGram, file = "fDF4.rds")

#Testing the model

#String for testing the code
i1 <- "to be or not to"
i2 <- "thanks for the"
i3 <- "I am"
i4 <- "He"

unname(sapply(c(i1,i2,i3,i4), function(x){ t = baselineModel(x)
return(word(t[1,1], -1))}))

Validating the model (baseline model; Stop words removed)

#accuracy.baseline.stopRemoved <- CalculateScore(c("news"),T, "baseline")
accuracy.baseline.stopRemoved2 <- calculateScore2(c("news"), "baseline")
#accuracy.baseline.stopRemoved
accuracy.baseline.stopRemoved2

validateDocs <- c("news")
model = "baseline"

samplefilesValidation <- c()

for (i in 1:length(validateDocs))
  samplefilesValidation[i] <- paste0("validate_en_US.",validateDocs,".txt")

textData <- readtext(samplefilesValidation)
sent <-  tokens(stri_trans_general(textData[1,2], "latin-ascii"), what = "sentence")

df2 <- as.data.frame(sent$text1)
#df2 <- df2[1:1000,]
df2[] <- lapply(df2, function(x) {
  if(is.factor(x)) as.character(x) else x
})

cs <- unname(sapply(df2$`sent$text1`, function(x) CleanInputString(x)))
df2$cleanString <- cs
df2$lastWord <- stri_extract_last_words(df2$cleanString)
#words <- strsplit(df2$`sent$text1`, " ")
df2$testString <- str_sub(df2$cleanString, start = 1L, end = (nchar(df2$cleanString) - nchar(df2$lastWord)))

if (as.character(model) == "baseline"){
  pre2 <- sapply(df2$testString, function(x){ t = baselineModel(x)
  return(word(t[1,1], -1))})
}else{
  pre2 <- sapply(df2$testString, function(x){ t = GTModel(x)
  return(t[1])})
}

df2$predictions <- pre2
df2$score <- df2$predictions == df2$lastWord

############################################################## 
# Executing the baseline model; Stop words NOT removed
##############################################################

#tokenizing corpus with 1,2,3,and 4 grams
myNgramDF2 <- tokenizeCorpus(myCorpus, 4, removeStopWords=F, verbose = F)

#Building the frequency matrices for 1/2/3/4-grams and saving it
oneGram <- corpusNGramGenerator(myNgramDF2[[1]])
twoGram <- corpusNGramGenerator(myNgramDF2[[2]])
threeGram <- corpusNGramGenerator(myNgramDF2[[3]])
fourGram <- corpusNGramGenerator(myNgramDF2[[4]])

saveRDS(oneGram, file = "fDF1.rds")
saveRDS(twoGram, file = "fDF2.rds")
saveRDS(threeGram, file = "fDF3.rds")
saveRDS(fourGram, file = "fDF4.rds")

Validating the model (baseline model; Stop words NOT removed)

#accuracy.baseline.stopNotRemoved <- CalculateScore(c("news",F, "baseline"))
accuracy.baseline.stopNotRemoved2 <- calculateScore2(c("news"), "baseline")
#accuracy.baseline.stopNotRemoved
accuracy.baseline.stopNotRemoved2

############################################################## 
# Executing the GTmodel; Stop words removed | Default Values of Lambda
##############################################################

#tokenizing corpus with 1,2,3,and 4 grams
myNgramDF <- tokenizeCorpus(myCorpus, 4, removeStopWords=T, verbose = F)

#Building the frequency matrices for 1/2/3/4-grams and saving it
oneGram <- corpusNGramGenerator(myNgramDF[[1]])
twoGram <- corpusNGramGenerator(myNgramDF[[2]])
threeGram <- corpusNGramGenerator(myNgramDF[[3]])

uGram <- unigram(oneGram)
bGram <- bigram(twoGram)
tGram <- trigram(threeGram)

saveRDS(uGram, file = "uGram.rds")
saveRDS(bGram, file = "bGram.rds")
saveRDS(tGram, file = "tGram.rds")

# Test Phrases

q1 <- "Happy Mothers"
q2 <- "Hey sunshine, can you follow me and make me the"
q3 <- "I love"
q4 <- "He went on a romantic date at the"
q5 <- "I haven't seen it in quite some"
q6 <- "His long wet hair fell out of his eyes with his little"
q7 <- "To be or not to"

lapply(c(q1,q2,q3,q4,q5,q6,q7), function(x){ GTModel(x)})

#accuracy.GTModel.stopRemoved <- CalculateScore(c("news"),F, "GTModel")
accuracy.GTModel.stopRemoved2 <- calculateScore2(c("news"), "GTModel")
#accuracy.GTModel.stopRemoved
accuracy.GTModel.stopRemoved2

############################################################## 
# Executing the GTmodel; Stop words NOT removed | Default Values of Lambda
##############################################################

#tokenizing corpus with 1,2,3,and 4 grams
myNgramDF <- tokenizeCorpus(myCorpus, 4, removeStopWords=F, verbose = F)

#Building the frequency matrices for 1/2/3/4-grams and saving it
oneGram <- corpusNGramGenerator(myNgramDF[[1]])
twoGram <- corpusNGramGenerator(myNgramDF[[2]])
threeGram <- corpusNGramGenerator(myNgramDF[[3]])

uGram <- unigram(oneGram)
bGram <- bigram(twoGram)
tGram <- trigram(threeGram)

saveRDS(uGram, file = "uGram.rds")
saveRDS(bGram, file = "bGram.rds")
saveRDS(tGram, file = "tGram.rds")

accuracy.GTModel.stopNotRemoved2 <- calculateScore2(c("news"), "GTModel")
accuracy.GTModel.stopNotRemoved2

############################################################## 
# Executing the GTmodel; Stop words NOT removed | Differing Coefficients
##############################################################

#tokenizing corpus with 1,2,3,and 4 grams
myNgramDF <- tokenizeCorpus(myCorpus, 4, removeStopWords=T, verbose = F)

#Building the frequency matrices for 1/2/3/4-grams and saving it
oneGram <- corpusNGramGenerator(myNgramDF[[1]])
twoGram <- corpusNGramGenerator(myNgramDF[[2]])
threeGram <- corpusNGramGenerator(myNgramDF[[3]])

uGram <- unigram(oneGram)
bGram <- bigram(twoGram)
tGram <- trigram(threeGram)

saveRDS(uGram, file = "uGram.rds")
saveRDS(bGram, file = "bGram.rds")
saveRDS(tGram, file = "tGram.rds")

coef1 <- c(0,0,1)
coef2 <- c(0.1,0.2,0.7)
coef3 <- c(0.3,0.5,0.2)
coef4 <- c(0.1,0.1,0.8)
coef5 <- c(0.7,0.2,0.1)

accuracy.GTModel.coef1 <- calculateScore2(c("news"), "GTModel", coef1)
accuracy.GTModel.coef2 <- calculateScore2(c("news"), "GTModel", coef2)
accuracy.GTModel.coef3 <- calculateScore2(c("news"), "GTModel", coef3)
accuracy.GTModel.coef4 <- calculateScore2(c("news"), "GTModel", coef4)
accuracy.GTModel.coef5 <- calculateScore2(c("news"), "GTModel", coef5)

print(accuracy.GTModel.coef1)
print(accuracy.GTModel.coef2)
print(accuracy.GTModel.coef3)
print(accuracy.GTModel.coef4)
print(accuracy.GTModel.coef5)

Word Predictor

Akshay Khanna

June 1, 2019

Overview

Pre-processing the Dataset

Tokenizing and Exploratory Data Analysis

Building the Predictive Models

Model 1: Base-model

Model 2: GTModel

Validation of the Models and Results

Conclusion and Future Work

References

Appendix: Code

Overview

Random sampling of the dataset

Creation of corpus

Cleaning of the dataset and tokenizing the corpus

Exploratory data analysis of the corpus

Building n-gram dataframes

Baseline Model

Clean input string

GTModel

Good Turning model

CleanNgram

Validation using full text

Validating the model (baseline model; Stop words removed)

Validating the model (baseline model; Stop words NOT removed)