Word prediction is a technology tools that suggests words as one types on a keyboard, tablet, or phone. Behind the autocorrect smartphone text is this technology that allow users to select a predicted word or phrases. Using this as a backdrop, the key takeaways for this project answers the following questions: * Which approach works the best when predicting the next word based on previous n words? * How do these results differ? * How can we improve our model?
I investigated predicting words in the sequence as entered by the users. This word predictor model takes a phrase (multiple words or a word) as an input, and after running the model, it will predict the next word as the output. The data used for this project is the SwiftKey dataset. The dataset comprises of the English version of three documents: Twitter, Blogs and Texts. The combined data set of these three documents is around 500 MB. A brief description of the dataset is given below:
My approach to build this word predictor is based upon forming the Markov chain or n-gram approach. I have built and compared two models:
Finally, I have validated both the models by calculating their accuracy.
Using the blogs and the twitter documents, we built a single corpus. This corpus was used to train the model. Before building our model, we have performed some basic pre-processing steps on our text corpus. These steps were necessary in order to clean the data: * Removing duplicates * Removing URLs * Removing Twitter symbols * Profanity filtering: Removing Bad words * Removing contractions * Removing alphabets except ‘a’ and ‘I’ * Removing numbers * Removing punctuation * Removing stop-words * Removing whitespace
For the pre-processing of the dataset we have extensively used the ‘quanteda’ library in R. Besides quanteda, we have also used the following R packages: Tm, RWeka, SnowballC, readtext, doParallel, wordcloud, ggplot2, stringr, stringi, readtext.
Post pre-processing, I generated the 1,2 and 3-grams from our text corpus. After generating these n-gram tokens, I performed a basic exploratory analysis.
The n-gram frequency table generated in the previous section is the pre-requisite for building our predictive model. See Appendix 1 for the complete code.
Since the text corpus is massive, I have randomly sampled our dataset to include just 1% of the total data available for training our models. Also, I have only considered blogs and twitter data to train our model. 1% of the news document is used to validate the models.
I have first modelled our training data set based on a simple application of the Katz-Backoff model. The Katz-Backoff model roughly follows the following steps:
In this approach, along with the frequencies, I also calculated the probabilities of each of the n-grams as they appear in our training corpus. The bigram and trigram probabilities were calculated using the maximum likelihood estimation as follows:
The next word’s trigram probability was estimated by the simple linear interpolation method (Daniel & Martin, 2018) in which the probabilities of the next word’s unigram, bigram, and trigram were weighted and combined (New University of Lisbon)
The optimal weighting factors resulted from searches by the simplex method (Singer & Nelder, 2009).
An important factor that needs to be considered when listing the probabilities of all the n-grams is the fact that there might be unobserved n-grams that have a zero frequency in the corpus. To mitigate this issue, I have attempted to smooth the probability distribution in such a way so as to account for n-grams that have not been observed in the corpus. This has been done by smoothing the frequencies of the n-grams using the Good-Turing smoothing method (Gale & Sampson, 1995).
Both the models were calculated using a custom accuracy function. Accuracy is calculated by running each model against a validation data set. I have taken 1% of the total news dataset. Each sentence of this data set is extracted in a row of a data frame. Each sentence (in each row) is split into two parts: all the words of the sentence except the last word (2nd Column), last word (3rd Column) of the sentence. The input to the model is the data in the 2nd column and the last word is predicted. The prediction is then checked against the actual last word in the 3rd column and a score of 1 is assigned if the prediction and the actual word are same. The accuracy is calculated as the sum of the final score divided by the number of rows in the data frame.
2 variants of each model were considered: One in which the English stop-words were removed and another one where the stop-words were not removed. For the second model, we considered the default value of lambda coefficients (0, 0.3 and 0.7 for unigram, bigram and trigrams respectively). The following results were obtained:
Thus, I saw that the model accuracy decreased (albeit marginally) when including the stop-words. This is intuitive because the frequency of the stop-words will be disproportionately high hence there is more likelihood of the both the models to predict the next word as a stop-word; even in those cases where the correct prediction is not a stop-word.
Now, comparing the accuracy of the base-model and the GTModel, I have inferred that the accuracy of the GTModel is high (~6%) than as compared to the base model (~3%). This is consistent with our intuition that the second model also considers the probability of n-grams which are not present in the training corpus (through Good-Turing Smoothing). This was not the case in base-model where the probability of non-occurring n-grams in the training data was straightaway set to zero.
Once I inferred that the GTModel was better than the base model, I also experimented with the various values of lambda to determine the most favourable coefficients yielding highest probability of the predicted word which, in turn, increases the accuracy of the predictions made using the GTModel. It is noteworthy to mention that I have also considered GTModel where the stop-words were removed for the reasons mentioned above.
From the above table, we can infer that higher the weight for the third coefficient that corresponds to the trigram probability, higher is the accuracy of the model. Conversely, higher values of the first coefficient results in decreased accuracy of the GTModel. This suggests that while predicting the probabilities of the predicted word, higher probability of the predicted words will depend more on the trigrams than as compared to the unigrams.
From this work, we can conclude that the Good Turing Smoothing is actually a pretty good method to smooth out the probabilities of the observed n-grams by taking into consideration the un-observed n-grams in the corpus. Also, for a word predictor algorithm, as evidenced by the accuracy scores, it is best to remove the stop-words while training the model.
The accuracy scores are generally low for this project. This can be attributed to a variety of factors, including but not limited to:
A small training corpus which is about just over 1% of the total data available. Although, due to the computational limitations I haven’t been able to experiment with various sizes of training data and their corresponding accuracies, we are sure that the richer is the vocabulary the better the models work; particularly the GTModel because it will have a larger number of n-grams and the smoothing will be more pronounced.
The pre-processing of the training corpus is inaccurate as the pre-processed data also contain improper contractions, abbreviations, and short-hands, the presence of which reduces the significance of valid n-grams. Such edge cases can best be handled by extensive use of manually applying regular expressions as opposed to the existing libraries.
The bigrams and the trigrams generated from the text corpus also has some incoherent combinations because of the lack of beginning and end sentence markers. Example consider the string, “Hi! Nice to meet you. I am a student at the University of Washington”. In this sentence, one of the 3-grams that will be generated after the pre-processing is “meet you I”. This is obviously incorrect and doesn’t make sense.
Another major limitation of this approach of using n-gram based Katz Back-off algorithm is that that the semantic relationship between the words is not taken into consideration. To build a word predictor with a more accuracy prediction, we probably need to implement a hybrid model that also determines the semantic relationship between the words (like the Word2Vec approach) as well as base the predictions on past data depending on n-gram approach like the Katz-Backoff algorithm.
Daniel, J., & Martin, J. H. (2018). Naive Bayes and Sentiment. In J. H. Daniel Jurafsky, Speech and Language Processing (pp. 14-15).
Gale, W. A., & Sampson, G. (1995). Good-Turing Smoothing Without Tears. Journal of Quantitative Linguistics, 217-237. Katz, S. M. (1987). Estimation of Probabilities from Sparse Data for the. IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 400-401.
New University of Lisbon. (n.d.). Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Retrieved from New University of Lisbon Web Site: http://ctp.di.fct.unl.pt/~jmag/ir/slides/a05%20Language%20models.pdf
Singer, S., & Nelder, J. (2009). Nelder-Mead algorithm. Retrieved from Scholarpedia Web site: http://www.scholarpedia.org/article/Nelder-Mead_algorithm
Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. (2018) “quanteda: An R package for the quantitative analysis of textual data”. Journal of Open Source Software. 3(30), 774.https://doi.org/10.21105/joss.00774
#THe following libraries are used in this analysis
library(tm)
library(RWeka)
library(SnowballC)
library(quanteda)
library(readtext)
library(doParallel)
library(wordcloud)
library(ggplot2)
library(stringr)
library(stringi)
library(readtext)
library(dplyr)
Follwoing tasks are accomplished in this report:
For this project, I am using the “quanteda” package as it provides a rich set of text analysis features coupled with excellent performance relative to Java-based R packages for text analysis. More information here [https://github.com/lgreski/datasciencectacontent/blob/master/markdown/capstone-choosingATextPackage.md].
samplingFun <- function(filename, p, type="train")
{
con <- file(paste0("en_US.",filename,".txt"))
tempData <- readLines(con)
numrows <- length(tempData)
close(con)
#Random sampling of the file
set.seed(100)
sampleFile <- tempData[rbinom(n=numrows, size = 1, prob = p) == 1]
#Making a new file
if(type=="train")
con <- file(paste0("train_en_US.",filename,".txt"))
else if (type=="validate")
con <- file(paste0("validate_en_US.",filename,".txt"))
else
con <- file(paste0("test_en_US.",filename,".txt"))
writeLines(sampleFile, con)
close(con)
}
Here I am combining the data from all the three sample files created above into a single monolith corpus. This will subsequently be used for further analysis.
generateCorpus <- function(docsTrain, docsValidate, probTrain, probValidate){
require(stringi)
#Saving the file names
files <- docsTrain
probTrain <- probTrain
probValidate <- probValidate
#Calling the function to sample each of the three files
for(i in 1:length(docsTrain)) {
samplingFun(docsTrain[i],probTrain, type="train")
}
for(i in 1:length(docsValidate)) {
samplingFun(docsValidate,probValidate, type="validate")
}
#After the above code runs successfully, three sample files are created in the same existing directory.
# Creating a vector of the sample files
#samplefiles <- c("train_en_US.blogs.txt", "train_en_US.news.txt", "train_en_US.twitter.txt")
samplefiles <- c()
for (i in 1:length(docsTrain))
samplefiles <- paste0("train_en_US.",docsTrain[i],".txt")
# Creating a quanteda corpus of the three sample documents created above
textC <- stringi::stri_trans_general(readtext(samplefiles), "latin-ascii")
#myCorpus <- corpus(readtext(samplefiles))
myCorpus <- corpus(textC)
return(myCorpus)
}
In this section I will clean the sample data set (the corpus created above) followed by profanity filtering from the corpus, i.e., removal of the bad words. This section performs the following functions: 1. Removal of numbers 2. Removal of whitespaces 3. Transforming content to lower case 4. Removal of puntucation 5. Removal of stopwords 6. Profanity filtering
tokenizeCorpus <- function(myCorpus, ngrm, removeStopWords=T, verbose = T)
{
#Tokenizing the corpus
myTokens <- tokens(myCorpus,
what = "word",
remove_punct = TRUE, #Remove Punctuation
remove_hyphens = FALSE, #preserve_intra_word_dashes
remove_numbers = TRUE, #Remove Numbers
remove_symbols = TRUE, #Remove Symbols
remove_twitter = TRUE, #Remove Twitter Symbols
remove_url = TRUE, #Remove URLs
remove_separators = TRUE,
concatenator = " ")
#Tokenizing the corpus
#myTokens <- tokens(myCorpus, what = "sentence")
#Getting the profane words data
profanity <- read.table(url("https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"))
profanity <- as.character(unlist(profanity))
#removing all single letters except 'a' and 'i'
alphabets <- letters
alphabets <- alphabets[!alphabets %in% c("a","i")]
#alphabets
#Creating a vector of nGrams
#For this report, we are considering unigrams, bigrams and trigrams
nGram = 1:ngrm
if(removeStopWords == TRUE)
removeList <- c(stopwords("english"),profanity, alphabets)
else
removeList <- c(profanity, alphabets)
#This stores a list of data frames (term document matrix for each unigram, bigram and trigram
myNgramDF <- lapply(nGram, function(x){
#Creating a document feature matrix
mydfm <- dfm(tokens_ngrams(myTokens, n=x),
tolower=TRUE, #Covert to lowercase
remove = removeList) #remove stopwords and profanity
#print(head(mydfm))
#Trimming DFM to consider only those term which appear a minimum of 10 times in all the 3 documents
mydfm <- dfm_trim(mydfm, min_termfreq = 1)
#Converting the DFM to a data-frame
mydf <- convert(mydfm, to = "data.frame")
#Creating a Document Term Matrix from DFM
#A Document Term Matrix is a tranposed DFM
mydf <- t(mydf)
#Setting column names of DTM as the name of the documents
colnames(mydf) <- mydf[1,]
mydf <- mydf[-1,]
mydf <- as.data.frame(mydf)
#colnames(mydf) <- c("sample_en_US.blogs.txt", "sample_en_US.news.txt", "sample_en_US.twitter.txt")
#print(colnames(mydf))
})
#Printing how the data set looks like
if(verbose == T){
for(i in 1:4){
cat("\n\n\t\tThis is how the Document Term Matrix of",i,"-gram looks like.\n\n")
print(head(myNgramDF[[i]]))}
}
return(myNgramDF)
}
#This function plots the Frequency Histogram of the n-grams for each of the three documents
plotFreqWords <- function(x, myNgramDF){
#Getting the m-gram data frame from the list and storing it temporarily
a <- myNgramDF[[x]]
#Coverting each of the columns into numeric data-type
#We need this since the data frame returned from the DFM function has numbers as type factors
a[] <- lapply(a, function(x) {
if(is.factor(x)) as.numeric(as.character(x)) else x
})
#Removing row names and maing it as a separate column
a <- data.frame(DocTerms = row.names(a), a, row.names = NULL)
#A vector containg the document types. This will be used for naming each of the histograms
doc = c("US Blogs", "US News", "US Twitter")
#For each n-gram we need 3 histograms corresponding to each of the three types of documents
for(i in 1:3){
main = paste0("15 Most Frequent ",x,"-grams ", "in ",doc[i], " article")
b <- a[,c(1,i+1)]
#b <- b[order(-b$sample_en_US.news.txt),]
b <- b[order(-b[,2]),]
colnames(b) <- c("word","Frequency")
#b
p <- ggplot(data=b[1:20,],
aes(x=reorder(word,Frequency),
y=Frequency,
fill=Frequency))+
geom_bar(stat="identity")+
xlab("Word")+
labs(title = main) +
#theme(legend.title=element_blank()) +
coord_flip()
print(p)
cat("\n\n\tWord-cloud of 15 Most Frequent ",x,"-grams ", "in ",doc[i], " article")
w <- wordcloud(b$word[1:15], b$Frequency[1:20],
colors=brewer.pal(8, "Dark2"))
print(w)
}
}
#temp <- myNgramDF[[1]]
corpusNGramGenerator <- function(nGramDF, n){
#nGramDF <- temp
#colnames(nGramDF) <- c("sample_en_US.blogs.txt", "sample_en_US.news.txt", "sample_en_US.twitter.txt")
#colnames(nGramDF) <- c("sample_en_US.blogs.txt", "sample_en_US.twitter.txt")
#
#Coverting each of the columns into numeric data-type
#We need this since the data frame returned from the DFM function has numbers as type factors
nGramDF[] <- lapply(nGramDF, function(x) {
if(is.factor(x)) as.numeric(as.character(x)) else x
})
if (ncol(nGramDF) == 3){
colnames(nGramDF) <- c("sample_en_US.blogs.txt", "sample_en_US.twitter.txt")
nGramDF$Frequency <- nGramDF$sample_en_US.blogs.txt + nGramDF$sample_en_US.twitter.txt
}
else
{
colnames(nGramDF) <- c("sample_en_US.news.txt")
nGramDF$Frequency <- nGramDF$sample_en_US.news.txt
}
#Removing row names and maing it as a separate column
nGramDF <- data.frame(Terms = row.names(nGramDF), nGramDF, row.names = NULL)
nGramDF <- subset(nGramDF, select = c("Terms", "Frequency"))
#sorting the data frame based on frequencies
nGramDF <- nGramDF[order(nGramDF$Frequency, decreasing = TRUE),]
#nGramDF$RelativeFrequency <- round(100*(nGramDF$Frequency/sum(nGramDF$Frequency)),2)
#nGramDF$CumulativeFrequency <- cumsum(nGramDF$Frequency)
#nGramDF$RelativeCumulativeFrequency <- 100*(nGramDF$CumulativeFrequency/sum(nGramDF$Frequency))
nGramDF[] <- lapply(nGramDF, function(x) {
if(is.factor(x)) as.character(x) else x
})
nGramDF$Terms <- gsub("_", " ", nGramDF$Terms, fixed=TRUE)
return(nGramDF)
}
#The baseline model running off simple Katz-Backoff model
baselineModel <- function(inStr)
{
fDF1 <- readRDS("C:/Users/akhan/OneDrive/University of Washington/Final Project/fDF1.rds");
fDF2 <- readRDS("C:/Users/akhan/OneDrive/University of Washington/Final Project/fDF2.rds");
fDF3 <- readRDS("C:/Users/akhan/OneDrive/University of Washington/Final Project/fDF3.rds");
fDF4 <- readRDS("C:/Users/akhan/OneDrive/University of Washington/Final Project/fDF4.rds");
assign("mesg", "in PredNextTerm", envir = .GlobalEnv)
# Clean up the input string and extract only the words with no leading and trailing white spaces
inStr <- CleanInputString(inStr);
# Split the input string across white spaces and then extract the length
inStr <- unlist(strsplit(inStr, split=" "));
inStrLen <- length(inStr);
nxtTermFound <- FALSE;
predNxtTerm <- as.character(NULL);
#mesg <<- as.character(NULL);
# 1. First test the Four Gram using the four gram data frame
if (inStrLen >= 3 & !nxtTermFound)
{
# Assemble the terms of the input string separated by one white space each
inStr1 <- paste(inStr[(inStrLen-2):inStrLen], collapse=" ");
# Subset the Four Gram data frame
searchStr <- paste("^",inStr1, sep = "");
fDF4Temp <- fDF4[grep (searchStr, fDF4$Terms), ];
# Check to see if any matching record returned
if ( length(fDF4Temp[, 1]) > 1 )
{
predNxtTerm <- fDF4Temp[1,1];
nxtTermFound <- TRUE;
mesg <<- "Next word is predicted using 4-gram."
}
fDF4Temp <- NULL;
}
# 2. Next test the Three Gram using the three gram data frame
if (inStrLen >= 2 & !nxtTermFound)
{
# Assemble the terms of the input string separated by one white space each
inStr1 <- paste(inStr[(inStrLen-1):inStrLen], collapse=" ");
# Subset the Three Gram data frame
searchStr <- paste("^",inStr1, sep = "");
fDF3Temp <- fDF3[grep (searchStr, fDF3$Terms), ];
# Check to see if any matching record returned
if ( length(fDF3Temp[, 1]) > 1 )
{
predNxtTerm <- fDF3Temp[1,1];
nxtTermFound <- TRUE;
mesg <<- "Next word is predicted using 3-gram."
}
fDF3Temp <- NULL;
}
# 3. Next test the Two Gram using the three gram data frame
if (inStrLen >= 1 & !nxtTermFound)
{
# Assemble the terms of the input string separated by one white space each
inStr1 <- inStr[inStrLen];
# Subset the Two Gram data frame
searchStr <- paste("^",inStr1, sep = "");
fDF2Temp <- fDF2[grep (searchStr, fDF2$Terms), ];
# Check to see if any matching record returned
if ( length(fDF2Temp[, 1]) > 1 )
{
predNxtTerm <- fDF2Temp[1,1];
nxtTermFound <- TRUE;
mesg <<- "Next word is predicted using 2-gram.";
}
fDF2Temp <- NULL;
}
# 4. If no next term found in Four, Three and Two Grams return the most
# frequently used term from the One Gram using the one gram data frame
if (!nxtTermFound & inStrLen > 0)
{
predNxtTerm <- fDF1$Terms[1];
mesg <- "No next word found, the most frequent word is selected as next word."
}
nextTerm <- word(predNxtTerm, -1);
if (inStrLen > 0){
dfTemp1 <- data.frame(nextTerm, mesg);
return(dfTemp1);
} else {
nextTerm <- "";
mesg <-"";
dfTemp1 <- data.frame(nextTerm, mesg);
return(dfTemp1);
}
}
# This function "Cleans up" the user input string
CleanInputString <- function(inStr)
{
# Test sentence
#inStr <- "This is. the; - . use's 12"
# First remove the non-alphabatical characters
inStr <- iconv(inStr, "latin1", "ASCII", sub=" ");
inStr <- gsub("[^[:alpha:][:space:][:punct:]]", "", inStr);
# Then convert to a Corpus
inStrCrps <- VCorpus(VectorSource(inStr))
# Convert the input sentence to lower case
# Remove punctuations, numbers, white spaces
# non alphabets characters
inStrCrps <- tm_map(inStrCrps, content_transformer(tolower))
inStrCrps <- tm_map(inStrCrps, removePunctuation)
inStrCrps <- tm_map(inStrCrps, removeNumbers)
inStrCrps <- tm_map(inStrCrps, stripWhitespace)
inStr <- as.character(inStrCrps[[1]])
inStr <- gsub("(^[[:space:]]+|[[:space:]]+$)", "", inStr)
# Return the cleaned resulting senytense
# If the resulting string is empty return empty and string.
# if (nchar(inStr) > 0 && !is.null(instr)) {
# return(inStr);
# } else {
# return("");
# }
return(inStr)
}
## Simple Good-Turing method
## code adapted from S codes: gtfuncs.S (nrzest, rstest) and gtanal.S
## by Willian A. Gale
## https://faculty.cs.byu.edu/~ringger/CS479/papers/Gale-SimpleGoodTuring.pdf
## Simple Good-Turing method:
## 1. Use Turing estimate for small r
## 2. Use Linear Good-Turing estimate for large r
## 3. Switch to LGT when r(T) - r(LGT) > 1.65*sd
## 4. Once switched to LGT, keep using LGT
##
## simpleGT takes a data table of frequencies r and frequencies of frequencies
## Nr and calculated the smoothed frequencies by the Simple Good Turing method.
## freqhist = imput, a data table of r (freq) and Nr (NFreq)
## rrstar = output, a data table of r (freq) and rstar (freqGT)
## where freq = original freq, freqGT = freq smoothed by simple GT method.
## the first row is the pair of 0 freq = (0, freqGT_at_0)
##
require(data.table)
## Loading required package: data.table
simpleGT <- function(freqhist) {
## nrzest: Averaging transformation
## Replace nr by zr = nr/(0.5*(t - q))
## where q, r, t are successive indices of non-zero values
## in gtfuncs.S by William A. Gale
##
nrzest <- function(r, nr) {
d <- c(1, diff(r))
dr <- c(0.5 * (d[-1] + d[-length(d)]), d[length(d)])
return(nr/dr)
}
## rstest: Linear Good-Turing estimate
## log(nr) = a + b * log(r)
## b = coef[2]
## rstest r(star)est = r *(1 + 1/r)^(b + 1)
## b < -1
## in gtfuncs.S by William A. Gale
##
rstest <- function(r, coef) {
return(r * (1 + 1/r)^(1 + coef[2]))
}
## The following code comes from gtanal.S by William A. Gale
## Get the input xr and xnr
xm <- freqhist
xr <- xm[, 1]
xnr <- xm[, 2]
xN <- sum(xr * xnr)
## make averaging transform
xnrz <- nrzest(xr, xnr)
## get Linear Good-Turing estimate
xf <- lsfit(log(xr), log(xnrz))
xcoef <- xf$coef
xrst <- rstest(xr, xcoef)
xrstrel <- xrst/xr
## get Turing estimate
xrtry <- xr == c(xr[-1]-1, 0)
xrstarel <- rep(0, length(xr))
xrstarel[xrtry] <- (xr[xrtry]+1) / xr[xrtry] * c(xnr[-1], 0)[xrtry] / xnr[xrtry]
## make switch from Turing to LGT estimates
tursd <- rep(1, length(xr))
for (i in 1:length(xr)) {
tursd[i] <- (i+1) / xnr[i] * sqrt(xnr[i=1] * (1 + xnr[i+1] / xnr[i]))
}
xrstcmbrel <- rep(0, length(xr))
useturing <- TRUE
for (r in 1:length(xr)) {
if (!useturing) {
xrstcmbrel[r] <- xrstrel[r]
} else if (abs(xrstrel - xrstarel)[r] * r / tursd[r] > 1.65) {
xrstcmbrel[r] <- xrstarel[r]
} else {
useturing <- FALSE
xrstcmbrel[r] <- xrstrel[r]
}
}
## renormalize the probabilities for observed objects
sumpraw <- sum(xrstcmbrel * xr * xnr / xN)
xrstcmbrel <- xrstcmbrel * (1 - xnr[1]/xN) / sumpraw
## output to file
# cat(xN, sum(xnr), file="gtanal", sep=",")
# cat(0, xnr[1]/xN, file="gtanal", sep=",", append=TRUE)
# for (i in 1:length(xr)) {
# cat(xr[i], xr[i]*xrstcmbrel[i], file="gtanal", append=TRUE)
# }
## output matrix (0, r0est) + (xr, xnrstarnormalized)
rrstar <- cbind(c(0, xr), c(xnr[1]/xN, xr*xrstcmbrel))
## output data table by pairs = (r = freq, rstar = freqGT)
## keyed (ordered) by freq.
rrstar <- data.table(rrstar)
colnames(rrstar) <- c("freq", "freqGT")
setkey(rrstar, freq)
return(rrstar)
}
# This function converts a data frame of ngrams to a data table, removes ngrams
## with frequencies less than the minFreq, and calculates the frequencies
## smoothed by Simple Good-Turing method.
cleanNgram <- function(ngramDf) {
library(data.table)
ngramDt <- data.table(ngramDf)
## Convert the data frame to data table and rename columns
colnames(ngramDf) <- c("ngram", "freq")
ngramDt <- data.table(ngramDf)
## Remove ngrams with frequencies below the cutoff minFreq
#ngramDt <- ngramDt[freq >= minFreq]
## Get frequency (Nr) of frequency r,
freqNf <- data.table(table(ngramDt[, "freq"]))
colnames(freqNf) <- c("freq", "NFreq")
freqNf <- sapply(freqNf, as.numeric)
#print(head(freqNf))
freqSimpleGT <- simpleGT(freqNf)
#print(head(freqSimpleGT))
## Merge ngramDt with freqSimpleGT with bgramDt
setkey(ngramDt, freq)
ngramDt <- merge(ngramDt, freqSimpleGT)
#print(head(ngramDt))
## Calculate probability of zero frequency
pZero <- freqSimpleGT[1, 2]/sum(c(1, freqNf[, "NFreq"]) * freqSimpleGT[, 2])
#print(pZero)
## Output the clean smoothed ngrams and probability of zero frequency
ngramAndPzero <- list("ngramDt"=ngramDt, "pZero"=pZero)
return(ngramAndPzero)
}
# split1gram: trims single words and recalculates frequencies and probability:
# p(w1) = count(w1)/count(all w1)
# output is a data table with each row is a set of (unigram, word1,
# freq, prob, freqGT, probGT)
split1Gram <- function(ugramDt) {
require(stringr)
## Trim trailing spaces
ugramDt[, word1 := str_trim(ngram)]
setkey(ugramDt, word1)
## Reset frequencies and calculate words' probability (unsmoothed)
ugramDt <- ugramDt[, freq := sum(freq), by=c("word1")]
ugramTotalFreq <- sum(ugramDt$freq)
ugramDt[, prob := freq/ugramTotalFreq]
## Reset frequencies and calculate words' probability (smoothed)
ugramDt <- ugramDt[, freqGT := sum(freqGT), by=c("word1")]
ugramTotalFreqGT <- sum(ugramDt$freqGT)
ugramDt[, probGT := freqGT/ugramTotalFreqGT]
## Set key column
setkey(ugramDt, word1)
## Reorder the columns in bigrams
setcolorder(ugramDt, c("ngram", "word1",
"freq", "prob", "freqGT", "probGT"))
return(ugramDt)
}
# split2gram: splits the 2-grams into words,recalculates frequencies and probability:
# p(w2|w1) = count(w1,w2)/count(w1)
# outputis a data table with each row is a set of (bigram, word1,
# word2, freq, prob, freqGT, probGT)
split2Gram <- function(bgramDt){
require(stringr)
## Split the bigram into words
bgramSplits <- str_split(bgramDt$ngram, boundary("word"))
bgramDt[, word1 := sapply(bgramSplits, function(m) m[1])]
bgramDt[, word2 := sapply(bgramSplits, function(m) m[2])]
## Count instances of word1-word2 and word1 by freq (unsmoothed)
bgramDt[, count_w1_w2 := sum(freq), by=c("word1", "word2")]
bgramDt[, count_w1 := sum(freq), by=c("word1")]
## Calculate p(w2|w1) = count(w1,w2)/count(w1)
bgramDt[, prob := count_w1_w2/count_w1]
## Count instances of word1-word2 and word1 by freqGT (smoothed)
bgramDt[, count_w1_w2_GT := sum(freqGT), by=c("word1", "word2")]
bgramDt[, count_w1_GT := sum(freqGT), by=c("word1")]
## Calculate p(w2|w1) = count(w1,w2)/count(w1) by freqGT
bgramDt[, probGT := count_w1_w2_GT/count_w1_GT]
## Remove temporary columns
bgramDt[, c("count_w1_w2", "count_w1", "count_w1_w2_GT", "count_w1_GT") := NULL]
## Set key columns
setkey(bgramDt, word1, word2)
## Reorder the columns in bigrams
setcolorder(bgramDt, c("ngram", "word1", "word2",
"freq", "prob", "freqGT", "probGT"))
return(bgramDt)
}
# split3gram: splits the 3-grams into words, recalculates frequencies and probability:
# p(w3|w1w2) = count(w1,w2,w3)/count(w1,w2)
# outputis a data table with each row is a set of (trigram, word1,
# word2, word3, freq, prob, freqGT, probGT)
split3Gram <- function(tgramDt) {
if (!require(stringr)) {
stop("Library stringr is missing.")
}
## Split the bigram into words
tgramSplits <- str_split(tgramDt$ngram, boundary("word"))
tgramDt[, word1 := sapply(tgramSplits, function(m) m[1])]
tgramDt[, word2 := sapply(tgramSplits, function(m) m[2])]
tgramDt[, word3 := sapply(tgramSplits, function(m) m[3])]
## Count instances of word1-word2-word3 and word1-word2 by freq (unsmoothed)
tgramDt[, count_w1_w2_w3 := sum(freq), by=c("word1", "word2", "word3")]
tgramDt[, count_w1_w2 := sum(freq), by=c("word1", "word2")]
## Calculate p(w3|w1w2) = count(w1,w2,w3)/count(w1,w2)
tgramDt[, prob := count_w1_w2_w3/count_w1_w2]
## Count instances of word1-word2-word3 and word1-word2 by freqGT (smoothed)
tgramDt[, count_w1_w2_w3_GT := sum(freqGT), by=c("word1", "word2", "word3")]
tgramDt[, count_w1_w2_GT := sum(freqGT), by=c("word1", "word2")]
## Calculate p(w2|w1) = count(w1,w2)/count(w1) by freqGT
tgramDt[, probGT := count_w1_w2_w3_GT/count_w1_w2_GT]
## Remove temporary columns
tgramDt[, c("count_w1_w2_w3", "count_w1_w2",
"count_w1_w2_w3_GT", "count_w1_w2_GT") := NULL]
setkey(tgramDt, word1, word2, word3)
## Reorder the columns in bigrams
setcolorder(tgramDt, c("ngram", "word1", "word2", "word3",
"freq", "prob", "freqGT", "probGT"))
return(tgramDt)
}
# Now converting n-gram data frame into a table such that the n-gram splits into
# n words and this data table also has the smoothed frequencies of each n-gram
unigram <- function(ugramDf) {
ugramPzero <- cleanNgram(ugramDf)
ugramPzero$ngramDt <- split1Gram(ugramPzero$ngramDt)
return(ugramPzero)
}
bigram <- function(bgramDf) {
bgramPzero <- cleanNgram(bgramDf)
bgramPzero$ngramDt <- split2Gram(bgramPzero$ngramDt)
return(bgramPzero)
}
trigram <- function(tgramDf) {
tgramPzero <- cleanNgram(tgramDf)
tgramPzero$ngramDt <- split3Gram(tgramPzero$ngramDt)
return(tgramPzero)
}
# bimodel function predicts the second words of a bigram, if not found, reports
# the most frequent unigram.
# Calculate trigram probabilities as the sum of three weighted
# probabilities: unigram, bigram, and trigram.
# The Jelinek & Mercer method
# with predictProb = coef[1]*uprobGT + coef[2]*probGT + coef[3]*tprobGT
#
bimodel <- function(lastWord, coef, ugramPzero, bgramPzero, tgramPzero) {
newWord1 <- lastWord
#if ("<UNK>" %in% newWord1) return(data.table())
## Get bigrams containing the input first word
#bNextWords <- biNextWords(bgramPzero$ngramDt, newWord1)
bNextWords <- subset(bgramPzero$ngramDt, word1==newWord1)
#bNextWords <- bNextWords[word2 != "<UNK>", ]
if (dim(bNextWords)[1] > 0) {
## Get probabilities of unigrams = nextWord
setkey(bNextWords, word2)
setkey(ugramPzero$ngramDt, word1)
bNextWords <- bNextWords[ugramPzero$ngramDt, nomatch=0L]
names(bNextWords) <- gsub("i.", "u", names(bNextWords))
## Add probability of trigram at zero frequency
bNextWords[, tprobGT := tgramPzero$pZero]
## Calculate trigram probabilities
bNextWords[, predictProb := coef[1]*uprobGT + coef[2]*probGT
+ coef[3]*tprobGT]
## Sort predicted probabilities in decreasing order
setorder(bNextWords, -predictProb)
predictions <- bNextWords
} else {
## Get the most frequent word if trigrams and bigrams not found
uNextWords <- ugramPzero$ngramDt[order(-probGT)][1]
## Add probabilities of bigrams and trigrams at zero frequency
uNextWords[, bprobGT := bgramPzero$pZero]
uNextWords[, tprobGT := tgramPzero$pZero]
## Calculate trigram probabilities
uNextWords[, predictProb := coef[1]*probGT + coef[2]*bprobGT
+ coef[3]*tprobGT]
predictions <- uNextWords
}
return(predictions)
}
# triimodel function predicts the 3rd word of a trigram, if not found, reports
# the most frequent unigram.
# Calculate trigram probabilities as the sum of three weighted
# probabilities: unigram, bigram, and trigram.
# The Jelinek & Mercer method
# with predictProb = coef[1]*uprobGT + coef[2]*probGT + coef[3]*tprobGT
trimodel <- function(lastWords, coef, ugramPzero, bgramPzero, tgramPzero, nKeep) {
newWord1 = lastWords[[1]]
newWord2 = lastWords[[2]]
#if ("<UNK>" %in% newWord2) return(data.table())
## Get trigrams of two new words
#tNextWords <- triNextWords(tgramPzero$ngramDt, newWord1, newWord2)
tNextWords <- subset(tgramPzero$ngramDt, word1==newWord1 & word2==newWord2)
if (nrow(tNextWords) > 0) {
## Get probabilities of bigrams = newWord2-nextWord
setkey(tNextWords, word2, word3)
setkey(bgramPzero$ngramDt, word1, word2)
tNextWords <- tNextWords[bgramPzero$ngramDt, nomatch=0L]
names(tNextWords) <- gsub("i.", "b", names(tNextWords))
## Get probabilities of unigrams = nextWord
setkey(tNextWords, word3)
setkey(ugramPzero$ngramDt, word1)
tNextWords <- tNextWords[ugramPzero$ngramDt, nomatch=0L]
names(tNextWords) <- gsub("i.", "u", names(tNextWords))
tNextWords[, predictProb := coef[1]*uprobGT + coef[2]*bprobGT
+ coef[3]*probGT]
## Sort predicted probabilities in decreasing order
setorder(tNextWords, -predictProb)
predictions <- tNextWords
} else {
## Get bigrams if trigrams not found
predictions <- bimodel(newWord2, coef, ugramPzero, bgramPzero, tgramPzero)
}
if (nrow(predictions) > nKeep) {
predictions <- predictions[1:nKeep, ]
}
return(predictions)
}
# GTModel function get the next words for the input string from the ugramPzero/bgramPzero/tgramPzero
GTModel <- function(inputStr, coef = c(0.0, 0.3, 0.7)) {
require(tm)
require(ngram)
library(stringr)
cleanStr <- CleanInputString(inputStr)
uGram <- readRDS("C:/Users/akhan/OneDrive/University of Washington/Final Project/uGram.rds")
bGram <- readRDS("C:/Users/akhan/OneDrive/University of Washington/Final Project/bGram.rds")
tGram <- readRDS("C:/Users/akhan/OneDrive/University of Washington/Final Project/tGram.rds")
## Split the clean string into words
inputWords <- str_split(cleanStr, boundary("word"))[[1]]
if (length(inputWords) > 1) {
testWords <- tail(inputWords, 2)
} else if (length(inputWords) > 0) {
testWords <- c("<UNK>", tail(inputWords, 1))
} else {
return("<UNK>")
}
predictions <- trimodel(testWords, coef,
uGram,
bGram,
tGram, 3)
cnames <- colnames(predictions)
if ("word3" %in% cnames) {
nextWords <- predictions$word3
} else if ("word2" %in% cnames) {
nextWords <- predictions$word2
} else {
nextWords <- predictions$word1
}
#print(head(predictions))
return(nextWords)
}
# Validation using n-grams
# CalculateScore <- function(validateDocs, removeSW=T, model="baseline"){
#
# # Creating a vector of the validation files
# #samplefiles <- c("train_en_US.blogs.txt", "train_en_US.news.txt", "train_en_US.twitter.txt")
# samplefilesValidation <- c()
#
# for (i in 1:length(validateDocs))
# samplefilesValidation[i] <- paste0("validate_en_US.",validateDocs[i],".txt")
#
# textCValidate <- stringi::stri_trans_general(readtext(samplefilesValidation), "latin-ascii")
# myCorpusValidation <- corpus(textCValidate)
#
# myNgramDFValidation <- tokenizeCorpus(myCorpusValidation, 4, removeStopWords=removeSW, verbose = F)
#
# threeGramValidation <- corpusNGramGenerator(myNgramDFValidation[[3]])
#
# df <- threeGramValidation
# df <- df[1:500,]
# df$lastTerms <- stri_extract_last_words(df$Terms)
# df$predictTrems <- str_sub(df$Terms, start = 1L, end = (nchar(df$Terms) - nchar(df$lastTerms)))
#
# if (as.character(model) == "baseline"){
# pre <- sapply(df$predictTrems, function(x){ t = baselineModel(x)
# return(word(t[1,1], -1))})
# }else{
# pre <- sapply(df$predictTrems, function(x){ t = GTModel(x)
# return(t[1])})
# }
#
# df$predictions <- pre
# df$score <- df$predictions == df$lastTerms
#
# return(sum(df$score/nrow(df)))
# }
calculateScore2 <- function(validateDocs, model="baseline", coef = c(0.0, 0.3, 0.7))
{
samplefilesValidation <- c()
for (i in 1:length(validateDocs))
samplefilesValidation[i] <- paste0("validate_en_US.",validateDocs[i],".txt")
textData <- readtext(samplefilesValidation)
sent <- tokens(stri_trans_general(textData[1,2], "latin-ascii"), what = "sentence")
df2 <- as.data.frame(sent$text1)
#df2 <- df2[1:500,]
df2[] <- lapply(df2, function(x) {
if(is.factor(x)) as.character(x) else x
})
cs <- unname(sapply(df2$`sent$text1`, function(x) CleanInputString(x)))
df2$cleanString <- cs
df2$lastWord <- stri_extract_last_words(df2$cleanString)
#words <- strsplit(df2$`sent$text1`, " ")
df2$testString <- str_sub(df2$cleanString, start = 1L, end = (nchar(df2$cleanString) - nchar(df2$lastWord)))
if (as.character(model) == "baseline"){
pre2 <- sapply(df2$testString, function(x){ t = baselineModel(x)
return(word(t[1,1], -1))})
}else{
pre2 <- sapply(df2$testString, function(x){ t = GTModel(x, coef)
return(t[1])})
}
df2$predictions <- pre2
df2$score <- df2$predictions == df2$lastWord
return(sum(df2$score/nrow(df2), na.rm = T))
}
##############################################################
# Executing the baseline model; Stop words removed
##############################################################
#generating training corpus from blogs and twitter and validation corpus from news documents
#training data will have about 2% of the total corpus data and validation data will have about 1% of the total corpus data
myCorpus <- generateCorpus(c("blogs","twitter"), c("news"), 0.01, 0.01)
#tokenizing corpus with 1,2,3,and 4 grams
myNgramDF <- tokenizeCorpus(myCorpus, 4, removeStopWords=T, verbose = T)
#Exploratory Data Anlysis
for(i in 1:4)
plotFreqWords(i)
#Building the frequency matrices for 1/2/3/4-grams and saving it
oneGram <- corpusNGramGenerator(myNgramDF[[1]])
twoGram <- corpusNGramGenerator(myNgramDF[[2]])
threeGram <- corpusNGramGenerator(myNgramDF[[3]])
fourGram <- corpusNGramGenerator(myNgramDF[[4]])
saveRDS(oneGram, file = "fDF1.rds")
saveRDS(twoGram, file = "fDF2.rds")
saveRDS(threeGram, file = "fDF3.rds")
saveRDS(fourGram, file = "fDF4.rds")
#Testing the model
#String for testing the code
i1 <- "to be or not to"
i2 <- "thanks for the"
i3 <- "I am"
i4 <- "He"
unname(sapply(c(i1,i2,i3,i4), function(x){ t = baselineModel(x)
return(word(t[1,1], -1))}))
#accuracy.baseline.stopRemoved <- CalculateScore(c("news"),T, "baseline")
accuracy.baseline.stopRemoved2 <- calculateScore2(c("news"), "baseline")
#accuracy.baseline.stopRemoved
accuracy.baseline.stopRemoved2
validateDocs <- c("news")
model = "baseline"
samplefilesValidation <- c()
for (i in 1:length(validateDocs))
samplefilesValidation[i] <- paste0("validate_en_US.",validateDocs,".txt")
textData <- readtext(samplefilesValidation)
sent <- tokens(stri_trans_general(textData[1,2], "latin-ascii"), what = "sentence")
df2 <- as.data.frame(sent$text1)
#df2 <- df2[1:1000,]
df2[] <- lapply(df2, function(x) {
if(is.factor(x)) as.character(x) else x
})
cs <- unname(sapply(df2$`sent$text1`, function(x) CleanInputString(x)))
df2$cleanString <- cs
df2$lastWord <- stri_extract_last_words(df2$cleanString)
#words <- strsplit(df2$`sent$text1`, " ")
df2$testString <- str_sub(df2$cleanString, start = 1L, end = (nchar(df2$cleanString) - nchar(df2$lastWord)))
if (as.character(model) == "baseline"){
pre2 <- sapply(df2$testString, function(x){ t = baselineModel(x)
return(word(t[1,1], -1))})
}else{
pre2 <- sapply(df2$testString, function(x){ t = GTModel(x)
return(t[1])})
}
df2$predictions <- pre2
df2$score <- df2$predictions == df2$lastWord
##############################################################
# Executing the baseline model; Stop words NOT removed
##############################################################
#tokenizing corpus with 1,2,3,and 4 grams
myNgramDF2 <- tokenizeCorpus(myCorpus, 4, removeStopWords=F, verbose = F)
#Building the frequency matrices for 1/2/3/4-grams and saving it
oneGram <- corpusNGramGenerator(myNgramDF2[[1]])
twoGram <- corpusNGramGenerator(myNgramDF2[[2]])
threeGram <- corpusNGramGenerator(myNgramDF2[[3]])
fourGram <- corpusNGramGenerator(myNgramDF2[[4]])
saveRDS(oneGram, file = "fDF1.rds")
saveRDS(twoGram, file = "fDF2.rds")
saveRDS(threeGram, file = "fDF3.rds")
saveRDS(fourGram, file = "fDF4.rds")
#accuracy.baseline.stopNotRemoved <- CalculateScore(c("news",F, "baseline"))
accuracy.baseline.stopNotRemoved2 <- calculateScore2(c("news"), "baseline")
#accuracy.baseline.stopNotRemoved
accuracy.baseline.stopNotRemoved2
##############################################################
# Executing the GTmodel; Stop words removed | Default Values of Lambda
##############################################################
#tokenizing corpus with 1,2,3,and 4 grams
myNgramDF <- tokenizeCorpus(myCorpus, 4, removeStopWords=T, verbose = F)
#Building the frequency matrices for 1/2/3/4-grams and saving it
oneGram <- corpusNGramGenerator(myNgramDF[[1]])
twoGram <- corpusNGramGenerator(myNgramDF[[2]])
threeGram <- corpusNGramGenerator(myNgramDF[[3]])
uGram <- unigram(oneGram)
bGram <- bigram(twoGram)
tGram <- trigram(threeGram)
saveRDS(uGram, file = "uGram.rds")
saveRDS(bGram, file = "bGram.rds")
saveRDS(tGram, file = "tGram.rds")
# Test Phrases
q1 <- "Happy Mothers"
q2 <- "Hey sunshine, can you follow me and make me the"
q3 <- "I love"
q4 <- "He went on a romantic date at the"
q5 <- "I haven't seen it in quite some"
q6 <- "His long wet hair fell out of his eyes with his little"
q7 <- "To be or not to"
lapply(c(q1,q2,q3,q4,q5,q6,q7), function(x){ GTModel(x)})
#accuracy.GTModel.stopRemoved <- CalculateScore(c("news"),F, "GTModel")
accuracy.GTModel.stopRemoved2 <- calculateScore2(c("news"), "GTModel")
#accuracy.GTModel.stopRemoved
accuracy.GTModel.stopRemoved2
##############################################################
# Executing the GTmodel; Stop words NOT removed | Default Values of Lambda
##############################################################
#tokenizing corpus with 1,2,3,and 4 grams
myNgramDF <- tokenizeCorpus(myCorpus, 4, removeStopWords=F, verbose = F)
#Building the frequency matrices for 1/2/3/4-grams and saving it
oneGram <- corpusNGramGenerator(myNgramDF[[1]])
twoGram <- corpusNGramGenerator(myNgramDF[[2]])
threeGram <- corpusNGramGenerator(myNgramDF[[3]])
uGram <- unigram(oneGram)
bGram <- bigram(twoGram)
tGram <- trigram(threeGram)
saveRDS(uGram, file = "uGram.rds")
saveRDS(bGram, file = "bGram.rds")
saveRDS(tGram, file = "tGram.rds")
accuracy.GTModel.stopNotRemoved2 <- calculateScore2(c("news"), "GTModel")
accuracy.GTModel.stopNotRemoved2
##############################################################
# Executing the GTmodel; Stop words NOT removed | Differing Coefficients
##############################################################
#tokenizing corpus with 1,2,3,and 4 grams
myNgramDF <- tokenizeCorpus(myCorpus, 4, removeStopWords=T, verbose = F)
#Building the frequency matrices for 1/2/3/4-grams and saving it
oneGram <- corpusNGramGenerator(myNgramDF[[1]])
twoGram <- corpusNGramGenerator(myNgramDF[[2]])
threeGram <- corpusNGramGenerator(myNgramDF[[3]])
uGram <- unigram(oneGram)
bGram <- bigram(twoGram)
tGram <- trigram(threeGram)
saveRDS(uGram, file = "uGram.rds")
saveRDS(bGram, file = "bGram.rds")
saveRDS(tGram, file = "tGram.rds")
coef1 <- c(0,0,1)
coef2 <- c(0.1,0.2,0.7)
coef3 <- c(0.3,0.5,0.2)
coef4 <- c(0.1,0.1,0.8)
coef5 <- c(0.7,0.2,0.1)
accuracy.GTModel.coef1 <- calculateScore2(c("news"), "GTModel", coef1)
accuracy.GTModel.coef2 <- calculateScore2(c("news"), "GTModel", coef2)
accuracy.GTModel.coef3 <- calculateScore2(c("news"), "GTModel", coef3)
accuracy.GTModel.coef4 <- calculateScore2(c("news"), "GTModel", coef4)
accuracy.GTModel.coef5 <- calculateScore2(c("news"), "GTModel", coef5)
print(accuracy.GTModel.coef1)
print(accuracy.GTModel.coef2)
print(accuracy.GTModel.coef3)
print(accuracy.GTModel.coef4)
print(accuracy.GTModel.coef5)