Summary

In this project, N-gram probabilities are used to predict the next word given a sequence of words in a sentence. First an exploratory data analysis is carried out on the text data obtained from various online web sources - news, blogs and twitter. Second N-gram probabilities are calculated and associated models are built to predict the next word given a sequence of words. Finally the models are compared and the best model is selected. The model built out of this project is ultimately used in a shiny app that simulates a predictive keyboard.

Exploratory data analysis

In this section exploratory data analysis is carried out to undersand word, bigram and trigram frequencies.

Data set

Lets first figure out how big the data set is.

# Read number of lines,words in each file and measure its size
system("wc -lwc final/en_US/*.txt > temp.txt")
a <- read.table("temp.txt"); 
a <- data.frame("Filename" = a$V4, "Lines" = prettyNum(a$V1,big.mark = ","),
                "Words" = prettyNum(a$V2,big.mark = ","), "Size.MB" = round(a$V3/2^20,2))
library(knitr); kable(a,align="c")
Filename Lines Words Size.MB
final/en_US/en_US.blogs.txt 899,288 37,334,690 200.42
final/en_US/en_US.news.txt 1,010,242 34,372,720 196.28
final/en_US/en_US.twitter.txt 2,360,148 30,374,206 159.36
total 4,269,678 102,081,616 556.07

There are about 4 million lines of text containing about 100 million words spread in files of size 550MB. Lets read a portion of this data into workspace and carry out analysis on that.

Training, Validation and Test Data sets

Lets split 60% of data into training, 20% of data into validation and the remaning 20% of data into testing data sets. The idea is to grab a set of lines from each of the sources of data - blogs, news and twitter to for a particular type of data set - training/validation/testing sets.

# Process each input file. 
Nlines <- 600000
d.b <- readLines("final/en_US/en_US.blogs.txt",Nlines)
d.n <- readLines("final/en_US/en_US.news.txt",Nlines)
d.t <- readLines("final/en_US/en_US.twitter.txt",Nlines)

# Split data into training, validation and test sets
set.seed(100)
idx <- sample(seq(Nlines)); N1 <- round(0.6*Nlines); N2 <- round(0.8*Nlines)
train <- c(d.b[idx[1:N1]],          d.n[idx[1:N1]],          d.t[idx[1:N1]])
valid <- c(d.b[idx[(N1+1):N2]],     d.n[idx[(N1+1):N2]],     d.t[idx[(N1+1):N2]])
test  <- c(d.b[idx[(N2+1):Nlines]], d.n[idx[(N2+1):Nlines]], d.t[idx[(N2+1):Nlines]])
Data Set Num lines Percentage lines
Training Set 1,080,000 60
Validation Set 360,000 20
Testing Set 360,000 20
Total 1,800,000 100

Lets now sample the training set and carry out analysis on uni-gram, bi-gram and tri-grams. Since modeling these N-grams is a computationally expensive step, I would like to take the approach of sampling the training set - obtain some metrics - infer the statistics of the population (ie., entire training set).

Data cleansing and pre-processing

The following steps are carried out as part of ngram_tokenize() function while carrying out the analysis. This is included as part of ngram_compute() function. Refer to appendix section for implementation details.

  1. Split the raw text sentences into words
  2. Filter out punctuations.
  3. Convert all words to lowercase.
  4. Filter out profane words. List of profane words are taken from CMU website
  5. Filter out stop words. List of stop word taken from MIT Paper

Uni-gram analysis

Lets look at the unigrams and their frequency of occurence in the training sets. The way to go about is to sample a small subset of data from the training set - compute the unigrams and their counts - repeat it for many such small subsets - average across all sample sets. To start with a sample size of 100 lines of text and 100 such samples are chosen. This equates to 10,000 lines of text.

M <- 10000
# Compute unigrams
unigrams <- ngram_compute(train[1:M],n=1,mat=1) # Refer appendix for function definition

Top 10 unigrams and their average counts are

um <- unigrams %*% matrix(1,M,1) #row sums
names(um) <- rownames(unigrams); um <- sort(um, decreasing = TRUE)
um[1:10]
##   time people   back   make    day   good   love   life things   work 
##   1031    638    581    535    529    527    514    438    435    418
library(wordcloud)
wordcloud(names(um),um,max.words=200,scale = c(3,0.5),rot.per = 0.2, colors = brewer.pal(6,"Dark2"),random.order = FALSE)

Statistical Inference

Using the central limit theorem to infer the population statistics from sample statistics, the average counts from the samples estimates to the counts seen in population. Note the counts are not integers as they are averaged over all sample sets and hence they are estimates.

Ms <- 100; Mcol = floor(M/Ms)
mask_mat <- matrix(0,Mcol*Ms,Mcol); for(i in 1:Mcol) mask_mat[(((i-1)*Ms+1):(i*Ms)),i] <- 1
ums <- unigrams[names(um[1:10]),] %*% mask_mat; rownames(ums) <- names(um[1:10])
boxplot(t(ums[10:1,]),xlab = "frequency (Number of unigrams per 100 lines)", main = "Unigram frequencies",horizontal = T,las=1)

Bi-gram and Tri-gram analysis

Dictionary selection

sum(um); length(um)
## [1] 162977
## [1] 28582
round(c(sum(um[1:1400]), sum(um[1:13500]))/sum(um)*100,2)
## [1] 50.95 90.39

In the test sample set considered with sample size of 100 lines and 100 such samples, there are 10,000 lines of text containig 162,977 words and 28,582 unique unigrams. Data shows that in a frequency sorted word dictionary, only 1,400 unique words are neeed to cover about 50% of all the word instance and 13,500 unique words are needed to cover about 90% of all word instances. Lets consider the top 15,000 unigrams to be part of the dictionary.

dict <- names(um[1:13500])

Lets look at the bigrams as well as trigrams and their frequency of occurence in the training sets. Here words that are not present in the dictionary are filtered out.

# Compute bigrams and trigrams
bigrams <- ngram_compute(train[1:M],n=2,mat=1,dict)
trigrams <- ngram_compute(train[1:M],n=3,mat=1,dict)

bm <- bigrams %*% matrix(1,ncol(bigrams),1); names(bm) <- rownames(bigrams); bm <- sort(bm,decreasing = TRUE)
tm <- trigrams %*% matrix(1,ncol(trigrams),1); names(tm) <- rownames(trigrams); tm <- sort(tm,decreasing = TRUE)

The top 5 bigrams and trigrams and their total counts

## vested interests interests vested        years ago        long time 
##              253              251               59               39 
##      high school 
##               25
##    vested interests vested interests vested interests 
##                        251                        251 
##     santelena hotel venice         hotel venice italy 
##                         11                         11 
##        amazon services llc 
##                          8

Prediction model

In this section various models would be computed and evaluated to select the best model.

Model Building

Lets buid Unigram, Bigram and Trigram models on a larger data set. The following options are considered in model building.

  1. Smoothing : Laplace smoothing of adding 1 to all the counts is used here to calculate probabilities for those missing unigrams. Out of vocaboulary words are marked as “UNK”.
  2. Stop words : Stop words are removed from unigram, bigram and trigram models.
  3. Pruning : Any ngrams with lesser the two counts are not considered into the model directly. They are indirectly included via smoothing option above.
M <- 100000
prune <- 2
u.count <- ngram_compute(train[1:M],n=1)   # Unigram model
dict <- names(u.count[u.count>prune])    # Dictionary
b.count <- ngram_compute(train[1:M],n=2,dictionary = dict)   # Bigram model
t.count <- ngram_compute(train[1:M],n=3,dictionary = dict)   # Trigram model

Unigram, Bigram and Trigram probabilities

Compute unigram, bigram and trigram probability tables in this step. They are used to evaluate models and also predict the next word.

# Calculate unigram log likelihood
N <- sum(u.count) # Total number of tokens
V <- length(dict) # Number of words in the dictionary
u.model <- rbind(as.matrix(u.count[dict]),"UNK" = 0) + 1 # Laplacian smoothing
u.model <- cbind(u.model,log(u.model[,1]/(N+V)) ); colnames(u.model) <- c("count","MLE")

# Calculate bigram log likelihood
bc.prune <- b.count[b.count > prune]
PQ <- names(bc.prune)
P <- sapply(strsplit(PQ, split = " "), function(x) x[1])
bm.1 <- as.matrix(bc.prune) + 1 # Laplacian smoothing
bm.1 <- cbind(bm.1,log(bm.1[,1]/u.model[P,1]) ) # "Word1" "Word2"
bm.2 <- as.matrix(cbind(rep(1,V),log(1/u.model[1:V,1]))) # "Word1" "UNK"
rownames(bm.2) <- paste(dict,"UNK")
bm.3 <- as.matrix(cbind(1,u.model["UNK",2])); rownames(bm.3) <- "UNK UNK" # "UNK" "UNK"
b.model <- rbind(bm.1,bm.2,bm.3); colnames(b.model) <- c("count","MLE")

# Calculate trigram log likelihood
tc.prune <- t.count[t.count > prune]
PQR <- rownames(tc.prune)
PQ <- sapply(strsplit(PQR, split = " "), function(x) paste(x[1:2],collapse = " "))
t.model <- cbind(as.matrix(tc.prune+1), log(as.matrix(tc.prune+1)/b.model[PQ,1]) )
colnames(t.model) <- c("count","MLE")
N-gram Number of ngrams Unique no. of ngrams Unique no. after pruning
unigrams 1,639,936 95,886 37,124
bigrams 1,471,002 1,177,380 48,210
trigrams 1,380,041 1,362,337 2,218

The number of tokens in the model N = 1,639,936 and the number of words in the dictionary V = 37,124.

# Save model to disk
u.mle <- sort(u.model[,2], decreasing = TRUE)
b.mle <- sort(b.model[,2], decreasing = TRUE)
t.mle <- sort(t.model[,2], decreasing = TRUE)
save(u.mle,b.mle,t.mle,file="Model.RData")

Top 10 unigrams, bigrams and trigrams from the corpus are listed below

Unigrams Count MLE Bigrams Count MLE Trigrams Count MLE
time 10144 -5.1 years ago 580 -1.9 interests vested interests 252 0.0
people 6463 -5.6 long time 336 -2.2 vested interests vested 252 0.0
day 5708 -5.7 united states 324 -0.4 amazon services llc 61 0.0
back 5703 -5.7 high school 285 -1.7 couple weeks ago 45 -1.2
make 5687 -5.7 vested interests 255 0.0 incorporated item pp 43 -0.2
good 5500 -5.7 interests vested 252 -0.5 world war ii 37 -0.8
love 5058 -5.8 ice cream 180 -0.9 couple years ago 34 -1.1
life 4502 -5.9 weeks ago 175 -2.0 long time ago 32 -2.4
things 4313 -6.0 spend time 169 -1.4 preheat oven degrees 32 -0.8
work 4198 -6.0 couple weeks 153 -2.1 advertising fees advertising 31 0.0

A cursory check : The log probability values calculated fall in between -inf to 0 that maps to probability values of 0 to 1 as expected.

summary(u.model[,2]); summary(b.model[,2]); summary(t.model[,2])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -14.340 -12.950 -12.550 -12.120 -11.510  -5.115
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -14.340  -4.945  -3.277  -3.420  -1.609   0.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.9770 -1.4470 -0.4700 -0.9242  0.0000  0.0000
par(mfrow=c(1,3))
hist(u.model[,2],xlab="MLE", main= "Unigrams", col = "blue")
hist(b.model[,2],xlab="MLE", main="Bigrams", col = "green")
hist(t.model[,2],xlab="MLE", main="Trigrams",col = "salmon")

Model comparison and selection

Lets compare the three models individually and as a group to see how best they work on the validation set. Perplexity would be calculated on the validation set and the one with the least perplexity wins.

u_perp <- ngram_perplexity(valid[1:10000],u.model,b.model,t.model,1) # Unigram model perplexity
b_perp <- ngram_perplexity(valid[1:10000],u.model,b.model,t.model,2) # Bigram model perplexity
t_perp <- ngram_perplexity(valid[1:10000],u.model,b.model,t.model,3) # Trigram model perplexity
Model Type Number of unique n-grams Perplexity (validation set)
Unigram 49,483 10304.68
Bigram 97,693 238.65
Trigram 2,218 143.57

From the perplexity numbers, the model of choice would be trigram model as it has the least perplexity on validation data set. However since the number of trigram samples obtained from the training set is not significant, back off approach is used to predit the next word. As such all three models are considered with stupid backoff to roll back to the next best model if an ngram doesn’t exist in the language model.

Example prediction

Lets run through example test cases

print(ngram_predict("To all the mothers out there, wish you a Happy Mother's"))
## [1] "day"
print(ngram_predict("Wish you a very Happy"))
## [1] "birthday"
print(ngram_predict("Processed foods contain high fructose"))
## [1] "corn"
print(ngram_predict("Processed foods contain high fructose corn"))
## [1] "syrup"
print(ngram_predict("This is United"))
## [1] "states"
print(ngram_predict("This is United states of"))
## [1] "america"
print(ngram_predict("05th of May is Cinco"))
## [1] "de"
print(ngram_predict("05th of May is Cinco de"))
## [1] "mayo"
print(ngram_predict("What did you get as a Valentine's"))
## [1] "day"
print(ngram_predict("What did you get as a Valentine's day"))
## [1] "gift"

Since the stop words are removed from the dictionary and ngrams, this model wouldn’t predict any stop words.

Appendix - Code

ngram_tokenize()

Function that tokenizes lines of text into words

ngram_tokenize <- function(data) {
    for(i in 1:length(data)) {
        temp <- data[i]
        temp <- gsub("[.-]"," ", temp) # replace . - with space
        temp <- gsub("[[:punct:]]","",temp) # remove punctuation
        temp <- gsub("[0-9]","",temp) # remove numbers
        data[i] <- tolower(temp) # to lower case
    }
    data <- lapply(data, function(x) unlist(strsplit(x,split=" ")) ) # into words
    data <- lapply(data, function(x) grep("^[a-z]+$",x,value=TRUE) ) # select only english
    
    # Remove profane and remove stop words
    profanity <- readLines("bad-words.txt"); 
    stopWords <- readLines("stop-words.txt"); stopWords <- gsub("[[:punct:]]","",stopWords)
    remWords <- c(profanity[-1],stopWords)
    data <- lapply(data, function(x) { x[!(x %in% remWords)] })
}

ngram_compute()

Function to compute counts of ngrams. This is equivalent to NGramTokenizer() and TermDocumentMatrix() combined.

ngram_compute <- function(data,n,mat = 0,dictionary = character()) {
    data <- ngram_tokenize(data) # Tokenize the data
    # Create n-grams
    if(n>1) { data <- sapply(data,function(x) x[x %in% dictionary]) } # select words in dictionary
    if(n==2) { # Bigrams
        idx2 <- sapply(data, function(x) ifelse(length(x)>1,TRUE,FALSE)) # rows with atleast 2 words
        data <- lapply(data[idx2],function(x) { paste(x[1:(length(x)-1)],x[2:length(x)]) })
    }
    if(n==3) { # Trigrams
        idx3 <- sapply(data, function(x) ifelse(length(x)>2,TRUE,FALSE)) # rows with atleast 3 words
        data <- lapply(data[idx3],function(x) { paste(x[1:(length(x)-2)],x[2:(length(x)-1)],x[3:length(x)]) })
    }
    # Count unique n-grams
    if(mat==1) { # simulates term-document-matrix
        L <- length(data)
        unique_ngrams <- unique(unlist(data))
        ngram <- matrix(0,length(unique_ngrams),L); rownames(ngram) <- unique_ngrams; colnames(ngram) <- 1:L
        for(i in 1:L) { ns <- table(data[[i]]);  ngram[names(ns),i] <- ns }
    }else { # obtains ngram counts
        ngram <- sort(table(unlist(data)),decreasing = TRUE)
    }
    ngram
}

bigram_mle()

Function to compute bigram MLE of a test data set. It can also compute MLE against a dictionary which can be used to predict the next word.

bigram_mle <- function(D,u.model,b.mle,flag=0) {
    D[!(D %in% rownames(u.model))] <- "UNK"; 
    L <- length(D)
    if(flag == 0) { 
        Y <- D[1:(L-1)]
        Z <- D[2:L]
    } else { # special case for predicted word
        Z <- rownames(u.model); Z <- Z[1:(length(Z)-1)] # Map against dictionary  
        Y <- rep(D[L],length(Z))
    }
    YZ <- paste(Y,Z)
    id1 <- (YZ %in% rownames(b.model))
    id2 <- (!id1) & (Y != "UNK")
    id3 <- (!id1) & (Y == "UNK")
    a <- matrix(NA,length(YZ),1)
    if(sum(id1)>0) { a[id1] <- b.model[YZ[id1],2] }
    if(sum(id2)>0) { a[id2] <- log(1/u.model[Y[id2],1]) }
    if(sum(id3)>0) { a[id3] <- u.model["UNK",2] }
    rownames(a) <- YZ; colnames(a) <- "MLE"
    a
}

trigram_mle()

Function to compute trigram MLE of a test data set. It can also compute MLE against a dictionary which can be used to predict the next word.

trigram_mle <- function(D,u.model,b.model,t.model,flag = 0) {
    D[!(D %in% rownames(u.model))] <- "UNK"; L <- length(D)
    if(flag == 0) {
        P <- D[1:(L-2)]; Q <- D[2:(L-1)]; R <- D[3:L]
    }else { # special case for predicted word
        R <- rownames(u.model); R <- R[1:(length(R)-1)] # Map against dictionary
        P <- rep(D[L-1],length(R)) 
        Q <- rep(D[L],length(R))
    }
    PQR <- paste(P,Q,R)
    id1 <- (PQR %in% rownames(t.model))
    PQ <- paste(P,Q)
    id2 <- (!id1) & (PQ %in% rownames(b.model))
    id3 <- (!id1) & (!(PQ %in% rownames(b.model))) & (P != "UNK")
    id4 <- (!id1) & (!(PQ %in% rownames(b.model))) & (P == "UNK")
    
    a <- matrix(NA,length(PQR),1)
    if(sum(id1)>0) { a[id1] <- t.model[PQR[id1],2] }
    if(sum(id2)>0) { a[id2] <- log(1/b.model[PQ[id2],1]) }
    if(sum(id3)>0) { a[id3] <- log(1/u.model[P[id3],1]) }
    if(sum(id4)>0) { a[id4] <- u.model["UNK",2] }
    
    rownames(a) <- PQR; colnames(a) <- "MLE"
    a
}

ngram_perplexity()

Function to compute perplexity of a test data set. Loops are unrolled for fast performance

ngram_perplexity <- function(data,u.model,b.model,t.model,n) {
    D <- ngram_tokenize(data); D <- unlist(D); 
    if(n==1) { # Unigram
        D[!(D %in% rownames(u.model))] <- "UNK"
        a <- u.model[D,2]
        n_perp <- exp(-sum(a)/length(a))
    }
    if(n==2) { # Bigram model
        a <- bigram_mle(D,u.model,b.model)
        n_perp <- exp(-sum(a)/length(a))
    }
    if(n==3) { # Trigram model
        a <- trigram_mle(D,u.model,b.model,t.model)
        n_perp <- exp(-sum(a)/length(a))
    }
    n_perp
}

ngram_predict()

Function to predict the next word given a sequence of words

ngram_predict <- function(data) {
    
    D <- ngram_tokenize(data) # Tokenize
    D <- unlist(D); D[!(D %in% names(u.mle))] <- "UNK" # mark the ones not in dictionary as UNK
    L <- length(D)
    
    if(L==0) { P <- "UNK"; Q <- "UNK"}
    if(L==1) { P <- "UNK"; Q <- D[L] }
    if(L>1) { P <- D[L-1]; Q <- D[L] }
    R <- names(u.mle); R <- R[-length(R)] # Map against dictionary
    
    PQR <- paste(P,Q,R)
    id1 <- (PQR %in% names(t.mle))
    PQ <- paste(P,Q)
    QR <- paste(Q,R)
    id2 <- (!id1) & (PQ %in% names(b.mle))    &   (QR %in% names(b.mle))
    id3 <- (!id1) & (PQ %in% names(b.mle))    & (!(QR %in% names(b.mle)))
    id4 <- (!id1) & (!(PQ %in% names(b.mle))) &   (QR %in% names(b.mle))
    id5 <- (!id1) & (!(PQ %in% names(b.mle))) & (!(QR %in% names(b.mle)))
    
    a <- matrix(NA,length(PQR),1)
    if(sum(id1)>0) { a[id1] <- t.mle[PQR[id1]] }
    if(sum(id2)>0) { a[id2] <- b.mle[PQ] + b.mle[QR[id2]]            + log(2*0.4) }
    if(sum(id3)>0) { a[id3] <- b.mle[PQ] + u.mle[Q] + u.mle[R[id3]]  + log(3*0.4) }
    if(sum(id4)>0) { a[id4] <- u.mle[P]  + u.mle[Q] + b.mle[QR[id4]] + log(3*0.4) }
    if(sum(id5)>0) { a[id5] <- u.mle[P]  + u.mle[Q] + u.mle[R[id5]]  + log(4*0.4) }
    
    pred_word <- R[which.max(a)]
}