Summary

In this project, N-gram probabilities are used to predict the next word given a sequence of words in a sentence. First an exploratory data analysis is carried out on the text data obtained from various online web sources - news, blogs and twitter. Second N-gram probabilities are calculated and associated models are built to predict the next word given a sequence of words. Finally the models are compared and the best model is selected. The model built out of this project is ultimately used in a shiny app that simulates a predictive keyboard.

Exploratory data analysis

In this section exploratory data analysis is carried out to undersand word, bigram and trigram frequencies.

Data set

Lets first figure out how big the data set is.

# Read number of lines,words in each file and measure its size
system("wc -lwc final/en_US/*.txt > temp.txt")
a <- read.table("temp.txt"); 
a <- data.frame("Filename" = a$V4, "Lines" = prettyNum(a$V1,big.mark = ","),
                "Words" = prettyNum(a$V2,big.mark = ","), "Size.MB" = round(a$V3/2^20,2))
library(knitr); kable(a,align="c")

Filename	Lines	Words	Size.MB
final/en_US/en_US.blogs.txt	899,288	37,334,690	200.42
final/en_US/en_US.news.txt	1,010,242	34,372,720	196.28
final/en_US/en_US.twitter.txt	2,360,148	30,374,206	159.36
total	4,269,678	102,081,616	556.07

There are about 4 million lines of text containing about 100 million words spread in files of size 550MB. Lets read a portion of this data into workspace and carry out analysis on that.

Training, Validation and Test Data sets

Lets split 60% of data into training, 20% of data into validation and the remaning 20% of data into testing data sets. The idea is to grab a set of lines from each of the sources of data - blogs, news and twitter to for a particular type of data set - training/validation/testing sets.

# Process each input file. 
Nlines <- 600000
d.b <- readLines("final/en_US/en_US.blogs.txt",Nlines)
d.n <- readLines("final/en_US/en_US.news.txt",Nlines)
d.t <- readLines("final/en_US/en_US.twitter.txt",Nlines)

# Split data into training, validation and test sets
set.seed(100)
idx <- sample(seq(Nlines)); N1 <- round(0.6*Nlines); N2 <- round(0.8*Nlines)
train <- c(d.b[idx[1:N1]],          d.n[idx[1:N1]],          d.t[idx[1:N1]])
valid <- c(d.b[idx[(N1+1):N2]],     d.n[idx[(N1+1):N2]],     d.t[idx[(N1+1):N2]])
test  <- c(d.b[idx[(N2+1):Nlines]], d.n[idx[(N2+1):Nlines]], d.t[idx[(N2+1):Nlines]])

Data Set	Num lines	Percentage lines
Training Set	1,080,000	60
Validation Set	360,000	20
Testing Set	360,000	20
Total	1,800,000	100

Lets now sample the training set and carry out analysis on uni-gram, bi-gram and tri-grams. Since modeling these N-grams is a computationally expensive step, I would like to take the approach of sampling the training set - obtain some metrics - infer the statistics of the population (ie., entire training set).

Data cleansing and pre-processing

The following steps are carried out as part of ngram_tokenize() function while carrying out the analysis. This is included as part of ngram_compute() function. Refer to appendix section for implementation details.

Split the raw text sentences into words
Filter out punctuations.
Convert all words to lowercase.
Filter out profane words. List of profane words are taken from CMU website
Filter out stop words. List of stop word taken from MIT Paper

Uni-gram analysis

Lets look at the unigrams and their frequency of occurence in the training sets. The way to go about is to sample a small subset of data from the training set - compute the unigrams and their counts - repeat it for many such small subsets - average across all sample sets. To start with a sample size of 100 lines of text and 100 such samples are chosen. This equates to 10,000 lines of text.

M <- 10000
# Compute unigrams
unigrams <- ngram_compute(train[1:M],n=1,mat=1) # Refer appendix for function definition

Top 10 unigrams and their average counts are

um <- unigrams %*% matrix(1,M,1) #row sums
names(um) <- rownames(unigrams); um <- sort(um, decreasing = TRUE)

um[1:10]

##   time people   back   make    day   good   love   life things   work 
##   1031    638    581    535    529    527    514    438    435    418

library(wordcloud)
wordcloud(names(um),um,max.words=200,scale = c(3,0.5),rot.per = 0.2, colors = brewer.pal(6,"Dark2"),random.order = FALSE)

Statistical Inference

Using the central limit theorem to infer the population statistics from sample statistics, the average counts from the samples estimates to the counts seen in population. Note the counts are not integers as they are averaged over all sample sets and hence they are estimates.

Ms <- 100; Mcol = floor(M/Ms)
mask_mat <- matrix(0,Mcol*Ms,Mcol); for(i in 1:Mcol) mask_mat[(((i-1)*Ms+1):(i*Ms)),i] <- 1
ums <- unigrams[names(um[1:10]),] %*% mask_mat; rownames(ums) <- names(um[1:10])

boxplot(t(ums[10:1,]),xlab = "frequency (Number of unigrams per 100 lines)", main = "Unigram frequencies",horizontal = T,las=1)

Bi-gram and Tri-gram analysis

Dictionary selection

sum(um); length(um)

## [1] 162977

## [1] 28582

round(c(sum(um[1:1400]), sum(um[1:13500]))/sum(um)*100,2)

## [1] 50.95 90.39

In the test sample set considered with sample size of 100 lines and 100 such samples, there are 10,000 lines of text containig 162,977 words and 28,582 unique unigrams. Data shows that in a frequency sorted word dictionary, only 1,400 unique words are neeed to cover about 50% of all the word instance and 13,500 unique words are needed to cover about 90% of all word instances. Lets consider the top 15,000 unigrams to be part of the dictionary.

dict <- names(um[1:13500])

Lets look at the bigrams as well as trigrams and their frequency of occurence in the training sets. Here words that are not present in the dictionary are filtered out.

# Compute bigrams and trigrams
bigrams <- ngram_compute(train[1:M],n=2,mat=1,dict)
trigrams <- ngram_compute(train[1:M],n=3,mat=1,dict)

bm <- bigrams %*% matrix(1,ncol(bigrams),1); names(bm) <- rownames(bigrams); bm <- sort(bm,decreasing = TRUE)
tm <- trigrams %*% matrix(1,ncol(trigrams),1); names(tm) <- rownames(trigrams); tm <- sort(tm,decreasing = TRUE)

The top 5 bigrams and trigrams and their total counts

## vested interests interests vested        years ago        long time 
##              253              251               59               39 
##      high school 
##               25

##    vested interests vested interests vested interests 
##                        251                        251 
##     santelena hotel venice         hotel venice italy 
##                         11                         11 
##        amazon services llc 
##                          8

Prediction model

In this section various models would be computed and evaluated to select the best model.

Model Building

Lets buid Unigram, Bigram and Trigram models on a larger data set. The following options are considered in model building.

Smoothing : Laplace smoothing of adding 1 to all the counts is used here to calculate probabilities for those missing unigrams. Out of vocaboulary words are marked as “UNK”.
Stop words : Stop words are removed from unigram, bigram and trigram models.
Pruning : Any ngrams with lesser the two counts are not considered into the model directly. They are indirectly included via smoothing option above.

M <- 100000
prune <- 2
u.count <- ngram_compute(train[1:M],n=1)   # Unigram model
dict <- names(u.count[u.count>prune])    # Dictionary
b.count <- ngram_compute(train[1:M],n=2,dictionary = dict)   # Bigram model
t.count <- ngram_compute(train[1:M],n=3,dictionary = dict)   # Trigram model

Unigram, Bigram and Trigram probabilities

Compute unigram, bigram and trigram probability tables in this step. They are used to evaluate models and also predict the next word.

# Calculate unigram log likelihood
N <- sum(u.count) # Total number of tokens
V <- length(dict) # Number of words in the dictionary
u.model <- rbind(as.matrix(u.count[dict]),"UNK" = 0) + 1 # Laplacian smoothing
u.model <- cbind(u.model,log(u.model[,1]/(N+V)) ); colnames(u.model) <- c("count","MLE")

# Calculate bigram log likelihood
bc.prune <- b.count[b.count > prune]
PQ <- names(bc.prune)
P <- sapply(strsplit(PQ, split = " "), function(x) x[1])
bm.1 <- as.matrix(bc.prune) + 1 # Laplacian smoothing
bm.1 <- cbind(bm.1,log(bm.1[,1]/u.model[P,1]) ) # "Word1" "Word2"
bm.2 <- as.matrix(cbind(rep(1,V),log(1/u.model[1:V,1]))) # "Word1" "UNK"
rownames(bm.2) <- paste(dict,"UNK")
bm.3 <- as.matrix(cbind(1,u.model["UNK",2])); rownames(bm.3) <- "UNK UNK" # "UNK" "UNK"
b.model <- rbind(bm.1,bm.2,bm.3); colnames(b.model) <- c("count","MLE")

# Calculate trigram log likelihood
tc.prune <- t.count[t.count > prune]
PQR <- rownames(tc.prune)
PQ <- sapply(strsplit(PQR, split = " "), function(x) paste(x[1:2],collapse = " "))
t.model <- cbind(as.matrix(tc.prune+1), log(as.matrix(tc.prune+1)/b.model[PQ,1]) )
colnames(t.model) <- c("count","MLE")

N-gram	Number of ngrams	Unique no. of ngrams	Unique no. after pruning
unigrams	1,639,936	95,886	37,124
bigrams	1,471,002	1,177,380	48,210
trigrams	1,380,041	1,362,337	2,218

The number of tokens in the model N = 1,639,936 and the number of words in the dictionary V = 37,124.

# Save model to disk
u.mle <- sort(u.model[,2], decreasing = TRUE)
b.mle <- sort(b.model[,2], decreasing = TRUE)
t.mle <- sort(t.model[,2], decreasing = TRUE)
save(u.mle,b.mle,t.mle,file="Model.RData")

Top 10 unigrams, bigrams and trigrams from the corpus are listed below

Unigrams	Count	MLE	Bigrams	Count	MLE	Trigrams	Count	MLE
time	10144	-5.1	years ago	580	-1.9	interests vested interests	252	0.0
people	6463	-5.6	long time	336	-2.2	vested interests vested	252	0.0
day	5708	-5.7	united states	324	-0.4	amazon services llc	61	0.0
back	5703	-5.7	high school	285	-1.7	couple weeks ago	45	-1.2
make	5687	-5.7	vested interests	255	0.0	incorporated item pp	43	-0.2
good	5500	-5.7	interests vested	252	-0.5	world war ii	37	-0.8
love	5058	-5.8	ice cream	180	-0.9	couple years ago	34	-1.1
life	4502	-5.9	weeks ago	175	-2.0	long time ago	32	-2.4
things	4313	-6.0	spend time	169	-1.4	preheat oven degrees	32	-0.8
work	4198	-6.0	couple weeks	153	-2.1	advertising fees advertising	31	0.0

A cursory check : The log probability values calculated fall in between -inf to 0 that maps to probability values of 0 to 1 as expected.

summary(u.model[,2]); summary(b.model[,2]); summary(t.model[,2])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -14.340 -12.950 -12.550 -12.120 -11.510  -5.115

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -14.340  -4.945  -3.277  -3.420  -1.609   0.000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.9770 -1.4470 -0.4700 -0.9242  0.0000  0.0000

par(mfrow=c(1,3))
hist(u.model[,2],xlab="MLE", main= "Unigrams", col = "blue")
hist(b.model[,2],xlab="MLE", main="Bigrams", col = "green")
hist(t.model[,2],xlab="MLE", main="Trigrams",col = "salmon")

Model comparison and selection

Lets compare the three models individually and as a group to see how best they work on the validation set. Perplexity would be calculated on the validation set and the one with the least perplexity wins.

u_perp <- ngram_perplexity(valid[1:10000],u.model,b.model,t.model,1) # Unigram model perplexity
b_perp <- ngram_perplexity(valid[1:10000],u.model,b.model,t.model,2) # Bigram model perplexity
t_perp <- ngram_perplexity(valid[1:10000],u.model,b.model,t.model,3) # Trigram model perplexity

Model Type	Number of unique n-grams	Perplexity (validation set)
Unigram	49,483	10304.68
Bigram	97,693	238.65
Trigram	2,218	143.57

From the perplexity numbers, the model of choice would be trigram model as it has the least perplexity on validation data set. However since the number of trigram samples obtained from the training set is not significant, back off approach is used to predit the next word. As such all three models are considered with stupid backoff to roll back to the next best model if an ngram doesn’t exist in the language model.

Example prediction

Lets run through example test cases

print(ngram_predict("To all the mothers out there, wish you a Happy Mother's"))

## [1] "day"

print(ngram_predict("Wish you a very Happy"))

## [1] "birthday"

print(ngram_predict("Processed foods contain high fructose"))

## [1] "corn"

print(ngram_predict("Processed foods contain high fructose corn"))

## [1] "syrup"

print(ngram_predict("This is United"))

## [1] "states"

print(ngram_predict("This is United states of"))

## [1] "america"

print(ngram_predict("05th of May is Cinco"))

## [1] "de"

print(ngram_predict("05th of May is Cinco de"))

## [1] "mayo"

print(ngram_predict("What did you get as a Valentine's"))

## [1] "day"

print(ngram_predict("What did you get as a Valentine's day"))

## [1] "gift"

Since the stop words are removed from the dictionary and ngrams, this model wouldn’t predict any stop words.

Appendix - Code

ngram_tokenize()

Function that tokenizes lines of text into words

ngram_tokenize <- function(data) {
    for(i in 1:length(data)) {
        temp <- data[i]
        temp <- gsub("[.-]"," ", temp) # replace . - with space
        temp <- gsub("[[:punct:]]","",temp) # remove punctuation
        temp <- gsub("[0-9]","",temp) # remove numbers
        data[i] <- tolower(temp) # to lower case
    }
    data <- lapply(data, function(x) unlist(strsplit(x,split=" ")) ) # into words
    data <- lapply(data, function(x) grep("^[a-z]+$",x,value=TRUE) ) # select only english
    
    # Remove profane and remove stop words
    profanity <- readLines("bad-words.txt"); 
    stopWords <- readLines("stop-words.txt"); stopWords <- gsub("[[:punct:]]","",stopWords)
    remWords <- c(profanity[-1],stopWords)
    data <- lapply(data, function(x) { x[!(x %in% remWords)] })
}

ngram_compute()

Function to compute counts of ngrams. This is equivalent to NGramTokenizer() and TermDocumentMatrix() combined.

ngram_compute <- function(data,n,mat = 0,dictionary = character()) {
    data <- ngram_tokenize(data) # Tokenize the data
    # Create n-grams
    if(n>1) { data <- sapply(data,function(x) x[x %in% dictionary]) } # select words in dictionary
    if(n==2) { # Bigrams
        idx2 <- sapply(data, function(x) ifelse(length(x)>1,TRUE,FALSE)) # rows with atleast 2 words
        data <- lapply(data[idx2],function(x) { paste(x[1:(length(x)-1)],x[2:length(x)]) })
    }
    if(n==3) { # Trigrams
        idx3 <- sapply(data, function(x) ifelse(length(x)>2,TRUE,FALSE)) # rows with atleast 3 words
        data <- lapply(data[idx3],function(x) { paste(x[1:(length(x)-2)],x[2:(length(x)-1)],x[3:length(x)]) })
    }
    # Count unique n-grams
    if(mat==1) { # simulates term-document-matrix
        L <- length(data)
        unique_ngrams <- unique(unlist(data))
        ngram <- matrix(0,length(unique_ngrams),L); rownames(ngram) <- unique_ngrams; colnames(ngram) <- 1:L
        for(i in 1:L) { ns <- table(data[[i]]);  ngram[names(ns),i] <- ns }
    }else { # obtains ngram counts
        ngram <- sort(table(unlist(data)),decreasing = TRUE)
    }
    ngram
}

bigram_mle()

Function to compute bigram MLE of a test data set. It can also compute MLE against a dictionary which can be used to predict the next word.

bigram_mle <- function(D,u.model,b.mle,flag=0) {
    D[!(D %in% rownames(u.model))] <- "UNK"; 
    L <- length(D)
    if(flag == 0) { 
        Y <- D[1:(L-1)]
        Z <- D[2:L]
    } else { # special case for predicted word
        Z <- rownames(u.model); Z <- Z[1:(length(Z)-1)] # Map against dictionary  
        Y <- rep(D[L],length(Z))
    }
    YZ <- paste(Y,Z)
    id1 <- (YZ %in% rownames(b.model))
    id2 <- (!id1) & (Y != "UNK")
    id3 <- (!id1) & (Y == "UNK")
    a <- matrix(NA,length(YZ),1)
    if(sum(id1)>0) { a[id1] <- b.model[YZ[id1],2] }
    if(sum(id2)>0) { a[id2] <- log(1/u.model[Y[id2],1]) }
    if(sum(id3)>0) { a[id3] <- u.model["UNK",2] }
    rownames(a) <- YZ; colnames(a) <- "MLE"
    a
}

trigram_mle()

Function to compute trigram MLE of a test data set. It can also compute MLE against a dictionary which can be used to predict the next word.

trigram_mle <- function(D,u.model,b.model,t.model,flag = 0) {
    D[!(D %in% rownames(u.model))] <- "UNK"; L <- length(D)
    if(flag == 0) {
        P <- D[1:(L-2)]; Q <- D[2:(L-1)]; R <- D[3:L]
    }else { # special case for predicted word
        R <- rownames(u.model); R <- R[1:(length(R)-1)] # Map against dictionary
        P <- rep(D[L-1],length(R)) 
        Q <- rep(D[L],length(R))
    }
    PQR <- paste(P,Q,R)
    id1 <- (PQR %in% rownames(t.model))
    PQ <- paste(P,Q)
    id2 <- (!id1) & (PQ %in% rownames(b.model))
    id3 <- (!id1) & (!(PQ %in% rownames(b.model))) & (P != "UNK")
    id4 <- (!id1) & (!(PQ %in% rownames(b.model))) & (P == "UNK")
    
    a <- matrix(NA,length(PQR),1)
    if(sum(id1)>0) { a[id1] <- t.model[PQR[id1],2] }
    if(sum(id2)>0) { a[id2] <- log(1/b.model[PQ[id2],1]) }
    if(sum(id3)>0) { a[id3] <- log(1/u.model[P[id3],1]) }
    if(sum(id4)>0) { a[id4] <- u.model["UNK",2] }
    
    rownames(a) <- PQR; colnames(a) <- "MLE"
    a
}

ngram_perplexity()

Function to compute perplexity of a test data set. Loops are unrolled for fast performance

ngram_perplexity <- function(data,u.model,b.model,t.model,n) {
    D <- ngram_tokenize(data); D <- unlist(D); 
    if(n==1) { # Unigram
        D[!(D %in% rownames(u.model))] <- "UNK"
        a <- u.model[D,2]
        n_perp <- exp(-sum(a)/length(a))
    }
    if(n==2) { # Bigram model
        a <- bigram_mle(D,u.model,b.model)
        n_perp <- exp(-sum(a)/length(a))
    }
    if(n==3) { # Trigram model
        a <- trigram_mle(D,u.model,b.model,t.model)
        n_perp <- exp(-sum(a)/length(a))
    }
    n_perp
}

ngram_predict()

Function to predict the next word given a sequence of words

ngram_predict <- function(data) {
    
    D <- ngram_tokenize(data) # Tokenize
    D <- unlist(D); D[!(D %in% names(u.mle))] <- "UNK" # mark the ones not in dictionary as UNK
    L <- length(D)
    
    if(L==0) { P <- "UNK"; Q <- "UNK"}
    if(L==1) { P <- "UNK"; Q <- D[L] }
    if(L>1) { P <- D[L-1]; Q <- D[L] }
    R <- names(u.mle); R <- R[-length(R)] # Map against dictionary
    
    PQR <- paste(P,Q,R)
    id1 <- (PQR %in% names(t.mle))
    PQ <- paste(P,Q)
    QR <- paste(Q,R)
    id2 <- (!id1) & (PQ %in% names(b.mle))    &   (QR %in% names(b.mle))
    id3 <- (!id1) & (PQ %in% names(b.mle))    & (!(QR %in% names(b.mle)))
    id4 <- (!id1) & (!(PQ %in% names(b.mle))) &   (QR %in% names(b.mle))
    id5 <- (!id1) & (!(PQ %in% names(b.mle))) & (!(QR %in% names(b.mle)))
    
    a <- matrix(NA,length(PQR),1)
    if(sum(id1)>0) { a[id1] <- t.mle[PQR[id1]] }
    if(sum(id2)>0) { a[id2] <- b.mle[PQ] + b.mle[QR[id2]]            + log(2*0.4) }
    if(sum(id3)>0) { a[id3] <- b.mle[PQ] + u.mle[Q] + u.mle[R[id3]]  + log(3*0.4) }
    if(sum(id4)>0) { a[id4] <- u.mle[P]  + u.mle[Q] + b.mle[QR[id4]] + log(3*0.4) }
    if(sum(id5)>0) { a[id5] <- u.mle[P]  + u.mle[Q] + u.mle[R[id5]]  + log(4*0.4) }
    
    pred_word <- R[which.max(a)]
}

N-gram model based word prediction for smart keyboard

Sridhar Pilli