In this project, N-gram probabilities are used to predict the next word given a sequence of words in a sentence. First an exploratory data analysis is carried out on the text data obtained from various online web sources - news, blogs and twitter. Second N-gram probabilities are calculated and associated models are built to predict the next word given a sequence of words. Finally the models are compared and the best model is selected. The model built out of this project is ultimately used in a shiny app that simulates a predictive keyboard.
In this section exploratory data analysis is carried out to undersand word, bigram and trigram frequencies.
Lets first figure out how big the data set is.
# Read number of lines,words in each file and measure its size
system("wc -lwc final/en_US/*.txt > temp.txt")
a <- read.table("temp.txt");
a <- data.frame("Filename" = a$V4, "Lines" = prettyNum(a$V1,big.mark = ","),
"Words" = prettyNum(a$V2,big.mark = ","), "Size.MB" = round(a$V3/2^20,2))
library(knitr); kable(a,align="c")
| Filename | Lines | Words | Size.MB |
|---|---|---|---|
| final/en_US/en_US.blogs.txt | 899,288 | 37,334,690 | 200.42 |
| final/en_US/en_US.news.txt | 1,010,242 | 34,372,720 | 196.28 |
| final/en_US/en_US.twitter.txt | 2,360,148 | 30,374,206 | 159.36 |
| total | 4,269,678 | 102,081,616 | 556.07 |
There are about 4 million lines of text containing about 100 million words spread in files of size 550MB. Lets read a portion of this data into workspace and carry out analysis on that.
Lets split 60% of data into training, 20% of data into validation and the remaning 20% of data into testing data sets. The idea is to grab a set of lines from each of the sources of data - blogs, news and twitter to for a particular type of data set - training/validation/testing sets.
# Process each input file.
Nlines <- 600000
d.b <- readLines("final/en_US/en_US.blogs.txt",Nlines)
d.n <- readLines("final/en_US/en_US.news.txt",Nlines)
d.t <- readLines("final/en_US/en_US.twitter.txt",Nlines)
# Split data into training, validation and test sets
set.seed(100)
idx <- sample(seq(Nlines)); N1 <- round(0.6*Nlines); N2 <- round(0.8*Nlines)
train <- c(d.b[idx[1:N1]], d.n[idx[1:N1]], d.t[idx[1:N1]])
valid <- c(d.b[idx[(N1+1):N2]], d.n[idx[(N1+1):N2]], d.t[idx[(N1+1):N2]])
test <- c(d.b[idx[(N2+1):Nlines]], d.n[idx[(N2+1):Nlines]], d.t[idx[(N2+1):Nlines]])
| Data Set | Num lines | Percentage lines |
|---|---|---|
| Training Set | 1,080,000 | 60 |
| Validation Set | 360,000 | 20 |
| Testing Set | 360,000 | 20 |
| Total | 1,800,000 | 100 |
Lets now sample the training set and carry out analysis on uni-gram, bi-gram and tri-grams. Since modeling these N-grams is a computationally expensive step, I would like to take the approach of sampling the training set - obtain some metrics - infer the statistics of the population (ie., entire training set).
The following steps are carried out as part of ngram_tokenize() function while carrying out the analysis. This is included as part of ngram_compute() function. Refer to appendix section for implementation details.
Lets look at the unigrams and their frequency of occurence in the training sets. The way to go about is to sample a small subset of data from the training set - compute the unigrams and their counts - repeat it for many such small subsets - average across all sample sets. To start with a sample size of 100 lines of text and 100 such samples are chosen. This equates to 10,000 lines of text.
M <- 10000
# Compute unigrams
unigrams <- ngram_compute(train[1:M],n=1,mat=1) # Refer appendix for function definition
Top 10 unigrams and their average counts are
um <- unigrams %*% matrix(1,M,1) #row sums
names(um) <- rownames(unigrams); um <- sort(um, decreasing = TRUE)
um[1:10]
## time people back make day good love life things work
## 1031 638 581 535 529 527 514 438 435 418
library(wordcloud)
wordcloud(names(um),um,max.words=200,scale = c(3,0.5),rot.per = 0.2, colors = brewer.pal(6,"Dark2"),random.order = FALSE)
Statistical Inference
Using the central limit theorem to infer the population statistics from sample statistics, the average counts from the samples estimates to the counts seen in population. Note the counts are not integers as they are averaged over all sample sets and hence they are estimates.
Ms <- 100; Mcol = floor(M/Ms)
mask_mat <- matrix(0,Mcol*Ms,Mcol); for(i in 1:Mcol) mask_mat[(((i-1)*Ms+1):(i*Ms)),i] <- 1
ums <- unigrams[names(um[1:10]),] %*% mask_mat; rownames(ums) <- names(um[1:10])
boxplot(t(ums[10:1,]),xlab = "frequency (Number of unigrams per 100 lines)", main = "Unigram frequencies",horizontal = T,las=1)
Dictionary selection
sum(um); length(um)
## [1] 162977
## [1] 28582
round(c(sum(um[1:1400]), sum(um[1:13500]))/sum(um)*100,2)
## [1] 50.95 90.39
In the test sample set considered with sample size of 100 lines and 100 such samples, there are 10,000 lines of text containig 162,977 words and 28,582 unique unigrams. Data shows that in a frequency sorted word dictionary, only 1,400 unique words are neeed to cover about 50% of all the word instance and 13,500 unique words are needed to cover about 90% of all word instances. Lets consider the top 15,000 unigrams to be part of the dictionary.
dict <- names(um[1:13500])
Lets look at the bigrams as well as trigrams and their frequency of occurence in the training sets. Here words that are not present in the dictionary are filtered out.
# Compute bigrams and trigrams
bigrams <- ngram_compute(train[1:M],n=2,mat=1,dict)
trigrams <- ngram_compute(train[1:M],n=3,mat=1,dict)
bm <- bigrams %*% matrix(1,ncol(bigrams),1); names(bm) <- rownames(bigrams); bm <- sort(bm,decreasing = TRUE)
tm <- trigrams %*% matrix(1,ncol(trigrams),1); names(tm) <- rownames(trigrams); tm <- sort(tm,decreasing = TRUE)
The top 5 bigrams and trigrams and their total counts
## vested interests interests vested years ago long time
## 253 251 59 39
## high school
## 25
## vested interests vested interests vested interests
## 251 251
## santelena hotel venice hotel venice italy
## 11 11
## amazon services llc
## 8
In this section various models would be computed and evaluated to select the best model.
Lets buid Unigram, Bigram and Trigram models on a larger data set. The following options are considered in model building.
M <- 100000
prune <- 2
u.count <- ngram_compute(train[1:M],n=1) # Unigram model
dict <- names(u.count[u.count>prune]) # Dictionary
b.count <- ngram_compute(train[1:M],n=2,dictionary = dict) # Bigram model
t.count <- ngram_compute(train[1:M],n=3,dictionary = dict) # Trigram model
Compute unigram, bigram and trigram probability tables in this step. They are used to evaluate models and also predict the next word.
# Calculate unigram log likelihood
N <- sum(u.count) # Total number of tokens
V <- length(dict) # Number of words in the dictionary
u.model <- rbind(as.matrix(u.count[dict]),"UNK" = 0) + 1 # Laplacian smoothing
u.model <- cbind(u.model,log(u.model[,1]/(N+V)) ); colnames(u.model) <- c("count","MLE")
# Calculate bigram log likelihood
bc.prune <- b.count[b.count > prune]
PQ <- names(bc.prune)
P <- sapply(strsplit(PQ, split = " "), function(x) x[1])
bm.1 <- as.matrix(bc.prune) + 1 # Laplacian smoothing
bm.1 <- cbind(bm.1,log(bm.1[,1]/u.model[P,1]) ) # "Word1" "Word2"
bm.2 <- as.matrix(cbind(rep(1,V),log(1/u.model[1:V,1]))) # "Word1" "UNK"
rownames(bm.2) <- paste(dict,"UNK")
bm.3 <- as.matrix(cbind(1,u.model["UNK",2])); rownames(bm.3) <- "UNK UNK" # "UNK" "UNK"
b.model <- rbind(bm.1,bm.2,bm.3); colnames(b.model) <- c("count","MLE")
# Calculate trigram log likelihood
tc.prune <- t.count[t.count > prune]
PQR <- rownames(tc.prune)
PQ <- sapply(strsplit(PQR, split = " "), function(x) paste(x[1:2],collapse = " "))
t.model <- cbind(as.matrix(tc.prune+1), log(as.matrix(tc.prune+1)/b.model[PQ,1]) )
colnames(t.model) <- c("count","MLE")
| N-gram | Number of ngrams | Unique no. of ngrams | Unique no. after pruning |
|---|---|---|---|
| unigrams | 1,639,936 | 95,886 | 37,124 |
| bigrams | 1,471,002 | 1,177,380 | 48,210 |
| trigrams | 1,380,041 | 1,362,337 | 2,218 |
The number of tokens in the model N = 1,639,936 and the number of words in the dictionary V = 37,124.
# Save model to disk
u.mle <- sort(u.model[,2], decreasing = TRUE)
b.mle <- sort(b.model[,2], decreasing = TRUE)
t.mle <- sort(t.model[,2], decreasing = TRUE)
save(u.mle,b.mle,t.mle,file="Model.RData")
Top 10 unigrams, bigrams and trigrams from the corpus are listed below
| Unigrams | Count | MLE | Bigrams | Count | MLE | Trigrams | Count | MLE |
|---|---|---|---|---|---|---|---|---|
| time | 10144 | -5.1 | years ago | 580 | -1.9 | interests vested interests | 252 | 0.0 |
| people | 6463 | -5.6 | long time | 336 | -2.2 | vested interests vested | 252 | 0.0 |
| day | 5708 | -5.7 | united states | 324 | -0.4 | amazon services llc | 61 | 0.0 |
| back | 5703 | -5.7 | high school | 285 | -1.7 | couple weeks ago | 45 | -1.2 |
| make | 5687 | -5.7 | vested interests | 255 | 0.0 | incorporated item pp | 43 | -0.2 |
| good | 5500 | -5.7 | interests vested | 252 | -0.5 | world war ii | 37 | -0.8 |
| love | 5058 | -5.8 | ice cream | 180 | -0.9 | couple years ago | 34 | -1.1 |
| life | 4502 | -5.9 | weeks ago | 175 | -2.0 | long time ago | 32 | -2.4 |
| things | 4313 | -6.0 | spend time | 169 | -1.4 | preheat oven degrees | 32 | -0.8 |
| work | 4198 | -6.0 | couple weeks | 153 | -2.1 | advertising fees advertising | 31 | 0.0 |
A cursory check : The log probability values calculated fall in between -inf to 0 that maps to probability values of 0 to 1 as expected.
summary(u.model[,2]); summary(b.model[,2]); summary(t.model[,2])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -14.340 -12.950 -12.550 -12.120 -11.510 -5.115
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -14.340 -4.945 -3.277 -3.420 -1.609 0.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.9770 -1.4470 -0.4700 -0.9242 0.0000 0.0000
par(mfrow=c(1,3))
hist(u.model[,2],xlab="MLE", main= "Unigrams", col = "blue")
hist(b.model[,2],xlab="MLE", main="Bigrams", col = "green")
hist(t.model[,2],xlab="MLE", main="Trigrams",col = "salmon")
Lets compare the three models individually and as a group to see how best they work on the validation set. Perplexity would be calculated on the validation set and the one with the least perplexity wins.
u_perp <- ngram_perplexity(valid[1:10000],u.model,b.model,t.model,1) # Unigram model perplexity
b_perp <- ngram_perplexity(valid[1:10000],u.model,b.model,t.model,2) # Bigram model perplexity
t_perp <- ngram_perplexity(valid[1:10000],u.model,b.model,t.model,3) # Trigram model perplexity
| Model Type | Number of unique n-grams | Perplexity (validation set) |
|---|---|---|
| Unigram | 49,483 | 10304.68 |
| Bigram | 97,693 | 238.65 |
| Trigram | 2,218 | 143.57 |
From the perplexity numbers, the model of choice would be trigram model as it has the least perplexity on validation data set. However since the number of trigram samples obtained from the training set is not significant, back off approach is used to predit the next word. As such all three models are considered with stupid backoff to roll back to the next best model if an ngram doesn’t exist in the language model.
Lets run through example test cases
print(ngram_predict("To all the mothers out there, wish you a Happy Mother's"))
## [1] "day"
print(ngram_predict("Wish you a very Happy"))
## [1] "birthday"
print(ngram_predict("Processed foods contain high fructose"))
## [1] "corn"
print(ngram_predict("Processed foods contain high fructose corn"))
## [1] "syrup"
print(ngram_predict("This is United"))
## [1] "states"
print(ngram_predict("This is United states of"))
## [1] "america"
print(ngram_predict("05th of May is Cinco"))
## [1] "de"
print(ngram_predict("05th of May is Cinco de"))
## [1] "mayo"
print(ngram_predict("What did you get as a Valentine's"))
## [1] "day"
print(ngram_predict("What did you get as a Valentine's day"))
## [1] "gift"
Since the stop words are removed from the dictionary and ngrams, this model wouldn’t predict any stop words.
Function that tokenizes lines of text into words
ngram_tokenize <- function(data) {
for(i in 1:length(data)) {
temp <- data[i]
temp <- gsub("[.-]"," ", temp) # replace . - with space
temp <- gsub("[[:punct:]]","",temp) # remove punctuation
temp <- gsub("[0-9]","",temp) # remove numbers
data[i] <- tolower(temp) # to lower case
}
data <- lapply(data, function(x) unlist(strsplit(x,split=" ")) ) # into words
data <- lapply(data, function(x) grep("^[a-z]+$",x,value=TRUE) ) # select only english
# Remove profane and remove stop words
profanity <- readLines("bad-words.txt");
stopWords <- readLines("stop-words.txt"); stopWords <- gsub("[[:punct:]]","",stopWords)
remWords <- c(profanity[-1],stopWords)
data <- lapply(data, function(x) { x[!(x %in% remWords)] })
}
Function to compute counts of ngrams. This is equivalent to NGramTokenizer() and TermDocumentMatrix() combined.
ngram_compute <- function(data,n,mat = 0,dictionary = character()) {
data <- ngram_tokenize(data) # Tokenize the data
# Create n-grams
if(n>1) { data <- sapply(data,function(x) x[x %in% dictionary]) } # select words in dictionary
if(n==2) { # Bigrams
idx2 <- sapply(data, function(x) ifelse(length(x)>1,TRUE,FALSE)) # rows with atleast 2 words
data <- lapply(data[idx2],function(x) { paste(x[1:(length(x)-1)],x[2:length(x)]) })
}
if(n==3) { # Trigrams
idx3 <- sapply(data, function(x) ifelse(length(x)>2,TRUE,FALSE)) # rows with atleast 3 words
data <- lapply(data[idx3],function(x) { paste(x[1:(length(x)-2)],x[2:(length(x)-1)],x[3:length(x)]) })
}
# Count unique n-grams
if(mat==1) { # simulates term-document-matrix
L <- length(data)
unique_ngrams <- unique(unlist(data))
ngram <- matrix(0,length(unique_ngrams),L); rownames(ngram) <- unique_ngrams; colnames(ngram) <- 1:L
for(i in 1:L) { ns <- table(data[[i]]); ngram[names(ns),i] <- ns }
}else { # obtains ngram counts
ngram <- sort(table(unlist(data)),decreasing = TRUE)
}
ngram
}
Function to compute bigram MLE of a test data set. It can also compute MLE against a dictionary which can be used to predict the next word.
bigram_mle <- function(D,u.model,b.mle,flag=0) {
D[!(D %in% rownames(u.model))] <- "UNK";
L <- length(D)
if(flag == 0) {
Y <- D[1:(L-1)]
Z <- D[2:L]
} else { # special case for predicted word
Z <- rownames(u.model); Z <- Z[1:(length(Z)-1)] # Map against dictionary
Y <- rep(D[L],length(Z))
}
YZ <- paste(Y,Z)
id1 <- (YZ %in% rownames(b.model))
id2 <- (!id1) & (Y != "UNK")
id3 <- (!id1) & (Y == "UNK")
a <- matrix(NA,length(YZ),1)
if(sum(id1)>0) { a[id1] <- b.model[YZ[id1],2] }
if(sum(id2)>0) { a[id2] <- log(1/u.model[Y[id2],1]) }
if(sum(id3)>0) { a[id3] <- u.model["UNK",2] }
rownames(a) <- YZ; colnames(a) <- "MLE"
a
}
Function to compute trigram MLE of a test data set. It can also compute MLE against a dictionary which can be used to predict the next word.
trigram_mle <- function(D,u.model,b.model,t.model,flag = 0) {
D[!(D %in% rownames(u.model))] <- "UNK"; L <- length(D)
if(flag == 0) {
P <- D[1:(L-2)]; Q <- D[2:(L-1)]; R <- D[3:L]
}else { # special case for predicted word
R <- rownames(u.model); R <- R[1:(length(R)-1)] # Map against dictionary
P <- rep(D[L-1],length(R))
Q <- rep(D[L],length(R))
}
PQR <- paste(P,Q,R)
id1 <- (PQR %in% rownames(t.model))
PQ <- paste(P,Q)
id2 <- (!id1) & (PQ %in% rownames(b.model))
id3 <- (!id1) & (!(PQ %in% rownames(b.model))) & (P != "UNK")
id4 <- (!id1) & (!(PQ %in% rownames(b.model))) & (P == "UNK")
a <- matrix(NA,length(PQR),1)
if(sum(id1)>0) { a[id1] <- t.model[PQR[id1],2] }
if(sum(id2)>0) { a[id2] <- log(1/b.model[PQ[id2],1]) }
if(sum(id3)>0) { a[id3] <- log(1/u.model[P[id3],1]) }
if(sum(id4)>0) { a[id4] <- u.model["UNK",2] }
rownames(a) <- PQR; colnames(a) <- "MLE"
a
}
Function to compute perplexity of a test data set. Loops are unrolled for fast performance
ngram_perplexity <- function(data,u.model,b.model,t.model,n) {
D <- ngram_tokenize(data); D <- unlist(D);
if(n==1) { # Unigram
D[!(D %in% rownames(u.model))] <- "UNK"
a <- u.model[D,2]
n_perp <- exp(-sum(a)/length(a))
}
if(n==2) { # Bigram model
a <- bigram_mle(D,u.model,b.model)
n_perp <- exp(-sum(a)/length(a))
}
if(n==3) { # Trigram model
a <- trigram_mle(D,u.model,b.model,t.model)
n_perp <- exp(-sum(a)/length(a))
}
n_perp
}
Function to predict the next word given a sequence of words
ngram_predict <- function(data) {
D <- ngram_tokenize(data) # Tokenize
D <- unlist(D); D[!(D %in% names(u.mle))] <- "UNK" # mark the ones not in dictionary as UNK
L <- length(D)
if(L==0) { P <- "UNK"; Q <- "UNK"}
if(L==1) { P <- "UNK"; Q <- D[L] }
if(L>1) { P <- D[L-1]; Q <- D[L] }
R <- names(u.mle); R <- R[-length(R)] # Map against dictionary
PQR <- paste(P,Q,R)
id1 <- (PQR %in% names(t.mle))
PQ <- paste(P,Q)
QR <- paste(Q,R)
id2 <- (!id1) & (PQ %in% names(b.mle)) & (QR %in% names(b.mle))
id3 <- (!id1) & (PQ %in% names(b.mle)) & (!(QR %in% names(b.mle)))
id4 <- (!id1) & (!(PQ %in% names(b.mle))) & (QR %in% names(b.mle))
id5 <- (!id1) & (!(PQ %in% names(b.mle))) & (!(QR %in% names(b.mle)))
a <- matrix(NA,length(PQR),1)
if(sum(id1)>0) { a[id1] <- t.mle[PQR[id1]] }
if(sum(id2)>0) { a[id2] <- b.mle[PQ] + b.mle[QR[id2]] + log(2*0.4) }
if(sum(id3)>0) { a[id3] <- b.mle[PQ] + u.mle[Q] + u.mle[R[id3]] + log(3*0.4) }
if(sum(id4)>0) { a[id4] <- u.mle[P] + u.mle[Q] + b.mle[QR[id4]] + log(3*0.4) }
if(sum(id5)>0) { a[id5] <- u.mle[P] + u.mle[Q] + u.mle[R[id5]] + log(4*0.4) }
pred_word <- R[which.max(a)]
}