With advances in Computer power and Data Science techniques, Natural Language Processing is heavily relied on Machine learning algorithms and Statistical Learning Methods. These algorithms and methods take a large set of ‘features’ as inputs that are generated from input data block. Bag-of-words and N-gram model are most commonly used methods to predict the next word in a sentence.
Bag-of-words model will be used to generate features from the data. The most common feature is term frequency, namely, the number of times a term/word appear in a line of text. Term frequency is not the best representation for the text.
N-gram method of prediction can provide more information within the text. N-gram model depends on knowledge of word sequence from (N-1) prior words. This Model is widely used in NLP processing. Each n-gram model is composed of n words. i.e. 1-gram is one word, 2-gram is two sequential words, 3-gram is three sequential words etc. The bag-of-words model can be considered as 1-gram.
For the purpose of this exercise, we’ll use the dataset provided for this project. SwiftKey is a corporate partner for this project and the dataset is a zip file including blog posts, new articles and Twitter tweets. Here are some of the statistics about of data/corpus.
| File | Lines | Words |
|---|---|---|
| en_US.twitter.txt | 2360149 | 30373583 |
| en_US.blogs.txt | 899289 | 3733413 |
| en_US.news.txt | 77260 | 2643969 |
As the data files are too big to analyze with 8GB RAM and i7 processor, I randomly picked 15K rows/lines from News.txt file. Here are some frequently used 1-gram words from the sample data.
The initial exploration via simple visualization and 1-gram model indicated that the raw data required a number of transformations in order to use it in N-gram modeling. For the purpose of building simple 2-gram, and 3-gram models, the following assumptions and data cleanups were chosen.
We’ll use R text mining (tm) package along with ‘grep’ function to perform above cleanup. Understanding the distribution among the word tokens help shape our expectation. As part of this assignment, we’ll work on a 3-gram model. The basic building blocks of the model are unigrams (n=1), bigram(n=2), and trigram(n=3). There are fewer unigrams (31681) than bigrams(195847) and trigrams (307742).
Following plots provide the distribution of the frequenies for bigram and trygram.
For additional plots showing frequent n-gram words, please see the appendix
After tokenizing the sample data and creating n-gram model, we can predict the next word by following below steps.
nextword("thank you")
nextword("first")
nextword("one of the")
## tokens Freq
## 242084 thank you for 4
## 242083 thank you cleveland 1
## 242085 thank you muny 1
## 242086 thank you or 1
## 242087 thank you to 1
## tokens Freq
## 63710 first time 47
## 63568 first half 15
## 63666 first round 14
## 63716 first two 13
## 63706 first three 11
## tokens Freq
## 201865 one of the most 15
## 201829 one of the best 6
## 201833 one of the biggest 6
## 201852 one of the first 4
## 201866 one of the nations 4
Next steps are to use ‘tm’ package to clean the data and apply Smoothing methods to the prediction model. Also, we can use shiny application to take input next and display the next predicted word list.
rnd.lines <- sample(1:77000,10000)
#con=file('capstone/final/en_US/en_US.twitter.txt',"r")
#con=file('capstone/final/en_US/en_US.blogs.txt',"r")
con=file('capstone/final/en_US/en_US.news.txt',"r")
#length(readLines(con,warn=FALSE ))
while ( TRUE ) {
text <- readLines(con, n = 1,skipNul = TRUE)
ln <- length(text)
cnt <- cnt + 1
#if (cnt == 10000) {
# break
#}
# if end of file, exit out of the loop
if ( ln == 0 ) {
break
}
if (cnt %in% rnd.lines) {
# Print text without standard stop words
#tokens <- c(tokens,strsplit(stripWhitespace(removeWords(text,stopwords("en"))),split = " ")[[1]])
tokens <- c(tokens,strsplit(text,split=" ")[[1]])
}
}
close(con)
all.tokens <- tolower(gsub("[[:punct:]]","",tokens))
all.tokens <- tolower(gsub("[[:digit:]]","",all.tokens))
all.len <- length(all.tokens)
#### construct 1-gram.. up to 4-gram words from all tokens..
word.dist <- data.frame(table(all.tokens))
word.dist <- cbind(word.dist,prob=word.dist$Freq/length(all.tokens))
word.dist <- word.dist[order(word.dist$Freq,decreasing=T),]
two.gram <- NULL
three.gram <- NULL
four.gram <- NULL
g <- ggplot(word.dist[1:20,],aes(x=all.tokens,y=Freq))+geom_bar(stat='identity',position = 'dodge')
g <- g+ggtitle("1-gram")+coord_flip()
g <- g+geom_text(aes(label=Freq),position = position_dodge(.5),hjust=0,color='black')
g
for(i in 1:all.len-1) {
two.gram[i] <- paste(all.tokens[i],all.tokens[i+1])
three.gram[i] <- paste(two.gram[i],all.tokens[i+2])
four.gram[i] <- paste(three.gram[i],all.tokens[i+3])
}
word.dist2 <- data.frame(table(two.gram))
word.dist2 <- cbind(word.dist2,prob=word.dist2$Freq/nrow(word.dist2))
word.dist2 <- word.dist2[order(word.dist2$Freq,decreasing=T),] ##[1:100,]
word.dist3 <- data.frame(table(three.gram))
word.dist3 <- cbind(word.dist3,prob=word.dist3$Freq/nrow(word.dist3))
word.dist3 <- word.dist3[order(word.dist3$Freq,decreasing=T),] ##[1:100,]
word.dist4 <- data.frame(table(four.gram))
word.dist4 <- cbind(word.dist4,prob=word.dist4$Freq/nrow(word.dist4))
word.dist4 <- word.dist4[order(word.dist4$Freq,decreasing=T),] ##[1:100,]
word.dist2$two.gram <- as.character(word.dist2$two.gram)
word.dist3$three.gram <- as.character(word.dist3$three.gram)
word.dist4$four.gram <- as.character(word.dist4$four.gram)
## all tokens with 15K sample (news file) 392172
### function to get nextword
names(word.dist) <- c("tokens","Freq","prob")
names(word.dist2) <- c("tokens","Freq","prob")
names(word.dist2) <- c("tokens","Freq","prob")
names(word.dist3) <- c("tokens","Freq","prob")
names(word.dist4) <- c("tokens","Freq","prob")
## word prediction function
nextword <- function(txt) {
txt <- tolower(txt)
words <- strsplit(txt,split=" ")[[1]]
nwords <- length(words)
wlist <- NULL
if (nwords > 3) {
return('Error')
}
if (nwords == 3) {
wlist <- word.dist4[grep(paste("^",txt,sep=""),word.dist4$tokens),1:2]
if (nrow(wlist) == 0) {
txt <- paste(words[2],words[3])
nwords <- 2
}
}
if (nwords == 2) {
wlist <- word.dist3[grep(paste("^",txt,sep=""),word.dist3$tokens),1:2]
if (nrow(wlist) < 6) {
txt <- ifelse(is.na(words[3]),words[2],words[3])
nwords <- 1
}
}
if (nwords == 1) {
wlist <- rbind(wlist,word.dist2[grep(paste("^",txt,sep=""),word.dist2$tokens),1:2])
if (nrow(wlist) < 6) {
wlist <- rbind(wlist,word.dist[1:5,1:2])
}
}
return(wlist[1:min(5,nrow(wlist)),])
}
nextword("thank you")
nextword("the first")
nextword("one of the")