The goal of this project is to build an application that can predict the next word in a sentence based on the words that have already been typed. With mobile devices increasingly becoming the choice of device for email, social networking etc., helping users type better with a word prediction application has gained relevance.
This milestone report is aimed at understanding how to work with the dataset available to us and gaining insights about word statistics and relations from the data.
The data used in this project is available at Capstone Data. It contains text from news, blogs and twitter and is available in four languages - English, French, Russian and German. For the purpose of this exercise, I will be using the English dataset.
Considering that the language used on twitter is very different from the language used in blogs or news, I will be analyzing the three forms of data separately before combining them.
The three text files are loaded separately. To understand the structure of the data and get an idea of the basic statistics, the files are processed to extract information about how many lines they have, how many words they have and the number of unique words in each.
suppressWarnings(library(tm))
## Loading required package: NLP
suppressWarnings(library(SnowballC))
suppressWarnings(library(RWeka))
suppressWarnings(library(knitr))
suppressWarnings(library(wordcloud))
## Loading required package: RColorBrewer
suppressWarnings(library(RColorBrewer))
blog = readLines("en_US.blogs.txt",n=-1,warn=F,encoding="UTF-8")
news = readLines("en_US.news.txt",n=-1,warn=F,encoding="UTF-8")
twitter = readLines("en_US.twitter.txt",n=-1,warn=F,encoding="UTF-8")
blog_l <- length(blog)
news_l <- length(news)
twitter_l <- length(twitter)
news_words = strsplit(news,"\\W+",perl=TRUE)
news_words = unlist(news_words)
num_news_words = length(news_words)
news_uniq = length(unique(news_words))
blog_words = strsplit(blog,"\\W+",perl=TRUE)
blog_words = unlist(blog_words)
num_blog_words = length(blog_words)
blog_uniq = length(unique(blog_words))
twitter_words = strsplit(twitter,"\\W+",perl=TRUE)
twitter_words = unlist(twitter_words)
num_twitter_words = length(twitter_words)
twitter_uniq = length(unique(twitter_words))
df_stats = data.frame("Number of Lines"=c(blog_l,news_l,twitter_l),"Number of words"=c(num_blog_words,num_news_words,num_twitter_words),"Number of Unique words"=c(news_uniq,blog_uniq,twitter_uniq))
row.names(df_stats) = c("Blog","News","Twitter")
## Number.of.Lines Number.of.words Number.of.Unique.words
## Blog 899288 38378991 90443
## News 77259 2754345 342790
## Twitter 2360148 31151206 439291
As can be seen from the table, the number of lines and words or “tokens” is extremely large in the raw file. To further understand frequencies and relations of words in the corpus, I will be building bigrams and trigrams. But with the amount of data we have, the capacity of memory will not be able to support all the computations. It would be better to use a smaller sample of the text to train our model on. For further processing, we will use a random sample of 5% of the data. Using this random sample, I will build my n-gram model.
The first step is creating corpus to analyze the text. Like I had mentioned previously, I will be analyzing the three texts separately because of the difference of nature of each type of data. For profanity filtering, instead of deleting them from the text now, I plan on filtering it out in the final step of model once the n-grams have been generated.
news_sample = sample(news,news_l*0.05)
blog_sample = sample(blog,blog_l*0.05)
twitter_sample = sample(twitter,twitter_l*0.05)
news_sample = gsub("[^[:alpha:][:space:]]","",news_sample)
blog_sample = gsub("[^[:alpha:][:space:]]","",blog_sample)
twitter_sample = gsub("[^[:alpha:][:space:]]","",twitter_sample)
ncorp = Corpus(VectorSource(news_sample))
bcorp = Corpus(VectorSource(blog_sample))
tcorp = Corpus(VectorSource(twitter_sample))
Cleaning the three corpora. I create a version of the corpora that removes stopwords to understand the most frequent words in unigrams. For bigrams and trigrams the stopwords will be included as it is needed to make relevant word sequences.
funs = list(tolower,removeNumbers, stripWhitespace, removePunctuation)
f = content_transformer(function(x) gsub("[^[:alpha:][:space:]]","",x))
ncorp = tm_map(ncorp,FUN = tm_reduce, tmFuns = funs)
ncorp = tm_map(ncorp,PlainTextDocument)
ncorp = tm_map(ncorp,FUN = f)
ncorpUni = tm_map(ncorp,removeWords,stopwords("english"))
bcorp = tm_map(bcorp,FUN = tm_reduce, tmFuns = funs)
bcorp = tm_map(bcorp,PlainTextDocument)
bcorp = tm_map(bcorp,FUN = f)
bcorpUni = tm_map(bcorp,removeWords,stopwords("english"))
tcorp = tm_map(tcorp,FUN = tm_reduce, tmFuns = funs)
tcorp = tm_map(tcorp,PlainTextDocument)
tcorp = tm_map(tcorp,FUN = f)
tcorpUni = tm_map(tcorp,removeWords,stopwords("english"))
ntdm = TermDocumentMatrix(ncorpUni)
btdm = TermDocumentMatrix(bcorpUni)
ttdm = TermDocumentMatrix(tcorpUni)
topfreq = findFreqTerms(ntdm, lowfreq=100)
nFreq = sort(rowSums(as.matrix(ntdm[topfreq,])),decreasing=TRUE)
df_nuni = data.frame(words=names(nFreq),freq=nFreq)
topfreq = findFreqTerms(btdm, lowfreq=1000)
bFreq = sort(rowSums(as.matrix(btdm[topfreq,])),decreasing=TRUE)
df_buni = data.frame(words=names(bFreq),freq=bFreq)
topfreq = findFreqTerms(ttdm, lowfreq=1000)
tFreq = sort(rowSums(as.matrix(ttdm[topfreq,])),decreasing=TRUE)
df_tuni = data.frame(words=names(tFreq),freq=tFreq)
## said will one new year last just two can also
## 963 434 305 288 265 219 216 216 212 209
## one will just can like time get know now people
## 6006 5589 4995 4851 4783 4428 3504 3064 2958 2918
## just like get love good will day can thanks dont
## 7566 6098 5631 5372 5017 4676 4573 4532 4464 4427
bigram = function(x) NGramTokenizer(x,Weka_control(min = 2, max = 2))
trigram = function(x) NGramTokenizer(x,Weka_control(min = 3, max = 3))
bitdm_news <- TermDocumentMatrix(ncorp, control = list(tokenize = bigram))
bitdm_blog <- TermDocumentMatrix(bcorp, control = list(tokenize = bigram))
bitdm_twitter <- TermDocumentMatrix(tcorp, control = list(tokenize = bigram))
tritdm_news <- TermDocumentMatrix(ncorp, control = list(tokenize = trigram))
tritdm_blog <- TermDocumentMatrix(bcorp, control = list(tokenize = trigram))
tritdm_twitter <- TermDocumentMatrix(tcorp, control = list(tokenize = trigram))
bifreq_news <- findFreqTerms(bitdm_news, lowfreq=100)
bifreq_blog <- findFreqTerms(bitdm_blog, lowfreq=100)
bifreq_twitter <- findFreqTerms(bitdm_twitter, lowfreq=100)
nbiFreq <- sort(rowSums(as.matrix(bitdm_news[bifreq_news,])),decreasing=TRUE)
bbiFreq <- sort(rowSums(as.matrix(bitdm_blog[bifreq_blog,])),decreasing=TRUE)
tbiFreq <- sort(rowSums(as.matrix(bitdm_twitter[bifreq_twitter,])),decreasing=TRUE)
trifreq_news <- findFreqTerms(tritdm_news, lowfreq=5)
trifreq_blog <- findFreqTerms(tritdm_blog, lowfreq=50)
trifreq_twitter <- findFreqTerms(tritdm_twitter, lowfreq=50)
ntriFreq <- sort(rowSums(as.matrix(tritdm_news[trifreq_news,])),decreasing=TRUE)
btriFreq <- sort(rowSums(as.matrix(tritdm_blog[trifreq_blog,])),decreasing=TRUE)
ttriFreq <- sort(rowSums(as.matrix(tritdm_twitter[trifreq_twitter,])),decreasing=TRUE)
df_nbigram = data.frame(word=names(nbiFreq),Frequency=nbiFreq)
df_bbigram = data.frame(word=names(bbiFreq),Frequency=bbiFreq)
df_tbigram = data.frame(word=names(tbiFreq),Frequency=tbiFreq)
nbiFreq #Bigram frequency in News
## of the in the for the to the on the at the to be in a
## 713 644 302 302 285 217 198 197
## with the and the from the as a of a he said it was will be
## 184 181 150 146 132 125 125 123
## with a for a is a he was that the by the and a to a
## 118 111 111 109 106 103 102 102
bbiFreq[1:20] #Bigram frequency in Blogs
## of the in the to the on the to be for the and the i have
## 9260 7725 4269 3735 3279 2928 2836 2454
## and i i was it is at the it was is a in a with the
## 2422 2384 2349 2348 2316 2284 2257 2182
## i am that i from the of a
## 2092 1972 1902 1733
tbiFreq[1:20] #Bigram frequency on Twitter
## in the for the of the on the to be to the
## 3852 3665 2840 2476 2406 2236
## thanks for at the i love going to have a thank you
## 2122 1934 1837 1718 1680 1653
## if you i have i am i dont for a to see
## 1647 1493 1476 1466 1428 1347
## is a will be
## 1345 1323
df_ntrigram = data.frame(word=names(ntriFreq),Frequency=ntriFreq)
df_btrigram = data.frame(word=names(btriFreq),Frequency=btriFreq)
df_ttrigram = data.frame(word=names(ttriFreq),Frequency=ttriFreq)
ntriFreq[1:20] #Trigram frequency in News
## one of the a lot of part of the be able to
## 51 49 28 22
## going to be according to the for the first out of the
## 20 19 19 19
## some of the the end of the first time in the first
## 19 19 18 17
## of the year is one of said it was as well as
## 17 15 15 14
## at the end in the same more than a of the season
## 14 14 14 14
btriFreq[1:20] #Trigram frequency in Blogs
## one of the a lot of the end of some of the it was a
## 652 610 339 333 326
## out of the to be a be able to as well as i want to
## 326 323 307 304 298
## a couple of i have been i have to there is a part of the
## 297 260 260 259 256
## i dont know i have a the fact that it is a the rest of
## 254 253 251 250 248
ttriFreq[1:20] #Trigram frequency on Twitter
## thanks for the thank you for looking forward to
## 1160 465 435
## i love you cant wait to i want to
## 421 411 393
## for the follow going to be a lot of
## 368 365 324
## to be a im going to i need to
## 305 296 295
## one of the i dont know to see you
## 264 256 254
## have a great i have a i have to
## 251 243 233
## you have a i wish i
## 233 221
For the prediction model I am thinking of implementing a simple back off method first. Based on the counts of unigrams, bigrams and trigrams that have already been built, we can generate probabalities of various trigrams, bigrams and unigrams. The most probable trigram will be used for next word prediction. In the absence of a matching trigram, the model will back off to bigram and so on. I plan to also hopefully try and implement a smoothing method like Kneser Ney as it is one of the most frequently used smoothing techniques in next word prediction algorithms.
On Shiny App, users will be allowed to type in a sentence and based on the last word typed, the next three most probably words will be displayed in the output. If I manage to implement Kneser Ney smoothing, I would like to have an option where users can select between Back-off and Kneser Ney algorithms to see how the predictions vary.