Coursera's Data Science Capstone Project

Milestone Report

Background

The goal of this project is to build an application that can predict the next word in a sentence based on the words that have already been typed. With mobile devices increasingly becoming the choice of device for email, social networking etc., helping users type better with a word prediction application has gained relevance.

This milestone report is aimed at understanding how to work with the dataset available to us and gaining insights about word statistics and relations from the data.

The data used in this project is available at Capstone Data. It contains text from news, blogs and twitter and is available in four languages - English, French, Russian and German. For the purpose of this exercise, I will be using the English dataset.

Considering that the language used on twitter is very different from the language used in blogs or news, I will be analyzing the three forms of data separately before combining them.

Loading the data and basic processing

The three text files are loaded separately. To understand the structure of the data and get an idea of the basic statistics, the files are processed to extract information about how many lines they have, how many words they have and the number of unique words in each.

suppressWarnings(library(tm))

## Loading required package: NLP

suppressWarnings(library(SnowballC))
suppressWarnings(library(RWeka))
suppressWarnings(library(knitr))
suppressWarnings(library(wordcloud))

## Loading required package: RColorBrewer

suppressWarnings(library(RColorBrewer))

blog = readLines("en_US.blogs.txt",n=-1,warn=F,encoding="UTF-8")
news = readLines("en_US.news.txt",n=-1,warn=F,encoding="UTF-8")
twitter = readLines("en_US.twitter.txt",n=-1,warn=F,encoding="UTF-8")

blog_l <- length(blog)
news_l <- length(news)
twitter_l <- length(twitter)

news_words = strsplit(news,"\\W+",perl=TRUE)
news_words = unlist(news_words)
num_news_words = length(news_words)
news_uniq = length(unique(news_words))

blog_words = strsplit(blog,"\\W+",perl=TRUE)
blog_words = unlist(blog_words)
num_blog_words = length(blog_words)
blog_uniq = length(unique(blog_words))

twitter_words = strsplit(twitter,"\\W+",perl=TRUE)
twitter_words = unlist(twitter_words)
num_twitter_words = length(twitter_words)
twitter_uniq = length(unique(twitter_words))

df_stats = data.frame("Number of Lines"=c(blog_l,news_l,twitter_l),"Number of words"=c(num_blog_words,num_news_words,num_twitter_words),"Number of Unique words"=c(news_uniq,blog_uniq,twitter_uniq))
row.names(df_stats) = c("Blog","News","Twitter")

Summary of basic statistics of the dataset

##         Number.of.Lines Number.of.words Number.of.Unique.words
## Blog             899288        38378991                  90443
## News              77259         2754345                 342790
## Twitter         2360148        31151206                 439291

As can be seen from the table, the number of lines and words or “tokens” is extremely large in the raw file. To further understand frequencies and relations of words in the corpus, I will be building bigrams and trigrams. But with the amount of data we have, the capacity of memory will not be able to support all the computations. It would be better to use a smaller sample of the text to train our model on. For further processing, we will use a random sample of 5% of the data. Using this random sample, I will build my n-gram model.

Cleaning data and building n grams

The first step is creating corpus to analyze the text. Like I had mentioned previously, I will be analyzing the three texts separately because of the difference of nature of each type of data. For profanity filtering, instead of deleting them from the text now, I plan on filtering it out in the final step of model once the n-grams have been generated.

news_sample = sample(news,news_l*0.05)
blog_sample = sample(blog,blog_l*0.05)
twitter_sample = sample(twitter,twitter_l*0.05)

news_sample = gsub("[^[:alpha:][:space:]]","",news_sample)
blog_sample = gsub("[^[:alpha:][:space:]]","",blog_sample)
twitter_sample = gsub("[^[:alpha:][:space:]]","",twitter_sample)

ncorp = Corpus(VectorSource(news_sample))
bcorp = Corpus(VectorSource(blog_sample))
tcorp = Corpus(VectorSource(twitter_sample))

Cleaning the three corpora. I create a version of the corpora that removes stopwords to understand the most frequent words in unigrams. For bigrams and trigrams the stopwords will be included as it is needed to make relevant word sequences.

funs = list(tolower,removeNumbers, stripWhitespace, removePunctuation)
f = content_transformer(function(x) gsub("[^[:alpha:][:space:]]","",x))

ncorp = tm_map(ncorp,FUN = tm_reduce, tmFuns = funs)
ncorp = tm_map(ncorp,PlainTextDocument)
ncorp = tm_map(ncorp,FUN = f)
ncorpUni = tm_map(ncorp,removeWords,stopwords("english"))

bcorp = tm_map(bcorp,FUN = tm_reduce, tmFuns = funs)
bcorp = tm_map(bcorp,PlainTextDocument)
bcorp = tm_map(bcorp,FUN = f)
bcorpUni = tm_map(bcorp,removeWords,stopwords("english"))

tcorp = tm_map(tcorp,FUN = tm_reduce, tmFuns = funs)
tcorp = tm_map(tcorp,PlainTextDocument)
tcorp = tm_map(tcorp,FUN = f)
tcorpUni = tm_map(tcorp,removeWords,stopwords("english"))

ntdm = TermDocumentMatrix(ncorpUni)
btdm = TermDocumentMatrix(bcorpUni)
ttdm = TermDocumentMatrix(tcorpUni)

topfreq = findFreqTerms(ntdm, lowfreq=100)
nFreq = sort(rowSums(as.matrix(ntdm[topfreq,])),decreasing=TRUE)
df_nuni = data.frame(words=names(nFreq),freq=nFreq)

topfreq = findFreqTerms(btdm, lowfreq=1000)
bFreq = sort(rowSums(as.matrix(btdm[topfreq,])),decreasing=TRUE)
df_buni = data.frame(words=names(bFreq),freq=bFreq)

topfreq = findFreqTerms(ttdm, lowfreq=1000)
tFreq = sort(rowSums(as.matrix(ttdm[topfreq,])),decreasing=TRUE)
df_tuni = data.frame(words=names(tFreq),freq=tFreq)

List of most frequent words in news

## said will  one  new year last just  two  can also 
##  963  434  305  288  265  219  216  216  212  209

plot of chunk unnamed-chunk-7

List of most frequent words in blog

##    one   will   just    can   like   time    get   know    now people 
##   6006   5589   4995   4851   4783   4428   3504   3064   2958   2918

plot of chunk unnamed-chunk-8

List of most frequent words in twitter

##   just   like    get   love   good   will    day    can thanks   dont 
##   7566   6098   5631   5372   5017   4676   4573   4532   4464   4427

plot of chunk unnamed-chunk-9

Next we build bigram and trigrams

bigram = function(x) NGramTokenizer(x,Weka_control(min = 2, max = 2))
trigram = function(x) NGramTokenizer(x,Weka_control(min = 3, max = 3))

bitdm_news <- TermDocumentMatrix(ncorp, control = list(tokenize = bigram))
bitdm_blog <- TermDocumentMatrix(bcorp, control = list(tokenize = bigram))
bitdm_twitter <- TermDocumentMatrix(tcorp, control = list(tokenize = bigram))
tritdm_news <- TermDocumentMatrix(ncorp, control = list(tokenize = trigram))
tritdm_blog <- TermDocumentMatrix(bcorp, control = list(tokenize = trigram))
tritdm_twitter <- TermDocumentMatrix(tcorp, control = list(tokenize = trigram))

bifreq_news <- findFreqTerms(bitdm_news, lowfreq=100)
bifreq_blog <- findFreqTerms(bitdm_blog, lowfreq=100)
bifreq_twitter <- findFreqTerms(bitdm_twitter, lowfreq=100)
nbiFreq <- sort(rowSums(as.matrix(bitdm_news[bifreq_news,])),decreasing=TRUE)
bbiFreq <- sort(rowSums(as.matrix(bitdm_blog[bifreq_blog,])),decreasing=TRUE)
tbiFreq <- sort(rowSums(as.matrix(bitdm_twitter[bifreq_twitter,])),decreasing=TRUE)

trifreq_news <- findFreqTerms(tritdm_news, lowfreq=5)
trifreq_blog <- findFreqTerms(tritdm_blog, lowfreq=50)
trifreq_twitter <- findFreqTerms(tritdm_twitter, lowfreq=50)

ntriFreq <- sort(rowSums(as.matrix(tritdm_news[trifreq_news,])),decreasing=TRUE)
btriFreq <- sort(rowSums(as.matrix(tritdm_blog[trifreq_blog,])),decreasing=TRUE)
ttriFreq <- sort(rowSums(as.matrix(tritdm_twitter[trifreq_twitter,])),decreasing=TRUE)

Most frequent bigrams in news, blogs and twitter in that order

df_nbigram = data.frame(word=names(nbiFreq),Frequency=nbiFreq)
df_bbigram = data.frame(word=names(bbiFreq),Frequency=bbiFreq)
df_tbigram = data.frame(word=names(tbiFreq),Frequency=tbiFreq)
nbiFreq       #Bigram frequency in News

##   of the   in the  for the   to the   on the   at the    to be     in a 
##      713      644      302      302      285      217      198      197 
## with the  and the from the     as a     of a  he said   it was  will be 
##      184      181      150      146      132      125      125      123 
##   with a    for a     is a   he was that the   by the    and a     to a 
##      118      111      111      109      106      103      102      102

bbiFreq[1:20] #Bigram frequency in Blogs

##   of the   in the   to the   on the    to be  for the  and the   i have 
##     9260     7725     4269     3735     3279     2928     2836     2454 
##    and i    i was    it is   at the   it was     is a     in a with the 
##     2422     2384     2349     2348     2316     2284     2257     2182 
##     i am   that i from the     of a 
##     2092     1972     1902     1733

tbiFreq[1:20] #Bigram frequency on Twitter

##     in the    for the     of the     on the      to be     to the 
##       3852       3665       2840       2476       2406       2236 
## thanks for     at the     i love   going to     have a  thank you 
##       2122       1934       1837       1718       1680       1653 
##     if you     i have       i am     i dont      for a     to see 
##       1647       1493       1476       1466       1428       1347 
##       is a    will be 
##       1345       1323

plot of chunk unnamed-chunk-13

Most frequent trigrams in news, blogs and twitter in that order

df_ntrigram = data.frame(word=names(ntriFreq),Frequency=ntriFreq)
df_btrigram = data.frame(word=names(btriFreq),Frequency=btriFreq)
df_ttrigram = data.frame(word=names(ttriFreq),Frequency=ttriFreq)
ntriFreq[1:20] #Trigram frequency in News

##       one of the         a lot of      part of the       be able to 
##               51               49               28               22 
##      going to be according to the    for the first       out of the 
##               20               19               19               19 
##      some of the       the end of   the first time     in the first 
##               19               19               18               17 
##      of the year        is one of      said it was       as well as 
##               17               15               15               14 
##       at the end      in the same      more than a    of the season 
##               14               14               14               14

btriFreq[1:20] #Trigram frequency in Blogs

##    one of the      a lot of    the end of   some of the      it was a 
##           652           610           339           333           326 
##    out of the       to be a    be able to    as well as     i want to 
##           326           323           307           304           298 
##   a couple of   i have been     i have to    there is a   part of the 
##           297           260           260           259           256 
##   i dont know      i have a the fact that       it is a   the rest of 
##           254           253           251           250           248

ttriFreq[1:20] #Trigram frequency on Twitter

##     thanks for the      thank you for looking forward to 
##               1160                465                435 
##         i love you       cant wait to          i want to 
##                421                411                393 
##     for the follow        going to be           a lot of 
##                368                365                324 
##            to be a        im going to          i need to 
##                305                296                295 
##         one of the        i dont know         to see you 
##                264                256                254 
##       have a great           i have a          i have to 
##                251                243                233 
##         you have a           i wish i 
##                233                221

plot of chunk unnamed-chunk-15

Conclusion and Next steps

For the prediction model I am thinking of implementing a simple back off method first. Based on the counts of unigrams, bigrams and trigrams that have already been built, we can generate probabalities of various trigrams, bigrams and unigrams. The most probable trigram will be used for next word prediction. In the absence of a matching trigram, the model will back off to bigram and so on. I plan to also hopefully try and implement a smoothing method like Kneser Ney as it is one of the most frequently used smoothing techniques in next word prediction algorithms.

On Shiny App, users will be allowed to type in a sentence and based on the last word typed, the next three most probably words will be displayed in the output. If I manage to implement Kneser Ney smoothing, I would like to have an option where users can select between Back-off and Kneser Ney algorithms to see how the predictions vary.