Capstone Project

Summary of Report

In order to build the predictive model for predicting the next word, one has to understand the distribution and relationship between the words, tokens and phrases in the text. In this exercise we are given three (3) types of texts from blogs, news and twitter to study and analyse. We have to carry out a thorough exploratory analysis of the data and understand the basic distribution of words and relationship between the words in the corpora.

This report contains tables of the results to show the variation in the frequencies of words (unigrams) and word pairs (bigrams and trigrams) in the data.(The plots that appeared in the knitted html file disappeared when the file was published to RPubs. So I had to make do with tabulated data.)

Loading of software packages and the dataset

The following software packages were loaded in the RStudio:

## Loading required package: NLP
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
## 
## Attaching package: 'qdap'
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix
## The following object is masked from 'package:NLP':
## 
##     ngrams
## The following object is masked from 'package:base':
## 
##     Filter
## 
## Attaching package: 'stringr'
## The following object is masked from 'package:qdap':
## 
##     %>%
## 
## Attaching package: 'pryr'
## The following object is masked from 'package:tm':
## 
##     inspect

Download the 3 original texts and sample a small subset for Exploratory Analysis

blogs = readLines("~/Capstone/final/en_US/en_US.blogs.txt",encoding="UTF-8",skipNul = TRUE)
news = readLines("~/Capstone/final/en_US/en_US.news.txt",encoding="UTF-8", skipNul = TRUE)
## Warning in readLines("~/Capstone/final/en_US/en_US.news.txt", encoding
## = "UTF-8", : incomplete final line found on '~/Capstone/final/en_US/
## en_US.news.txt'
twitter = readLines("~/Capstone/final/en_US/en_US.twitter.txt",encoding="UTF-8", skipNul = TRUE)

# Sample a smaller number of lines in the 3 texts
set.seed(123)
blogs <- blogs[rbinom(length(blogs)*.003, length(blogs), .5)]
news <- news[rbinom(length(news)*.030, length(news), .5)]
twitter <- twitter[rbinom(length(twitter)*.004, length(twitter), .5)]

Word Count for Original and Sampled Text Documents

The folowing code was used to give the word count for the original text and the smaller sampled text documents: stri_stats_latex(blogs) stri_stats_latex(news) stri_stats_latex(twitter)

Shows the word count in sampled data stri_stats_latex(blogs) stri_stats_latex(news) stri_stats_latex(twitter)

Summary of Word Count of Original and Sampled texts

The word count of the original texts and the sampled texts are given in the following table:

##   doc.text original sampled
## 1    blogs 37570839  109560
## 2     news  2651432   76958
## 3  twitter 30451170  122711

Corpus of Sampled Dataset

In the following section we set up a corpus of the sampled data for cleaning and analysis. We also see five examples of the cleaned texts without stop words

bnttext = c(blogs,news,twitter)
bnttext_sent = sent_detect(bnttext, language = "en", model = NULL)
rm(blogs, news, twitter)

corpus <- VCorpus(VectorSource(bnttext_sent))
corpus <- tm_map(corpus, content_transformer(tolower), lazy=TRUE)
CleanCorpora <- function(corpus){
    # Non UTF-8 Characters and Set lower case
    corpus <- tm_map(corpus, content_transformer (function(x) iconv(enc2utf8(x), sub = "byte")))
    corpus <- tm_map(corpus, content_transformer (function(x) iconv(x,'latin1','ASCII',sub='')))
    corpus <- tm_map(corpus, content_transformer(tolower))
}

corpus <- tm_map(corpus, removeNumbers)
remover <- content_transformer(function(x,pattern)gsub(pattern,' ',x))
corpus<- tm_map(corpus, remover, '[@][a-zA-Z0-9_]{1,15}') # remove twitter usernames
corpus<- tm_map(corpus, remover, 'Ã|½Ã|¸¥' )
corpus<- tm_map(corpus, remover, 'í ½í¸¥' )
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
bad_words <- VectorSource(readLines("~/Capstone/Terms-to-Block.csv"))
corpus <- tm_map(corpus, removeWords, bad_words)
## Error in rank(x, ties.method = "min", na.last = "keep"): unimplemented type 'list' in 'greater'
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument,language = ("english"))# stemming the words


fulldata<-data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)
fulldata[1:5,1]
## [1] " friend fill  empti worri  fear"                                                                                                                                                                     
## [2] " trek  jungl  wild anim  torrid desert     travers    danger eventu  get hors gallop  mountain    love villag fill  music  song…    love  stay      carri   blind snow fight wolv   bite bitter wind"
## [3] "  speci  man grown one stage poorer  longer possess  strength  interpret  creat fiction produc nihilist"                                                                                             
## [4] " nihilist   man  judg   world             world          exist"                                                                                                                                      
## [5] "accord   view  exist   mean  patho   vain   nihilist pathosat   time  patho  inconsist   part   nihilist"

The size of the cleaned corpus without stop words is 28.9 Mbs.

one_grams <- NGramTokenizer(fulldata, Weka_control(min = 1, max = 1))
bi_grams <- NGramTokenizer(fulldata, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
tri_grams <- NGramTokenizer(fulldata, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))
one_gramsDF <- data.frame(table(one_grams))
bi_gramsDF <- data.frame(table(bi_grams))
tri_gramsDF <- data.frame(table(tri_grams))
unigrams_sorted <- one_gramsDF[order(one_gramsDF$Freq,decreasing = TRUE),]
bigrams_sorted <- bi_gramsDF[order(bi_gramsDF$Freq,decreasing = TRUE),]
trigrams_sorted <- tri_gramsDF[order(tri_gramsDF$Freq,decreasing = TRUE),]
top10unigram <- unigrams_sorted[1:10,]
top10bigram <- bigrams_sorted[1:10,]
top10trigram <- trigrams_sorted[1:10,]

Table of the top10unigram, top10bigram and top10trigram without stop words.

The plots disappeared when the knitted html files were published to RPubs

par(oma=c(0,0,5,0),mfrow = c(2,2), mar=c(5,2,3,2))
barplot(top10unigram$Freq,names.arg=top10unigram$one_grams,las=3)
top10unigram
##       one_grams Freq
## 7742       said  705
## 9965       will  695
## 6368        one  632
## 3729        get  614
## 1440        can  544
## 9144       time  537
## 10135      year  530
## 4850       just  486
## 5227       like  448
## 6130        new  443
title("Top10 Unigram")
barplot(top10bigram$Freq,names.arg=top10bigram$bi_grams,las=3)
top10bigram
##         bi_grams Freq
## 23617  last year   91
## 45933        u s   66
## 49767   year ago   36
## 40911    st loui   35
## 25352  look like   33
## 25429  los angel   31
## 26234  make sure   29
## 36516  right now   29
## 12181 dont think   28
## 45854   two year   26
title("top10 Bigram")
barplot(top10trigram$Freq,names.arg=top10trigram$tri_grams, las=3) # the texts becomes vertical
top10trigram
##                 tri_grams Freq
## 48980       there way can   20
## 3819    averag sale price   16
## 3821   averag time market   16
## 9130  close escrow averag   16
## 15090  escrow averag sale   16
## 38096   price averag time   16
## 42105   sale price averag   16
## 49874     time market day   16
## 18826   game playoff seri   15
## 27118         let us know   14
title("Top10 Trigram")

object_size(corpus) # Size of the corpus in Mbs.
## 28.9 MB
plot of chunk unnamed-chunk-6

Keep stop words in the cleaning and analysis

In this case, the size of the corpus is 29.1 Mbs.

# If we keep the stop words then we can see the following examples which are more readable:
## Error in rank(x, ties.method = "min", na.last = "keep"): unimplemented type 'list' in 'greater'
## [1] "our friend fill with empti worri and fear"                                                                                                                                                                                                                                                                         
## [2] "the trek through jungl with wild anim the torrid desert that have to be travers with all their danger eventu they get hors gallop up mountain and down into love villag fill with music and song… he would have love to stay there but they had to carri on through blind snow fight wolv and the bite bitter wind"
## [3] "this same speci of man grown one stage poorer no longer possess the strength to interpret to creat fiction produc nihilist"                                                                                                                                                                                        
## [4] "a nihilist is a man who judg of the world as it is that it ought not to be and of the world as it ought to be that it doe not exist"                                                                                                                                                                               
## [5] "accord to this view our exist has no mean the patho of in vain is the nihilist pathosat the same time as patho an inconsist on the part of the nihilist"

Table of top10unigram, top10bigram and top10trigram with the stop words included

(The 3 plots somehow disappeared when the knitted html file was published at RPubs)

The size of the corpus is given here in Mbs.

The following 3 plots show the top-15 unigram, bigram and trigram:

par(oma=c(0,0,5,0),mfrow = c(2,2), mar=c(5,2,3,2))
barplot(top10unigram$Freq,names.arg=top10unigram$one_grams, las=3) # the texts becomes vertical
top10unigram
##      one_grams  Freq
## 9159       the 10400
## 9316        to  5604
## 538        and  5323
## 216          a  5068
## 6423        of  4449
## 4535        in  3532
## 4459         i  2918
## 9153      that  2429
## 4773        it  2212
## 4751        is  2118
title("Top10 Unigram")
barplot(top10bigram$Freq,names.arg=top10bigram$bi_grams,las=3)
top10bigram
##       bi_grams Freq
## 36658   of the 1006
## 26200   in the  904
## 37293   on the  483
## 55100   to the  483
## 19555  for the  371
## 54391    to be  333
## 25732     in a  316
## 4703   and the  311
## 6466    at the  294
## 20244 from the  246
title("top10 Bigram")
barplot(top10trigram$Freq,names.arg=top10trigram$tri_grams,las=3)
top10trigram
##           tri_grams Freq
## 1580       a lot of   83
## 55389    one of the   61
## 76163       the u s   36
## 36084     i want to   33
## 73434 the fact that   33
## 76342    the way to   31
## 35504      i have a   29
## 38060   in the last   29
## 80262     to have a   29
## 3135  accord to the   28
title("Top10 Trigram")

object_size(corpus) # Size of the corpus in Mbs.
## 29.1 MB
plot of chunk unnamed-chunk-8

Conclusions

The original dataset is very big and includes more than 70 million words from the blogs.txt, news.txt and twitter.txt. It took a lot of time to download and will also take a lot more time to analyse. So a sample dataset of about 300 thousand words from all the three original texts were used in the Exploratory Analysis to avoid a very long runtime.

The study also compared the results of including and excluding stopwords in the sampled corpus. The results show that the frequency of stop words is many times that of non-stop words. Since we are going to come up with a predictive model for normal everyday language, we should not exclude stopwords in the study that will be used to arrive at the predictive model.

The five examples of the 2 different sets of sampled data, after these have been cleaned, also show that improvements have to be made to the text mining and cleaning algorithm.

Further Work required for Predictive model

First there is a need to divide the sample dataset into a training and testing set for purposes of initial study and subsequent testing of the predictive model.

As mentioned earlier, stop words should be kept in the study. Also there is a need to fine-tune the cleaning process before arriving at the final model.

The final model will be a Shiny app that takes as input a phrase (multiple words), and one clicks submit, and it predicts the next word.

Future efforts will include both the backoff mmodel and the interpolated smoothing (Kneser-Ney) model in predicting the next word, before deciding which model to adopt. It may also be worthwhile to try out the stupid backoff implementation method. Finally, I Would appreciate feedback on which of the models has proven to be accurate and efficient in terms of memory usage and response time in the past. Thank you.

References

Terms-to-Block were downloaded from the following website: "http://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-to-use-facebooks-moderation-tool/""