In order to build the predictive model for predicting the next word, one has to understand the distribution and relationship between the words, tokens and phrases in the text. In this exercise we are given three (3) types of texts from blogs, news and twitter to study and analyse. We have to carry out a thorough exploratory analysis of the data and understand the basic distribution of words and relationship between the words in the corpora.
This report contains tables of the results to show the variation in the frequencies of words (unigrams) and word pairs (bigrams and trigrams) in the data.(The plots that appeared in the knitted html file disappeared when the file was published to RPubs. So I had to make do with tabulated data.)
The following software packages were loaded in the RStudio:
blogs = readLines("~/Capstone/final/en_US/en_US.blogs.txt",encoding="UTF-8",skipNul = TRUE) news = readLines("~/Capstone/final/en_US/en_US.news.txt",encoding="UTF-8", skipNul = TRUE)
## Warning in readLines("~/Capstone/final/en_US/en_US.news.txt", encoding
## = "UTF-8", : incomplete final line found on '~/Capstone/final/en_US/
## en_US.news.txt'
twitter = readLines("~/Capstone/final/en_US/en_US.twitter.txt",encoding="UTF-8", skipNul = TRUE) # Sample a smaller number of lines in the 3 texts set.seed(123) blogs <- blogs[rbinom(length(blogs)*.003, length(blogs), .5)] news <- news[rbinom(length(news)*.030, length(news), .5)] twitter <- twitter[rbinom(length(twitter)*.004, length(twitter), .5)]
The folowing code was used to give the word count for the original text and the smaller sampled text documents: stri_stats_latex(blogs) stri_stats_latex(news) stri_stats_latex(twitter)
Shows the word count in sampled data stri_stats_latex(blogs) stri_stats_latex(news) stri_stats_latex(twitter)
The word count of the original texts and the sampled texts are given in the following table:
## doc.text original sampled ## 1 blogs 37570839 109560 ## 2 news 2651432 76958 ## 3 twitter 30451170 122711
In the following section we set up a corpus of the sampled data for cleaning and analysis. We also see five examples of the cleaned texts without stop words
bnttext = c(blogs,news,twitter) bnttext_sent = sent_detect(bnttext, language = "en", model = NULL) rm(blogs, news, twitter) corpus <- VCorpus(VectorSource(bnttext_sent)) corpus <- tm_map(corpus, content_transformer(tolower), lazy=TRUE) CleanCorpora <- function(corpus){ # Non UTF-8 Characters and Set lower case corpus <- tm_map(corpus, content_transformer (function(x) iconv(enc2utf8(x), sub = "byte"))) corpus <- tm_map(corpus, content_transformer (function(x) iconv(x,'latin1','ASCII',sub=''))) corpus <- tm_map(corpus, content_transformer(tolower)) } corpus <- tm_map(corpus, removeNumbers) remover <- content_transformer(function(x,pattern)gsub(pattern,' ',x)) corpus<- tm_map(corpus, remover, '[@][a-zA-Z0-9_]{1,15}') # remove twitter usernames corpus<- tm_map(corpus, remover, 'Ã|½Ã|¸¥' ) corpus<- tm_map(corpus, remover, 'í ½í¸¥' ) corpus <- tm_map(corpus, stripWhitespace) corpus <- tm_map(corpus, removePunctuation) bad_words <- VectorSource(readLines("~/Capstone/Terms-to-Block.csv")) corpus <- tm_map(corpus, removeWords, bad_words)
## Error in rank(x, ties.method = "min", na.last = "keep"): unimplemented type 'list' in 'greater'
corpus <- tm_map(corpus, removeWords, stopwords("english")) corpus <- tm_map(corpus, stemDocument,language = ("english"))# stemming the words fulldata<-data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F) fulldata[1:5,1]
## [1] " friend fill empti worri fear" ## [2] " trek jungl wild anim torrid desert travers danger eventu get hors gallop mountain love villag fill music song love stay carri blind snow fight wolv bite bitter wind" ## [3] " speci man grown one stage poorer longer possess strength interpret creat fiction produc nihilist" ## [4] " nihilist man judg world world exist" ## [5] "accord view exist mean patho vain nihilist pathosat time patho inconsist part nihilist"
The size of the cleaned corpus without stop words is 28.9 Mbs.
one_grams <- NGramTokenizer(fulldata, Weka_control(min = 1, max = 1)) bi_grams <- NGramTokenizer(fulldata, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!")) tri_grams <- NGramTokenizer(fulldata, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!")) one_gramsDF <- data.frame(table(one_grams)) bi_gramsDF <- data.frame(table(bi_grams)) tri_gramsDF <- data.frame(table(tri_grams)) unigrams_sorted <- one_gramsDF[order(one_gramsDF$Freq,decreasing = TRUE),] bigrams_sorted <- bi_gramsDF[order(bi_gramsDF$Freq,decreasing = TRUE),] trigrams_sorted <- tri_gramsDF[order(tri_gramsDF$Freq,decreasing = TRUE),] top10unigram <- unigrams_sorted[1:10,] top10bigram <- bigrams_sorted[1:10,] top10trigram <- trigrams_sorted[1:10,]
The plots disappeared when the knitted html files were published to RPubs
par(oma=c(0,0,5,0),mfrow = c(2,2), mar=c(5,2,3,2)) barplot(top10unigram$Freq,names.arg=top10unigram$one_grams,las=3) top10unigram
## one_grams Freq ## 7742 said 705 ## 9965 will 695 ## 6368 one 632 ## 3729 get 614 ## 1440 can 544 ## 9144 time 537 ## 10135 year 530 ## 4850 just 486 ## 5227 like 448 ## 6130 new 443
title("Top10 Unigram") barplot(top10bigram$Freq,names.arg=top10bigram$bi_grams,las=3) top10bigram
## bi_grams Freq ## 23617 last year 91 ## 45933 u s 66 ## 49767 year ago 36 ## 40911 st loui 35 ## 25352 look like 33 ## 25429 los angel 31 ## 26234 make sure 29 ## 36516 right now 29 ## 12181 dont think 28 ## 45854 two year 26
title("top10 Bigram") barplot(top10trigram$Freq,names.arg=top10trigram$tri_grams, las=3) # the texts becomes vertical top10trigram
## tri_grams Freq ## 48980 there way can 20 ## 3819 averag sale price 16 ## 3821 averag time market 16 ## 9130 close escrow averag 16 ## 15090 escrow averag sale 16 ## 38096 price averag time 16 ## 42105 sale price averag 16 ## 49874 time market day 16 ## 18826 game playoff seri 15 ## 27118 let us know 14
title("Top10 Trigram") object_size(corpus) # Size of the corpus in Mbs.
## 28.9 MB

In this case, the size of the corpus is 29.1 Mbs.
# If we keep the stop words then we can see the following examples which are more readable:## Error in rank(x, ties.method = "min", na.last = "keep"): unimplemented type 'list' in 'greater'
## [1] "our friend fill with empti worri and fear" ## [2] "the trek through jungl with wild anim the torrid desert that have to be travers with all their danger eventu they get hors gallop up mountain and down into love villag fill with music and song he would have love to stay there but they had to carri on through blind snow fight wolv and the bite bitter wind" ## [3] "this same speci of man grown one stage poorer no longer possess the strength to interpret to creat fiction produc nihilist" ## [4] "a nihilist is a man who judg of the world as it is that it ought not to be and of the world as it ought to be that it doe not exist" ## [5] "accord to this view our exist has no mean the patho of in vain is the nihilist pathosat the same time as patho an inconsist on the part of the nihilist"
(The 3 plots somehow disappeared when the knitted html file was published at RPubs)
The size of the corpus is given here in Mbs.
The following 3 plots show the top-15 unigram, bigram and trigram:
par(oma=c(0,0,5,0),mfrow = c(2,2), mar=c(5,2,3,2)) barplot(top10unigram$Freq,names.arg=top10unigram$one_grams, las=3) # the texts becomes vertical top10unigram
## one_grams Freq ## 9159 the 10400 ## 9316 to 5604 ## 538 and 5323 ## 216 a 5068 ## 6423 of 4449 ## 4535 in 3532 ## 4459 i 2918 ## 9153 that 2429 ## 4773 it 2212 ## 4751 is 2118
title("Top10 Unigram") barplot(top10bigram$Freq,names.arg=top10bigram$bi_grams,las=3) top10bigram
## bi_grams Freq ## 36658 of the 1006 ## 26200 in the 904 ## 37293 on the 483 ## 55100 to the 483 ## 19555 for the 371 ## 54391 to be 333 ## 25732 in a 316 ## 4703 and the 311 ## 6466 at the 294 ## 20244 from the 246
title("top10 Bigram") barplot(top10trigram$Freq,names.arg=top10trigram$tri_grams,las=3) top10trigram
## tri_grams Freq ## 1580 a lot of 83 ## 55389 one of the 61 ## 76163 the u s 36 ## 36084 i want to 33 ## 73434 the fact that 33 ## 76342 the way to 31 ## 35504 i have a 29 ## 38060 in the last 29 ## 80262 to have a 29 ## 3135 accord to the 28
title("Top10 Trigram") object_size(corpus) # Size of the corpus in Mbs.
## 29.1 MB

The original dataset is very big and includes more than 70 million words from the blogs.txt, news.txt and twitter.txt. It took a lot of time to download and will also take a lot more time to analyse. So a sample dataset of about 300 thousand words from all the three original texts were used in the Exploratory Analysis to avoid a very long runtime.
The study also compared the results of including and excluding stopwords in the sampled corpus. The results show that the frequency of stop words is many times that of non-stop words. Since we are going to come up with a predictive model for normal everyday language, we should not exclude stopwords in the study that will be used to arrive at the predictive model.
The five examples of the 2 different sets of sampled data, after these have been cleaned, also show that improvements have to be made to the text mining and cleaning algorithm.
First there is a need to divide the sample dataset into a training and testing set for purposes of initial study and subsequent testing of the predictive model.
As mentioned earlier, stop words should be kept in the study. Also there is a need to fine-tune the cleaning process before arriving at the final model.
The final model will be a Shiny app that takes as input a phrase (multiple words), and one clicks submit, and it predicts the next word.
Future efforts will include both the backoff mmodel and the interpolated smoothing (Kneser-Ney) model in predicting the next word, before deciding which model to adopt. It may also be worthwhile to try out the stupid backoff implementation method. Finally, I Would appreciate feedback on which of the models has proven to be accurate and efficient in terms of memory usage and response time in the past. Thank you.
Terms-to-Block were downloaded from the following website: "http://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-to-use-facebooks-moderation-tool/""