With this report I’d like to show my progress on the project and briefly explain my approach to building a predicting model. My ultimate aim is to build a model to predict the users’ next words based on the previous written ones. In order to do that 3 large data sets, consisting of tweets, blogs and news data, will be utilized to train my model. Here I will show the results of my exploratory analysis and my brief plan for the next on building the model.
In order to describe the data windows command prompt is used to get the file descriptions, the file size information, line counts and word counts.
Our datasets consists of the following text files:
# C:\Users\mturgal>ls -1 /cygdrive/d/PROGRAMCILIK/'Data Science'/CAPSTONE/final/en
# _US
# en_US.blogs.txt
# en_US.news.txt
# en_US.twitter.txt#
Our files are relatively large
# C:\Users\mturgal>ls -sh /cygdrive/d/PROGRAMCILIK/'Data Science'/CAPSTONE/final/en_US
# total 557M
# 201M en_US.blogs.txt 197M en_US.news.txt 160M en_US.twitter.txt
#
with the following line counts:
# C:\Users\mturgal>wc -l /cygdrive/d/PROGRAMCILIK/'Data Science'/CAPSTONE/final/en_US/*
# 899288
#/cygdrive/d/PROGRAMCILIK/Data Science/CAPSTONE/final/en_US/en_US.blogs.txt
# 1010242
# /cygdrive/d/PROGRAMCILIK/Data Science/CAPSTONE/final/en_US/en_US.news.txt
# 2360148
# /cygdrive/d/PROGRAMCILIK/Data Science/CAPSTONE/final/en_US/en_US.twitter.txt
# 4269678 total
and following word counts:
# C:\Users\mturgal>wc -w /cygdrive/d/PROGRAMCILIK/'Data Science'/CAPSTONE/final/en_US/*
# 37334114
#/cygdrive/d/PROGRAMCILIK/Data Science/CAPSTONE/final/en_US/en_US.blogs.txt
# 34365936
#/cygdrive/d/PROGRAMCILIK/Data Science/CAPSTONE/final/en_US/en_US.news.txt
# 30359852
#/cygdrive/d/PROGRAMCILIK/Data Science/CAPSTONE/final/en_US/en_US.twitter.txt
# 102059902 total
As seen above, the datasets are quite large and a preliminary trial to work with the whole datasets proved to be impractical. As it is also suggested in Task 1, to overcome this problem I take samples from each data sets:
setwd("D:/PROGRAMCILIK/Data Science/CAPSTONE/final/en_US")
library(tm)
## Warning: package 'tm' was built under R version 3.1.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.1.3
library(SnowballC)
## Warning: package 'SnowballC' was built under R version 3.1.3
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.1.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
twitterCon <- file("en_US.twitter.txt", "r")
blogsCon <- file("en_US.blogs.txt", "r")
newsCon <- file("en_US.news.txt", "r")
twitter<-readLines(twitterCon,n=20000,skipNul=TRUE)
close(twitterCon)
twitter<-sample(twitter,size=4000)
blogs<-readLines(blogsCon,n=20000,skipNul=TRUE)
close(blogsCon)
blogs<-sample(blogs,size=4000)
news<-readLines(newsCon,n=20000,skipNul=TRUE)
close(newsCon)
news<-sample(news,size=4000)
After the files are read we close the file links:
#write(twitter,"twitter_sample.txt")
unlink(twitter)
#write(news,"news_sample.txt")
unlink(news)
#write(blogs,"blogs_sample.txt")
unlink(blogs)
In order to work and analyse the word sets we need to transform them into corpora. After doing this we will be able to use functions to preprocess the data sets.
twitterCorpus=Corpus(VectorSource(twitter))
newsCorpus=Corpus(VectorSource(news))
blogsCorpus=Corpus(VectorSource(blogs))
All three corpora are stripped from unnecessary white spaces, punctuation, numbers and also profanity, which will be necessary to prevent our model from predicting obscene words, which will be inappropriate.
tweetsPP <-tm_map(twitterCorpus,content_transformer(tolower))
tweetsPP <- tm_map(tweetsPP, stripWhitespace)
tweetsPP <- tm_map(tweetsPP, removePunctuation)
tweetsPP <- tm_map(tweetsPP, removeNumbers)
newsPP <-tm_map(newsCorpus,content_transformer(tolower))
newsPP <- tm_map(newsPP, stripWhitespace)
newsPP <- tm_map(newsPP, removePunctuation)
newsPP <- tm_map(newsPP, removeNumbers)
blogsPP <-tm_map(blogsCorpus,content_transformer(tolower))
blogsPP <- tm_map(blogsPP, stripWhitespace)
blogsPP <- tm_map(blogsPP, removePunctuation)
blogsPP <- tm_map(blogsPP, removeNumbers)
A profanity list is acquired from the site http://www.cs.cmu.edu/~biglou/resources/ which A list of 1,300+ English terms that could be found offensive. We will load this file from the same directory and remove these words from the corpora.
profanity <- file("bad-words.txt","r")
profanitywords <- readLines(profanity)
close(profanity)
tweetsPP <-tm_map(tweetsPP,removeWords,profanitywords)
newsPP <-tm_map(newsPP,removeWords,profanitywords)
blogsPP <-tm_map(blogsPP,removeWords,profanitywords)
Next step is tokenizing the corpora into uni-gram, bi-gram and tri-gram parts and measure the frequencies. Here we will take a visual approach and plot the most frequent words.
#source:https://gist.github.com/benmarwick/5370329
onegramTokenizer<-function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tweets1Gram <- TermDocumentMatrix(tweetsPP, control = list(tokenize = onegramTokenizer))
tweets2Gram <- TermDocumentMatrix(tweetsPP, control = list(tokenize = bigramTokenizer))
tweets3Gram <- TermDocumentMatrix(tweetsPP, control = list(tokenize = trigramTokenizer))
frequentTweets <-findFreqTerms(tweets1Gram,lowfreq=120)
tweets1matrix<-as.matrix(tweets1Gram[frequentTweets,])
rm(tweets1Gram)
frequencies<-rowSums(tweets1matrix)
tweets1DF <-data.frame(words=frequentTweets,count=frequencies)
rm(tweets1matrix)
g <-ggplot(tweets1DF,aes(reorder(words,count),count))+geom_bar(stat="identity",fill="#a50f15")+theme(axis.text.x = element_text(angle = 60, hjust = 1))+coord_flip()+labs(title="Unigrams in Tweets",y="Word Counts",x="Words")
# plotting 2-gram frequencies
frequentTweets2 <-findFreqTerms(tweets2Gram,lowfreq=40)
tweets2matrix<-as.matrix(tweets2Gram[frequentTweets2,])
frequencies2<-rowSums(tweets2matrix)
tweets2DF <-data.frame(words=frequentTweets2,count=frequencies2)
g2 <-ggplot(tweets2DF,aes(reorder(words,count),count))+geom_bar(stat="identity",fill="#de2d26")+theme(axis.text.x = element_text(angle = 60, hjust = 1))+coord_flip()+labs(title="Bigrams in Tweets",y="Word Counts",x="Words")
rm(tweets2matrix)
rm(tweets2Gram)
# plotting 3-gram frequencies
(frequentTweets3 <-findFreqTerms(tweets3Gram,lowfreq=8))
## [1] "be able to" "cant wait to" "for the follow"
## [4] "for the rt" "going to be" "he lovz me"
## [7] "i dont know" "i feel like" "i have a"
## [10] "i love you" "i need a" "i want to"
## [13] "if you want" "is going to" "it would be"
## [16] "looking forward to" "of the day" "one of the"
## [19] "thank you for" "thanks for the" "to be a"
## [22] "to get a" "to see you" "you for the"
## [25] "you have to"
tweets3matrix<-as.matrix(tweets3Gram[frequentTweets3,])
frequencies3<-rowSums(tweets3matrix)
tweets3DF <-data.frame(words=frequentTweets3,count=frequencies3)
g3 <-ggplot(tweets3DF,aes(reorder(words,count),count))+geom_bar(stat="identity",fill="#fb6a4a")+theme(axis.text.x = element_text(angle = 60, hjust = 1))+coord_flip()+labs(title="Trigrams in Tweets",y="Word Counts",x="Words")
rm(tweets3matrix)
rm(tweets3Gram)
#plottint them side by side
require(gridExtra)
## Loading required package: gridExtra
## Warning: package 'gridExtra' was built under R version 3.1.3
grid.arrange(g,g2,g3, ncol=3)
news1Gram <- TermDocumentMatrix(newsPP, control = list(tokenize = onegramTokenizer))
news2Gram <- TermDocumentMatrix(newsPP, control = list(tokenize = bigramTokenizer))
news3Gram <- TermDocumentMatrix(newsPP, control = list(tokenize = trigramTokenizer))
frequentNews <-findFreqTerms(news1Gram,lowfreq=250)
news1matrix<-as.matrix(news1Gram[frequentNews,])
frequencies<-rowSums(news1matrix)
news1DF <-data.frame(words=frequentNews,count=frequencies)
n <-ggplot(news1DF,aes(reorder(words,count),count))+geom_bar(stat="identity",fill="#54278f")+theme(axis.text.x = element_text(angle = 60, hjust = 1))+coord_flip()+labs(title="Unigrams in News",y="Word Counts",x="Words")
# rm(news1matrix)
# rm(news1Gram)
# plotting 2-gram frequencies
frequentNews2 <-findFreqTerms(news2Gram,lowfreq=80)
news2matrix<-as.matrix(news2Gram[frequentNews2,])
frequencies2<-rowSums(news2matrix)
news2DF <-data.frame(words=frequentNews2,count=frequencies2)
n2 <-ggplot(news2DF,aes(reorder(words,count),count))+geom_bar(stat="identity",fill="#756bb1")+theme(axis.text.x = element_text(angle = 60, hjust = 1))+coord_flip()+labs(title="Bigrams in News",y="Word Counts",x="Words")
# rm(news2matrix)
# rm(news2Gram)
# plotting 3-gram frequencies
frequentNews3 <-findFreqTerms(news3Gram,lowfreq=14)
news3matrix<-as.matrix(news3Gram[frequentNews3,])
frequencies3<-rowSums(news3matrix)
news3DF <-data.frame(words=frequentNews3,count=frequencies3)
n3 <-ggplot(news3DF,aes(reorder(words,count),count))+geom_bar(stat="identity",fill="#9e9ac8")+theme(axis.text.x = element_text(angle = 60, hjust = 1))+coord_flip()+labs(title="Trigrams in News",y="Word Counts",x="Words")
# rm(news3matrix)
# rm(news3Gram)
#plotting them side by side
grid.arrange(n,n2,n3, ncol=3)
blogs1Gram <- TermDocumentMatrix(blogsPP, control = list(tokenize = onegramTokenizer))
blogs2Gram <- TermDocumentMatrix(blogsPP, control = list(tokenize = bigramTokenizer))
blogs3Gram <- TermDocumentMatrix(blogsPP, control = list(tokenize = trigramTokenizer))
# rm(blogsPP)
frequentBlogs <-findFreqTerms(blogs1Gram,lowfreq=380)
blogs1matrix<-as.matrix(blogs1Gram[frequentBlogs,])
# rm(blogs1Gram)
frequencies<-rowSums(blogs1matrix)
blogs1DF <-data.frame(words=frequentBlogs,count=frequencies)
# rm(blogs1matrix)
b <-ggplot(blogs1DF,aes(reorder(words,count),count))+geom_bar(stat="identity",fill="#006d2c")+theme(axis.text.x = element_text(angle = 60, hjust = 1))+coord_flip()+labs(title="Unigrams in Blogs",y="Word Counts",x="Words")
# rm(blogs1matrix)
# rm(blogs1Gram)
# plotting 2-gram frequencies
frequentBlogs2 <-findFreqTerms(blogs2Gram,lowfreq=110)
blogs2matrix<-as.matrix(blogs2Gram[frequentBlogs2,])
frequencies2<-rowSums(blogs2matrix)
blogs2DF <-data.frame(words=frequentBlogs2,count=frequencies2)
b2 <-ggplot(blogs2DF,aes(reorder(words,count),count))+geom_bar(stat="identity",fill="#2ca25f")+theme(axis.text.x = element_text(angle = 60, hjust = 1))+coord_flip()+labs(title="Bigrams in Blogs",y="Word Counts",x="Words")
# rm(blogs2matrix)
# rm(blogs2Gram)
# plotting 3-gram frequencies
frequentBlogs3 <-findFreqTerms(blogs3Gram,lowfreq=19)
blogs3matrix<-as.matrix(blogs3Gram[frequentBlogs3,])
frequencies3<-rowSums(blogs3matrix)
blogs3DF <-data.frame(words=frequentBlogs3,count=frequencies3)
b3 <-ggplot(blogs3DF,aes(reorder(words,count),count))+geom_bar(stat="identity",fill="#66c2a4")+theme(axis.text.x = element_text(angle = 60, hjust = 1))+coord_flip()+labs(title="Trigrams in Blogs",y="Word Counts",x="Words")
# rm(blogs3matrix)
# rm(blogs3Gram)
#plotting them side by side
grid.arrange(b,b2,b3, ncol=3)
Inspecting the results, we can see some similarities between different categories (tweets,blogs and news) however differences exist as well. So when building the models we will take the category into account in order to increase the accuracy of the model.
Since we will predict the next word(s) based on previous words a Bayesian approach will be my first choice with utilizing Naive Bayes algorithm.