Load the data and show the summary, currenlty considering only US folder for analysis
blogs <- readLines("D:\\Lab\\Final\\final\\en_US\\en_US.blogs.txt", warn = F)
news <- readLines("D:\\Lab\\Final\\final\\en_US\\en_US.news.txt", warn = F)
twitter <- readLines("D:\\Lab\\Final\\final\\en_US\\en_US.twitter.txt", warn = F)
summary(blogs)
## Length Class Mode
## 899288 character character
summary(news)
## Length Class Mode
## 77259 character character
summary(twitter)
## Length Class Mode
## 2360148 character character
Seems its very big data so taking sample of 20% of sample
lessBlogs = sample(blogs, .2*length(blogs))
lessNews = sample(news, .2*length(news))
lessTwitter = sample(twitter, .2*length(twitter))
Clean the data using text mining packages
library(tm)
## Loading required package: NLP
corBlogs = Corpus(VectorSource(lessBlogs))
cleanBlogs = tm_map(corBlogs, removePunctuation)
cleanBlogs = tm_map(cleanBlogs, removeWords, stopwords("english"))
cleanBlogs = tm_map(cleanBlogs, removeNumbers)
cleanBlogs = tm_map(cleanBlogs, stripWhitespace)
cleanBlogs = tm_map(cleanBlogs, tolower)
cleanBlogs = tm_map(cleanBlogs, stemDocument)
corNews = Corpus(VectorSource(lessNews))
cleanNews = tm_map(corNews, removePunctuation)
cleanNews = tm_map(cleanNews, removeWords, stopwords("english"))
cleanNews = tm_map(cleanNews, removeNumbers)
cleanNews = tm_map(cleanNews, stripWhitespace)
cleanNews = tm_map(cleanNews, tolower)
cleanNews = tm_map(cleanNews, stripWhitespace)
corTwitter = Corpus(VectorSource(lessTwitter))
cleanTwitter = tm_map(corTwitter, removePunctuation)
cleanTwitter = tm_map(cleanTwitter, removeWords, stopwords("english"))
cleanTwitter = tm_map(cleanTwitter, removeNumbers)
cleanTwitter = tm_map(cleanTwitter, stripWhitespace)
cleanTwitter = tm_map(cleanTwitter, tolower)
cleanTwitter = tm_map(cleanTwitter, stripWhitespace)
dtmBlogs<- DocumentTermMatrix(cleanBlogs)
dtmNews <- DocumentTermMatrix(cleanNews)
dtmTwitter <- DocumentTermMatrix(cleanTwitter)
The words appearing most times
findFreqTerms(dtmBlogs, lowfreq = 10000)
## [1] "also" "make" "one" "the" "can" "know" "love" "peopl"
## [9] "want" "and" "day" "just" "like" "now" "see" "thing"
## [17] "think" "get" "look" "new" "year" "back" "good" "first"
## [25] "will" "even" "time" "use" "work" "this" "way"
findFreqTerms(dtmNews, lowfreq = 500)
## [1] "last" "year" "game" "new" "will" "can" "said"
## [8] "the" "state" "get" "one" "two" "years" "make"
## [15] "now" "people" "and" "just" "first" "city" "also"
## [22] "time" "back" "but" "three" "like" "school" "percent"
findFreqTerms(dtmNews, lowfreq = 500)
## [1] "last" "year" "game" "new" "will" "can" "said"
## [8] "the" "state" "get" "one" "two" "years" "make"
## [15] "now" "people" "and" "just" "first" "city" "also"
## [22] "time" "back" "but" "three" "like" "school" "percent"
Because of very big data size it required more space than my system has to process the data. I have done only 1-gram analysis but I will find the way to do 2-gram and more analysis in better way, plot the word cloud and better predictive model so any new text can be handled in better way. Finally I will create an Shiny app and make all the analyis will be available on server.