Assignment- 1, Milestone report

Summary

Load the data and show the summary, currenlty considering only US folder for analysis

blogs <- readLines("D:\\Lab\\Final\\final\\en_US\\en_US.blogs.txt", warn = F)
news <- readLines("D:\\Lab\\Final\\final\\en_US\\en_US.news.txt", warn = F)
twitter <- readLines("D:\\Lab\\Final\\final\\en_US\\en_US.twitter.txt", warn = F)
summary(blogs)
##    Length     Class      Mode 
##    899288 character character
summary(news)
##    Length     Class      Mode 
##     77259 character character
summary(twitter)
##    Length     Class      Mode 
##   2360148 character character

Seems its very big data so taking sample of 20% of sample

lessBlogs = sample(blogs, .2*length(blogs))
lessNews = sample(news, .2*length(news))
lessTwitter = sample(twitter, .2*length(twitter))

Clean the data

Clean the data using text mining packages

library(tm)
## Loading required package: NLP
corBlogs = Corpus(VectorSource(lessBlogs))
cleanBlogs = tm_map(corBlogs, removePunctuation)
cleanBlogs = tm_map(cleanBlogs, removeWords, stopwords("english"))
cleanBlogs = tm_map(cleanBlogs, removeNumbers)
cleanBlogs = tm_map(cleanBlogs, stripWhitespace)
cleanBlogs = tm_map(cleanBlogs, tolower)
cleanBlogs = tm_map(cleanBlogs, stemDocument)

corNews = Corpus(VectorSource(lessNews))
cleanNews = tm_map(corNews, removePunctuation)
cleanNews = tm_map(cleanNews, removeWords, stopwords("english"))
cleanNews = tm_map(cleanNews, removeNumbers)
cleanNews = tm_map(cleanNews, stripWhitespace)
cleanNews = tm_map(cleanNews, tolower)
cleanNews = tm_map(cleanNews, stripWhitespace)

corTwitter = Corpus(VectorSource(lessTwitter))
cleanTwitter = tm_map(corTwitter, removePunctuation)
cleanTwitter = tm_map(cleanTwitter, removeWords, stopwords("english"))
cleanTwitter = tm_map(cleanTwitter, removeNumbers)
cleanTwitter = tm_map(cleanTwitter, stripWhitespace)
cleanTwitter = tm_map(cleanTwitter, tolower)
cleanTwitter = tm_map(cleanTwitter, stripWhitespace)

Find the most frequent words

dtmBlogs<- DocumentTermMatrix(cleanBlogs)
dtmNews <- DocumentTermMatrix(cleanNews)
dtmTwitter <- DocumentTermMatrix(cleanTwitter)

The words appearing most times

findFreqTerms(dtmBlogs, lowfreq =  10000)
##  [1] "also"  "make"  "one"   "the"   "can"   "know"  "love"  "peopl"
##  [9] "want"  "and"   "day"   "just"  "like"  "now"   "see"   "thing"
## [17] "think" "get"   "look"  "new"   "year"  "back"  "good"  "first"
## [25] "will"  "even"  "time"  "use"   "work"  "this"  "way"
findFreqTerms(dtmNews, lowfreq =  500)
##  [1] "last"    "year"    "game"    "new"     "will"    "can"     "said"   
##  [8] "the"     "state"   "get"     "one"     "two"     "years"   "make"   
## [15] "now"     "people"  "and"     "just"    "first"   "city"    "also"   
## [22] "time"    "back"    "but"     "three"   "like"    "school"  "percent"
findFreqTerms(dtmNews, lowfreq =  500)
##  [1] "last"    "year"    "game"    "new"     "will"    "can"     "said"   
##  [8] "the"     "state"   "get"     "one"     "two"     "years"   "make"   
## [15] "now"     "people"  "and"     "just"    "first"   "city"    "also"   
## [22] "time"    "back"    "but"     "three"   "like"    "school"  "percent"

Conclusion

Because of very big data size it required more space than my system has to process the data. I have done only 1-gram analysis but I will find the way to do 2-gram and more analysis in better way, plot the word cloud and better predictive model so any new text can be handled in better way. Finally I will create an Shiny app and make all the analyis will be available on server.