Loading News text file:
setwd("~/coursera/data scientist/Capstone/Coursera-SwiftKey/final/en_US")
conr<-file("en_US.news.txt", "rb")
text<-readLines(conr)
close(conr)
Downloaded News text file contained 1010242 lines and 34372598 words. Prior to writing this report I loaded and cleaned all the data (News, Blogs and Twitter). However, these data files were too large for any data plotting. I run PC overnight to no success in analysis part. Therefore, as instructed in lecture, will reduce file size using rbinom() function.
set.seed(123456)
i<-rbinom(length(text), 1, 0.5)
text<-text[which(i>0)]
During the initial data cleaning control characters, punctuation’s and digits were removed. Then letters were rewritten with lower case, and profanity words were removed. Profanity words list was obtained online. Resulting text file was saved in textdata directory en_US2.news.txt file.
t<-gsub("[[:cntrl:] | [:punct:] | [:digit:]]", " ", text)
t<-iconv(t, "latin1", "ASCII", sub="")
t<-tolower(t) # to lower letters
conr<-file("profanity.txt", "rb")
profanity<-readLines(conr)
close(conr)
profanity<-tolower(profanity)
pattern<-paste(profanity, collapse = "|")
t<-gsub(pattern, "", t)
t<-gsub("\\b\\S*(\\S+?)\\1{2}\\S*\\b", " ", t, perl=TRUE)
conw<-file("textdata//en_US2.news.txt","w")
writeLines(t, conw)
close(conw)
rm(conr, conw, text, t)
The same was repeated for Blogs text file:
Downloaded Blogs text file contained 449266 lines and 18645366 words.
Downloaded Twitter text file contained 1179849 lines and 15186130 words.
In the next data clean up process, will use tm package. First, to remove stop words. Those words have no significance. Then text will be stemmed, meaning ending likes -ing, -s will be removed. That will be followed by removal of white space. Need to create manually a new directory called cleandata inside the directory textdata. Cleaned data will be saved inside that cleandata directory.
setwd("~/coursera/data scientist/Capstone/Coursera-SwiftKey/final/en_US")
docs<-Corpus(DirSource("textdata"))
docs<-tm_map(docs, removeWords, stopwords("english")) # remove stop words
docs<-tm_map(docs, stemDocument)
docs<-tm_map(docs, stripWhitespace) # remove white space
setwd("textdata/cleandata")
writeCorpus(docs)
after all cleanup there are words containing a single letter. These will be deleted.
Data loading into corpus. Here tm package is deployed. Data uploaded into
setwd("~/coursera/data scientist/Capstone/Coursera-SwiftKey/final/en_US")
docs<-Corpus(DirSource("textdata/cleandata/gooddata"))
docs<-tm_map(docs, stripWhitespace)
meta(docs, "id")
## $en_US2b.blogs.txt.txt
## [1] "en_US2b.blogs.txt.txt"
##
## $en_US2b.news.txt.txt
## [1] "en_US2b.news.txt.txt"
##
## $en_US2b.twitter.txt.txt
## [1] "en_US2b.twitter.txt.txt"
dtm<-DocumentTermMatrix(docs)
After all the filtering total number of words in Blogs, News and Twitter files were 9360261, 738333, 7971506.
20 most frequent words were:
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
head(freq, 20)
## can just like get one will time love day make
## 131206 127875 123965 122755 116182 110301 100833 95553 95060 79701
## know good thank now don see work new think look
## 79158 78873 75554 74025 68767 67444 66462 64864 63960 63457
A histogram of words that were used more than 50000 times:
Below 50 most popular words plotted in a color WordCloud figure:
building 2-gram and 3-gram: