Loading and cleaning data

Loading News text file:

setwd("~/coursera/data scientist/Capstone/Coursera-SwiftKey/final/en_US")
conr<-file("en_US.news.txt", "rb") 
text<-readLines(conr)
close(conr)

Downloaded News text file contained 1010242 lines and 34372598 words. Prior to writing this report I loaded and cleaned all the data (News, Blogs and Twitter). However, these data files were too large for any data plotting. I run PC overnight to no success in analysis part. Therefore, as instructed in lecture, will reduce file size using rbinom() function.

set.seed(123456)
i<-rbinom(length(text), 1, 0.5)
text<-text[which(i>0)]

During the initial data cleaning control characters, punctuation’s and digits were removed. Then letters were rewritten with lower case, and profanity words were removed. Profanity words list was obtained online. Resulting text file was saved in textdata directory en_US2.news.txt file.

t<-gsub("[[:cntrl:] | [:punct:] | [:digit:]]", " ", text) 
t<-iconv(t, "latin1", "ASCII", sub="")
t<-tolower(t) # to lower letters
conr<-file("profanity.txt", "rb") 
profanity<-readLines(conr)
close(conr)
profanity<-tolower(profanity)
pattern<-paste(profanity, collapse = "|")
t<-gsub(pattern, "", t)
t<-gsub("\\b\\S*(\\S+?)\\1{2}\\S*\\b", " ", t, perl=TRUE)
conw<-file("textdata//en_US2.news.txt","w")
writeLines(t, conw)
close(conw)
rm(conr, conw, text, t)

The same was repeated for Blogs text file:

Downloaded Blogs text file contained 449266 lines and 18645366 words.

Downloaded Twitter text file contained 1179849 lines and 15186130 words.

In the next data clean up process, will use tm package. First, to remove stop words. Those words have no significance. Then text will be stemmed, meaning ending likes -ing, -s will be removed. That will be followed by removal of white space. Need to create manually a new directory called cleandata inside the directory textdata. Cleaned data will be saved inside that cleandata directory.

setwd("~/coursera/data scientist/Capstone/Coursera-SwiftKey/final/en_US")
docs<-Corpus(DirSource("textdata"))
docs<-tm_map(docs, removeWords, stopwords("english")) # remove stop words
docs<-tm_map(docs, stemDocument)
docs<-tm_map(docs, stripWhitespace)  # remove white space
setwd("textdata/cleandata")

writeCorpus(docs)

after all cleanup there are words containing a single letter. These will be deleted.

Text data analysis

Data loading into corpus. Here tm package is deployed. Data uploaded into

setwd("~/coursera/data scientist/Capstone/Coursera-SwiftKey/final/en_US")
docs<-Corpus(DirSource("textdata/cleandata/gooddata"))
docs<-tm_map(docs, stripWhitespace)

meta(docs, "id")
## $en_US2b.blogs.txt.txt
## [1] "en_US2b.blogs.txt.txt"
## 
## $en_US2b.news.txt.txt
## [1] "en_US2b.news.txt.txt"
## 
## $en_US2b.twitter.txt.txt
## [1] "en_US2b.twitter.txt.txt"
dtm<-DocumentTermMatrix(docs)

After all the filtering total number of words in Blogs, News and Twitter files were 9360261, 738333, 7971506.

Cleaned data characteristics

20 most frequent words were:

freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
head(freq, 20)
##    can   just   like    get    one   will   time   love    day   make 
## 131206 127875 123965 122755 116182 110301 100833  95553  95060  79701 
##   know   good  thank    now    don    see   work    new  think   look 
##  79158  78873  75554  74025  68767  67444  66462  64864  63960  63457

Discussion

  1. This data will be used to model text prediction algorithm and Shinny app.
  2. I dont know yet how to proceed from DocumentTermMatrix to prediction:
  1. 2-gram or 3-gram word sequancies ?
  2. neural network?
  1. Difficulties:
  1. DocumentTermMatrix takes several hours to execute. Too long to try many options.
  2. was not able to launch RStudio on AWS.
  1. Good comments and advice is appreciated.

building 2-gram and 3-gram: