tm packageThe purpose of this report is to expose the data exploratory analysis for the John Hopkins and Coursera Data Science Specialization Capstone project. The first step consist in getting data from the source supplied by the Specialization and the sample of this. The second step is the data cleaning (removing the punctuation symbols and the undesired characters) and the Document-Term Matrix generation with the tokenization.
The data source is from the follow page
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip,
we get the data from 3 web sources (Blogs, News and Twitter) in four
different languages (de_DE, en_US, fi_FI, ru_RU), we choose
en_US for the report.
if(!file.exists("./datasource/")){
dir.create("./datasource", recursive = TRUE)
download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
destfile = "./datasource/Coursera-SwiftKey.zip")
unzip(zipfile = "./datasource/Coursera-SwiftKey.zip", exdir = "./datasource/")
file.remove("./datasource/Coursera-SwiftKey.zip")
}
Once the data is downloaded we read the data with the
readLines() method to get a list o Large Characters to
process the data in R.
rawBlogs <- readLines("./datasource/final/en_US/en_US.blogs.txt", encoding = "UTF-8")
rawNews <- readLines("./datasource/final/en_US/en_US.news.txt", encoding = "UTF-8")
rawTwitter <- readLines("./datasource/final/en_US/en_US.twitter.txt", encoding = "UTF-8")
With the data load, We can see the large size of the files, it difficult the data process, for this we use to sample all the three sources.
sample_blogs <- rawBlogs[rbinom(n = length(rawBlogs), size = 1, prob = 0.01) == 1]
sample_news <- rawNews[rbinom(n = length(rawNews), size = 1, prob = 0.01) == 1]
sample_twitter <- rawTwitter[rbinom(n = length(rawTwitter), size = 1, prob = 0.01) ==1]
rm(rawBlogs,rawNews,rawTwitter)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 647803 34.6 5644814 301.5 4877527 260.5
## Vcells 6071217 46.4 89648842 684.0 100540667 767.1
The data has sampled and it is ready to transform the simple text
data in a Class Object to use the tm package
corpora <- VCorpus(VectorSource(list(sample_blogs,sample_news,sample_twitter)))
The corpora help us to manipulate the data with different methods like clean and Document-Term matrix to resume the top frequency terms.
punctuation_symbols <- '[(,.!?;:")]'
punctuation_symbols <- paste(punctuation_symbols,"'‘“-_—", sep = "")
clean_symbols <- content_transformer(function(x) gsub(punctuation_symbols, "", x))
corpora <- tm_map(corpora, clean_symbols)
corpora <- tm_map(corpora, removePunctuation)
corpora <- tm_map(corpora, removeNumbers)
corpora <- tm_map(corpora, removeWords, stopwords("english"))
corpora <- tm_map(corpora, stripWhitespace)
corpora <- tm_map(corpora, stemDocument)
corpora <- tm_map(corpora, content_transformer(tolower))
head(corpora[[1]]$content)
## [1] "contain dmae provid firm tone benefit enhanc facial contour an antiinflammatori enhanc strength antioxid therapiesvitamin c ester patent a power antioxid improv firm elast diminish appear fine line wrinkl discolor impart radiant smooth lumin appear"
## [2] "ok yall ja you guys read the one i usual shame ignor the one i seldom address direct im go lay line"
## [3] "in word use mantra god pleas god want anyon approach him senseless manner he want us right use mind he creat us to approach holi lord without use gift full sens abomin gross sign disrespect it’s silli — act silli god almighti that sin"
## [4] "i know i post pictur back i didn’t write i think i tri post iphon text didn’t show anyway ’s i think i snap photo"
## [5] "i was a professor of law not exactli senior lectur on"
## [6] "ps my italian vocabgrammar knowledg now practic nonexist sad will fix immedi upon return us"
Finally with the corpora fulled processed (cleaned and ordered) we can obtain the Document-Term Matrix to show the frequency terms to do the Machine Learn model in the future.
dtm <- DocumentTermMatrix(corpora)
findMostFreqTerms(dtm, n = 10)
## $`1`
## the one will like time can just get make day
## 1956 1274 1242 1083 1049 984 958 940 749 690
##
## $`2`
## the said will year one new time state say can
## 2644 2427 1119 1003 823 712 654 646 637 613
##
## $`3`
## just get like thank love day good will the dont
## 1502 1483 1347 1265 1243 1080 1042 945 939 891
From the three documents we can take common terms like
one, day, the, like,
can, will, time,
get. This terms can help us to do the model.