Text mining using the tm package

The purpose of this report is to expose the data exploratory analysis for the John Hopkins and Coursera Data Science Specialization Capstone project. The first step consist in getting data from the source supplied by the Specialization and the sample of this. The second step is the data cleaning (removing the punctuation symbols and the undesired characters) and the Document-Term Matrix generation with the tokenization.

1. Getting and Sampling Data

The data source is from the follow page https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, we get the data from 3 web sources (Blogs, News and Twitter) in four different languages (de_DE, en_US, fi_FI, ru_RU), we choose en_US for the report.

if(!file.exists("./datasource/")){
      dir.create("./datasource", recursive = TRUE)
      download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                    destfile = "./datasource/Coursera-SwiftKey.zip")
      unzip(zipfile = "./datasource/Coursera-SwiftKey.zip", exdir = "./datasource/")
      file.remove("./datasource/Coursera-SwiftKey.zip")
}

Once the data is downloaded we read the data with the readLines() method to get a list o Large Characters to process the data in R.

rawBlogs <- readLines("./datasource/final/en_US/en_US.blogs.txt", encoding = "UTF-8")
rawNews <- readLines("./datasource/final/en_US/en_US.news.txt", encoding = "UTF-8")
rawTwitter <- readLines("./datasource/final/en_US/en_US.twitter.txt", encoding = "UTF-8")

With the data load, We can see the large size of the files, it difficult the data process, for this we use to sample all the three sources.

sample_blogs <- rawBlogs[rbinom(n = length(rawBlogs), size = 1, prob = 0.01) == 1]
sample_news <- rawNews[rbinom(n = length(rawNews), size = 1, prob = 0.01) == 1]
sample_twitter <- rawTwitter[rbinom(n = length(rawTwitter), size = 1, prob = 0.01) ==1]
rm(rawBlogs,rawNews,rawTwitter)
gc()
##           used (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells  647803 34.6    5644814 301.5   4877527 260.5
## Vcells 6071217 46.4   89648842 684.0 100540667 767.1

2. Obtaining the corpora to do Exploratory Data Analysis

The data has sampled and it is ready to transform the simple text data in a Class Object to use the tm package

corpora <- VCorpus(VectorSource(list(sample_blogs,sample_news,sample_twitter)))

The corpora help us to manipulate the data with different methods like clean and Document-Term matrix to resume the top frequency terms.

punctuation_symbols <- '[(,.!?;:")]'
punctuation_symbols <- paste(punctuation_symbols,"'‘“-_—", sep = "")
clean_symbols <- content_transformer(function(x) gsub(punctuation_symbols, "", x))
corpora <- tm_map(corpora, clean_symbols)
corpora <- tm_map(corpora, removePunctuation)
corpora <- tm_map(corpora, removeNumbers)
corpora <- tm_map(corpora, removeWords, stopwords("english"))
corpora <- tm_map(corpora, stripWhitespace)
corpora <- tm_map(corpora, stemDocument)
corpora <- tm_map(corpora, content_transformer(tolower))

head(corpora[[1]]$content)
## [1] "contain dmae provid firm tone benefit enhanc facial contour an antiinflammatori enhanc strength antioxid therapiesvitamin c ester patent a power antioxid improv firm elast diminish appear fine line wrinkl discolor impart radiant smooth lumin appear"
## [2] "ok yall ja you guys read the one i usual shame ignor the one i seldom address direct im go lay line"                                                                                                                                                     
## [3] "in word use mantra god pleas god want anyon approach him senseless manner he want us right use mind he creat us to approach holi lord without use gift full sens abomin gross sign disrespect it’s silli — act silli god almighti that sin"              
## [4] "i know i post pictur back i didn’t write i think i tri post iphon text didn’t show anyway ’s i think i snap photo"                                                                                                                                       
## [5] "i was a professor of law not exactli senior lectur on"                                                                                                                                                                                                   
## [6] "ps my italian vocabgrammar knowledg now practic nonexist sad will fix immedi upon return us"

Finally with the corpora fulled processed (cleaned and ordered) we can obtain the Document-Term Matrix to show the frequency terms to do the Machine Learn model in the future.

dtm <- DocumentTermMatrix(corpora)
findMostFreqTerms(dtm, n = 10)
## $`1`
##  the  one will like time  can just  get make  day 
## 1956 1274 1242 1083 1049  984  958  940  749  690 
## 
## $`2`
##   the  said  will  year   one   new  time state   say   can 
##  2644  2427  1119  1003   823   712   654   646   637   613 
## 
## $`3`
##  just   get  like thank  love   day  good  will   the  dont 
##  1502  1483  1347  1265  1243  1080  1042   945   939   891

From the three documents we can take common terms like one, day, the, like, can, will, time, get. This terms can help us to do the model.