Abstract

We complete the first steps towards constructing a prediction app for Coursera’s Data Science capstone project. We download the data sets that will be used to train the app. We clean the data, construct corpora, and perform some exploratory data analysis. We begin to think about how to build the algorithm for our app.

For ease of reading, I have suppressed the display of most of the code.

Data Processing

Download the Data Sets

The data is stored in a zip file which may be downloaded here.

Let’s see which files we’ve downloaded.

We consider only the Enlish language files.

# list.files("final")
list.files("final/en_US")
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", :
## linha final incompleta encontrada em 'final/en_US/en_US.twitter.txt'

Convert all characters to ASCII and save to text files

This was necessary since the news file had characters (emoticons) that were causing the program to crash.

blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")

# save the data to .txt files
save(blogs, file="blogs.txt")
save(news, file="news.txt")
save(twitter, file="twitter.txt")

Basic Statistics

First, we look at properties of the files themselves.

##    Source Size_in_MB Total_Lines Total_Words
## 1   Blogs     200.42      899288    37510767
## 2 Twitter      63.16      935554    11929481
## 3    News     196.28     1010242    34749565

Get data about the line counts, character counts, and 5-number summary for words for each file.

For blogs, we have

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899165   206043906   169609063
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.71   60.00 6725.00

For news, we have

##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010241   202917604   169555316
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    19.0    31.0    34.4    46.0  1796.0

For twitter, we have

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   161961555   133948120
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

Data Sampling

Given the large sizes of these files, we sample 10,000 lines from each file in order to improve data processing efficiency. The resulting file is called all_samp.

##        Source Size_in_MB Total_Lines Total_Words
## 1 All Samples       2.17       30000      890042
##       Lines LinesNEmpty       Chars CharsNWhite 
##       30000       29998     4995589     4139868
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   20.00   29.67   39.00  813.00

Data Cleaning and Corpus Building

We create a corpus from the all_samp.txt file and then clean it. We use the text mining library tm to perform the following transformations:

## Warning in tm_map.SimpleCorpus(corp, content_transformer(tolower)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corp, stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corp, removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corp, removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corp, special_chars, "#|/|@|\\|"):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corp, removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corp, content_transformer(removeURL)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corp, removeWords, profanities): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corp, stemDocument): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corp, PlainTextDocument): transformation drops
## documents

N-Gram Tokenization

We use unigrams, bigrams, and trigrams to find word frequencies and correlations bewtween words.

# For more information, see: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

dtm <- DocumentTermMatrix(corp)
dtm <- removeSparseTerms(dtm, 0.75)

uni_tokenizer <- function(x) NGramTokenizer(corp, Weka_control(min = 1, max = 1))
unidtm <- DocumentTermMatrix(corp, control = list(tokenize = uni_tokenizer))
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
## ignored
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom tokenizer is
## ignored
bi_tokenizer <- function(x) NGramTokenizer(corp, Weka_control(min = 2, max = 2))
bidtm <- DocumentTermMatrix(corp, control = list(tokenize = bi_tokenizer))
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
## ignored
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom tokenizer is
## ignored
tri_tokenizer <- function(x) NGramTokenizer(corp, Weka_control(min = 3, max = 3))
tridtm <- DocumentTermMatrix(corp, control = list(tokenize = tri_tokenizer))
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
## ignored
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom tokenizer is
## ignored

Exploratory Data Analysis

We use histograms and word clouds to explore the frequencies of words in our corpus. Let us start by looking a words with high frequency:

## [1] "Unigrams - 10 Most Frequent"
##      word freq
## said said 2931
## will will 2860
## one   one 2719
## like like 2390
## get   get 2264
## time time 2253
## just just 2235
## can   can 2089
## year year 2050
## make make 1780

## [1] "Bigrams - 10 Most Frequent"
##      word freq
## said said 2931
## will will 2860
## one   one 2719
## like like 2390
## get   get 2264
## time time 2253
## just just 2235
## can   can 2089
## year year 2050
## make make 1780

## [1] "Trigrams - 10 Most Frequent"
##      word freq
## said said 2931
## will will 2860
## one   one 2719
## like like 2390
## get   get 2264
## time time 2253
## just just 2235
## can   can 2089
## year year 2050
## make make 1780

Let us look at the word cloud for unigrams as this tend to be visually more interesting than histograms.

Top 50 Unigrams

set.seed(666)
wordcloud(names(unifreq), unifreq, max.words=50, scale=c(5, .1), colors=brewer.pal(8, "Dark2"))
## Warning in wordcloud(names(unifreq), unifreq, max.words = 50, scale = c(5, :
## new could not be fit on page. It will not be plotted.

Observations and Next Steps for the Prediction App

Even after cleaning the corpus, it still takes some time to process the data. We need to find ways to process the data sets more quickly if our App is going to be useful.