Capstone Project Milestone Report

Abstract

We complete the first steps towards constructing a prediction app for Coursera’s Data Science capstone project. We download the data sets that will be used to train the app. We clean the data, construct corpora, and perform some exploratory data analysis. We begin to think about how to build the algorithm for our app.

For ease of reading, I have suppressed the display of most of the code.

Data Processing

Download the Data Sets

The data is stored in a zip file which may be downloaded here.

Let’s see which files we’ve downloaded.

We consider only the Enlish language files.

# list.files("final")
list.files("final/en_US")

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Convert all characters to ASCII and save to text files

This was necessary since the news file had characters (emoticons) that were causing the program to crash.

blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")

# save the data to .txt files
save(blogs, file="blogs.txt")
save(news, file="news.txt")
save(twitter, file="twitter.txt")

Basic Statistics

First, we look at properties of the files themselves.

##    Source Size_in_MB Total_Lines Total_Words
## 1   Blogs     200.42      899288    37510168
## 2 Twitter     159.36     2360148    30088605
## 3    News     196.28     1010242    34749301

Get data about the line counts, character counts, and 5-number summary for words for each file.

For blogs, we have

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899165   206043906   169609063

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.71   60.00 6725.00

For news, we have

##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010241   202917604   169555316

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    19.0    31.0    34.4    46.0  1796.0

For twitter, we have

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   161961555   133948120

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

Data Sampling

Given the large sizes of these files, we sample 10,000 lines from each file in order to improve data processing efficiency. The resulting file is called all_samp.

##        Source Size_in_MB Total_Lines Total_Words
## 1 All Samples       2.18       30000      896843

##       Lines LinesNEmpty       Chars CharsNWhite 
##       30000       30000     5030312     4167940

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   20.00   29.89   39.00 1796.00

Data Cleaning and Corpus Building

We create a corpus from the all_samp.txt file and then clean it. We use the text mining library tm to perform the following transformations:

Convert all words to lower case
Strip away all white space
Strip away all punctuation
Strip away all numbers
Strip way various non-alphanumeric characters
Remove stop words (i.e. words that are uninteresting, but appear frequently in text such as “the”,“and”, “also”, … etc.)
Strip away all urls
Remove profanity
Stemming to remove common word endings (e.g. ‘’s’, ‘ing’, … etc)

## Warning: package 'SnowballC' was built under R version 3.3.2

N-Gram Tokenization

We use unigrams, bigrams, and trigrams to find word frequencies and correlations bewtween words.

# For more information, see: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

dtm <- DocumentTermMatrix(corp)
dtm <- removeSparseTerms(dtm, 0.75)

uni_tokenizer <- function(x) NGramTokenizer(corp, Weka_control(min = 1, max = 1))
unidtm <- DocumentTermMatrix(corp, control = list(tokenize = uni_tokenizer))

bi_tokenizer <- function(x) NGramTokenizer(corp, Weka_control(min = 2, max = 2))
bidtm <- DocumentTermMatrix(corp, control = list(tokenize = bi_tokenizer))

tri_tokenizer <- function(x) NGramTokenizer(corp, Weka_control(min = 3, max = 3))
tridtm <- DocumentTermMatrix(corp, control = list(tokenize = tri_tokenizer))

Exploratory Data Analysis

We use histograms and word clouds to explore the frequencies of words in our corpus. Let us start by looking a words with high frequency:

## [1] "Unigrams - 10 Most Frequent"

##      word freq
## said said 3024
## will will 2941
## one   one 2719
## like like 2405
## just just 2331
## get   get 2286
## time time 2190
## can   can 2072
## year year 2007
## make make 1791

## [1] "Bigrams - 10 Most Frequent"

##                          word freq
## last year           last year  198
## new york             new york  181
## dont know           dont know  171
## look like           look like  160
## high school       high school  159
## right now           right now  157
## year ago             year ago  147
## feel like           feel like  137
## last week           last week  136
## board district board district  126

## [1] "Trigrams - 10 Most Frequent"

##                                            word freq
## township board district township board district  126
## district candid file       district candid file   67
## board district candid     board district candid   59
## state repres district     state repres district   30
## cant wait see                     cant wait see   24
## luck luck luck                   luck luck luck   22
## new york citi                     new york citi   21
## two year ago                       two year ago   20
## happi mother day               happi mother day   18
## let us know                         let us know   17

Let us look at the word cloud for unigrams as this tend to be visually more interesting than histograms.

Top 50 Unigrams

set.seed(666)
wordcloud(names(unifreq), unifreq, max.words=50, scale=c(5, .1), colors=brewer.pal(8, "Dark2"))

Observations and Next Steps for the Prediction App

Stemming has made building the corpus more efficient, but we need to address the potential for awkward choices in the App. For example, “happi mother day” is in the corpus instead of “happy mothers day”
Removing stop words also makes for a clean corpus, but we should not exclude them from the App.

Even after cleaning the corpus, it still takes some time to process the data. We need to find ways to process the data sets more quickly if our App is going to be useful.