We complete the first steps towards constructing a prediction app for Coursera’s Data Science capstone project. We download the data sets that will be used to train the app. We clean the data, construct corpora, and perform some exploratory data analysis. We begin to think about how to build the algorithm for our app.
For ease of reading, I have suppressed the display of most of the code.
The data is stored in a zip file which may be downloaded here.
Let’s see which files we’ve downloaded.
We consider only the Enlish language files.
# list.files("final")
list.files("final/en_US")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", :
## linha final incompleta encontrada em 'final/en_US/en_US.twitter.txt'
This was necessary since the news file had characters (emoticons) that were causing the program to crash.
blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")
# save the data to .txt files
save(blogs, file="blogs.txt")
save(news, file="news.txt")
save(twitter, file="twitter.txt")
First, we look at properties of the files themselves.
## Source Size_in_MB Total_Lines Total_Words
## 1 Blogs 200.42 899288 37510767
## 2 Twitter 63.16 935554 11929481
## 3 News 196.28 1010242 34749565
Get data about the line counts, character counts, and 5-number summary for words for each file.
For blogs
, we have
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899165 206043906 169609063
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.71 60.00 6725.00
For news
, we have
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010241 202917604 169555316
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 19.0 31.0 34.4 46.0 1796.0
For twitter
, we have
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 161961555 133948120
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
Given the large sizes of these files, we sample 10,000 lines from
each file in order to improve data processing efficiency. The resulting
file is called all_samp
.
## Source Size_in_MB Total_Lines Total_Words
## 1 All Samples 2.17 30000 890042
## Lines LinesNEmpty Chars CharsNWhite
## 30000 29998 4995589 4139868
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 20.00 29.67 39.00 813.00
We create a corpus from the all_samp.txt
file and then
clean it. We use the text mining library tm
to perform the
following transformations:
## Warning in tm_map.SimpleCorpus(corp, content_transformer(tolower)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corp, stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corp, removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corp, removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corp, special_chars, "#|/|@|\\|"):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corp, removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corp, content_transformer(removeURL)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corp, removeWords, profanities): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corp, stemDocument): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corp, PlainTextDocument): transformation drops
## documents
We use unigrams, bigrams, and trigrams to find word frequencies and correlations bewtween words.
# For more information, see: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
dtm <- DocumentTermMatrix(corp)
dtm <- removeSparseTerms(dtm, 0.75)
uni_tokenizer <- function(x) NGramTokenizer(corp, Weka_control(min = 1, max = 1))
unidtm <- DocumentTermMatrix(corp, control = list(tokenize = uni_tokenizer))
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
## ignored
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom tokenizer is
## ignored
bi_tokenizer <- function(x) NGramTokenizer(corp, Weka_control(min = 2, max = 2))
bidtm <- DocumentTermMatrix(corp, control = list(tokenize = bi_tokenizer))
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
## ignored
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom tokenizer is
## ignored
tri_tokenizer <- function(x) NGramTokenizer(corp, Weka_control(min = 3, max = 3))
tridtm <- DocumentTermMatrix(corp, control = list(tokenize = tri_tokenizer))
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
## ignored
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom tokenizer is
## ignored
We use histograms and word clouds to explore the frequencies of words in our corpus. Let us start by looking a words with high frequency:
## [1] "Unigrams - 10 Most Frequent"
## word freq
## said said 2931
## will will 2860
## one one 2719
## like like 2390
## get get 2264
## time time 2253
## just just 2235
## can can 2089
## year year 2050
## make make 1780
## [1] "Bigrams - 10 Most Frequent"
## word freq
## said said 2931
## will will 2860
## one one 2719
## like like 2390
## get get 2264
## time time 2253
## just just 2235
## can can 2089
## year year 2050
## make make 1780
## [1] "Trigrams - 10 Most Frequent"
## word freq
## said said 2931
## will will 2860
## one one 2719
## like like 2390
## get get 2264
## time time 2253
## just just 2235
## can can 2089
## year year 2050
## make make 1780
Let us look at the word cloud for unigrams as this tend to be visually more interesting than histograms.
set.seed(666)
wordcloud(names(unifreq), unifreq, max.words=50, scale=c(5, .1), colors=brewer.pal(8, "Dark2"))
## Warning in wordcloud(names(unifreq), unifreq, max.words = 50, scale = c(5, :
## new could not be fit on page. It will not be plotted.
Stemming has made building the corpus more efficient, but we need to address the potential for awkward choices in the App. For example, “happi mother day” is in the corpus instead of “happy mothers day”
Removing stop words also makes for a clean corpus, but we should not exclude them from the App.
Even after cleaning the corpus, it still takes some time to process the data. We need to find ways to process the data sets more quickly if our App is going to be useful.