Summarize Assignment Goals

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Load Data and Create Samples

Swiftkey and the team at John’s Hopkins University made the capstone project dataset available for download. To avoid downloading the 500+ MB file repeatedly, my script checks for, downloads if not found, and extracts the dataset.

Initial exploration proved the dataset too large to process on my hardware. Though I later moved to Amazon Web Services (see details in Appendix A), sampling the data sped development. Randomly sampling 10% of corpus lines yielded sub 100MB files sufficiently small to process and upload to GitHub.

Summarize Text

The three English text files - blogs, news, and Twitter - varied in size. I chose to investigate further by summarizing simple statistics for lines and characters:

## Loading required package: NLP
##           File.Name Line.Count Mean.Char.Line Median.Char.Line
## 1   en_US.blogs.txt      90098       229.4905              154
## 2    en_US.news.txt     101228       201.0837              185
## 3 en_US.twitter.txt     235727        68.6378               64
##   Stddev.Char.Line Min.Char.Line Max.Char.Line
## 1        251.35918             1          4714
## 2        134.56986             1          8949
## 3         37.21169             3           140

The results make clear the Twitter file includes shorter lines (it has a 140 character limit) while the blogs and news files align more closely. Their average length - however - suggests multiple sentences which may need consideration in developing the predictive model.

Term Document Matrix

After characters come words, or so my elementary school teachers always told me! Building a term-document matrix began with cleaning the sample corpus. My two immediate concerns:

  1. How would including/excluding common words (“stop words”) affect counts?
  2. What to do with punctuation? Twitter is notorious for “l33t” speak, smileys, and other non-words…

Including Stop Words

I chose to keep stop words in the first term document matrix. Anticipating they would dominate, graphing top-10 terms by count revealed much:

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

The stop words overwhelm all others. Removing them will help exploration.

Excluding Stop Words

New insight appears with key words in each document such as “said” in news articles and “im” in Twitter. Possible errors for prediction also arise: “im” instead of “I’m”.

Explore N-Grams

Considering the clear differences for stop words in single evaluation, start by comparing bi-grams with stop words:

Bi grams without stop words:

The take-away: stop words still dominate. However, the more significant finding comes from examining trig-rams with prediction as the goal. Example: A user types “case of” and the program must predict the next word. The 10% sample corpus returns…

## [1] "190 possible trigrams. Top 10 results:"
## <<TermDocumentMatrix (terms: 10, documents: 3)>>
## Non-/sparse entries: 22/8
## Sparsity           : 27%
## Maximal term length: 17
## Weighting          : term frequency (tf)
## 
##                    Docs
## Terms               en_US.blogs.txt en_US.news.txt en_US.twitter.txt
##   case of the                    21             15                15
##   case of a                       6             13                 2
##   case of an                      2              7                 0
##   case of one                     2              2                 0
##   case of beer                    0              2                 1
##   case of rain                    1              1                 1
##   case of any                     1              1                 0
##   case of chen                    1              1                 0
##   case of continued               0              2                 0
##   case of divorce                 0              1                 1

“Case of the” and “case of a” represent common phrases. Yet, our user typing “The guy in front of me just bought a pound of bacon, a bouquet, and a case of” would view both results as useless. “Case of beer” with only three hits makes perfect sense!

Plan Predictive Model

Simply build longer n-grams? That threatens over-fitting and long processing times. I plan - and it is just a plan at present - to instead:

  1. Stem the corpus to build n-grams with broader scope. Stem words will also speed processing.
  2. Increase the length of n-grams to 5 or 7 word stems. Borrowing from the example above, input “bacon, a bouquet, and a case of” should eliminate many low-count possibilities.
  3. Pair word stem prediction with stem-to-word likelihood. Presenting a user with choices of “run”, “running”, and “runner” would flow from predicting “run” as the word stem.

The steps solve problems made evident in my exploration. Several others I need still explore: sentence identification and punctuation control. Will I discover more? Most assuredly!

Appendix A: Turnkey RStudio in Amazon Web Services

I must give credit to Louis Aslett! He built, published, and thoroughly documented a turnkey Amazon Machine Image for RStudio. I booted my copy in minutes, changed my password (as he recommended), and forever changed how I work with R! Never again need I worry about ‘memory exceeded’ errors. Never again need I stare, eyes watering, as a process runs for hours. Better still: I have access on any computer with an Internet connection!