Swiftkey and the team at John’s Hopkins University made the capstone project dataset available for download. To avoid downloading the 500+ MB file repeatedly, my script checks for, downloads if not found, and extracts the dataset.
Initial exploration proved the dataset too large to process on my hardware. Though I later moved to Amazon Web Services (see details in Appendix A), sampling the data sped development. Randomly sampling 10% of corpus lines yielded sub 100MB files sufficiently small to process and upload to GitHub.
The three English text files - blogs, news, and Twitter - varied in size. I chose to investigate further by summarizing simple statistics for lines and characters:
## Loading required package: NLP
## File.Name Line.Count Mean.Char.Line Median.Char.Line
## 1 en_US.blogs.txt 90098 229.4905 154
## 2 en_US.news.txt 101228 201.0837 185
## 3 en_US.twitter.txt 235727 68.6378 64
## Stddev.Char.Line Min.Char.Line Max.Char.Line
## 1 251.35918 1 4714
## 2 134.56986 1 8949
## 3 37.21169 3 140
The results make clear the Twitter file includes shorter lines (it has a 140 character limit) while the blogs and news files align more closely. Their average length - however - suggests multiple sentences which may need consideration in developing the predictive model.
After characters come words, or so my elementary school teachers always told me! Building a term-document matrix began with cleaning the sample corpus. My two immediate concerns:
I chose to keep stop words in the first term document matrix. Anticipating they would dominate, graphing top-10 terms by count revealed much:
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
The stop words overwhelm all others. Removing them will help exploration.
New insight appears with key words in each document such as “said” in news articles and “im” in Twitter. Possible errors for prediction also arise: “im” instead of “I’m”.
Considering the clear differences for stop words in single evaluation, start by comparing bi-grams with stop words:
Bi grams without stop words:
The take-away: stop words still dominate. However, the more significant finding comes from examining trig-rams with prediction as the goal. Example: A user types “case of” and the program must predict the next word. The 10% sample corpus returns…
## [1] "190 possible trigrams. Top 10 results:"
## <<TermDocumentMatrix (terms: 10, documents: 3)>>
## Non-/sparse entries: 22/8
## Sparsity : 27%
## Maximal term length: 17
## Weighting : term frequency (tf)
##
## Docs
## Terms en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## case of the 21 15 15
## case of a 6 13 2
## case of an 2 7 0
## case of one 2 2 0
## case of beer 0 2 1
## case of rain 1 1 1
## case of any 1 1 0
## case of chen 1 1 0
## case of continued 0 2 0
## case of divorce 0 1 1
“Case of the” and “case of a” represent common phrases. Yet, our user typing “The guy in front of me just bought a pound of bacon, a bouquet, and a case of” would view both results as useless. “Case of beer” with only three hits makes perfect sense!
Simply build longer n-grams? That threatens over-fitting and long processing times. I plan - and it is just a plan at present - to instead:
The steps solve problems made evident in my exploration. Several others I need still explore: sentence identification and punctuation control. Will I discover more? Most assuredly!
I must give credit to Louis Aslett! He built, published, and thoroughly documented a turnkey Amazon Machine Image for RStudio. I booted my copy in minutes, changed my password (as he recommended), and forever changed how I work with R! Never again need I worry about ‘memory exceeded’ errors. Never again need I stare, eyes watering, as a process runs for hours. Better still: I have access on any computer with an Internet connection!