Data Science Capstone Project Milestone Report

Introduction

The purpose of this report is to provide a concise non-technical overview of the project data and to summarize plans for creating an algorithm for Next Word Prediction. The following topics will be addressed:
    * Load and Describe the Raw Data
    * Sampling
    * Pre-Processing/Data Cleaning
    * Exploratory Analysis
    * N-Gram Tokenization
    * Preliminary N-Grams
    * Initial Observations and Next Steps

Load and Describe the Raw Data

Three English language files were downloaded from the course website and loaded into R. The files are sourced from: 1) blog entries, 2) news items, and 3) Tweeter tweets. Summary statistics for each file are provided below.

##      File R_Object_Bytes Line_Count Word_Count
## 1   Blogs      260564320     899288   37334441
## 2    News      261722608    1010242   34372579
## 3 Twitter      316037344    2360148   30373792

Sampling

Given the size of the raw data files, random sampling was done to subset 10 percent of the blog and new files and 15 percent of the Twitter file for initial analysis. A larger portion of the Twitter file was selected on the expectation that Tweets my contain more variation in terminology (i.e., “noise”) than the more formal data sources.

The three samples were combined into a single text corpus, which forms the basis for all subsequent analysis in this report.

Pre-Processing/Data Cleaning

The following procedures are applied to the corpus in order to prepare it for tokenization:
    * Remove punctuation
    * Remove special characters
    * Remove numbers
    * Convert to lowercase
    * Stem words to their root or base form
    * Strip the whitespace
    * Reformat the corpus to a plain text document

Note that I elected to retain stopwords and profanity in the corpus for this phase of the analysis. Profanity will be excluded from predicted word lists, but further analysis will be needed to understand the utility of stopwords versus how they impact file size when retained.

Exploratory Analysis

The plain text version of the corpus is transformed into a document-term matrix (dtm), which is a matrix with documents as the rows and terms as the columns. Word frequency counts are contained in the cells of the matrix.

## <<DocumentTermMatrix (documents: 3, terms: 27113)>>
## Non-/sparse entries: 40511/40828
## Sparsity           : 50%
## Maximal term length: 30
## Weighting          : term frequency (tf)

We can see that the corpus contains 3 documents (i.e., blogs, news, Twitter) and 27,113 unique terms. The Sparcity value of 50% indicates that the dtm is roughly half empty, indicating that many words occur relatively few times.

Word frequencies in the corpus are displayed below. Recall that stopwords have not been excluded from the corpus, which explains the prevalence of words such as “the”.

Least frequent terms

##     abolished   abomination    absconding acknowlegding   adjustments 
##             1             1             1             1             1 
##         admin 
##             1

Most frequent terms

##   with    you   that    for    and    the 
##  82939 115847 116788 131985 260796 522989

Frequencies of frequencies

## freq
##    1    2    3    4    5    6    7    8    9   10 
## 1340  900  669  454  369  402  330  334  263  261

## freq
##  58204  62303  64017  73420  82939 115847 116788 131985 260796 522989 
##      1      1      1      1      1      1      1      1      1      1

From the information displayed above we can see that:
    * The least frequently appearing terms are real words, not gibberish.
    * There are 1,340 terms that occur just once.
    * The most frequently appearing word is “the” (522,989 times).

Additional graphical representations of relative word frequencies are shown below.

N-Gram Tokenization

Tokenization is the process of breaking a stream of text (i.e., the corpus) up into words or phrases called tokens. N-grams were created using the NGramTokenizer from the RWeka package.

For this application, an n-gram represents a contiguous sequence of n words derived from the text corpus as follows:

##     N_Gram Words
## 1  unigram     1
## 2   bigram     2
## 3  trigram     3
## 4 quadgram     4

Preliminary N-Grams

Frequencies for the initial set of n-grams is shown below.

##     N_Gram Words Frequency
## 1  unigram     1     27113
## 2   bigram     2    159556
## 3  trigram     3    240753
## 4 quadgram     4    251540

N-gram top 10 frequencies are shown below for bigrams, trigrams, and quadgrams. Recall that the most frequently observed unigrams are displayed above.

Bigram

##         Count
## in the  47373
## of the  43197
## for the 25693
## to the  24368
## on the  21109
## at the  17389
## to be   17117
## and the 14780
## in a    13434
## is a    12442

Trigram

##                    Count
## thanks for the      4620
## a lot of            3976
## one of the          3024
## to be a             2341
## going to be         2297
## looking forward to  2234
## i don t             2035
## i love you          1996
## for the follow      1870
## i have a            1663

Quadgram

##                       Count
## thanks for the follow  1385
## the end of the         1029
## in the middle of        801
## have a great day        720
## the rest of the         663
## at the end of           655
## its going to be         618
## at the same time        607
## in the u s              576
## was one of the          559

Initial Observations and Next Steps

It is clear that a significant amount of work remains in order to build a predictive model that balances the potentially competing priorities of application speed and predictive accuracy. I anticipate that tradeoffs will be necessary in order to provide users with an acceptable response time.

It will also be crucial that data pre-processing systematically cleans the corpus, n-grams and user input in exactly the same way. At this point it appears that alternate forms of apostrophes and/or quotation marks need a closer examination. Other open issues at this time include:
    * profanity: leave in corpus but filter from prediction?
    * stopwords: in or out?
    * stemming: yes or no?
    * removeSparseTerms value: high, low, or mid-range?

The development of a robust training/test plan will be needed.

Model features:
    * To deal with unseen n-grams:
       + Implement the Katz back-off method,
       + Coupled with a smoothing/discounting method (e.g., Good-Turing)
    * Will fivegrams be needed?
    * Evaluate methods for reducing n-gram data table size.