The purpose of this report is to provide a concise non-technical overview of the project data and to summarize plans for creating an algorithm for Next Word Prediction. The following topics will be addressed:
* Load and Describe the Raw Data
* Sampling
* Pre-Processing/Data Cleaning
* Exploratory Analysis
* N-Gram Tokenization
* Preliminary N-Grams
* Initial Observations and Next Steps
Three English language files were downloaded from the course website and loaded into R. The files are sourced from: 1) blog entries, 2) news items, and 3) Tweeter tweets. Summary statistics for each file are provided below.
## File R_Object_Bytes Line_Count Word_Count
## 1 Blogs 260564320 899288 37334441
## 2 News 261722608 1010242 34372579
## 3 Twitter 316037344 2360148 30373792
Given the size of the raw data files, random sampling was done to subset 10 percent of the blog and new files and 15 percent of the Twitter file for initial analysis. A larger portion of the Twitter file was selected on the expectation that Tweets my contain more variation in terminology (i.e., “noise”) than the more formal data sources.
The three samples were combined into a single text corpus, which forms the basis for all subsequent analysis in this report.
The following procedures are applied to the corpus in order to prepare it for tokenization:
* Remove punctuation
* Remove special characters
* Remove numbers
* Convert to lowercase
* Stem words to their root or base form
* Strip the whitespace
* Reformat the corpus to a plain text document
Note that I elected to retain stopwords and profanity in the corpus for this phase of the analysis. Profanity will be excluded from predicted word lists, but further analysis will be needed to understand the utility of stopwords versus how they impact file size when retained.
The plain text version of the corpus is transformed into a document-term matrix (dtm), which is a matrix with documents as the rows and terms as the columns. Word frequency counts are contained in the cells of the matrix.
## <<DocumentTermMatrix (documents: 3, terms: 27113)>>
## Non-/sparse entries: 40511/40828
## Sparsity : 50%
## Maximal term length: 30
## Weighting : term frequency (tf)
We can see that the corpus contains 3 documents (i.e., blogs, news, Twitter) and 27,113 unique terms. The Sparcity value of 50% indicates that the dtm is roughly half empty, indicating that many words occur relatively few times.
Word frequencies in the corpus are displayed below. Recall that stopwords have not been excluded from the corpus, which explains the prevalence of words such as “the”.
## abolished abomination absconding acknowlegding adjustments
## 1 1 1 1 1
## admin
## 1
## with you that for and the
## 82939 115847 116788 131985 260796 522989
## freq
## 1 2 3 4 5 6 7 8 9 10
## 1340 900 669 454 369 402 330 334 263 261
## freq
## 58204 62303 64017 73420 82939 115847 116788 131985 260796 522989
## 1 1 1 1 1 1 1 1 1 1
From the information displayed above we can see that:
* The least frequently appearing terms are real words, not gibberish.
* There are 1,340 terms that occur just once.
* The most frequently appearing word is “the” (522,989 times).
Additional graphical representations of relative word frequencies are shown below.
Tokenization is the process of breaking a stream of text (i.e., the corpus) up into words or phrases called tokens. N-grams were created using the NGramTokenizer from the RWeka package.
For this application, an n-gram represents a contiguous sequence of n words derived from the text corpus as follows:
## N_Gram Words
## 1 unigram 1
## 2 bigram 2
## 3 trigram 3
## 4 quadgram 4
Frequencies for the initial set of n-grams is shown below.
## N_Gram Words Frequency
## 1 unigram 1 27113
## 2 bigram 2 159556
## 3 trigram 3 240753
## 4 quadgram 4 251540
N-gram top 10 frequencies are shown below for bigrams, trigrams, and quadgrams. Recall that the most frequently observed unigrams are displayed above.
## Count
## in the 47373
## of the 43197
## for the 25693
## to the 24368
## on the 21109
## at the 17389
## to be 17117
## and the 14780
## in a 13434
## is a 12442
## Count
## thanks for the 4620
## a lot of 3976
## one of the 3024
## to be a 2341
## going to be 2297
## looking forward to 2234
## i don t 2035
## i love you 1996
## for the follow 1870
## i have a 1663
## Count
## thanks for the follow 1385
## the end of the 1029
## in the middle of 801
## have a great day 720
## the rest of the 663
## at the end of 655
## its going to be 618
## at the same time 607
## in the u s 576
## was one of the 559
It is clear that a significant amount of work remains in order to build a predictive model that balances the potentially competing priorities of application speed and predictive accuracy. I anticipate that tradeoffs will be necessary in order to provide users with an acceptable response time.
It will also be crucial that data pre-processing systematically cleans the corpus, n-grams and user input in exactly the same way. At this point it appears that alternate forms of apostrophes and/or quotation marks need a closer examination. Other open issues at this time include:
* profanity: leave in corpus but filter from prediction?
* stopwords: in or out?
* stemming: yes or no?
* removeSparseTerms value: high, low, or mid-range?
The development of a robust training/test plan will be needed.
Model features:
* To deal with unseen n-grams:
+ Implement the Katz back-off method,
+ Coupled with a smoothing/discounting method (e.g., Good-Turing)
* Will fivegrams be needed?
* Evaluate methods for reducing n-gram data table size.