This report will explain my current progress towards building a Shiny app that will predict the next word given an input text string. I was given 3 files of English text to use as a basis for my prediction app. The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. The files have been language filtered but may still contain some foreign text. The download link to the data is https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, and I used the 3 en_US files.
The three data files contain a large amount of English texts. Simply reading the 3 files in as text consumes around 800MB of system memory. The table below shows a breakdown of the files.
## File Name File (MB) R Object (MB) Lines Words Chars.
## 1 en_US.blogs.txt 200 248.5 899288 37334441 208361438
## 2 en_US.news.txt 196 249.5 1010242 34372595 203763865
## 3 en_US.twitter.txt 159 301.4 2360148 30373792 162384825
Since this will eventually need to run on a low power system (Shiny app), some sampling and data reduction will be needed. Therefore, I sampled the data set twice, pulling both 1% and 5% of the total data before continuing to process the data.
My next step was to preprocess the data. Most notably I tranformed all of the text to lower case only and removed punctuation and stop words. Stop words are very common words that tend to not aid in text prediction (i.e. a, an, the, is etc…). I used the list of 33 words provided by the tokenizers package. I’ve chosen not to profanity filter at this time as I am curious what effect this will have. I may chose to add profanity filtering at a later time.
My first step was to generate word frequency lists for the 1 and 5 percent samples to identify common words that will be in the data set. These are as follows:
I noticed that the lists are very similar. This is encouraging though not surprising, since the 1% sample has over 645,000 words in it, which is a large enough sample size to get representative data.
Since the end goal is to generate word predictions, it is necessary to find out what kinds of word grouping we can find in the data. I took the sample texts and divided them up using Ngram algorithms. An Ngram is a list of words that occur together in a sentence, of length N. So for example, in the sentence “I want a cookie”, there are 3 unique 2Ngrams: “I want”, “want a”, and “a cookie”. Similarly there are 2 unique 3Ngrams in the same sentence: “I want a” and “want a cookie”. Using this technique I have created similar bar graphs of the 2, 3 and 4 Ngrams from the 2 sample data sets below.
One final note on the Ngram analysis is that there are millions of records even in these sampled data sets, especailly when it comes to the list of 2, 3, and 4 Ngrams.
## Type Total items Unique items
## 1 Words in 1% 645863 59420
## 2 Words in 5% 3251146 154537
## 3 BiGrams in 1% 663511 464026
## 4 BiGrams in 5% 3344958 1866870
## 5 TriGrams in 1% 621296 592824
## 6 TriGrams in 5% 3133929 2843271
## 7 FourGrams in 1% 580917 577358
## 8 FourGrams in 5% 2931984 2886904
Since I anticipate needing at least one list for each length of Ngram, it may take up too much room in the Shiny app to include every occurence found in the 5% sample set. It may be necessary to take only a the top X records from these lists, since, for example my 5% list of 4Ngrams consumes over 240MB of ram on my computer. I do plan on using recursive searching so that leaving out records from the larger values of N will not preclude my app from giving a prediction.