NLP prediction app project milestone

Overview

This report will explain my current progress towards building a Shiny app that will predict the next word given an input text string. I was given 3 files of English text to use as a basis for my prediction app. The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. The files have been language filtered but may still contain some foreign text. The download link to the data is https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, and I used the 3 en_US files.

Summary of the 3 data files

The three data files contain a large amount of English texts. Simply reading the 3 files in as text consumes around 800MB of system memory. The table below shows a breakdown of the files.

##           File Name File (MB) R Object (MB)   Lines    Words    Chars.
## 1   en_US.blogs.txt       200         248.5  899288 37334441 208361438
## 2    en_US.news.txt       196         249.5 1010242 34372595 203763865
## 3 en_US.twitter.txt       159         301.4 2360148 30373792 162384825

Since this will eventually need to run on a low power system (Shiny app), some sampling and data reduction will be needed. Therefore, I sampled the data set twice, pulling both 1% and 5% of the total data before continuing to process the data.

Preprocessing

My next step was to preprocess the data. Most notably I tranformed all of the text to lower case only and removed punctuation and stop words. Stop words are very common words that tend to not aid in text prediction (i.e. a, an, the, is etc…). I used the list of 33 words provided by the tokenizers package. I’ve chosen not to profanity filter at this time as I am curious what effect this will have. I may chose to add profanity filtering at a later time.

Exploratory data analysis

My first step was to generate word frequency lists for the 1 and 5 percent samples to identify common words that will be in the data set. These are as follows:

I noticed that the lists are very similar. This is encouraging though not surprising, since the 1% sample has over 645,000 words in it, which is a large enough sample size to get representative data.

Further analysis of word groupings - Ngram parsing

Since the end goal is to generate word predictions, it is necessary to find out what kinds of word grouping we can find in the data. I took the sample texts and divided them up using Ngram algorithms. An Ngram is a list of words that occur together in a sentence, of length N. So for example, in the sentence “I want a cookie”, there are 3 unique 2Ngrams: “I want”, “want a”, and “a cookie”. Similarly there are 2 unique 3Ngrams in the same sentence: “I want a” and “want a cookie”. Using this technique I have created similar bar graphs of the 2, 3 and 4 Ngrams from the 2 sample data sets below.

Anticipating problems

One final note on the Ngram analysis is that there are millions of records even in these sampled data sets, especailly when it comes to the list of 2, 3, and 4 Ngrams.

##              Type Total items Unique items
## 1     Words in 1%      645863        59420
## 2     Words in 5%     3251146       154537
## 3   BiGrams in 1%      663511       464026
## 4   BiGrams in 5%     3344958      1866870
## 5  TriGrams in 1%      621296       592824
## 6  TriGrams in 5%     3133929      2843271
## 7 FourGrams in 1%      580917       577358
## 8 FourGrams in 5%     2931984      2886904

Since I anticipate needing at least one list for each length of Ngram, it may take up too much room in the Shiny app to include every occurence found in the 5% sample set. It may be necessary to take only a the top X records from these lists, since, for example my 5% list of 4Ngrams consumes over 240MB of ram on my computer. I do plan on using recursive searching so that leaving out records from the larger values of N will not preclude my app from giving a prediction.

Next Steps

Build prediction tables from the Ngram lists, and save them to file so that the Shiny app can access them directly.
Experiment with building a recursive search model that can handle missing Ngrams
Experiment with the predition table sizes to optimize between accuracy and resource consumption
Build a Shiny app to make the predition algorithm accessible for use.