Data Science Capstone Week 2 Milestone

Summary

The data provided by the class owners is proving challenging to manipulate in a timely manner. I have found effective ways to deal with the development cycle – subsetting and storing complete R objects. The next challenge is to develop a functionally correct model which will, again rely on stored R objects for efficiency and runtime speed. The further challenge will be to pick the appropriate R packages and data structures to implmement that strategy. A final task in the modeling phase is to measure success, which I believe will require segregating Training data from Validation data and comparing predicted words vs. actual word sequences.

Data Cleaning and Characterization

It soon became clear that the original datasets were too large to even subset efficiently on my R machine. linecounts range from 0.9M to 2.3M:

Figure 1: Linecounts of original corpus files

jeff@jeff-ThinkPad-T440s:~/NLPCapstone/final/en_US$ wc -l *
   899288 en_US.blogs.txt
  1010242 en_US.news.txt
  2360148 en_US.twitter.txt

A perl script was used to subset by line count to 10% of this size. This 10% subset might need to be adjusted, depending on the accuracy and speed of the final product.

Figure 2: Perl subsetting script

#!/usr/bin/perl -an
#Script to select nth lines of stdin and write to std out . -an flag is for auto loop
use strict;
print unless $. % 10;

Using the ngram package, the first step in data cleansing was to removeNumbers, stripWhitespace, removePunctuation, and remove Stopwords. Under consideration are adding Stopwords back into the mix, being more selective about punctuation (use regexp to remove commas but leave apostrophies). I think also an end-of-sentence marker will be necessary, unless period/semicolon work for that function.

The next step was to build Document Term Structures containing Bigrams and Trigrams.

An important step is to apply some method of weeding out infrequent word occurances. I chose the ngram method removeSparseTerms with a specification of 0.5. In practical terms, this means that only ngrams that appear in half the docunents (2 of 3, in this case) are retained. This had the fortunate effect of shrinking the memory footprint for Trigrams from 502M to 7.3M, and still containing 73,878 terms. In other words, at least two occurances of the ngram appeared in the overall corpus, so singletons are automatically removed.

I hope I do not need to furture reduce the richness of the data, perhaps by setting an occurance threshold, because I really enjoy seeing a large selection of unusual English appear as preduction candidates. (I was an English major!)

Figure 3: Result of removeSparseTerms

dtmTrigram <- readRDS(file="NLP.dtmTrigram.RDS")
dtmTrigram <-removeSparseTerms(dtmTrigram, 0.5) 

> str(dtmTrigram)
List of 6
 $ i       : int [1:157376] 1 2 3 4 5 6 7 8 9 10 ...
 $ j       : int [1:157376] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:157376] 1 1 1 3 2 1 2 3 2 1 ...
 $ nrow    : int 73878
 $ ncol    : int 3
 ...
 
 
Terms             en_US.blogs.txt en_US.news.txt en_US.twitter.txt
  day youll see         1              0                 1

Nonetheless, even after removing uncommon words, the frequency distribution rmaains wildly skewed:

Figure 4: Frequency distribution of Trigrams

## Loading required package: NLP

Model functionality

My intention, going forward, is to stay with the bigram/trigram level of stateful history, but be able to scale up to more states, perhaps to 5-grams. An important consideration is to be able to address the differnt n-grams separately, so that back-off processing can be performed in a simple way with procedureal logic.

Model building

The highest technical risk is to choose a data structure for run-time. The requirements are obviously:

Be able to present a single word, or a pair of words, from the website to a prediction function and receive back a list of candidate words, with their probability values
To enable backoff logic, the prediction funcion need to be able to differentiate bween bigrams and trigrams.
Lookup has to be fast, preferably using a hashed or indexed lookup
The data structure has to be memory-efficientpl

I have spent a lot of time looking at Markhov Chain implementations in R. I do not think it is the appropriate data structure for me, because

A square transition matrix, un-sparse, of 360k x 360k (Bigrams) or 78k x 78k (Trigrams) will be very large!
I do not see indications in the documentation that looking up a single antecedent word (the first of the bigram) will be indexed or hashed.
Combining both bigrams and trigrams in one dependency matrix will require a ‘higher-order’ matrix, which is even larger (n² x n)
The logical operations avaialble in the methods of mahrkovchain package do not seem to be appropriate for this purpose.

Accordingly, I will probably build dataframes, or data.tables, or native environment lists to map from the antecedent word to the multiple rows of sucessor words and probabilities. There are ways to make this very fast, but I really only want to make it fast-enough.