The data provided by the class owners is proving challenging to manipulate in a timely manner. I have found effective ways to deal with the development cycle – subsetting and storing complete R objects. The next challenge is to develop a functionally correct model which will, again rely on stored R objects for efficiency and runtime speed. The further challenge will be to pick the appropriate R packages and data structures to implmement that strategy. A final task in the modeling phase is to measure success, which I believe will require segregating Training data from Validation data and comparing predicted words vs. actual word sequences.
It soon became clear that the original datasets were too large to even subset efficiently on my R machine. linecounts range from 0.9M to 2.3M:
jeff@jeff-ThinkPad-T440s:~/NLPCapstone/final/en_US$ wc -l *
899288 en_US.blogs.txt
1010242 en_US.news.txt
2360148 en_US.twitter.txt
A perl script was used to subset by line count to 10% of this size. This 10% subset might need to be adjusted, depending on the accuracy and speed of the final product.
#!/usr/bin/perl -an
#Script to select nth lines of stdin and write to std out . -an flag is for auto loop
use strict;
print unless $. % 10;
Using the ngram package, the first step in data cleansing was to removeNumbers, stripWhitespace, removePunctuation, and remove Stopwords. Under consideration are adding Stopwords back into the mix, being more selective about punctuation (use regexp to remove commas but leave apostrophies). I think also an end-of-sentence marker will be necessary, unless period/semicolon work for that function.
The next step was to build Document Term Structures containing Bigrams and Trigrams.
An important step is to apply some method of weeding out infrequent word occurances. I chose the ngram method removeSparseTerms with a specification of 0.5. In practical terms, this means that only ngrams that appear in half the docunents (2 of 3, in this case) are retained. This had the fortunate effect of shrinking the memory footprint for Trigrams from 502M to 7.3M, and still containing 73,878 terms. In other words, at least two occurances of the ngram appeared in the overall corpus, so singletons are automatically removed.
I hope I do not need to furture reduce the richness of the data, perhaps by setting an occurance threshold, because I really enjoy seeing a large selection of unusual English appear as preduction candidates. (I was an English major!)
dtmTrigram <- readRDS(file="NLP.dtmTrigram.RDS")
dtmTrigram <-removeSparseTerms(dtmTrigram, 0.5)
> str(dtmTrigram)
List of 6
$ i : int [1:157376] 1 2 3 4 5 6 7 8 9 10 ...
$ j : int [1:157376] 1 1 1 1 1 1 1 1 1 1 ...
$ v : num [1:157376] 1 1 1 3 2 1 2 3 2 1 ...
$ nrow : int 73878
$ ncol : int 3
...
Terms en_US.blogs.txt en_US.news.txt en_US.twitter.txt
day youll see 1 0 1
Nonetheless, even after removing uncommon words, the frequency distribution rmaains wildly skewed:
## Loading required package: NLP
My intention, going forward, is to stay with the bigram/trigram level of stateful history, but be able to scale up to more states, perhaps to 5-grams. An important consideration is to be able to address the differnt n-grams separately, so that back-off processing can be performed in a simple way with procedureal logic.
The highest technical risk is to choose a data structure for run-time. The requirements are obviously:
I have spent a lot of time looking at Markhov Chain implementations in R. I do not think it is the appropriate data structure for me, because
Accordingly, I will probably build dataframes, or data.tables, or native environment lists to map from the antecedent word to the multiple rows of sucessor words and probabilities. There are ways to make this very fast, but I really only want to make it fast-enough.