Slide 5 – Building the data.table and Conclusion:
The data.table which is the heart of this algorithm is built in several stages:
- Each of the 3 Corpus files was pre-processed to change upper case characters to lower case,
and to remove every non-alphabet character except for apostrophes contained within words
(contractions, like “it's” and “you'll”.)
This step is pretty fast, a few tens of seconds per file on a 2015 iMac with 24GB memory
- Those (3) pre-processed files were read in and N-Grams produced with Quanteda
for 2-6 word phrases and each set was saved seperately. Phrases which appeared fewer than
3 times in a given Corpus file were dropped. This step takes a bit more time, on the order of minutes per file
- Those files (5 files for each of 3 Corpus files) were combined, and a rough data.table written
out as one file. This takes computer memory more than time, but it takes on the order of minutes
- Lastly, the final file was “trimmed” to remove some profanity and doubled words.
Phrases which had appeared fewer than 4 times in the combined Corpus were removed.
Only the top 10 (or fewer) most frequent results were kept for each preamble.
The number of words in each preamble was stored in the table, as was the “stopword”
category for each prediction result. Those calculations take again, minutes
- And then the final data.table was saved as an RDS file for rapid loading. It is a 13MB file
when done, and takes about 2 sec to load from a local disk
Conclusion
The Shiny app does what was assigned, using the entire Corpus, fast and simply.