Coursera JHU Capstone Slide Deck

JRobertsDS
1 Aug 2020

Project

Slide 1 – Goals:

Develop a Shiny App in R, which does “Next Word” prediction on user input phrases
Include as much as possible of the supplied Corpus of documents in the prediction
Make it Fast
Make it Simple

Results:

Shiny App is at https://jrobertsds.shinyapps.io/Coursera_Capstone_App/
The entire Corpus was processed in creating the final prediction data.table
It's Fast! It takes an average of less than 2 msec of cpu time to run a prediction
It's Simple! The application will be described on the next slide
Here are the results running the benchmark at: https://github.com/hfoffani/dsci-benchmark

Algorithm Results, Speed, and Size
Overall top-3 score:	19.86 %
Overall top-1 precision:	15.32 %
Overall top-3 precision:	23.79 %
Average runtime:	1.73 msec
Total memory used:	130.89 MB

Slide 2 – Application Description

There is a single input text box at the top, in which the user can input phrases
There are two radio buttons on the left, with which the user can choose to predict phrases “By Strict Word Probability”, or with “Interesting Words First”
There are a few sentences of help which explain that “Interesting Words First” drops “stopwords” like “and” and “the” and “at” to the end of the list
Word predictions appear in a table with 8 suggestions. The First One is the “Single Word Prediction”; it is the best choice “next word” prediction from the algorithm

Slide 3 – Application Screenshot:

Screenshot

Slide 4 – Algorithm Description:

The prediction algorithm constructs a data.table so that “next word” predictions are simple lookups of input phrases into the data.table, resulting in ranked “predictions” from the table. Here are the first few rows from the default lookup:

               preamble count          word numWords stopword
1: the united states of   208       america        4    FALSE
2:     united states of   228       america        3    FALSE
3:            states of   249       america        2    FALSE
4:            states of    35           the        2     TRUE
5:            states of    19 consciousness        2    FALSE
6:            states of     9     emergency        2    FALSE

The results are ordered by the number of words in the matching phrase (the “preamble”,) the number of times that phrase appeared in the Corpus (the “count”), and if the user has chosen to put “Interesting Words First”, then stopwords are sorted into last place
Then the final results are just the unique words from the subset, in (already) sorted order
Note that the number of words in the preamble (“numWords”), and whether the resulting prediction (“word”) is a stopword, is already coded into the data table which is stored on disk
The prediction itself, as outlined above, is then very simple. There are no probability calculations at all, just a sort of the data.table after a phrase lookup

Slide 5 – Building the data.table and Conclusion:

The data.table which is the heart of this algorithm is built in several stages:

Each of the 3 Corpus files was pre-processed to change upper case characters to lower case, and to remove every non-alphabet character except for apostrophes contained within words (contractions, like “it's” and “you'll”.) This step is pretty fast, a few tens of seconds per file on a 2015 iMac with 24GB memory
Those (3) pre-processed files were read in and N-Grams produced with Quanteda for 2-6 word phrases and each set was saved seperately. Phrases which appeared fewer than 3 times in a given Corpus file were dropped. This step takes a bit more time, on the order of minutes per file
Those files (5 files for each of 3 Corpus files) were combined, and a rough data.table written out as one file. This takes computer memory more than time, but it takes on the order of minutes
Lastly, the final file was “trimmed” to remove some profanity and doubled words. Phrases which had appeared fewer than 4 times in the combined Corpus were removed. Only the top 10 (or fewer) most frequent results were kept for each preamble. The number of words in each preamble was stored in the table, as was the “stopword” category for each prediction result. Those calculations take again, minutes
And then the final data.table was saved as an RDS file for rapid loading. It is a 13MB file when done, and takes about 2 sec to load from a local disk

Conclusion

The Shiny app does what was assigned, using the entire Corpus, fast and simply.