1. Text Prediction

Todd Rimes
July 29, 2018

This text prediction model and demo app were built in R. Provided data was processed to create a demo app that takes input text from a user and predicts “candidate” words for the next word, given the last few words typed by the user.

3 data sources with 4.27 million total lines from texts, twitter, and news
Sampled 20,000 lines from each source for a total of 60,000 samples

2. How It Works

Combined corpus (aggregated text lines) was tokenized into recurring word combinations, yielding:
- 2,000,000 quadgrams: 4-word combinations to predict 4th word given 3 leading words by probability
- 40,000 trigrams: 3-word combinations to predict 3rd word given 2 leading words
- 2,800 bigrams: 2-word combinations to predict 2nd word given one leading word
- 3,400 unigrams: 1-word entries ranked by overall recurrence

3. Performance

In the demo UI, text entered is trimmed and parsed to its last three or fewer words. Then the model is “queried” to find 4-word phrases that begin with the last three words entered.

If any phrases are found, the top 4 (or less) are returned in order of overall probability which is calculated as each phrase's overall recurrence across all memoized 4-word phrases parsed from the corpus.
If no 4-word phrases are found, then 3-word phrases are queried and so on
If no 2-word phrases are found, then single words are queried

Performance ranges from 7 milliseconds (4-word match) to 7 microseconds (no match), indicating that, the more iterative searches (from 4 words to 3 and so on), the longer the server response time.

4. Demo

Demo of it working.

Shiny app: https://toddrimes.shinyapps.io/Capstone/

5. Next Steps

Ideas for improvements and optimizations.

More closely tune the number of samples without sacrificing predictive power. I tried 1% and 10,000 lines fro each. 1% was too much (too slow to tokenize) and 10,000 was too few (so poor predictive power). 20,000 was my last guess, but a lower number might predict as well and perform better, with smaller datasets.
Could remove news data altogether since expected utility of this model would better serve short-form use (e.g. texting and tweeting), not long-form article composition.
Could remove trigram and bigram searches as their respective ngrams are substantially smaller than the quadgrams which have more innate “context”.