Text Prediction with R
& R Shiny
Benjamin S. Knight
October 9th, 2016
- Created in R Shiny, the application utilizes a custom input widget - Simply enter the text you want to predict the next word to - no further input is required.
- Leverages the top 1,000,000 most frequent 5-grams and 4-grams from the Corpus of Contemporary American English (COCA). - Lower order ngrams are derived from a corpus of US news article extracts provided by SwiftKey. - The selected subsets of the 5-grams and 4-grams maximize the frequency of the predicted word. - Lower order ngrams are subset on Kneser-Ney probability.
| Ngram | Source | Derived | Utilized |
|---|---|---|---|
| 5-Grams | COCA | 1,000,000 | 635,763 |
| 4-Grams | COCA | 1,000,000 | 462,563 |
| Trigrams | SwiftKey - US News (English) | 14,307,749 | 500,000 |
| Bigrams | SwiftKey - US News (English) | 4,365,811 | 60,948 |
Accuracy was estimated by applying the algorithm to a 63 sentence-long New York Times article and assessing successful prediction rates of the 5th and last word of each sentence.
http://www.nytimes.com/2016/10/09/us/politics/donald-trump-campaign.html
| Article | Trials | Successes | Failures | Accuracy |
|---|---|---|---|---|
| NYT Article (5th Word) | 59 | 11 | 48 | 0.19 |
| NYT Article (Last Word) | 68 | 8 | 60 | 0.12 |
| Overall | 127 | 19 | 108 | 0.15 |