Text Prediction with R & R Shiny

Benjamin S. Knight

October 9th, 2016





The Application

- Created in R Shiny, the application utilizes a custom input widget - Simply enter the text you want to predict the next word to - no further input is required.

The Data

- Leverages the top 1,000,000 most frequent 5-grams and 4-grams from the Corpus of Contemporary American English (COCA). - Lower order ngrams are derived from a corpus of US news article extracts provided by SwiftKey. - The selected subsets of the 5-grams and 4-grams maximize the frequency of the predicted word. - Lower order ngrams are subset on Kneser-Ney probability.

Ngram Source Derived Utilized
5-Grams COCA 1,000,000 635,763
4-Grams COCA 1,000,000 462,563
Trigrams SwiftKey - US News (English) 14,307,749 500,000
Bigrams SwiftKey - US News (English) 4,365,811 60,948

The Algorithm

The Results

Accuracy was estimated by applying the algorithm to a 63 sentence-long New York Times article and assessing successful prediction rates of the 5th and last word of each sentence.

http://www.nytimes.com/2016/10/09/us/politics/donald-trump-campaign.html

Article Trials Successes Failures Accuracy
NYT Article (5th Word) 59 11 48 0.19
NYT Article (Last Word) 68 8 60 0.12
Overall 127 19 108 0.15