Text Prediction with R & R Shiny

Benjamin S. Knight

October 9th, 2016

The Application

- Created in R Shiny, the application utilizes a custom input widget - Simply enter the text you want to predict the next word to - no further input is required.

The Data

- Leverages the top 1,000,000 most frequent 5-grams and 4-grams from the Corpus of Contemporary American English (COCA). - Lower order ngrams are derived from a corpus of US news article extracts provided by SwiftKey. - The selected subsets of the 5-grams and 4-grams maximize the frequency of the predicted word. - Lower order ngrams are subset on Kneser-Ney probability.

Ngram	Source	Derived	Utilized
5-Grams	COCA	1,000,000	635,763
4-Grams	COCA	1,000,000	462,563
Trigrams	SwiftKey - US News (English)	14,307,749	500,000
Bigrams	SwiftKey - US News (English)	4,365,811	60,948

The Algorithm

The Results

Accuracy was estimated by applying the algorithm to a 63 sentence-long New York Times article and assessing successful prediction rates of the 5th and last word of each sentence.

http://www.nytimes.com/2016/10/09/us/politics/donald-trump-campaign.html

Article	Trials	Successes	Failures	Accuracy
NYT Article (5th Word)	59	11	48	0.19
NYT Article (Last Word)	68	8	60	0.12
Overall	127	19	108	0.15