"Next word" application

Nadia Stavisky
08 December, 2019

Introduction

I built “the next word” prediction model using the data from three Capstone Dataset files:

name size
en_US.blogs.txt 200M
en_US.news.txt 196M
en_US.twitter.txt 159M

I sampled the files to approximate the results that would be obtained using all the data.

name size
sample.txt 12.2M

Modeling

I used relationship between words in N-grams to build a predictive text mining model (Katz back-off modeling): prediction of the next word based on frequency of its combination with the previous 1, 2, or 3 words.

3-gram barplot

3-gram cloud

Evaluation

I evaluated the model for efficiency and accuracy on a test data set of 100 4-grams from unseen data.
Accuracy - the percentage of correct predictions of the next word given 1, 2 or 3 words (prefix n-1 gram).
Efficiency - request execution time.

Note that in cases when there is no history, i.e. a 4-gramm has not appeared in the training data set, the model offers the list of top 1-grams. Rare phrases (i.e. those that appeare only once in test data) inevitably produce errors and thus significantly degrade the model's accuracy. Therefore I decided to do the test using only 4-grams that appear at least twice in the test data. I believe this is a more relevant measure of accuracy.

Here are the results: model top1 prediction has the accuracy of 48%; top2 - 59%; top3 - 64%.

Shiny app

A Shiny app “Next Word” accepts an n-gram and predicts the next word. It takes a phrase (multiple words) from a text box as input and outputs a prediction of the next word. The phrase can be constructed by either selecting the word from the list of predicted words or by typing words manually. You can configure the app by selecting the maximum number of the predicted words to be displayed.

Next_word_app