Next Word Prediction

Patrick Machado
18 aug 2019

This is a pitch for a Shiny application that make use of the Natural Language Processing framework and predict the next word given an user input phrase.

It was constructed as the final project of the Coursera Data Science Specialization.

The dataset

The data used for training the algorithm was downloaded from here. The great amount of data was cleaned and sampled to construct the quadgrams used for the application, which then were classiffied and aggregated into five features:

  • Base input word 3
  • Base input word 2
  • Base input word 1
  • Prediction word
  • Frequency of each unique base - prediction word combination

More details about the dataset can be found in this Rpubs presentation.

The algorithm

For predicting, a modification of the back off model was implemented, the algorithm workflow is like this:

  1. The user input is cleaned with the same function that the original data was, and the three rightmost words are selected.

  2. This words are searched in the app database, and if there is a match the prediction word is returned.

  3. If no exact match is founded, the app database is filtered, when possible, by each one of the three input words and the prediction word with the greatest quadgram frequency is returned.

The application

  1. Type something…
  2. Wait a little for the app to predict a word…
  3. If you want to use the word in your phrase, click the USE WORD button…
    TIP: You also can press (Ctrl+x) to use the word

The performance

The application:

  • Has responsive and material design, thanks to shinymaterial package
  • Predicts even with odd words
  • Has a clean environment
  • Is easy to use
  • Has the option to append the predicted word
  • The shiny RData database has 31.7 MB size

The testing:

  • In my laptop makes 1000 predictions in 201 seg: 201 ms/execution
  • 6% accuracy

The developer:

Thanks and enjoy typing!

Patrick Machado