Next Word Prediction App

Johan
July 11th, 2018

Summary

This small piece of software takes a sequence of word and predicts what could be the next word. The model has been trained on a sample of blog articles, news and tweets. The model shows poor prediction performance but is encouraging accounting the simplicity of the prediction model.

How it works - ngrams dataset

  1. A algorithm loads a sample (20%) of the structured list of text from various sources (blogs, news and tweets)
  2. It cleans the text, in order to keep only words
  3. It tokenizes the text to create ngrams

  4. It counts how many times the ngrams appear in the text and ranks them

head(head5GramsModel)
                   input prediction Freq
1      at the end of the        day  102
2   on the other side of        the   74
3   i just finished a mi        run   60
4 just finished a mi run       with   59
5  thank you so much for        the   59
6   in the middle of the      night   58

How it works - Then the predictive model

  1. Decodes the sequence of words that is inputed
  2. Looks in the ngrams dataset for this sequence of words
  3. Return the ngrams with the highest frequencies and its associated predicted words.
word <- "This is a short sequence of"
gramsPred(word,
          model1=grams1model,
          model2=grams2model,
          model3=grams3model,
          model4=grams4model,
          model5=grams5model,
          top=3)
              input prediction Freq    acc
138361  sequence of     events    6 23.077
176548  sequence of        the    5 19.231
5179014 sequence of         10    1  3.846

Predictive performance

      time         correctPred    
 Min.   :0.02898   Mode :logical  
 1st Qu.:0.40270   FALSE:797      
 Median :0.56838   TRUE :174      
 Mean   :1.76139   NA's :12       
 3rd Qu.:3.51138                  
 Max.   :8.01202                  

  • correctPred: Prediction accuracy for out-of-sample testing
  • time: time system measued before and after the call of the prediction function

This relatively low performance can be explained by the simplicity of the model - using only ngram. In this model, ngrams of 1 up to 5-words sequences and are trained on a sample of 20% of the original text documents. The model also takes too long to be executed. Acceptable time should be below 0.5 secs. Let's keep in mind that this is an exploratory exercise and with more resources better result can be expected.

The Shiny App

The side panel: Adjusting the performance

  • User selects how fast she wants its app to run, the accuracy is the trade-off. It is recommended to keep the slide on “2” for the best trade-off.
  • User has to wait for the dataset to be loaded when the app first boots and when the performance slider is adjusted.

The main panel: Inputing the word sequence and getting the prediction

  • User types the sequence of words in the text box
  • The app displays the top 3 results and additional performance info