Data Science Capstone: Word Prediction

Christopher Han
March 18, 2019

Introduction

This data product takes in a word or a sentence and predicts the next word. The model is trained on 70% of the data and uses a stupid backoff model with ngrams ranging from 1-5. The application is deployed at this link https://chrishan.shinyapps.io/finalwordprediction/

Method

The algorithm uses a stupid backoff model. First the model starts with a 5-gram match, given the sentence is long enough. If there is a match, the probability of the word is calculated based on the 5-gram match. If there is not a match, it moves onto 4-gram, to 3-gram, and so on.

Stopwords

Stopwords are words that are very common in a language such as 'I', 'a', 'you'. Removing these words can possibly improve or worsen the prediction. The accuracy depends on the complexity of the sentence.

For more detailed documentation, check the documentation tab on the application.

The Application Interface

The shiny application consists of the following elements:

  • input textbox
  • checkbox to indicate whether to remove stopwords
  • predict button
  • data table that shows all the ngram matches and their probabilities

The Application Interface

title

Performance

Using the benchmark provided here Benchmark, we observed how the model performs on a test set.

Result 3-gram 4-gram 5-gram
Overall top-3 score 17.18% 17.57% 17.56%
Overall top-1 precision 12.77% 13.41% 13.45%
Overall top-3 precision 20.92 21.09 21.02
Average runtime 18.20 msec 20.08 msec 23.84 msec
Total memory used 105.32 MB 106.51 MB 106.88 MB

The 5-gram model provides the best overall top-1 precision with being able to predict the next word on the first try 13.45% of the time. The final deployed application uses the 5-gram model on the basis of this result.