Data Science Capstone: Word Prediction

Christopher Han
March 18, 2019

Introduction

This data product takes in a word or a sentence and predicts the next word. The model is trained on 70% of the data and uses a stupid backoff model with ngrams ranging from 1-5. The application is deployed at this link https://chrishan.shinyapps.io/finalwordprediction/

Method

The algorithm uses a stupid backoff model. First the model starts with a 5-gram match, given the sentence is long enough. If there is a match, the probability of the word is calculated based on the 5-gram match. If there is not a match, it moves onto 4-gram, to 3-gram, and so on.

The Application Interface

The shiny application consists of the following elements:

  • input textbox
  • checkbox to indicate whether to remove stopwords
  • predict button
  • data table that shows all the ngram matches and their probabilities

The Application Interface

title

Performance

Using the benchmark provided here Benchmark, we observed how the model performs on a test set.

Result 3-gram 4-gram 5-gram
Overall top-3 score 17.18% 17.57% 17.56%
Overall top-1 precision 12.77% 13.41% 13.45%
Overall top-3 precision 20.92 21.09 21.02
Average runtime 18.20 msec 20.08 msec 23.84 msec
Total memory used 105.32 MB 106.51 MB 106.88 MB

The 5-gram model provides the best overall top-1 precision with being able to predict the next word on the first try 13.45% of the time. The final deployed application uses the 5-gram model on the basis of this result.