Christopher Han
March 18, 2019
This data product takes in a word or a sentence and predicts the next word. The model is trained on 70% of the data and uses a stupid backoff model with ngrams ranging from 1-5. The application is deployed at this link https://chrishan.shinyapps.io/finalwordprediction/
Method
The algorithm uses a stupid backoff model. First the model starts with a 5-gram match, given the sentence is long enough. If there is a match, the probability of the word is calculated based on the 5-gram match. If there is not a match, it moves onto 4-gram, to 3-gram, and so on.
Stopwords
Stopwords are words that are very common in a language such as 'I', 'a', 'you'. Removing these words can possibly improve or worsen the prediction. The accuracy depends on the complexity of the sentence.
For more detailed documentation, check the documentation tab on the application.
The shiny application consists of the following elements:
Using the benchmark provided here Benchmark, we observed how the model performs on a test set.
| Result | 3-gram | 4-gram | 5-gram |
|---|---|---|---|
| Overall top-3 score | 17.18% | 17.57% | 17.56% |
| Overall top-1 precision | 12.77% | 13.41% | 13.45% |
| Overall top-3 precision | 20.92 | 21.09 | 21.02 |
| Average runtime | 18.20 msec | 20.08 msec | 23.84 msec |
| Total memory used | 105.32 MB | 106.51 MB | 106.88 MB |
The 5-gram model provides the best overall top-1 precision with being able to predict the next word on the first try 13.45% of the time. The final deployed application uses the 5-gram model on the basis of this result.