Capstone Project
Axel Schwanke
2016-01-24
The application predicts the most likely next word of an incomplete sentence.
It also supports the user in completion of the currently typed word.
The prediction algorithm is based on an n-gram language model
with interpolation and discounting techniques.
The application was developed in R using the R shiny package.
N-grams are the basis of the word prediction application (Wikipedia).
The n-grams (n=1..4) were created from the text corpora (news, blogs, twitter) with 5 million lines of text (HC Corpora) :
N-gram tables for the Shiny application:
| # n-grams | # n-grams for 50% coverage | # n-grams for 90% coverage | |
|---|---|---|---|
| 1-gram | 75,002 | 100 | 4,141 |
| 2-gram | 2,325,469 | 16,514 | 655,272 |
| 3-gram | 3,802,948 | 175,497 | 2,275,668 |
| 4-gram | 5,812,920 | 731,747 | 4,518,792 |
Some Facts:
Benchmark results: