The goal of this exercise was to build a predictive model of English text. For example: if somebody types 2 words an algorithm predicts the next word.
The prediction is based on
- a text dataset consisting of nearly 5 millions lines of text.
- an n-gram model (see wikipedia) with ca 645.000 unigrams, 10 million bigrams and 30 million trigrams.
- the stupid backoff algorithm, see literature.
The shiny app returns a plot showing 5 candidates for the next word with the top overall scores, indicating if prediction is drawn from trigrams, bigrams or unigrams.