Predicting the next word using n-grams

Matias Thayer
jul/05/2016

Final Project. Johns Hopkins University.
Data Science Specialization.

The predictions are given by the Kneser Ney smoothing probability distribution.
This is a public app, and you can try it in this Shiny app:
- https://chechir.shinyapps.io/Wordict/
You only need to input any text and opptionaly you can set some options such as:
- Number of words to return
- If you want the results ordered by the Keser Ney algorithm probability
The app takes around 12 seconds to load
A preliminar exploratory data analysis can be found Here

It accounts for unseen n-grams, and also includes clever ideas such as P continuation.
Also has an elegant intuition and good performance
I based my implementation on the work of Daniel Jurafsky & James H. Martin: Here
- Formula for bigrams: \[ P_{(KN)}(w_i|w_{i-1}) = \frac{max(c(w_{-1}, w_{1}) - \delta, 0)}{\sum_{w'}{c(w_{i-1}, w')}} + \lambda_{w_{i-1}}P_{cont}(w_i) \] Where lambda is \[ \lambda_{w-1}=\frac{d}{c(w_{i-1})} \]

15% accuracy to the first predicted word
30% accuracy to the 5 first words estimated
Measured against text messages missing the some of the words
- Randomly selected from tweets, blogs and news
- Excluded from the train set

The app was trained using the HC corpora
- It uses tweets, blog entries and news (only from English)
Some data was drop in order to make the app more responsive.
- I had to sacrifice some accuracy
The n-grams were calculated using quanteda
The counts and probability were calculated using the dplyr library
The code for the Shiny app can be found in here: https://github.com/chechir/WordPredictor
The code for calculating the n-grams and Kneser Ney is not provided, but if you need it leave a comment!

Improve accuracy by adding semantic to the algorithm
Implement in a mobile environment
Narrow the train corpus for a particual context (For example: train the corpus only using your business data)