Predicting the next word using n-grams
Matias Thayer
jul/05/2016
Final Project. Johns Hopkins University.
Data Science Specialization.
Predicting the next word
The predictions are given by the Kneser Ney smoothing probability distribution.
This is a public app, and you can try it in this Shiny app:
https://chechir.shinyapps.io/Wordict/
You only need to input any text and opptionaly you can set some options such as:
Number of words to return
If you want the results ordered by the Keser Ney algorithm probability
The app takes around 12 seconds to load
A preliminar exploratory data analysis can be found
Here
Why Kneser Ney Smoothing algorithm?
It accounts for unseen n-grams, and also includes clever ideas such as P continuation.
Also has an elegant intuition and good performance
I based my implementation on the work of Daniel Jurafsky & James H. Martin:
Here
Formula for bigrams:
\[ P_{(KN)}(w_i|w_{i-1}) = \frac{max(c(w_{-1}, w_{1}) - \delta, 0)}{\sum_{w'}{c(w_{i-1}, w')}} + \lambda_{w_{i-1}}P_{cont}(w_i) \] Where lambda is \[ \lambda_{w-1}=\frac{d}{c(w_{i-1})} \]
Performance of the model
15% accuracy to the first predicted word
30% accuracy to the 5 first words estimated
Measured against text messages missing the some of the words
Randomly selected from tweets, blogs and news
Excluded from the train set
Implementation
The app was trained using the
HC corpora
It uses tweets, blog entries and news (only from English)
Some data was drop in order to make the app more responsive.
I had to sacrifice some accuracy
The n-grams were calculated using quanteda
The counts and probability were calculated using the dplyr library
The code for the Shiny app can be found in here:
https://github.com/chechir/WordPredictor
The code for calculating the n-grams and Kneser Ney is not provided, but if you need it leave a comment!
Posible next steps
Improve accuracy by adding semantic to the algorithm
Implement in a mobile environment
Narrow the train corpus for a particual context (For example: train the corpus only using your business data)