Predicting the next word using n-grams

Matias Thayer
jul/05/2016

Final Project. Johns Hopkins University.
Data Science Specialization.

Predicting the next word

  • The predictions are given by the Kneser Ney smoothing probability distribution.
  • This is a public app, and you can try it in this Shiny app:
  • You only need to input any text and opptionaly you can set some options such as:
    • Number of words to return
    • If you want the results ordered by the Keser Ney algorithm probability
  • The app takes around 12 seconds to load
  • A preliminar exploratory data analysis can be found Here

Why Kneser Ney Smoothing algorithm?

  • It accounts for unseen n-grams, and also includes clever ideas such as P continuation.
  • Also has an elegant intuition and good performance
  • I based my implementation on the work of Daniel Jurafsky & James H. Martin: Here
    • Formula for bigrams: \[ P_{(KN)}(w_i|w_{i-1}) = \frac{max(c(w_{-1}, w_{1}) - \delta, 0)}{\sum_{w'}{c(w_{i-1}, w')}} + \lambda_{w_{i-1}}P_{cont}(w_i) \] Where lambda is \[ \lambda_{w-1}=\frac{d}{c(w_{i-1})} \]

Performance of the model

  • 15% accuracy to the first predicted word
  • 30% accuracy to the 5 first words estimated
  • Measured against text messages missing the some of the words
    • Randomly selected from tweets, blogs and news
    • Excluded from the train set

Implementation

  • The app was trained using the HC corpora
    • It uses tweets, blog entries and news (only from English)
  • Some data was drop in order to make the app more responsive.
    • I had to sacrifice some accuracy
  • The n-grams were calculated using quanteda
  • The counts and probability were calculated using the dplyr library
  • The code for the Shiny app can be found in here: https://github.com/chechir/WordPredictor
  • The code for calculating the n-grams and Kneser Ney is not provided, but if you need it leave a comment!

Posible next steps

  • Improve accuracy by adding semantic to the algorithm
  • Implement in a mobile environment
  • Narrow the train corpus for a particual context (For example: train the corpus only using your business data)