Predicting the Next Word in a Sentence

Astrid Deschenes
January 12th, 2019

Introduction

The goal of this project is to implement an algorithm to predict the next word in a sentence, given one or more words as input.

  • This is the capstone project of the Coursera Data Science Specialization.
  • This project is related to the natural language processing field.
  • The training dataset is provided by SwiftKey.
  • The algorithm should have a good predictive accuracy with minimal computational runtime.
  • The algorithm will be implemented in a Shiny application.

Next Word Prediction Algorithm

The algorithm implements the Kneser-Ney Smoothing1 model:

Equation

  • The value of 0.75 has been assigned to the parameter d.
  • The probabilities are calculated for trigrams, bigrams and unigrams using the training dataset.
  • The algorithm is paired with a back-off strategy. When the number of given words is 2 or more, the trigram probabilities are used first. If the given words are not present, the algorithm backs off to the bigram probabilities, and so one.

Next Word Prediction in Application

The algorithm is implemented in a Shiny3 application. The application displays the predicted next word from a fragment of a sentence given by the user.

References

[1] Daniel Jurafsky and James H. Martin (2018) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Third Edition Draft.

[2] Smitha Milli (2015) Kneser-Ney Smoothing. http://smithamilli.com/blog/kneser-ney/

[3] Winston Chang, Joe Cheng, JJ Allaire, Yihui Xie and Jonathan McPherson (2018). shiny: Web Application Framework for R. R package version 1.2.0. https://CRAN.R-project.org/package=shiny