Predicting the Next Word in a Sequence

Mahesh Arumugam
June 5, 2016

Introduction

  • In this project, we build a prediction algorithm to predict the next word in a sequence.
  • Steps involved in the project
    • Data acquisition and preprocessing (Coursera Site)
      • download, sample and clean datasets
      • build a corpus and apply transformations (e.g., strip whitespaces, filter profanity)
    • Exploratory Analysis (report)
      • ngram tokenization and coverage analysis
    • Prediction Model and Application Development
      • ngram model and dealing with unseen tokens (using smoothing algorithms)

Prediction Model

  • We build unigram, bigram, trigram and quadgram model from the sampled data.
  • We use the raw frequency counts from the highest possible ngram model to predict the next word.
  • To deal with the unseen tokens in the ngram model, we implemented two smoothing algorithms that distribute the probability mass to unseen tokens.
    • Add-one smoothing (Laplacian smoothing): add one to all counts and normalize the probabilities.
    • Kneser-Ney smoothing: is a backoff smoothing algorithm where the lower-order ngram probabilities are used to deal with unseen higher-order ngrams.

Next Word Prediction Application

Try the application here

  • Select the smoothing algorithm on the SideBar Panel of the Shiny app.
  • Start typing your text in the main Panel. The predictions will appear as radio-buttons.
  • “Predicted Result Summary” tab shows the probabilities.

Add-one smoothing screenshot

Add-one smoothing Add-one probabilities

Conclusion

  • The predictions take some time to appear. Please wait 10-20 seconds for the session to load and the radio buttons to appear.
  • The prediction accuracy is very less (approximately, 10-20%).
  • In the current implementation, Kneser Ney algorithm does not seem to provide a better prediction results compared to the add-one smoothing algorithm.
  • Scope for future work:
    • Improve the accuracy of the algorithm
    • Develop the model with a bigger sample (currently, limited by the hardware)
    • Implement other discounting models