"Words Predictions using n-gram with Good-Turing Smoothing"

Aadesh Neupane
2016-03-19

For completion of Coursera Data Science Specialization from Johns Hopkins University

Project Overview

We need to build an predictive algorithm which can predict upcoming words give some words as input. For this task, Coursera in collabration with Switfkey has provided us with the data set from which we will be build a english language corpus and predictive model.

Steps followed for this project:

  • Data acquisition and Cleaning
  • Algorithm Research
  • Algorithm Selection (n-gram with Good-Turing)
  • Models Trained
  • Evaluated Models
  • Shiny Application using trained models

Algorithm

n-gram with Good-Turing

Reassign the probability mass of all events that occur n times in the training data to all events that occur n-1 times

Good-Turing

Instructions for Application Use

Features

  • Prediction of next word almost instantly
  • Around 400 MB of lanuage model size for accurate prediction
  • Word clouds for next probabal words
  • Probabal wordlist with frequency and likelyhood values

Words Predict App

Conclusion

This capstone project provided me with the oppurtunity to explore NLP , train and build models. Also, It allowed to me to implement the skill that I have learned during previous nine course in this specilization. This application Shinyapp is the result of the knowledge and skill earned during this data science specilization.

Future Enhancements

  • Use 20% of the corpus to build the model to attain more accuracy
  • Expore other model and smoothing techniques to enhance performace

References

  • The Good-Turing Estimate, Ellis Weng, Andrew Owens
  • NLP Lunch Tutorial: Smoothing, Bill MacCartney
  • Language Models : Statistical Machine Translation
  • CS546: Learning NLP
  • CS498JH: Introduction to NLP, Julia Hockenmaier
  • Speech and Language Processing, Daniel Jurafsky & James H. Martin