"Words Predictions using n-gram with Good-Turing Smoothing"

Aadesh Neupane
2016-03-19

For completion of Coursera Data Science Specialization from Johns Hopkins University

Project Overview

We need to build an predictive algorithm which can predict upcoming words give some words as input. For this task, Coursera in collabration with Switfkey has provided us with the data set from which we will be build a english language corpus and predictive model.

Steps followed for this project:

Data acquisition and Cleaning
Algorithm Research
Algorithm Selection (n-gram with Good-Turing)
Models Trained
Evaluated Models
Shiny Application using trained models

Algorithm

n-gram with Good-Turing

Reassign the probability mass of all events that occur n times in the training data to all events that occur n-1 times

Good-Turing

Instructions for Application Use

Features

Prediction of next word almost instantly
Around 400 MB of lanuage model size for accurate prediction
Word clouds for next probabal words
Probabal wordlist with frequency and likelyhood values

Words Predict App

Conclusion

This capstone project provided me with the oppurtunity to explore NLP , train and build models. Also, It allowed to me to implement the skill that I have learned during previous nine course in this specilization. This application Shinyapp is the result of the knowledge and skill earned during this data science specilization.

Future Enhancements

Use 20% of the corpus to build the model to attain more accuracy
Expore other model and smoothing techniques to enhance performace

References

The Good-Turing Estimate, Ellis Weng, Andrew Owens
NLP Lunch Tutorial: Smoothing, Bill MacCartney
Language Models : Statistical Machine Translation
CS546: Learning NLP
CS498JH: Introduction to NLP, Julia Hockenmaier
Speech and Language Processing, Daniel Jurafsky & James H. Martin