Data Science Capstone Project - Next Word Prediction

Md. Rajib Hossain
4-17-2016

Application Overview

This is a very simple learning based word prediction application.

Input box contain a sequence of word and predict a next word with respect to previous word sequence.

Each next word is probable output with respect to given input data.

We learn the prediction model by using twitter, blogs and news data which is taken from coursera.

Algorithm

  • Load data from current directory.
  • Combine the blogs,newsa and twitter data.
  • Apply data cleaning operation in combine data.
  • Remove punctuations, numbers, stop words, white spaces.
  • Convert All word to lower case.
  • Apply Katz's back-off algorithm for Single word, Bi-gram, Tri- gram and n- gram model.
  • Saved the model and project the input data with four diffrent model and get the best probable result and disply the output box.

Future Works and It's Application

In this application we only work for blogs, news and twitter data from the coursera, But in future I want to work for any type of data. In future I try to work for large scale data set. The learning part need some smoothing for each model, Which is improve the accuracy and reduce the false rate for a next word prediction. This application work very fine for blogs twitter and news data set. I also test in the given data set and predict the next word very clearly

Application performance

This application accuracy depends on model to mode. When the input sequence is more than four word the accuracy of the next word is reduce. This application almost 95% accurate for first two or three word. Some of the cases it's performance increase. After all we assure that the next word prediction accuracy is fine tune for coursera given data se.

Thanks To All.