Capstone Project: Next Word Prediction Shiny App Report

Bhaskar Bandaru
24th January 2016

Data Science Coursera Capstone Project.

John Hopkins Bloomberg School of Public Health Data Science course series organised by Coursera MOOC.

Overview

Natural language processing techniques in R used to perform the analysis and build predictive model.

The data source is obtained from Project Data Link

This App provides Key Features

– Simple Interactive and Fast responding Predictive Model

– NLP Techniques can be used at Work for Product Sensitivty

– Customer or User Experience classification

Next Word Prediction App Overview

Data exploration steps have been detailed out in Project Initial Report

Following is the App Home screen and hosted at Shinyapp

App Screen Shot

Prediction Model and Details

– N-gram [Ref - 5] for building the model is used for predicting the next word based on the previous 1, 2, or 3 words tokens.

– Katz's Back-off algorithm [Ref -6] to predict the next word after user enters a partial sentence

– A “smoothing”“ technique has been developed based on the Simple Good-Turing estimator developed by by William A. Gale and Geoffrey Sampson [Ref -7 & 8] to handle unseen n-grams.

– Also a part of speech (abbreviated form: PoS or POS) model for category of words (or, more generally, of lexical items) has been developed but not usedd.

Conclusions and Improvements

1.In general the prediciton algorithm accuracy is better with 2-gram or more words.

  1. The Corpora sampling and the percentage data for the training should be large in sample data (only 1% considered).

  2. Application of Simple Good-Turing estimator Model is used. Predictablity is improved but performance is slow.

  3. Model needs improvements with Deeplearning and Elastic search modles to improve the performance and prediction..

References

[1] Natural language processing Wikipedia page : NLP

[2] Text mining infrastucture in R : Text Mining using R

[3] CRAN Task View: Natural Language Processing: CRAN NLP

[4] Profanity words from Luis von Ahn's Research Group - bad words

[5] N-gram language model Wikipedia page: N-Gram Model

[6] Katz's back-off model Wikipedia page: Katz's Back-off

[7] Simple Good-Turing Estimator by William A. Gale and Geoffrey Sampson Paper: Good–Turing frequency estimation without tears

[8] Simple Good-Turing (Gale and Sampson proposed) Algorithm: Launguage C - Code

[9] Standford Natural Language Processing Course on Coursera: Coursera NLP

Acknowledgements

Thanks very much to Professors: Brian Caffo, Jeff Leek and Roger Peng from John Hopkins Bloomeberg School of Public Health Data Science using R on Coursera course.

The Courseara Course Teaching Assitants and Staff, Classmates and the peer reviewers for their help and support.