Bhaskar Bandaru
24th January 2016
Data Science Coursera Capstone Project.
John Hopkins Bloomberg School of Public Health Data Science course series organised by Coursera MOOC.
Natural language processing techniques in R used to perform the analysis and build predictive model.
The data source is obtained from Project Data Link
This App provides Key Features
– Simple Interactive and Fast responding Predictive Model
– NLP Techniques can be used at Work for Product Sensitivty
– Customer or User Experience classification
Data exploration steps have been detailed out in Project Initial Report
Following is the App Home screen and hosted at Shinyapp
– N-gram [Ref - 5] for building the model is used for predicting the next word based on the previous 1, 2, or 3 words tokens.
– Katz's Back-off algorithm [Ref -6] to predict the next word after user enters a partial sentence
– A “smoothing”“ technique has been developed based on the Simple Good-Turing estimator developed by by William A. Gale and Geoffrey Sampson [Ref -7 & 8] to handle unseen n-grams.
– Also a part of speech (abbreviated form: PoS or POS) model for category of words (or, more generally, of lexical items) has been developed but not usedd.
1.In general the prediciton algorithm accuracy is better with 2-gram or more words.
The Corpora sampling and the percentage data for the training should be large in sample data (only 1% considered).
Application of Simple Good-Turing estimator Model is used. Predictablity is improved but performance is slow.
Model needs improvements with Deeplearning and Elastic search modles to improve the performance and prediction..
[1] Natural language processing Wikipedia page : NLP
[2] Text mining infrastucture in R : Text Mining using R
[3] CRAN Task View: Natural Language Processing: CRAN NLP
[4] Profanity words from Luis von Ahn's Research Group - bad words
[5] N-gram language model Wikipedia page: N-Gram Model
[6] Katz's back-off model Wikipedia page: Katz's Back-off
[7] Simple Good-Turing Estimator by William A. Gale and Geoffrey Sampson Paper: Good–Turing frequency estimation without tears
[8] Simple Good-Turing (Gale and Sampson proposed) Algorithm: Launguage C - Code
[9] Standford Natural Language Processing Course on Coursera: Coursera NLP
Thanks very much to Professors: Brian Caffo, Jeff Leek and Roger Peng from John Hopkins Bloomeberg School of Public Health Data Science using R on Coursera course.
The Courseara Course Teaching Assitants and Staff, Classmates and the peer reviewers for their help and support.