2019/03/26
Capstone Project Overview
- Goal: create an easy to use Word Prediction web application
- Prediction Model: Maximum Likelihoood estimation with Kneser-Ney Smoothing
- The application URL: Click Here
Data Cleansing and Modeling
- Source Data: Coursera-Swiftkey Capstone Project
- Cleansing Method: Remove punctuation, twitter hashtags, numbers, hyphens, symbols other than English alphabet, URLs.
- Model: N-Gram Modeling and Kneser-Ney smoothing
Maximum Likelihood Estimation(MLE) and Kneser-Ney Smoothing
- MLE: When words A-B-C are present, the predicted word C is determined by the conditional probability of P(C | A and B). When there are multiple options, the application uses the most likely word as predicted word.
- KN Smoothing
- The problem with MLE is that when the word combination does not exist it can't predict the word.
- To mitigate the problem the application uses Kneser-Ney Smoothing.
- Discount the probability from the word appeared frequently and distribute the probabilty to N-1 gram words based on N-1 gram words frequency.
How to Use the Application

Further Consideration
- More N-gram: current model has up to quad-gram. For more precision, it can be extended to more. The limitation is that the more N-gram there is more computation power required.
- Deep Learning(RNN): Known to be good at predicting next words. However R is not suitable for Deep Learning algorithm and most of us don't have computing resource for good RNN model. If resource allows, it could be the best solution.