Capstone Project

2019/03/26

Capstone Project Overview

Source Data: Coursera-Swiftkey Capstone Project
Cleansing Method: Remove punctuation, twitter hashtags, numbers, hyphens, symbols other than English alphabet, URLs.
Model: N-Gram Modeling and Kneser-Ney smoothing

MLE: When words A-B-C are present, the predicted word C is determined by the conditional probability of P(C | A and B). When there are multiple options, the application uses the most likely word as predicted word.
KN Smoothing
The problem with MLE is that when the word combination does not exist it can't predict the word.
To mitigate the problem the application uses Kneser-Ney Smoothing.
Discount the probability from the word appeared frequently and distribute the probabilty to N-1 gram words based on N-1 gram words frequency.

More N-gram: current model has up to quad-gram. For more precision, it can be extended to more. The limitation is that the more N-gram there is more computation power required.
Deep Learning(RNN): Known to be good at predicting next words. However R is not suitable for Deep Learning algorithm and most of us don't have computing resource for good RNN model. If resource allows, it could be the best solution.