Rongbin Ye
5/28/2020
Pro the request of Swiftkey and JHU data sciencist concetration, an input application is requested to be developed. Utilzing the three dataset of common English languages used in tweets, blogs and news, a predictive model need to be embedded in the app.
Considering explainability and perplexity,a major model based on N-grams is chosen to be the model for application development (jurafsky & Martin, 2019).
The N-gram model with kenser-ney smoothing is used. The basic idea is a lazy loading model which uses a given chuck to form a probability of the given word and certain combination of word. Furthermore, with kenser-ney smoothing, the higher ngram has a higher priority to stimulate a language context.
This data product contains three major components: main dashboard with prediction and a straightforward GUI for interaction.
Base on this prototype product, as more data being trained through the model, the capacity of prediction will be reinforced for this model. This is a screenshot of the product developed.
Indeed, there are some spaces for improvement for this product. Despite N-gram model provides a great predictictability, this model has three major flaws:
To tackle these flaws, the process has been optimized by data cleaning process. Yet, if a LSTM RNN could be applied, I expect a better prediction result with these consideration, but have less intepretability.