12 January 2019
Data Product Overview
- SmartPredict is an application that utilizes HC Corpora to make a model that would simulate aided word prediction via web interface.
- HC Corpora is a combination of blog, Twitter, and news datasets that are processed, cleaned and sampled by the author to be used in this application.
- Model uses a 13% sample from the cleaned corpora (7% from blog, 2% from twitter and 4% from news tokenized and cleaned corpuses).
Model Building and Performance
- The application uses n-gram model with frequency lookup.
- I initially generated four-gram model from a frequency-arranged tokens to predict the next word.
- Three candidates will be filtered from top results, each with decreasing probability of being the next word.
- If no matching four-grams can be found, then the algorithm would revert to three-grams, two-grams and one-grams.
- Obtaining 100 samples from the Twitter dataset, the model reports 22% accuracy with a runtime of 32.6784 seconds.
The Application

- The web app contains a single text box with three buttons on top.
- The most probable next-word returned from the server will then be supplied as value of the middle button, the second and third words will then be supplied to left and right buttons respectively.
- User can click on these buttons to aid in fast typing.
- SmartPredict application demo can be found here.
Conclusion
- Using a decent amout of cleaned sample, we can build a model that can be used in useful applications like SmartPredict.
- The application is deployed via Shiny, a free-tier data product platform from RStudio using with 1 GB memory application, thus it can be run on environments with smaller footprint.
- We can improve the app's predictive accuracy by obtaining the most recent corpus from wide variety of sources and increase the sample size while also considering the memory as well as storage constraints.
- We can also develop a multilingual SmartPredict application using corpus from different language sources.