Sébastien Lievain
16/02/2017
The data product created during this Capstone Project aims at reproducing the well known Swiftkeys keyboard prediction feature.
It is hosted on shinyapps.io and therefore the 1 Go constraint was taken into account when optimizing the accuracy / performance tradeoff.
The data product is organised with a traditionnal two-column layout.
The left panel holds parameters the User can interact with:
Our prediction model is based on a Tri-grams Language Model (2nd order Markov property).
In order to deal with unobserved tri-grams, two different models have been implemented:
The accuracy of the model was first calculated using Perplexity.
Unfortunately, the result was not stable when increasing the test dataset size and therefore not interpretable.
=> I finally decided to estimate the accuracy by measuring the number of hits (among top 3 predicted words) vs total number of tests.
Performance was obtained by timing the execution of both the Stupid and Katz back-off algorithms.
Acceptable limit artificially set to 15 seconds.
Parameters of the model used to optimize the Performance vs Accuracy tradeoff were:
The optimization task led to the selection of a ratio of 20% and a cutoff of 90% with bigrams and trigrams with a frequency of one filtered out.
Link to my data product on shinyapps.io:
http://84.39.36.234:3838/capstone_project/
The code of the data product, reports and scripts can be found on GitHub:
https://github.com/slievain/dataScienceCapstoneProject
Some videos from Michael Collins of Columbia who gave an excellent class on NLP:
https://www.youtube.com/playlist?list=PLO9y7hOkmmSH7-p6que1MYbhBx74AzH7-
https://www.youtube.com/playlist?list=PLO9y7hOkmmSHE2v_oEUjULGg20gyb-v1u