Coursera Data Science Capstone Project

Renato Pedroso Neto
September, 30th - 2016

A next word prediction project, based on SwiftKey data.

Objective

The objective of the project was to develop a prediction model algorithm, from bottom to the top, that could show to the user the next most probable word that he/she meant to type.

In order to complete this tasks we needed to do activities like data cleaning, exploratory analysis, R developing, modeling and create the data products (the App and the R presentation pitch).

The data used to this project is from SwiftKey and can be downloaded HERE.

The final product, the App, can be accessed HERE.

Metodology

The model is a simple NGRAM, from level 1 to level 4, that calculates the probabilities, and conditional probabilities too, using Katz Backoff Strategy and Add-1 smooth - LAPLACE developed in R.

To ensure low response time the code was developed using data tables, instead of data frames, with keys indexing for performance. To guarantee that the ngrams, and probabilities calculations, run fast enough it was used the set() function of data.tables package to insert data.

The model was constructed considering 30% of the total available data. After the tests and validation, in order to put the code on a free account in ShinyApps, the amount of data needed to be drop to 3%.

How to use

The APP can be accessed HERE.
When the user enter in the website, listed above, a simple layout will appear and the R code will immediately starts in background (a wait message will appear, wait for it to desappear and a inicial table will be shown).

Type any english sentence on the refered text box and check, on the right side, the suggestion of next word. A list of more words will appear too, also on the right side. This list can be managed to show more or less values.

A simple documentation can be found, also. Just click on “Documentation” on the top menu.

Final Considerations

The full code can be found on GITHUB.
Any suggestions are well accepted, please contact me through LINKEDIN.