Coursera Data Science Specification - Swiftkeys Capstone Project

Tho Duy Nguyen
2016/01/20

Introduction

The project aims to develop an application predicting next inputing word from the user, based on previous sequence of word(s). Through out the course, we went though many steps:

Read and cleaned trainning data provided by Swiftkey.
Tokenized, sample and explored many attributes of the data
Reasearch and choose suitable predicting mode: n-gram model.
Develop demo Shiny application and deploy to shinyapps.io
A short slide deck to describe the project.

Predicting model

Used a subset of English data in training process due to the limitation of resource: 50,000 entries of blogs, 50,000 entries of news, 100,000 entries of twitter
N-Grams model were built from trainning data to store information about the next input and it's possibility giving a sequence of inputing word(s)
2-gram, 3-gram, 4-gram and 5-gram are used to predict next word based on last 1, 2, 3, 4 input.

Improve predicting result

The predict model implement simple linear interpolation: construct a linear combination of the multiple probability estimates. Weight each contribution so that the result is another probability function.

simple linear interpolation

Sum of all weight values equals to 1
Assuming more detail input should lead to more accurate predict outcome, the application use a pre-defined weight:

weight

Optimize performance

SQLite was used to store n-grams model, optimzed RAM usage and reduce initial loading time.
Each n-gram was stored in a saperated *.db file (partitioned).
Shiny application used training model information up to 500 MB in size with reasonable response time.

How to use the Shiny application

The demo Shiny application is located here
Screen shot of the application

Screen shot

shinyapps.io sometimes is unstable and could lead to bad user experience.