Coursera Data Science Specification - Swiftkeys Capstone Project

Tho Duy Nguyen
2016/01/20

Introduction

The project aims to develop an application predicting next inputing word from the user, based on previous sequence of word(s). Through out the course, we went though many steps:

  • Read and cleaned trainning data provided by Swiftkey.
  • Tokenized, sample and explored many attributes of the data
  • Reasearch and choose suitable predicting mode: n-gram model.
  • Develop demo Shiny application and deploy to shinyapps.io
  • A short slide deck to describe the project.

Predicting model

  • Used a subset of English data in training process due to the limitation of resource: 50,000 entries of blogs, 50,000 entries of news, 100,000 entries of twitter
  • N-Grams model were built from trainning data to store information about the next input and it's possibility giving a sequence of inputing word(s)
  • 2-gram, 3-gram, 4-gram and 5-gram are used to predict next word based on last 1, 2, 3, 4 input.

Improve predicting result

  • The predict model implement simple linear interpolation: construct a linear combination of the multiple probability estimates. Weight each contribution so that the result is another probability function.

simple linear interpolation

  • Sum of all weight values equals to 1
  • Assuming more detail input should lead to more accurate predict outcome, the application use a pre-defined weight:

weight

Optimize performance

  • SQLite was used to store n-grams model, optimzed RAM usage and reduce initial loading time.
  • Each n-gram was stored in a saperated *.db file (partitioned).
  • Shiny application used training model information up to 500 MB in size with reasonable response time.

How to use the Shiny application

  • The demo Shiny application is located here

  • Screen shot of the application

Screen shot

  • shinyapps.io sometimes is unstable and could lead to bad user experience.