Data Science Specialization SwiftKey Capstone

Yuan Hu
November 15 2014

Introduction

This data science capstone is corporated with SwiftKey, who builds a smart keyboard to predict words that makes people to type easily on their mobile devices.

In this capstone we will work on understanding and building predictive text models like those used by SwiftKey.

Data Source

The data set is downloaded from Coursera site, the Capstone Dataset.

In order to predict word by doing text mining, I will acquisit and clean the data sets, and then build a plausible model.

Build Prediction Algorithm

  1. Obtain the data
  2. Clean the data
  3. Filter profanity words
  4. Tokenize the sentence
  5. Perform exploratory analysis
  6. Build basic n-gram model
  7. Build a predictive model

The size of the initial training dataset is about [ 560 M ] in total. Surprisingly, the final smoothed 3-gram model RData file is only [ 22 M ].

Shiny App Algorithm and Performance

1. Preprocess the input sentence

The user input sentence will be cleaned and tokenized.

2. Predict the next single word

The algormithm will search and return the single predicted word.

3. Performance of word prediction

Prediction is done within 1 second

# typical prediction time cost
   user  system elapsed 
   0.27    0.01    0.30 

Shiny App