Swiftkey Capstone Project

Anh Nguyen
Mar 8, 2017

Coursera Data Science Capstone Project

Project Goal When using mobile devices, the capability of having a text prediction can help the user type words faster with greater accuracy. With this challenge in mind, we have set out to build a fast and accurate prediction app that will predict the next word in a given sentence.

Product Introduction

Features:

  • Reactive The app processes the prediction reactively as the user types in words
  • 3 predictions Our app provides three predictions to increase coverage of the possible desired words
  • Profanity filter To prevent any potential lawsuits, our product also censors profanity
  • Accurate and fast We were able to achieve high prediction accuracy with a very short time.

Explanation of the internal algorithm

Initially, the first prototype of our product utilized a complex design that had logic based on the presence of stopwords. That initial model started with a low prediction accuracy of 10.1%. Through research, we found that the best improvements in accuracy came from:

  1. selecting the right data (type of data, eliminating outliers)
  2. cleaning the data source (punctuation, ordinal numbers)
  3. the amount of data used in the training set

Explanation of the internal algorithm

This release of the prediction model utilizes a simple backoff model. Based on the number of words provided, we first try to match the last 3 words using a “4 ngram” data set. If no matches are found, we repeat the process with 3 ngrams and 2 ngrams. The ngram data sets were also trimmed to increase prediction speed by removing any prediction that had more than 3 possible results - since we only provide 3 predictions, any excess is unnecessary.

Our accuracy of 25.1% was achieved by using 50% of the training set but then cutting back dramatically on single occurence tokens. This creates an accurate data set but at a small file size for fast load and prediction speeds. The prediction accuracy will most likely go up if a larger training set is used, but we were constrained by time and resources. All accuracy measurements are done using OOSE (Out Of Sample Error) validation data sets with 10k observations.

Link to the Shiny App and Instructions

Link to app:
https://tudinhhuong303.shinyapps.io/swiftkey/

Instructions:
After the app loads, simply enter the sentence of text into the input box. You will see reflected what you entered along with a suggested completion of the word you are currently typing. Enjoy!

Shiny Application Developed by:

  • Anh Nguyen