Data Science Capstone: SwiftKey
Di Zhu
3/2/2021
Background
- This Data Science Capstone project is provided by SwiftKey which builds a smart keyboard that makes it easier for people to type on their mobile devices.
- Our goal is to understand and build predictive text models like those used by SwiftKey which give options for what the next word might be when users type with the keyboard.
- In this project, we first did an exploratory analysis to understand the variation in frequencies of words and word pairs in the data.
- Then we built a Shiny App with a prediction model which gives you the predicted next word based on the N-grams extracted from the training data sets.
Prediction Model
- The model is trained with 70% of three data sets from blogs, news and Twitter provided by SwiftKey.
- The data is cleaned, and then 3-grams, 2-grams and 1-gram are extracted and their frequencies are counted.
- Only the N-grams that appear more than 3 times are kept to reduce the size of the app and the processing time.
- The model searches 3-grams first, and then 2-grams and 1-gram to find the possible next words.
- Since many n-grams are not included in the data set, some kinds of smoothing are necessary. Here I use the Katz's back-off model with Good-Turing smoothing.
- Katz's back-off model:
link. Good-Turing frequency estimation:
link
The Shiny App
- The design of the Shiny App is quite user friendly.
- You just need to input a few words or a sentence, select the number of next words that you want to figure out, and then press the submit button.
- Within a few second, the App will return a table with the predicted next words and their Katz probabilities.
- The 'details' tab shows some instructions to the App and a brief description of the algorithm.
- The Shiny App: Link
Interface of The App

- Thanks, and congratulations on your accomplishment!