Next Word Prediction: Simple SwiftKey

Sufia Khatun
September, 2016

Coursera Data Science Capstone Project

In association with SwiftKey

App Overview

It is a user interface word predictor app. This app performs the following task:

  • It takes any number of words as user input.
  • Perform prediction algorithm for predicting next words.
  • Predict possible next three words.

This app will be found here

Instructions

  • Initially this app will take approximately 50 sec to upload the data files.
  • This app has navigation panel with five tab panels:
    1. Word Predictor
    2. User Guide
    3. Documentation
    4. References
    5. Contact
  • The user will type a text and click on 'submit' button
  • App will show the three possible next words within 2 seconds.

Prediction Algorithm

Different algorithms were investigated for this app such as Katz's back-off model, Stupid backoff model and Kneser-Ney smoothing. It is found from the study that stupid backoff model is much easier and simpler for the web app.

N-gram model was used to predict next words. This app uses 5-grams to 2-grams to predict words based on Stupid backoff model. The algorithm defines following scoring function:

\[ \begin{aligned} S(w^{i}|w^{i−1}_{i-k+1}) = \begin{cases} \frac{freq\left ( w^{i}_{i-k+1} \right )}{freq\left ( w^{i-1}_{i-k+1} \right )}, & \text{ if } freq\left ( w^{i}_{i-k+1} \right ) >0 \\ \alpha*s(w^{i}|w^{i-1}_{i-k+2}), & \text{} Otherwise \end{cases} \end{aligned} \]

Where, S( ) = score of canditate \( w^{i} \) for a given condition \( w^{i-1}_{i-k+1} \) in n-gram. \( s(w^{i}|w^{i-1}_{i-k+2}) \) is the score of canditate \( w^{i} \) for a given condition \( w^{i-1}_{i-k+2} \) in (n-1)-gram and \( \alpha \) is equal to 0.4.

Example of Stupid Backoff Model

  • Suppose, the user typed “It would mean the”. One of the possible next words is “world”.
  • This model will look for freq('world' | 'it would mean the') in 5-gram and divide by the total freq('it would mean the') from 4-gram to get the score for 'world'. This process will continue looking back (n-1) grams until the possible next three words found. In equation,

\[ Score\left (world\right ) = \frac{freq\left (\text{world}\right )_{n=5}}{freq(\text{it would mean the})_{n=4}},\ \ \text{if}\ freq( \text{it would mean the world})_{n=5} >0\\ \] The score of possible next three words of “it would mean the” are:

Possible Next Word Freq-5gram Freq-4gram Freq-3gram Score
world 187 202 301 0.94
difference 0 0 40 0.03
same 0 0 35 0.03