Swiftkey Data Science Capstone Project

Rithesh Kumar
Sat Apr 18 14:25:38 2015

The goal of this project is to allow a user to input a phrase into the application, and it would predict the next word that they “most likely” want to type.
The primary use case for this application is text messaging on mobile phones.
The data available for training the predictive model is millions of tweets, blog posts, and news articles in English
Milestone Report Link : Milestone Report
Application link : Shiny App - Next Word Prediction
Github Link : Codes

Preprocessing the text (e.g. filter non-English words, symbols)

Tokenization

Prepare unigram, bigram and trigram from the data

Count the occurrences of each unique unigram, bigram, trigram and quadgram

Calculate probabilties for each N-Gram using Maximum Likelihood Estimate And Simlple Linear Interpolation

Get the text phrase from the user

Extract the last three tokens (e.g. prev1, prev2) from the phrase. If the phrase is not long enough, extract the last two tokens or last token.

Screenshot Of The App
Instructions
- Wait 10 seconds for the app to load
- Enter text in input textbox
- Top 3 most probable next words are displayed in the output textbox

Limitations

RAM built-in to the laptop wasn't enough to handle the sheer size of the data
A sample representative population of ~1% was only used to train the model
Sparse values were removed during term document creation
The prediction model is biased towards train data. New word prediction is not very accurate

References