SwiftKey Text Prediction Application

Kapil Malik
25 Apr, 2015

Introduction

This application predicts the next word by looking at previous 1, 2 or 3 words from user input text. The application builds an n-gram model from an english text corpus consisting of -

Blogs (about 900,000 blog posts)

News articles (about 1 Million news articles)

Twitter feed (about 2 Million tweets)

About text corpus : The data is from a corpus called HC Corpora. It has been analyzed in depth in the SwiftKey Data Analysis report submitted as part of milestone submission.I studied the distribution of unigrams, bigrams and trigrams in data to understand how many n-grams will be needed to represent 90% of total data.

Algorithm : NGram Models

Clean the data by keeping only alphabets and apostophe (') characters.

Generate list of all bigrams available in data, along with their count.
Example (“This is”, 20000)

Group the bigrams by first word (i.e. use only unigram). Sort the second words by descending order of count.
Example (“This”, Array([(“is”,20000), (“has”,5000)])

Keep the first entry in second word as most probable suggestion, along with its confidence as #count/#total.
Example (“This”, “is”, 0.8)
Similarly build bigram and trigram models.

Note : I used Apache Spark for processing raw data and output csv files. These were in-turn translated to RDS files using R to be loaded in application.

Algorithm : Prediction

Used all 500,000+ unigram models, but only 1 million (out of 13 million) bigram models, representing 80% of all bigrams) and only 200,000 trigram models.

Please note that an ngram model here will lookup last n words to predict last word.

Backoff Prediction

Look at the last 3 words from input. If available in trigram model, return the word and confidence.

Else, look at the last 2 words from input. If available in bigram model, return the word and confidence.

Else, look at the last 1 word from input. If available in unigram model, return the word and confidence.

Else return the most commonly occuring word (which is “the”).

Using SwiftKey Text Prediction Application

The application is available here
Input Panel

Enter the text in input box

Select ngram complexity as trigram (default) or bigram or unigram.
If you select say, unigram, then only unigram model is used and bigram/trigram models are ignored.

Text Prediction Output

Prediction : Prediction of next word for the input text

Full Sentence : Full sentence by concatenating input text with prediction

Confidence : Confidence percentage of word prediction

Useful links

You can find more about the application and dataset here -

Application Git-hub link : Application Code
Data processing script(spark) Git-hub link : Data Processing Script
Application (shiny server) link : Shiny Application
Milestone report link : Milestone Report