Coursera Capstone Swiftkey's Word Predictor

Sam Koon, August 2015

Introduction

The Swiftkey's Word Predictor App was developed from data derived from several blogs, news feeds and tweets which are provided by Coursera as a training set for this project. Natural Language Processing (NLP) techniques are implemented during development to predict the most likely word that the user will enter next.

The App will take a word as input and generate a prediction for the next word. One of the challenge encountered during develoment is to ensure both the source code and data files are small enough to be loaded onto the Shiny Server and to run optimally as well.

Algorithm

The App relies on N-grams of size 1 through 4 only. A simple back-off strategy has also been implemented to allow for unobserved combinations of words:

First, rare combinations (i.e. observed only once in a dataset), courrupted word and profanity were discarded.
Then the N-gram frequencies calculated for each dataset (blogs, news and Twitter) were averaged to obtain the final table of frequencies. This strategy was required due to the different characteristics of the datasets which would be diluted had the frequency tables been calculated over an aggregated set of texts.
Based on the frequencies obtained, only the top 5 most probable words following the nearest 1 to 3-gram text input are returned as a possible suggestion to the user.

Features of the App

The Word Predictor App is available at the following link, please wait for a while for it to load when you click on the link.
https://samkoon.shinyapps.io/wordpredictor
Start typing a sentence in the text input on the left side of the Word Predictor App and it will start to provide the next probable word prediction.

Reference

Natural language processing Wikipedia page:
http://en.wikipedia.org/wiki/Natural_language_processing
Text mining infrastucture in R:
http://www.jstatsoft.org/v25/i05/
CRAN Task View: Natural Language Processing:
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
n-gram wikipedia
https://en.wikipedia.org/wiki/N-gram