Next Word Prediction App

Suresh Subramaniam
03/14/2018

Introduction and Background

This app displays the probable next word a user will type based on the words he/she has typed so far. As the user types words, the app suggests the next word. Some of the practical applications of this framemwork are

Speed up text input and typing in constrained environments like a mobile phones or small keyboards
Help people with writing difficulties
Help people with inadequate language skills or while using a foreign language.
Use as a supervised or unsupervised text generator
Improve written communication quality by detecting mistakes and reducing grammatical and spelling errors.

Approach and Methodology

The high level details of how the app was built is below

The training and testing data was derived from sampling data from Twitter, Blogs and News. The data used is available at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Due to the large volume of data, a 10% random sample of the three individual datasets was extracted and combined to train the model. A 1% sample of the combined dataset was used to test the model, quantify the accuracy and do cross-validation.
Various optimization techniques were used to find the best balance between accuracy, performance and data volume.
The prediction was done after reducing the data to 5-grams. An ngram is a set of contiguous words in a document. More details at https://en.wikipedia.org/wiki/N-gram

Approach and Methodology (Continued)

The Backoff method was used to predict the probable next word. The user provided phrase is first searched for in the highest n-gram, if it is found present the next word, otherwise go the next lower level n-gram and search for the phase minus the first word, and so on till we reach the unigram. If it is not found in any n-gram then present the most frequently occurring single word. More details at https://www.quora.com/What-is-backoff-in-NLP
The UI for the app was built using the Shiny package and it was deployed on shinyapp.io.
A more detailed explanation of the methodolgy and suggestions on how to improve the accuracy and performance is available at http://rpubs.com/joresh/368348

Archiecture and Tools Used

The code for the data loading, cleaning, transforming and reporting are all written in R.
Generic functions have been developed for each step in the process which can be applied to any dataset (within contraints mentioned in the document at http://rpubs.com/joresh/368348)
The R packges used include base R, dplyr, tm, tidytext, data.table and stringr.
The app is deployed to the cloud using Shiny.

The link to the application is https://joresh.shinyapps.io/wordpred/