Coursera Data Science Capstone Final Project

Text Prediction Algorithm and Web Application

Jan Tatham
June 5th, 2016

This presentation serves as an introduction for an NLP Application for predicting the next word. The application is a Capstone Project for the Coursera Data Science Specialization Course provided by Johns Hopkins University and in cooperation with SwiftKey.

Objective

The goal of this project is to develop a Shiny Web Application to predict the next word following the text being entered by the user.

To assist the development of the algorithm and application, the exercise was divided into seven sub tasks.

Understanding the Problem
Data Acquisition and Cleaning
Exploratory Data Analysis
Statistical Modeling
Building a Prediction Model
Creative Exploration
Data Product

Data Analysis and Model Training

For the project a large sample of text was provided by SwiftKey from news articles, tweets and blog post sources. Due to the large size of the data, a very small random subset was extracted from each of the three sources. From this data a Corpus was created and then cleaned by conversion to lowercase, removing punctuation, numbers, white space, non-alpha characters, and profanity.

The corpus was then tokenized into n-grams, i.e., a series of the most common n words, resulting in the following n-grams: 2-grams, 3-grams, 4-grams and 5-grams.

The model used for prediction is a simple back-off model. The extracts the last four words typed by the user, then proceeds to compare it to the 5-gram model to predict the common next word. If not, then use the 4-gram model and so on.

The Application

To use the application, simply type in a word, phrase or sentence in the text box located in the top left. On the right side of the screen, a predicted word will appear in red on the left side of the screen. Additionally, underneath the predicted word is a table of other predicted suggestions in order of probabilty. The number of returned predicted word can be increased or decreased using the slider on the left.

The Next Word Prediction Application can be found:

https://sebity.shinyapps.io/Capstone-Final-Project/

Known Limitations

The application does have limitations which can be improved with further development. In its current form, the prediction algorithm depends entirely on, and upto the last five words entered. The application fails to take advantage of the context of the sentence which is vital for communication. A few ways the application can be improved are:

The application should learn from the user that is using the application. Learning the users most frequent words/phrases used as well their sentence construnction will improve the accuracy.
The current version of the app uses a very small sample of data that was provided due to hardware limitations. Using a larger sample of the dataset would increase the prediction accuracy.