John Hopkins Coursera Data Science Capstone Project for next word Prediction.

Vincent Amedekah
2016-09-21

Project Overview

The goal of this project is to create an application which uses prediction algorithm to predict the next word when a user enter some words. The shiny application created takes an input and spits out predicted next word.

Data Processing Overview

The data source for the application is SwiftKey Data set containing blogs, twitter and news. The data is read in as a text and a corpus is created. The corpus is cleaned to remove profanity words, stop words, numbers etc. The corpus is then tokenized into 2, 3, 4, 5 grams which are used for the prediction technique.

Application Algorithm Overview

The first task is to filter the user input, this is same text cleaning process we used on the SwiftKey data. This includes removing numbers, punctuation, foreign characters, profanity, single letter words and contractions etc. Next we search of matches based on the user input. For example if we have the input 'looking forward seeing' a match is defined as 'looking forward seeing you'. If matches are found with shortened phrases last 3 words, the algorithm returm a match 4 word from the stored N grams. Each match is assigned a log probability based on a back off stratety implemented.

Application Interface overview

A shiny application is created with a user input section which has an input box where the user can input the phrase, select number of predicted words to return and a submit button below which triggers the predicting. An output panel is provided on the right of the user input section which shows the phrase entered by the user, the cleaned form of the phrase and a table displaying the predicted words and the log of probability assigned during each prediction. The application can be accessed at https://mccosby2020.shinyapps.io/wordpredict/