Text Prediction: an application for predicting the next word

mpmartins1970
2017-01-02

Final Project Presentation
Coursera Data Science Capstone

Executive Summary

The main objective of this project is to build a web based data product (shiny application) that reads a phrase entered by a user and uses predictive analytics to predict the next word to be typed.
The data used for this word prediction model were English texts from HC Corpora (twitter, news and blogs sources).
Principles of data science, text mining and natural language processing were required to complete this project.

Data Analysis and Manipulation

The dataset from HC Corpora were cleaned, tokenized and normalized by transforming to lowercase, removing special characters, punctuation, numbers and stopwords, stripping whitespaces. Words were stemmed.
A sample of this cleaned corpus of data was used to generate a frequency sorted list of n-grams (2-gram, 3-gram and 4-gram). This sample was limited to most frequent 300k n-grams to optimize the response time of prediction model.
These n-grams were saved/stored in data frames and then used to do word prediction for the user input.

Prediction Algorithm

After the input of the user, the data is preprocessed (cleaned, tokenized and normalized just like it was done with corpus dataset).
If the user typed more than three words only the last three ones are used.
The algorithm will looking up for possible endings and will return the top suggestions for the next word.
Backing-off from 4-grams to 1-gram when no prediction word is found.
If no matches are found, the most common unigrams will be shown.

Shiny Application - Word Predictor

Using the prediction algorithm described, a web-based application was built and is available here: Word Predictor
When the app is launched, simply enter the text in “Your Sentence:” input text and press or Predict button

In Algorithm Results tab you will see:

-> The Next Single Word Prediction
-> A more Complete Word Predictions
-> The Original Sentence
-> Cleansed Text
-> What dataframe was used in prediction and time elapsed