Text Prediction: an application for predicting the next word

mpmartins1970
2017-01-02

Final Project Presentation
Coursera Data Science Capstone

Executive Summary

  • The main objective of this project is to build a web based data product (shiny application) that reads a phrase entered by a user and uses predictive analytics to predict the next word to be typed.

  • The data used for this word prediction model were English texts from HC Corpora (twitter, news and blogs sources).

  • Principles of data science, text mining and natural language processing were required to complete this project.

Data Analysis and Manipulation

  • The dataset from HC Corpora were cleaned, tokenized and normalized by transforming to lowercase, removing special characters, punctuation, numbers and stopwords, stripping whitespaces. Words were stemmed.

  • A sample of this cleaned corpus of data was used to generate a frequency sorted list of n-grams (2-gram, 3-gram and 4-gram). This sample was limited to most frequent 300k n-grams to optimize the response time of prediction model.

  • These n-grams were saved/stored in data frames and then used to do word prediction for the user input.

Prediction Algorithm

  • After the input of the user, the data is preprocessed (cleaned, tokenized and normalized just like it was done with corpus dataset).

  • If the user typed more than three words only the last three ones are used.

  • The algorithm will looking up for possible endings and will return the top suggestions for the next word.

  • Backing-off from 4-grams to 1-gram when no prediction word is found.

  • If no matches are found, the most common unigrams will be shown.

Shiny Application - Word Predictor

  • Using the prediction algorithm described, a web-based application was built and is available here: Word Predictor

  • When the app is launched, simply enter the text in “Your Sentence:” input text and press or Predict button

  • In Algorithm Results tab you will see:

    -> The Next Single Word Prediction
    -> A more Complete Word Predictions
    -> The Original Sentence
    -> Cleansed Text
    -> What dataframe was used in prediction and time elapsed