Capstone Project: Next Word Text Prediction ShinyApp

Jasmin Pielorz
August 23rd, 2015

Project Idea

Background: This capstone project is part of the Johns Hopkins Data Science Specialization offered by Coursera.

Aim: Creating a ShinyApp that takes as input a phrase in a text box input and outputs a prediction of the next word prediction.

The presentation gives an overview of:

  • How the provided data was preprocessed.
  • How the final prediction model was build.
  • The basic functionalities of the ShinyApp.

Training Data and Preprocessing

The training data stems from English news, blogs and twitter messages. They provide the basis for building a text corpus.

To analyze n-gram frequencies, the following preprocessing steps were performed:

  • Remove punctuations from text corpus.
  • Transform words to lower case.
  • Strip text of additional whitespaces.
  • Stopwords and numbers are intentionally included.

Prediction Algorithm

Steps for building a prediction model:

  • Use preprocessed text corpus to calculate bigrams, trigrams and fourgrams.
  • Consider all bigrams, where the first word occurs at least 10 times.
  • Consider all trigrams and fourgrams, where the first 2(3) words occur at least 5 times.
  • Use a simplified Katz back-off model for the next word prediction.
  • Optimise algorithm by saving and loading data as R data formats.

Next Word Prediction ShinyApp

  1. Wait 15 seconds to load the n-grams.
  2. Type in a phrase into the upper text box.
  3. Press the “Predict” button.
  4. If you like it, press “Add” or else press “Clear”.
  5. Go back to 3. or change the model in the side panel.

Acknowledgements

I would like to thank the entire team from Johns Hopkins University and Coursera for offering a very interesting and inspiring specialization in Data Science. A special thanks goes to C.H. Lampert for introducing me to the Python Natural Language Toolkit.

Useful References for the Capstone: