Text Prediction App

A project showcase for the John Hopkins University Data Science Specialization Capstone delivered through Coursera

Objectives

  1. Build an algorithm that can predict the next word when an user enters a phrase.

  2. Showcase the prediction algorithm with a Shiny App.

Process Overview and Details

flowchart

  1. Sample Corpus: Random Sampling is performed on provided corpus to get training, validation and test set.
  2. Process Data and Create N-grams: Training set is subsequently cleaned (removal of html tags, emails, twitter handles, punctuations etc) and N-grams tokens were created.
  3. Create Text Model: N-gram frequency tables are created.

Process Details

4. Build Prediction Function: After processing user input, lookup is performed to determine the most probable next word. Simple Backoff fallback is implemented for the N-gram tables. If no words are found, the phase “No word found” is returned.
5. Validate Model and Prediction: Validation is performed with validation set. Last word prediction accuracy is used as metric (i.e Percentage of predicted last word matches actual last word).
6. Refine model: Final model consists of 1,2,3 and 4-grams. Words are not stemmed and sparse terms were removed. Last word prediction accuracy for test set is 14.4%.

Application Instructions

  1. Access shiny app here
  2. Type your desired phrase into the text box.
  3. The application will then try to predict the next word and display it.