Data Science Capstone Project

Giovanni Melo Carvalho Viglioni
April, 13th 2016

Project Assignment

The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others. For this project you must submit:

  • A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.
  • A slide deck consisting of no more than 5 slides created with R Studio Presenter pitching your algorithm and app as if you were presenting to your boss or an investor.

A key point here is that the predictive model must be small enough to load onto the Shiny server.

Algorithm Function - Flowchart

Functionality of the text prediction algorithm:

  • Step 1: Remove punctuation, foreign characters, and profanity;
  • Step 2: Search for match, if matches are found, skip to Step 4;
  • Step 3: Shorten input, calculate a penalty value, search again;
  • Step 4: Calculate score for each match found, sort results.

Algorithm Function - Probability Model

The image shows how the algorithm calculates a score (employing the Markov assumption) given a predicted word.

  • This model was chosen due to speed returning predicted words and memory restrictions of the shinyApps website;
  • If we use the “Stupid Backoff” to find more matches we apply a penalty to the log probability score.

Application Interface

To use the application you must provide a phrase to be analyzed, select the maximum number of results to return and press the “Analyze Text” button.

Output: The original user phrase, the filtered phrase that the algorithm analyzes, and a table showing predicted words.

Click here to see the application and full documentation. Click here to see the source code.

Thanks.