Data Science Capstone Slide Presentation

Zetch Cruz-Ram, MD FPCOM
02 March 2021

The Project

To create a product highlighting the prediction algorithm that was built to provide an interface that can be accessed by others.

Deliverables:

  1. A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.

  2. A slide deck consisting of no more than 5 slides created with R Studio Presenter (https://support.rstudio.com/hc/en-us/articles/200486468-Authoring-R-Presentations) pitching the algorithm and app as if it were to be presented to the boss or an investor.

Word Prediction Model

The next word prediction model uses the principles of “tidy data” applied to text mining in R.

Key model steps:

  1. Input: raw text files for model training
  2. Clean training data; separate into 2-, 3-, and 4-word ngrams, save as tibbles
  3. Sort ngrams tibbles by frequency, save as repo
  4. Ngrams function: uses a “back-off” type prediction model
    • user supplies an input phrase
    • model uses last 3, 2, or 1 words to predict the best 4th, 3rd, or 2nd match in the repo
  5. Output: next word prediction

Benefits: Easy to read code; uses “pipes”; fast processing of training data; able to sample up to 25% of original corpus; relatively small output repo

Word Predictor App

Word Predictor App

Key Features:

  1. Text box for user input
  2. Side panel with user instructions
  3. Predicted next word output will be shown below user input
  4. Tabs with overview, instructions and plot of the ngrams distribution

Key Benefits:

  1. Fast response
  2. Method allows for large training sets leading to better next word predictions

Word Predictor App Screenshot

plot of chunk unnamed-chunk-1