6/24/2019

Introduction

This presentation demonstrates PredictApp - final project submission for Coursera Data Science Specialization by Johns Hopkins University.

The project submission consists of two parts:

  1. This slide deck
  2. A Shiny app located at https://lauripiispanen.shinyapps.io/PredictApp/

Algorithm

The application uses a Katz’s back-off model. Compared to simple n-gram models, back-off models distribute their probability mass across multiple n-gram models - effectively backing off through progressively shorter models.

\[ P_{bo} (w_i | w_{i-n+1} \cdot \cdot \cdot w_{i-1})\\ = \begin{cases} d_{w_{i-n + 1}\cdot \cdot \cdot w_i} \frac{C(W_{i-n+1} \cdot\cdot\cdot w_{i - 1}w_i)}{C(W_{i-n+1} \cdot\cdot\cdot w_{i - 1})} & \quad \text{if } C(w_{i-n+1} \cdot \cdot \cdot w_i) > k \\ \alpha_{w_{i-n + 1}\cdot \cdot \cdot w_{i-1}} P_{bo}{(W_i | W_{i-n+2} \cdot\cdot\cdot w_{i - 1})} & \quad \text{otherwise} \end{cases} \]

This allows back-off models to capture a more realistic representation of the source material probability distribution. However, some caveats still remain - e.g. it may be significant that a specific trigram is not grammatically valid, even though its constituent bigram parts are both valid.

PredictApp

PredictApp is an R Shiny app powered by the algorithm.

After typing some words, subsequent words are suggested. As the user keeps typing, more suggestions are shown and clicking one will add it to the sentence. Additionally, a table of estimated probabilities for the next word is shown.

BONUS! An additional “rambling” mode can produce randomly generated sentences from the model. Users can choose the number of words generated.

Deficiencies

For performance purposes, the back-off model has been severely reduced in size to be able to fit shinyapps.io memory constraints. The complete model is relatively small and therefore performance is limited.

The algorithm itself is valid for more elaborate models, yet implementing a more efficient data structure for storing the model data is cumbersome in R, and would be implemented in other languages for production purposes.

Katz’s back-off, while better than simpler n-gram models, has been superseded by more advanced probability distribution methods, such as Kneser-Ney smoothing.