11/15/2020

Executive Summary

  • Natural Language Processing App
  • Goal: Predict third word of tri-gram given two leading words \[\\\]
  • Use Katz Back-Off Method for predictions
    • Provide list of word predictions along with probabilities

Training Corpus & Prediction Method

  • Three (very large) starter files
    • en_US.blogs.txt
    • en_US.news.txt
    • en_US.twitter.txt
  • Create a sample dataset using 5% each of the .txt files
    • Corpus created and cleaned using the quanteda package.
  • Katz Back-Off Prediction Method
    • A generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram of a training corpus.

Shiny App Interface

References