Introduction to the NLP Capstone Project

Ben Apple
28 March 2015


  • Prediction Algorithm
    • The use of Markov Chain and Katz back-off as the primary modeling algorithm
    • Using the TM package for cleaning and compressing of the raw text
    • Keep response time down to 0.000 - 0.003
  • Instructions
    • User input sentence is truncate to the last 1 to 4 words
    • The app outputs the top five words and a word cloud for visualzation
  • Experience of Application
    • User Interface is buit in Shiny server
    • Manual / Documents

Prediction Algorithm

  • Markov Chain
    The term “Markov chain” refers to the sequence of random variables such a process moves through, with the Markov property defining serial dependence only between adjacent periods (as in a “chain”). It can thus be used for describing systems that follow a chain of linked events, where what happens next depends only on the current state of the system.

  • Kat back-off
    A generative ngram language model that estimates the conditional probability of a word given its history. Kat backoff



Users type their sentence into the text field of app. System will truncate the last 1~4 words and take them as input of predictive algorithm.



  • Word prediction
  • Wordcloud
    Users must click on the update word cloud bar to get an accurate wordcloud Wordcloud

Experience of App

App Constructure

R RStudio Shiny

Manual Documents

  • Predictive Model / App Workflow / Teminology
  • Interim Report / Final Presentation