Capstone Final Project Presentation

Bill Zhang

Problem Statement

The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others. For this project you must submit:

  1. A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.
  2. A slide deck consisting of no more than 5 slides created with R Studio Presenter (https://support.rstudio.com/hc/en-us/articles/200486468-Authoring-R-Presentations) pitching your algorithm and app as if you were presenting to your boss or an investor.

Getting & Cleaning the Data

  • A subset of the original data was sampled from the three sources (blogs,twitter and news) which is then merged into one.
  • Next, data cleaning is done by conversion to lowercase, strip white space, and removing punctuation and numbers.
  • The corresponding n-grams are then created (Quadgram,Trigram and Bigram).
  • Next, the term-count tables are extracted from the N-Grams and sorted according to the frequency in descending order.
  • Lastly, the n-gram objects are saved as R-Compressed files (.RData files).

Shiny Application

  • The Shiny application allow the prediction of the next 3 possible word for a sentence!
  • The user entered the text in an input box, other box will show the written statement!
  • The predicted word is obtained from the n-grams matrices, comparing it with tokenized frequency of 2, 3 and 4 grams sequences and will show 3 predicted words!
  • While entering the text, the field with the predicted next word refreshes instantaneously, and then the predicted word is then provided for the user's choice.

Further Work

  • Further Exploration can expand the main weakness of this approach: long-range context
    1. Current algorithm discards contextual information past 4-grams
    2. We can incorporate this into future work through clustering underlying training corpus/data and predicting what cluster the entire sentence would fall into
    3. This allows us to predict using ONLY the data subset that fits the long-range context of the sentence, while still preserving the performance characteristics of an n-gram and Stupid Backoff model