Capstone Final Project Presentation

Bill Zhang

Problem Statement

The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others. For this project you must submit:

A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.
A slide deck consisting of no more than 5 slides created with R Studio Presenter (https://support.rstudio.com/hc/en-us/articles/200486468-Authoring-R-Presentations) pitching your algorithm and app as if you were presenting to your boss or an investor.

Getting & Cleaning the Data

A subset of the original data was sampled from the three sources (blogs,twitter and news) which is then merged into one.
Next, data cleaning is done by conversion to lowercase, strip white space, and removing punctuation and numbers.
The corresponding n-grams are then created (Quadgram,Trigram and Bigram).
Next, the term-count tables are extracted from the N-Grams and sorted according to the frequency in descending order.
Lastly, the n-gram objects are saved as R-Compressed files (.RData files).

Shiny Application

The Shiny application allow the prediction of the next 3 possible word for a sentence!
The user entered the text in an input box, other box will show the written statement!
The predicted word is obtained from the n-grams matrices, comparing it with tokenized frequency of 2, 3 and 4 grams sequences and will show 3 predicted words!
While entering the text, the field with the predicted next word refreshes instantaneously, and then the predicted word is then provided for the user's choice.

Further Work

Further Exploration can expand the main weakness of this approach: long-range context
1. Current algorithm discards contextual information past 4-grams
2. We can incorporate this into future work through clustering underlying training corpus/data and predicting what cluster the entire sentence would fall into
3. This allows us to predict using ONLY the data subset that fits the long-range context of the sentence, while still preserving the performance characteristics of an n-gram and Stupid Backoff model