Capstone Project Presentation

2/28/2020

Introduction

This presentation is created for Coursera’s Data Science Capstone Project.

The goal of this project was to build a prediction algorithm in a shiny app to create an app that predicts the next word as the user types a sentence.

Here is a link to the shiny app: https://tarski.shinyapps.io/CapstoneProject/

Getting and Cleaning the Data

A subset of the original data was sampled from the three sources (blogs,twitter and news) which is then merged into one.
Next, data cleaning is done by conversion to lowercase, strip white space, and removing punctuation and numbers.
The corresponding n-grams are then created (Quadgram,Trigram and Bigram).
Next, the term-count tables are extracted from the N-Grams and sorted according to the frequency in descending order.
Lastly, the n-gram objects are saved as R-Compressed files (.RData files).

Word Prediction Model

Compressed data sets containing descending frequency sorted n-grams are first loaded.
User input words are cleaned in the similar way as before prior to prediction of the next word.
For prediction of the next word, Quadgram is first used (first three words of Quadgram are the last three words of the user provided sentence).
If no Quadgram is found, back off to Trigram (first two words of Trigram are the last two words of the sentence).
If no Trigram is found, back off to Bigram (first word of Bigram is the last word of the sentence).
If no Bigram is found, back off to the most common word with highest frequency ‘the’ is returned.

Shiny Application

The app works by the user typing a phrase into the input box, and on the right the app displays the predicted next word, the sentence input, and what n-gram is used to predict the next word.