2/28/2020

Introduction

This presentation is created for Coursera’s Data Science Capstone Project.

The goal of this project was to build a prediction algorithm in a shiny app to create an app that predicts the next word as the user types a sentence.

Here is a link to the shiny app: https://tarski.shinyapps.io/CapstoneProject/

Getting and Cleaning the Data

  • A subset of the original data was sampled from the three sources (blogs,twitter and news) which is then merged into one.

  • Next, data cleaning is done by conversion to lowercase, strip white space, and removing punctuation and numbers.

  • The corresponding n-grams are then created (Quadgram,Trigram and Bigram).

  • Next, the term-count tables are extracted from the N-Grams and sorted according to the frequency in descending order.

  • Lastly, the n-gram objects are saved as R-Compressed files (.RData files).

Word Prediction Model

  • Compressed data sets containing descending frequency sorted n-grams are first loaded.

  • User input words are cleaned in the similar way as before prior to prediction of the next word.

  • For prediction of the next word, Quadgram is first used (first three words of Quadgram are the last three words of the user provided sentence).

  • If no Quadgram is found, back off to Trigram (first two words of Trigram are the last two words of the sentence).

  • If no Trigram is found, back off to Bigram (first word of Bigram is the last word of the sentence).

  • If no Bigram is found, back off to the most common word with highest frequency ‘the’ is returned.

Shiny Application

The app works by the user typing a phrase into the input box, and on the right the app displays the predicted next word, the sentence input, and what n-gram is used to predict the next word.