Final Presentation

February 21, 2018

Brief Introduction

Designed by Johns Hopkins Universityu on coursera, which is the final part of the Data Science Specialization.
The optimal goal is to build a predictive model to predict the next word a user will type in when he is typing a sentence.
Data set used are twitter, news, and blogs.
Because of the limitation of size on shiny app, a subset of grams are chosed for building this model.

Getting & Cleaning Data

In order to further analyze the data, in other words, the grams, we first need to do some data cleaning.
Converting text to lowercase, strip white space, and removing punctuation and numbers.
Create n-grams: Bi-gram, Tri-gram and Quadgram.
Separate data into two category: twitters and all for specific usages.
Sort the n-gram data according to the frequency in descending order.

Prediction Model is based on the Katz Back-off algorithm

User input words are cleaned in the similar way as before prior to prediction of the next word.
For prediction of the next word, Quadgram is first used (first three words of Quadgram are the last three words of the user provided sentence).
If no Quadgram is found, back off to Trigram (first two words of Trigram are the last two words of the sentence).
If no Trigram is found, back off to Bigram (first word of Bigram is the last word of the sentence)
If no Bigram is found, back off to the most common word with highest frequency 'the' is returned.

Could be used on mobile input method so that users could select the most likely word they'd like to input without type in the whole word.
Could be used on analyzing the effectiveness of tweets.