February 21, 2018

Brief Introduction

  • Designed by Johns Hopkins Universityu on coursera, which is the final part of the Data Science Specialization.
  • The optimal goal is to build a predictive model to predict the next word a user will type in when he is typing a sentence.
  • Data set used are twitter, news, and blogs.
  • Because of the limitation of size on shiny app, a subset of grams are chosed for building this model.

Getting & Cleaning Data

  • In order to further analyze the data, in other words, the grams, we first need to do some data cleaning.
  • Converting text to lowercase, strip white space, and removing punctuation and numbers.
  • Create n-grams: Bi-gram, Tri-gram and Quadgram.
  • Separate data into two category: twitters and all for specific usages.
  • Sort the n-gram data according to the frequency in descending order.

Prediction Model is based on the Katz Back-off algorithm

  • User input words are cleaned in the similar way as before prior to prediction of the next word.
  • For prediction of the next word, Quadgram is first used (first three words of Quadgram are the last three words of the user provided sentence).
  • If no Quadgram is found, back off to Trigram (first two words of Trigram are the last two words of the sentence).
  • If no Trigram is found, back off to Bigram (first word of Bigram is the last word of the sentence)
  • If no Bigram is found, back off to the most common word with highest frequency 'the' is returned.

  • Could be used on mobile input method so that users could select the most likely word they'd like to input without type in the whole word.
  • Could be used on analyzing the effectiveness of tweets.