Coursera Data Science Capstone Project

Predicting the Next Word

Ashish Veera
21st January 2018

Introduction

The goal of this project is to build a predictive text model so that it can predict the next word as the user is typing a sentence into the Shiny app.

The Shiny app is available at https://ashishveera.shinyapps.io/ShinyApp/

Github Repo is available at https://github.com/AshishVeera/Data-Science-Capstone

Preprocessing the data

  • A subset of the original data (which was downloaded from Coursera) was sampled from the 3 distinct sources (blogs,twitter and news) and was finally merged into one.
  • The data was then processed by converting into lowercase, stripping white spaces and removing punctuation and numbers.
  • Then the n-grams were created (Quadgram,Trigram and Bigram).
  • The term-count tables were extracted from the N-Grams and sorted according to the frequency in descending order.
  • Finally, the n-gram objects were saved as R-Compressed files (.RData)

Word Prediction Model

The prediction model for predicting the next word is based on the Katz Back-off algorithm. Here's the description of the mechanics of this algorithm:

  • Compressed data sets containing descending frequency sorted n-grams are first loaded.
  • User input words are cleaned in the similar way as before prior to prediction of the next word. -For predicting the next word, Quadgram is first used (first three words of Quadgram are the last three words of the user provided sentence).
  • If no Quadgram is found, back off to Trigram (first two words of Trigram are the last two words of the sentence).
  • If no Trigram is found, back off to Bigram (first word of Bigram is the last word of the sentence)
  • If no Bigram is found, back off to the most common word with highest frequency 'the' is returned

Shiny Application

  • User can enter a partially sentence and click on the submit button
  • The predicted word is shown in the dedicated text box under the title “The predicted next word is:”
  • As a note, the n-gram used to predict the next word is indicated in the note section