Capstone Presentation

JHU Coursera Data Science Capstone Project (Presentation)

Maninder Khurana, Ph.D.
Sun Nov 03 17:26:56 2019

header

Smart Keyboard App to predict the next word

Introduction

This presentation is for partial fulfillment of the Coursera Data Science Capstone Course.

The project is on building a predictive text model along with a shiny app UI for predicting the next word as the user types similar to the technology of Swiftkey.

[Shiny App] - [https://phng.shinyapps.io/capstone]

[Github Repo] - [https://github.com/maninderkhurana/Capstone.git]

Collecting & Preparing the Data

App foundations: Collect, Process and Clean the data by the following steps:

  • Data Collection: A Sample subset of the original data from the three sources (blogs,twitter and news) collected and merged into one.
  • Data Cleaning: Conversion to lowercase, strip white space, and removing punctuation and numbers.
  • N-grams: Corresponding n-grams are then created (Quadgram,Trigram and Bigram).
  • Term-count Tablkes: The term-count tables are extracted from the N-Grams and sorted according to the frequency in descending order.
  • Store: Lastly, the n-gram objects are saved as R-Compressed files (.RData files).

Model Formulation

Algorithm: The prediction model for next word is based on the Katz Back-off algorithm as explained next:

  • Load the data sets: Compressed data sets containing descending frequency sorted n-grams are first loaded.
  • Words Preprocessing ( Maintain Consistency) : User input words are cleaned in the similar way as before prior to prediction of the next word.
  • Model Prediction: Quadgram is first used (first three words of Quadgram are the last three words of the user provided sentence); If no Quadgram is found, Try Trigram; If no Trigram is found, Try to Bigram;
  • If no Bigram is found, then the most common word with highest frequency 'the' is returned.

Shiny Application

For the project, a Shiny application was also developed based on the the model presented here and works in the way described. Do check it out.