Data Science Capstone Project

Djoko Soehartono
28 December 2016

header

Predicting Next Word

Introduction

This presentation is created as part of the requirement for the Coursera Data Science Capstone Course.

The goal of the project is to build a predictive text model combined with a shiny app UI that will predict the next word as the user types a sentence similar to the way most smart phone keyboards are implemented today using the technology of Swiftkey.

[Shiny App] - [https://dsoehartono.shinyapps.io/capstone]

[Github Repo] - [https://github.com/dsoehartono/capstone]

Getting & Cleaning the Data

Before building the word prediction algorithm, data are first processed and cleaned as steps below:

  • A subset of the original data was sampled from the three sources (blogs, news and twitter) which is then merged into one.
  • Next, data cleaning is done by conversion to lowercase, strip white space, and removing punctuation and numbers.
  • The corresponding N-grams are then created (Quadgram, Trigram and Bigram).
  • Next, the term-count tables are extracted from the N-Grams and sorted according to the frequency in descending order.
  • Lastly, the N-gram objects are saved as R-Compressed files (.RData files).

Model for Word Prediction

The prediction model for next word is based on the Katz Back-off algorithm, as explained below:

  • Compressed datasets containing descending frequency sorted N-grams are first loaded.
  • User input words are cleaned in the similar way as before prior to prediction of the next word.
  • For prediction of the next word, Quadgram is first used (first three words of Quadgram are the last three words of the user provided sentence).
  • If no Quadgram is found, back off to Trigram (first two words of Trigram are the last two words of the sentence).
  • If no Trigram is found, back off to Bigram (first word of Bigram is the last word of the sentence)
  • If no Bigram is found, back off to the most common word with highest frequency 'the' is returned.

Shiny Application

A Shiny application was developed based on the next word prediction model described previously as shown below:

  • User enters a partially complete sentence in the input box and press “Submit” button.
  • Next word predicted is shown at “Predicted Next Word” textbox.
  • N-gram back-off algorithm used in the prediction is also shown.