6/17/2018

Introduction

The goal of the project is to build a predictive text model combined with a shiny app UI that will predict the next word as the user types a sentence similar to the way most smart phone keyboards are implemented today using the technology of Swiftkey.

[Shiny App] - [https://jtsou.shinyapps.io/FindNextWord/]

Getting and Cleaning Data

  1. Loaded raw US data (blogs,twitter and news), and then merged into one.
  2. Data cleaning is done by conversion to lowercase, strip white space, and removing punctuation and numbers.
  3. The corresponding n-grams are then created (Unigram, Bigram, Trigram, Quadgram).
  4. The term-count tables are extracted from the N-Grams and sorted according to the frequency in descending order.
  5. The n-gram objects are saved as R-Compressed files (.rdata files).

Word Prediction Model

The model is based on Katz Back-off algorithm.

  1. Compressed data sets containing descending frequency sorted n-grams are first loaded.
  2. User input words are cleaned in the similar way as before prior to prediction of the next word.
  3. For prediction of the next word, Quadgram is first used (first three words of Quadgram are the last three words of the user provided sentence).
  4. If no Quadgram is found, back off to Trigram.
  5. If no Trigram is found, back off to Bigram.
  6. If no Bigram is found, back off to Unigram.

Shiny Application