Word Predictor Application

Jonathan Mallia
27th May, 2017

Coursera Data Science Capstone Project by John Hopkins University

Project Overview

The Word Predictor application predicts the next word based on the last words in the sentence.

Firstly, the last 4 words in the sentence are observed and determine if there is a suitable prediction for that phrase. If unsuccessful the last 3 words are then considered and fed to the prediction function to find a match. This process repeats until there is a match.

The application shows the top 3 words suggested from the predictions with the respective probablilty.

Data and processing

The biggest challenge in this project is the amount of data. The combination of words/phrases in a language are extensive so we need big corpuses (collections of text). This was a limitation due to the computer resources (especially mine with only 4Gb of RAM).

Therefore 6% of original corpus dataset (blogs, news and tweets) was used for building the prediction engine using R's tm package. This low sample percentage can have an impact on the accuracy of the prediction.

Subsequent steps

  • Remove word contractions, clean document, remove stop words, etc. Stemming was not performed as proved unuseful
  • Build bigrams, trigrams and quadgrams and save to disk to to reduce application load times

The Prediction algorithms

Katz model with Simple Good Turing smoothing was used for this project. Katz lets probabilities compete - the probabiliy of an N-gram vs a N-1-gram, adjusted by a factor. If there is an unobserved N-gram, we fall back to N-1 grams.

Simple Good-Turing (SGT) smoothing was implemented in order to get better probability estimates and most importantly get a reasonable probability estimate for unseen words and N-grams.

The application performs very fast due to the way the phrase cleaning and predictions models were written.

Subsequent steps

  • Develop Shiny application as interface to the algorithm
  • Deploy to ShinyApps.io application farm

How does it work?

  1. Visit link https://jonimatix.shinyapps.io/wordpredictor/
  2. Type a phrase, sentence, or word
  3. The application utilizes the trained model to predict the next three most probable words as your type
  4. The application outputs the results