Coursera Data Science Capstone: Next Word Prediction Web Application

Bryan Scheiderer
March 12, 2017

Introduction

This Shiny web application was developed as part of the capstone course to the Coursera Data Science Specialization in conjunction with SwiftKey.

The assignment was to build an app using natural language processing to predict the next word from a user input of a word or short phrase. A large corpus of of blog, news and twitter data was used to build the model. Using the R programming language and packages, the data set was searched and tokenized, that is, ngram frequency tables were built from the data. Ngrams are sequences of words; in this case only 1, 2, and 3 ngram tables were used. More accurate models will use larger ngrams. The frequency at which ngram appears in the database was calculated and saved. These saved data files are then searched by the model to predict the next word that follows the given user input text.

The Shiny Web App

The Next Word Prediction App is located at: https://bdscheiderer.shinyapps.io/WordPrediction/ Screenshot of the Next Word Prediction App

Prediction Algorithm Key Components

  • The user inputs a word string; only English words are supported at this time; only the last two words are used
  • The app will “clean” the user input of punctuation, numbers, profane words, stopwords and extra whitespace; the cleaned text is returned to the user interface for comparison
  • The prediction model then uses the clean word or words and searches stored 2- and 3-gram frequency tables for matches
  • The model is a simplified “backoff” prediction algorithm; the model first search the trigrams table for a match, if no match is found the model searches the bigrams table for a match; if there is no match in the trigrams or bigrams frequency tables, then the model randomly chooses three of the top 20 words in the unigram frequency table; while not statistically accurate, it is more interesting than simply offering the same top 3 words every time there are no other matches!

Future Improvements

Possible improvements for future versions:

  • Develop prediction model using stemming
  • Develop prediction model with stop words included
  • Use 4- and 5-word ngrams for better accuracy
  • Improve error detection from user input
  • Use different and larger word corpus for training purposes
  • Create larger ngram frequency tables for better accuracy
  • Improve user interface