Data Science Capstone Presentation:

By: An Aspiring Data Scientist
Date: Jan. 23, 2016

A Text Prediction 'Shiny' Application (in R), Inspired by SwiftKey

App Overview

My Shiny app uses standard NLP (natural language processing) methods to attempt to predict the user's next word based on up to 4 of the last words in their input phrase. Applications like this one help users of products with textual interfaces save time entering text, as well as help organizations fill in missing data in related datasets.

This presentation describes the following…

  • The Prediction Algorithm's Background
  • How the Shiny App Works (workflow: front and back-end)

Algorithm Background

  • After having 'cleaned' a corpus, which sampled millions of phrases from Twitter, blogs, and news sites, I created a term matrix indicating the corpus's n-gram frequences.
  • With a respectable sample size, I went ahead and removed all n-grams that were only used once, then created models based on the resultant Markov chains (essentially, an Nth conditional probability model).
    • For example: A bigram's probability would be P(Wi|W1..Wi-1) <- P(Wi|Wi-1), while a trigram's probability would be P(Wi|W1..Wi-1) <- P(Wi|Wi-2,Wi-1), where 'W' = word.
    • Naturally, some accuracy was lost by removing single instance n-grams, but it significantly boosted loading and processing speeds.
  • Overall, my algorithmic methods were probably very standard, but I did manage to get a little creative with the Shiny app itself.

How the App Works

  • First, the user enters their preferred word or phrase and presses 'predict', which then hides the div using ShinyJS while the system is thinking.
  • Next, the app identifies the last 1 to 4 words in that phrase using strsplit (base; helps w/load time) and then turns to the appropriate model to find the word with the highest conditional probability of being next at the target n-gram size (i.e. if we could only grab 3 words, we're looking for the derivative 4-gram with the highest probability).
    • Please note: The app's predictive capability is maxed at 5-grams.
  • The result is stored and presented as a UI output within a div that ShinyJS toggles to visible, and the user then has the option to try again.

Thank You

I'd like to take this time to thank all of my classmates in the Johns Hopkins Data Science Specialization for a fun year of challenging projects and interesting conversations.

Good luck in your future endeavors!