Data Science Capstone

Zach Eisenstein
July 2018

Summary

alt texta

  • The objective of this project was to develop a predictive model of english text delivered via a Shiny app.
  • The data upon which the model was built came from large amounts of text scraped from tweets, blogs and news articles and provides a wide breadth of english lanugage expression.
  • The predictive text app can be accessed here

Applied Methods

  • The model employed within the app is a probabilistic language backoff n-gram (Markov) model.
  • Next word prediction informed by looking back at previous word groupings (n-grams)

Example
Input text: “I am going to the”
trigram = “going to the”, bigram = “to the”, unigram = “the”

n-gram next word frequencies retrieved from training set

trigram value freq
going to the gym 41
going to the game 19
going to the movies 17

About the Application

The app has a simple interface suitable for mobile devices.
caption

As the user types, the interface updates in real time, utilizing the n-gram backoff algorithm to predict the next word.

The top 3 choices and their respective “scores” are shown for reference.

The app recognizes sentence ending punctuation and will cease the lookback at that point.

Reference

  • The predictive text app can be accessed here
  • The source data can be accessed from the following link
  • The ngram package was utilized in the buildout of this application
  • Click here to learn more about the Johns Hopkins Data Science Specialization from Coursera

Contact:
Zach Eisenstein
z.eisenstein2@gmail.com
LinkedIn