Capstone Project

Jeremy Fraenkel
10/22/2018

Project overview

This is a short presentation for the final capstone project for the JHU datascience course. You can find more information here: https://www.coursera.org/learn/data-science-project/home/welcome

  • The goal of the project was to build a shiny app to predict the next word in a sentence
  • In order to predict the next word, the algorithm needed to analyze word sequences from 3 sources (all in English): news articles, blogs and twitter. Due to the large amount of data only a small portion of the available data was used for this project
  • The 3 predicted words are the words with the highest n-gram probabilities

Shiny Application

plot of chunk image

  • On the left side you can input words
  • On the right side you get 3 predicted words as well as a wordcloud of different options

Project Methodology

  • The model was built using unigrams, bigrams, trigrams and quadgrams
  • The algorithm applied was the Kneser-Ney Algorithm (see page 53, https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf)
  • In order to test the accuracy of the model, different methods were applied such as looking at perplexity values
  • Note that the capabilities of the model are limited due to the limited amount of data used in the analysis
  • Further improvements could include: refining the model, increasing the amount of data used, having the model learn new words every time a user puts a new word in, suggest a spell check, etc…

References