Capstone Project

Jeremy Fraenkel
10/22/2018

This is a short presentation for the final capstone project for the JHU datascience course. You can find more information here: https://www.coursera.org/learn/data-science-project/home/welcome

The goal of the project was to build a shiny app to predict the next word in a sentence
In order to predict the next word, the algorithm needed to analyze word sequences from 3 sources (all in English): news articles, blogs and twitter. Due to the large amount of data only a small portion of the available data was used for this project
The 3 predicted words are the words with the highest n-gram probabilities

plot of chunk image

On the left side you can input words
On the right side you get 3 predicted words as well as a wordcloud of different options

The model was built using unigrams, bigrams, trigrams and quadgrams
The algorithm applied was the Kneser-Ney Algorithm (see page 53, https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf)
In order to test the accuracy of the model, different methods were applied such as looking at perplexity values
Note that the capabilities of the model are limited due to the limited amount of data used in the analysis
Further improvements could include: refining the model, increasing the amount of data used, having the model learn new words every time a user puts a new word in, suggest a spell check, etc…