Data Science Capstone-Natural Language Processing

Christopher Papanicolas

July 29, 2017

Introduction

Johns Hopkins along with Swiftkey, sponsored a project for students to analyze three different corpuses together, explore the content of the data with various data analysis tools, and create an app that predicts the next word of various sentences, phrases, and single words.

The Web application is simple to use and very cool! (https://cpapanicolas.shinyapps.io/prediction/)

Delete the word prediciton from the input box. (THere will be an error that comes up, ignore it)
Add a sentence/phrase/word to the box and a prediction will return.
The word predicted is based on frequency in the index we built for our model.

Sample of US blogs, US news, and twitter was extracted, and cleaned(tm_map package).
Exploratory analysis was done on the text.
The corpus was clean, tokenized, and indexed into dataframe with their respective probabilities in descending order.
Each dataframe represented ether a unigram, bigrame, a trigram, or a gquadgram. -The dataframes for eacn N-gram was combined into one dataframe.
The Katz Backoff Model and the Marckov Chain model was used to create a function to predict the next word using the N-gram data frame.
The prediction was based on the frequency of the next word in the given n-gram dataframe.

Limitations

Future Steps