Capstone Project Next Word Prediction App

Ramiro Caro
October 7th, 2016

alt text

Summary

The Data Science Specialization Capstone Project, have the objective of integrate all the knowledge obtain in the previous courses.

In this presentation I will introduce the “The Word Prediction” App. This app implements a NLP algorithms to calculate the most probable word after a given sentence, apllying interpolated 2, 3, 4 and 5 ngrams probabilities.

The application is implemented with Shiny, and published on shinyapps.io

Algorithm Implementation

Before implementing the prediction algorithm, an exploratory analysis was made. Here we clean an organize the data in a significant way.

The approach for the prediciton was using n-grams. In my case, I work using from bi-grams to 5-grams. To produce them i utilize the “quanteda” package that allow me to do it very simply.

I used interpolation to select the best prediction. I take the 3 more probables words from each n-gram, multiply them for a factor and then chosse the one with highest total score. That way i use all the information available in each prediction.

Utilization

The interface is quite minimalistic. In order to obtain the prediction, first you enter the sentence in the text box and then press the “Predict” button. The prediction outcome will appear lower in the screen. DONE!!

Additional Resources

In order to complete the project i had to review a lot of material in NLP and text processing. Here is a list of packages and material that i found very helpful.

Packages:

quanteda: Corpus processing and ngram generation.
dplyr / tidyr: Awesome Data Manipulation
stringi: Very fast for string processing and parsing

Courses:

Natural Language Processing: Awesome NLP course by Dan Jurafsky & Chris Manning, from Standford.