Ramiro Caro
October 7th, 2016
The Data Science Specialization Capstone Project, have the objective of integrate all the knowledge obtain in the previous courses.
In this presentation I will introduce the “The Word Prediction” App. This app implements a NLP algorithms to calculate the most probable word after a given sentence, apllying interpolated 2, 3, 4 and 5 ngrams probabilities.
The application is implemented with Shiny, and published on shinyapps.io
Before implementing the prediction algorithm, an exploratory analysis was made. Here we clean an organize the data in a significant way.
The approach for the prediciton was using n-grams. In my case, I work using from bi-grams to 5-grams. To produce them i utilize the “quanteda” package that allow me to do it very simply.
I used interpolation to select the best prediction. I take the 3 more probables words from each n-gram, multiply them for a factor and then chosse the one with highest total score. That way i use all the information available in each prediction.
In order to complete the project i had to review a lot of material in NLP and text processing. Here is a list of packages and material that i found very helpful.
Packages:
Courses: