Data Science Capstone Project - Natural Language Processing

Maurizio M. Murino
13/05/2016

The Objective

The main goal is to build a shiny application capable to predict the next word given one.

This exercise contemplated:

data cleaning;
exploratory analysis;
predictive modelling;
model optimization;
web application development.

Issues

Such a task is too computationally demanding for a PC such the one I use. First tests were performed on a 1% sample of the data. Later, the sample has been increased to 5%.

It works on a n-gram association rule: it checks in descending order from largest to smallest n-grams associated with the choosen term.

The app

Simply add a word of you choice and push the botton. Because the small sample, some words could be too rare to create a match. If this occurs, you should add a second word to the rare one! It produces also a probability table with the most likely following words.

alt text

References

The app, the data, the presentation and the development tests are hosted on git hub: https://github.com/Maurizio-Mario/CP_Natural_Language.git

R predictions: https://www.youtube.com/watch?v=0le0ijNVP5M