My Capstone Project

Fernando Crema
April, 2015

The Problem

The goal of this project is to complete sentences using previous information.

  1. We need an incomplete sentence.
  2. Where do we learn?

    • Input from Twitter, Blogs and News given by the instructors.
  3. Used library library(e1071) for NaiveBayes.

  4. Natural Language Processing.

    • Used lm library for Natural Language Processing.
    • Used RWeka for tokenize input and generate n-grams.

The Algorithm: Naive Bayes Classifier

We need to calculate the probability of the nth word using n-1 previous words. So we have a set:

2-gram

  1. As we have the n-grams, we need to calculate a table of frequencies.
  2. Using the frequencies, we can obtain all the conditional probabilites we need.
  3. For all the words we choose the maximum conditional probability of all words given the set.

2-gram

The Flow

2-gram

  1. We use data from Twitter, Blogs and News.
  2. We sample the data because it's too large.
  3. Applying Natural Language Processing such as n-gram techniches.
  4. Naive Bayes classification using as input the n-grams.

The APP

http://belgrades.shinyapps.io/data_product/

  1. Basic input of a sentence.
  2. Submit button if we need to predict.
  3. In the left, explanation of the problem and the algorithm.
  4. Image of the flow used to build the whole application.
  5. Output: Top 5 probabilites given the input.

The APP

2-gram