Data Science Capstone Project

A. Stefan
April 24, 2016

Introduction

  • The purpose of this project is to develop a prediction algorithm based on a data set from a corpus called HC Corpora (www.corpora.heliohost.org).
  • The model must (1) accept a phrase as its input and (2) return a prediction for the most likely next word
  • This application makes use of principles of Natural Language Processing and Text Mining

Approach

  • The model is built using 1% of the original data set, the available RAM limited the size of the sample set
  • N-grams were constructed, with n = 1, 2, 3. The small sample size did not justify the construction of n-grams of higher order
  • The entries (word combinations) in each of the n-grams are assigned probabilities (\( w_i \) = i-th word)
  • Trigram: \( P(w_{i}|w_{i-2}w_{i-1}) = \frac{count(w_{i-2}w_{i-1}w_i)}{count(w_{i-2}w_{i-1})} \)
  • Bigram: \( P(w_{i}|w_{i-1}) = \frac{count(w_{i-1}w_i)}{count(w_{i-1})} \)
  • Unigram: \( P(w_{i}) = \frac{count(w_i)}{corpus\ size)} \)

Approach (cont'd)

  • When given a phrase as input, the last two words are selected and matches are sought first in the trigram
  • If only one word is given, then the bigram is used
  • If the first step does not return results, i.e., the sequence of two words is not found in the trigram or the single word entered by the user is not found in the bigram, then a simple stupid backoff approach is implemented: if the trigram search returns NA, then select the last word in the input phrase and search in the bigram; if the search returns NA, then use the unigram probability

Description of the Application

alt text

  • The user types a phrase in the Input box at the left and the predicted word is shown in the Output box.
  • An example of a prediction is shown in the figure.

Comments and Additional Information