Data Science Capstone Project

A. Stefan
April 24, 2016

Introduction

  • The purpose of this project is to develop a prediction algorithm based on a data set from a corpus called HC Corpora (www.corpora.heliohost.org).
  • The model must (1) accept a phrase as its input and (2) return a prediction for the most likely next word
  • This application makes use of principles of Natural Language Processing and Text Mining

Approach

  • The model was built using 1% of the original data set, the available RAM limited the size of the sample set
  • N-grams were constructed, with n = 1, 2, 3. The small sample size did not justify the construction of n-grams of higher order
  • The entries (word combinations) in each of the n-grams were assigned probabilities ($ w_i $ = i-th word)
  • Trigram \[ P(w_{i}|w_{i-2}w_{i-1} = \frac{count(w_{i-2}w_{i-1}w_i)}{count(w_{i-2}w_{i-1})} \]

Approach (cont'd)

Description of the Application