Data Science Capstone Project
A. Stefan
April 24, 2016
Introduction
- The purpose of this project is to develop a prediction algorithm based on a
data set from a corpus called HC Corpora (www.corpora.heliohost.org).
- The model must (1) accept a phrase as its input and (2) return a prediction for the most likely next word
- This application makes use of principles of Natural Language Processing and Text Mining
Approach
- The model was built using 1% of the original data set, the available RAM limited the size of the sample set
- N-grams were constructed, with n = 1, 2, 3. The small sample size did not justify the construction of n-grams of higher order
- The entries (word combinations) in each of the n-grams were assigned probabilities ($ w_i $ = i-th word)
- Trigram \[
P(w_{i}|w_{i-2}w_{i-1} = \frac{count(w_{i-2}w_{i-1}w_i)}{count(w_{i-2}w_{i-1})}
\]
Description of the Application