Nicolais Guevara
April 23 2015
What is a Next Word Prediction?
Why do we need a prediction?
How ?
Public dataset
1) Load the data
2) Cleaning the data
remove profanity words
3) Generate n-gram from our data
4) Implementation of the predictor model
Stopwords included
the you for and
368 250 158 157
Removing stopwords: Most frequent words:
just like day will
59 51 48 46
Keeping stopwords
Removing stopwords
Most frequent 2-grams:
in the for the
44 35
For this sample:
-Total of 1-gram 3357
-Percent of 1-gram with frequency 1 and 2: 82.84%
-Total of 2-gram 9179
-Percent of 2-gram with frequency 1 and 2: 96.39%
For our model we keep from 1- to 4-gram with frequency greater than 2
The Application will provide:
Probability of a sentence (chain rule):
\[ p_1(I\,love\,the\,car) = p(I) p(love|I) p(the|I\,love) p(car|I\,love\,the) \\ p_2(I\,love\,the\,house) = p(I) p(love|I) p(the|I\,love) p(house|I\,love\,the) \]
Equal sentences (except the last word): the first three terms are identical (comparison only with the last term)
\[ p(car|I\,love\,the) = count(I\,love\,the\,car)/count(I\,love\,the) \] \[ p(house|I\,love\,the) = count(I\,love\,the\,house)/count(I\,love\,the) \]
Denominators are equal: \( count(I\,love\,the\,car) \), \( count(I\,love\,the\,house) \).
User should provide:
The Application will provide: