A model and website for Natural Language Prediction.
Created for the Data Science Capstone project
run by Coursera and John Hopkins University.
Chris Lill
21 January 2016
To use a collection of 4 million tweets, blogs and news articles to predict the next word that a user will type.
The text was simplified and a count was made for each instance of 2, 3 or 4 consecutive words. The probability was calculated for the top 5 answers for each phrase.
| Model | word-3 | word-2 | word-1 | answer | probability |
|---|---|---|---|---|---|
| Bigram | beautiful | day | 9% | ||
| Trigram | a | beautiful | day | 22% | |
| Quadgram | what | a | beautiful | day | 44% |
| Interpolated | what | a | beautiful | day | 34% |
Optimization of the model to fit the validation set gave:
\[ P_{interpolated} = 0.8 P_{quadgram} + 0.16 P_{trigram} + 0.04P_{bigram} \]
Ideas for improving the accuracy of the model include:
<BOS> tag into the model to improve the prediction of the first three words.