Predicting the next word

Marco M M
August 8, 2015

Processing of database

For the development of this Project we had 3 databases from:

Twitter
Blogs
News

I selected 3% of these 3 databases for creating a matrix with trigram, bigrams, and the count of words. The packages used for this activity were: tm, Weka, Slam. The cleaning of the database eliminated stopwords because it improved the accuracy of the predictions (for the quizzes!).

How to predict

When you put a phrase on the desk I used a prediction algoritm based in a 3-gram model:

First the application clean your input (delete stopwords, punctuation)
If the input has 2 or more words (no stopwords), the prediction algorithm will look in the 3-gram.
If there are no 3-gram prediction, the algorithm continues looking in the 2-gram.
Finally, if the algorithm does not find the words in the 2-gram, it will predict the most frequent (s) words in the uni-gram

Example. How does it work?

alt text
I put the phrase from twitter: When you meet someone (line 2 from the twitter archive), and pressed submit. And this algorithm predicted well the next word (with 3-grams)¡¡¡ Moreover, the app tells you which data frame was used for prediction: 3-gram, 2-gram or unigram.

Conclusions and learning¡¡¡

With R we can make a prediction model an publish it with slidify: https://marcomtzmtz.shinyapps.io/AppfinalCapstone
Problems during the process were the size of the database (that I improved cleaning it)