Data Science Capstone Project

Web Application

We have created a web application that predicts a word that a preson is going to write. It works as very simple:

Data

Train data contains about 550MB of text that are tweets, news, blogs. It is 4,000,000 lines that we process by:

Predictive Algorithm: Simple model

The most popular word is β€œthe”. So if there is no other data we provide it as the default next word. The 2-grams we treat as follows. We group then by the first word and then we find the most popular second word. This word is the predicted word if the prevous word is provided. Similary we treat 3-grams. We group them by the first and second word and we find the third most popular word.

Since the free shiny server has restictions of size of files we consider only grams that have appeared at least twice.

Accuracy

The accuracy of the model is 20%. We did not remove stop words since that leads to better accuracy.