Web Application

We have created a web application that predicts a word that a preson is going to write. It works as very simple:

in the text input a user enters text,
then press the button “Predict next word”,
on the right hand panel, one can see the next word.

Data

Train data contains about 550MB of text that are tweets, news, blogs. It is 4,000,000 lines that we process by:

removing everything that is not a letter,
replace upper case letters for lower case,
from each line of data we create 1, 2, 3, 4 grams (it was hard to procces data using R so we have written a C code that creates Binary Search Trees that contains grams),
we create csv files with created grams and frequecies of each gram.

Predictive Algorithm: Simple model

The most popular word is “the”. So if there is no other data we provide it as the default next word. The 2-grams we treat as follows. We group then by the first word and then we find the most popular second word. This word is the predicted word if the prevous word is provided. Similary we treat 3-grams. We group them by the first and second word and we find the third most popular word.

Since the free shiny server has restictions of size of files we consider only grams that have appeared at least twice.

Accuracy

The accuracy of the model is 20%. We did not remove stop words since that leads to better accuracy.

Data Science Capstone Project

Web Application

Data

Predictive Algorithm: Simple model

Accuracy