Data Science Capstone Project

Web Application

We have created a web application that predicts a word that a preson is going to write. It works as very simple:

in the text input a user enters text, next predicted word appears on the left hand side,
one can press the button “Add prediction to text” that adds the predicted next word to the text.

Data

Train data contains about 550MB of text that are tweets, news, blogs. It is 4,000,000 lines that we process by:

removing everything that is not a letter,
replace upper case letters for lower case,
from each line of data we create 1, 2, 3, 4 grams (it was hard to procces data using R so we have written a C code that creates Binary Search Trees that contains grams),
we create csv files with created grams and frequecies of each gram.

Predictive Algorithm: Simple model

The most popular word is “the”. So if there is no other data we provide it as the default next word.
The 2-grams we treat as follows. We group then by the first word and then we find the most popular second word. This word is the predicted word if the prevous word is provided.
Similary we treat 3-grams and 4-grams. We group them by firsts words and we find the next most popular word.

Since the free shiny server has restictions of size of files we consider only grams that had appeared at least twice in the train data.

Accuracy

The accuracy of the model is 20%. We did not remove stop words since that leads to better accuracy.

Links

Git repo:

https://github.com/sbartek/textProcessing

Aplication:

https://sbartek.shinyapps.io/predictnextword