ProDict

A predictive word application written by Carnac, the Magnificent that also happens to be a submission for the Coursera/JHU Data Science Capstone Project.

January 19, 2016

note: here is the application link. It is also linked on the last (fifth) slide.
also note: the annoying blinking logo only appears on this slide.

What the application does

screen shot

How it works

ProDict™ uses a backoff algorithm using 2,3,4 and 5-gram phrases to predict the next word using the prior words you typed (up to four prior words). Specifically, it suggests three possible words from highest to lowest confidence using the following logic:

the highest confidence words using the prior four words.
if less than three recommendations are found, it then backs off to using the prior three words.
…and keeps backing off until only the last word is used.

Performance

The initial backoff model used up to trigrams, randomly sampled 25% of the source data from twitter, news, and blog posts, used the top 50% of phrases/words by frequency and had an accuracy rate of roughly 9%.
A secondary model was developed using Kneser-Ney smoothing and had an accuracy rate of 12% but substantially slower performance on the Shiny server.
Given the noticeable performance differences, the initial backoff model was modified to use up to 5-grams, used 95% of the data and the top 90% of phrases/words by frequency. Accuracy improved to 11%. This model was implemented after adding back the test data. Server performance was similar to the initial backoff model.

Advantages

Here is the link to the final application. The main advantages of ProDict™ are:

Speedy, snappy performance.
Visually aesthetic - even with the horrible logo.
Easy to understand and use, even for the novice user.

We are asking for your investment to purchase an upgraded Shiny host so that we can implement a more robust algorithm AND have acceptable performance while improving accuracy. We would also like to hire a graphic designer for the logo.