Coursera Data Science Capstone Project

19/08/2015

The goal of this capstone project is to build a shiny application that gets the inputing text and predicts the next word.

Initial data was given by corpus (HC Corpora) and contained the sources of texts: news, twitter, blogs.

The prediction model was built using a random sample of text from news, Twitter and blogs.
All texts were devided into N-grams (N={1,2,3,4,5}) and then for each N there was a frequency dictionary made.
For iput text we look at last M words (M=4 if there 4 and more words written by user, 3 if 3+, and so on) and make a score function for next word: if there is any 5-gram with this 4 words at start we take last words according to their frequencies from 5-grams with coefficient k5, then we look at 4 grams with last 3 written words and score them with koef k4, k3 for trigramms, k2 for bigrams. Then we sum all of word scores and show top of them. If there is no written text from customer or no even bigrams - we take top unigrams from the whole Corpus.

Shiny app hosts here http://bit.ly/1MxJlfQ. Total accuracy of next word prediction in TOP-n words for n={1,2,3,4,5}: