Data science Capstone Project: Word Prediction

Herve Yu
August 10 2015

This is the Capstone project for:

From Swiftkey files Twitters, News, Blogs in the English language Create a data product attempting to predict the next word. Tasks:

Train data with Markov Ngram: tokenize and weight word occurences reference to: https://www.youtube.com/watch?v=o-CvoOkVrnY
Additional filtering required due performance: 7 millions+ texts caused performance problem to product hosted in shiny.io. Discounted Kneser-Ney smoothing criteria http://mkoerner.de/media/bachelor-thesis.pdf helps in filtering using criteria like prior 1,2,3 words are fixed, maximum variability on the 4 word. The dataset reduced to 100,000 lines
Backoff mechanism implemented to find the match first with Five-gram, Four-gram until unigram.

In the sidebar enter your text
Prediction result of the 5 words highest probabilty will be shown below
In the main panel, a maximum of 30 highest probablity words will displayed in a cloud
Access to the product using: https://yuhrvfr.shinyapps.io/wordpredict