Data science Capstone Project: Word Prediction

Herve Yu
August 11 2015

Presentation of Capstone project in partnership with:

John Hopkins Bloomberg School of Public Health - Pr Brian Caffo, Pr Roger Peng, Pr Jeff Leek and Coursera
Swiftkey Corporation provides the filesets
RStudio Corporation provides the hosting and development tool platforms

From Swiftkey files Twitters, News, Blogs in the English Create a data product to predict the next word. Tasks:

Train data with Markov Ngram: tokenize and weight word occurrences - https://www.youtube.com/watch?v=o-CvoOkVrnY
Additional filtering for scalability: 7 millions+ texts leads to performance problems for product with limited resources. Discounted Kneser-Ney smoothing criteria - http://mkoerner.de/media/bachelor-thesis.pdf helps in filtering using criterias on word variabilities, e.g. prior word 4 fixed retain word 5 with high variation enhance combination variety. The dataset is reduced to 100,000 lines
Simple online Backoff mechanism implemented to find the match first with 5-gram, 4-gram until Uni-gram - https://www.youtube.com/watch?v=t-TZ0YrrIDA

In the sidebar enter your text, after initialization “words” will display in the main panel in a reasonable time.
Prediction of highest ranked 5 words will display below your text.
A slider added to control number of words displayed in the cloud plot from 1 to 30
In the main panel the word cloud plot shows up along with the most probable next word
Overall performance should be lesser than .5 second for each interaction after initialization expected less than 10 seconds
Only after new word predictions detected will the layout refresh
Access: https://yuherve.shinyapps.io/wordpredictor.

The training of dataset is key, the algorithm can be extended for more sophisticate filtering of data, and additional data processing word similarity…
Self learning system, words and texts unknown can be stored and evaluated to become potential new entries for the training dataset in a automated fashion
Specialization in text prediction mechanism based on the type of texts being analyzed