Data science Capstone Project: Word Prediction

Herve Yu
August 10 2015

Presentation

This is the Capstone project for:

  • John Hopkins Bloomberg School of Public Health and Coursera specialization
  • Swiftkey Corporation provides the filesets
  • RStudio Corporation provides the hosting and development tool platforms

Objective

From Swiftkey files Twitters, News, Blogs in the English language Create a data product attempting to predict the next word. Tasks:

  • Explore the data
  • Train the data Natural Language Process
  • Build a shiny apps with the model

Realization

  • Train data with Markov Ngram: tokenize and weight word occurences reference to: https://www.youtube.com/watch?v=o-CvoOkVrnY
  • Additional filtering required due performance: 7 millions+ texts caused performance problem to product hosted in shiny.io. Discounted Kneser-Ney smoothing criteria http://mkoerner.de/media/bachelor-thesis.pdf helps in filtering using criteria like prior 1,2,3 words are fixed, maximum variability on the 4 word. The dataset reduced to 100,000 lines
  • Backoff mechanism implemented to find the match first with Five-gram, Four-gram until unigram.

Data Product Description

  • In the sidebar enter your text
  • Prediction result of the 5 words highest probabilty will be shown below
  • In the main panel, a maximum of 30 highest probablity words will displayed in a cloud
  • Access to the product using: https://yuhrvfr.shinyapps.io/wordpredict