Data science Capstone Project: Word Prediction

Herve Yu
August 10 2015

Presentation of Capstone project in partnership with:

  • John Hopkins Bloomberg School of Public Health - Pr Brian Caffo, Pr Roger Peng, Pr Jeff Leek and Coursera
  • Swiftkey Corporation provides the filesets
  • RStudio Corporation provides the hosting and development tool platforms

Objective

From Swiftkey files Twitters, News, Blogs in the English Create a data product to predict the next word. Tasks:

  • Explore the data
  • Train the data Natural Language Process
  • Build a shiny apps with the model

Realization

Data Product Description

  • In the sidebar enter your text, initialization will be indicated by the word “words” in the main panel in a reasonable time.
  • Prediction of the highest ranked 5 words will display below your text.
  • In the main panel, a maximum of 30 highest probable words will displayed in a cloud plot.
  • Overall performance should be lesser than .5 second for each reaction after initialization.
  • Access to the product using: https://yuhrvfr.shinyapps.io/wordpredict.

Future directions

  • The training of dataset is key, the algorithm can be extended for more sophisticated filtering of data
  • Self learning, words and texts unknown can be stored and evaluated to become potential new entries for the training dataset
  • Specialization in text prediction mechanism based on the type of texts being analyzed