Data science Capstone Project: Word Prediction

Herve Yu
August 11 2015

Presentation of Capstone project in partnership with:

  • John Hopkins Bloomberg School of Public Health - Pr Brian Caffo, Pr Roger Peng, Pr Jeff Leek and Coursera
  • Swiftkey Corporation provides the filesets
  • RStudio Corporation provides the hosting and development tool platforms

Objective

From Swiftkey files Twitters, News, Blogs in the English Create a data product to predict the next word. Tasks:

  • Explore the data
  • Train the data Natural Language Process
  • Build a shiny apps with the model

Realization

  • Train data with Markov Ngram: tokenize and weight word occurrences - https://www.youtube.com/watch?v=o-CvoOkVrnY
  • Additional filtering for scalability: 7 millions+ texts leads to performance problems for product with limited resources. Discounted Kneser-Ney smoothing criteria - http://mkoerner.de/media/bachelor-thesis.pdf helps in filtering using criterias on word variabilities, e.g. prior word 4 fixed retain word 5 with high variation enhance combination variety. The dataset is reduced to 100,000 lines
  • Simple online Backoff mechanism implemented to find the match first with 5-gram, 4-gram until Uni-gram - https://www.youtube.com/watch?v=t-TZ0YrrIDA

Data Product Description

  • In the sidebar enter your text, after initialization “words” will display in the main panel in a reasonable time.
  • Prediction of highest ranked 5 words will display below your text.
  • A slider added to control number of words displayed in the cloud plot from 1 to 30
  • In the main panel the word cloud plot shows up along with the most probable next word
  • Overall performance should be lesser than .5 second for each interaction after initialization expected less than 10 seconds
  • Only after new word predictions detected will the layout refresh
  • Access: https://yuherve.shinyapps.io/wordpredictor.

Some suggestions for future direction

  • The training of dataset is key, the algorithm can be extended for more sophisticate filtering of data, and additional data processing word similarity…
  • Self learning system, words and texts unknown can be stored and evaluated to become potential new entries for the training dataset in a automated fashion
  • Specialization in text prediction mechanism based on the type of texts being analyzed