Data Science Capstone Pitch

YuXuan Tay
Tuesday, August 04, 2015

Prediction Model

  • based on the N-Gram Language Model on the HC Corpora
    • blogs, news and Twitter corpora of the English language
  • preprocessing, such as sentence detection, punctuation removal and converting to lowercase were performed
  • word splitting was then done, with sentence beginning, numbers and rare words (with counts <= 5) represented by special symbols, to create a vectors of words for each corpus
  • n-grams of size up to 5 were generated by binding the word vector repeatedly with index displaced
  • n-grams were then counted and normalised into proportion based on the first (n-1) words of the n-gram
  • packages such as stringi and data.table were used

Prediction Steps

  • input text is cleaned in the similar manner as the corpora and the last 5 words extracted
  • predictions based on different n-gram sizes are obtained from each corpora
  • prediction confidence are combined based on a smoothing function for different n-gram sizes and based on preset weights for the different corpora
  • previous step incorporates backoff automatically in the event only small n-grams can be found
  • predicted words with top 5 confidence are presented as suggestions

Features

  • suggestion(s) displayed as button(s) for user interaction
  • user input text updates with user selected suggestion
  • processing steps presented to offer hints of underlying workings
  • prediction word cloud gives idea of word prediction confidence
  • number of suggestions can be increased up to 6
  • adjustable corpus and ngram weights for advance users