Data Science Capstone Pitch

YuXuan Tay
Tuesday, August 04, 2015

based on the N-Gram Language Model on the HC Corpora
- blogs, news and Twitter corpora of the English language
preprocessing, such as sentence detection, punctuation removal and converting to lowercase were performed
word splitting was then done, with sentence beginning, numbers and rare words (with counts <= 5) represented by special symbols, to create a vectors of words for each corpus
n-grams of size up to 5 were generated by binding the word vector repeatedly with index displaced
n-grams were then counted and normalised into proportion based on the first (n-1) words of the n-gram
packages such as stringi and data.table were used

input text is cleaned in the similar manner as the corpora and the last 5 words extracted
predictions based on different n-gram sizes are obtained from each corpora
prediction confidence are combined based on a smoothing function for different n-gram sizes and based on preset weights for the different corpora
previous step incorporates backoff automatically in the event only small n-grams can be found
predicted words with top 5 confidence are presented as suggestions