TZiegler
July 2016
This presentation is a brief description of a shiny application for predicting the next word of a sentence using Machine Learning (ML). The project is in cooperation with SwiftKey (http://swiftkey.com/en/) in the area of Natural Language Processing (NLP).
The Data Science/ML Process:
Major R packages used: “quanteda”, “stringi”, and “data.table”
Demo data (blogs, news and twitter) were used as word library for prediction. For optimal use of memory storage and prediction speed, subsets of the three data sets (about 10% each) were merged into one corpus. The data cleaning involved separating into sentences, converting to lower case, removing punctuations & swear words.
N-grams are the basis of the word prediction application. Therefore, the next steps were:
The probability of the occurence of the next word in a sentence can be computed from the previous words. To predict the next word of a sentence, an algorithm looks for all n-grams with the first (n-1) words matching the last (n-1) words of the sentence. The most likely next word is then predicted as the last word of the n-grams (n=2..4) that has the highest weighted frequency.
Two algorithms were tested, the Naive Bias and the Kneser-Ney Smoothing. Finally I used the state of the art word prediction algorithm Kneser-Ney Smoohing algorithm for its better predictions.
Detailed formulae applied in the algorithm can be found here: