Lim Kah Kheng
6th Jan 2016
The purpose is to analyse 3 corpus of data namely news, blogs and twitter data - which the corpuses are used to create a prediction algorithm of next words in a sentence.
This covers cleaning and analysing of data, take a samplings of data and build a predictive model.
The Shiny application is hosted in https://jkklim.shinyapps.io/swiftkey/
Process
Determine the size of the corpus and select 1% of its data to speed up loading of data into Shiny.
Clean the data and extract all unigrams, trigrams and bigrams
Create a model of unigrams, trigrams and bigrams where each model is sorted by occurence.
Take the input and compare with different models. Return first three matches.
Algorithm
https://en.wikipedia.org/wiki/Katz's_back-off_model is used.
The sentence is split into an array of words and compare with different models. If there is a match, it will return the most occurence and try to return the first three matches.