The goal of this project was to create a Shiny application that will predict the following word of (incomplete) phrase the user typed in. The result can be found here:
https://archnae.shinyapps.io/DataScienceCapstone/
The implementation consists of three steps:
This project used the 4-gram language model with Kneser-Net smoothing.
The next word is predicted by the most probable 4-gram for which first 3 words are known (the last 3 words of the already typed phrase).
The N-gram probabilities were calculated using 200,000 blog records provided by SwiftKey. The data were cleaned before use:
Calculated probabilities are saved in R data files to be used for prediction.
uses the probability data saved at the previous step to predict the next word of a partially typed prase.
Using this app has shown that 4-gram model often degrades into tri- or even bi-gram model because of the omnipresent articles (“a” and “the”) and other too-common “noise” words. while keeping within N-gram language model, the algorithm can hopefully be improved by: