Yasneen Ashroff
Jan 5 2017
The goal of this exercise was to build an app that takes an input phrase and predicts the next word.
It reads in a corpus of 1,000,000 blog posts, news articles and tweets and builds 1-gram, 2-gram, 3-gram and 4-gram models. It uses the models to calculate frequencies of each individual word encountered in the corpus, each 2-word phrase (bi-gram), each 3-word phrase and each 4-word phrase. These are stored in Document-Term Matrices and used to predict next word user enters.
Current model uses Stupid Backoff algorithm with Good-Turing Smoothing to calculate probability of each possible next word.
The current accuracy rate is around 15%. While this is low, the Swiftkey accuracy rate is roughly 35%. To improve my algorithm towards the 35% mark:
The shiny app can be found here: https://yashroff.shinyapps.io/text_prediction/