Philippe Jette
10/19/2018
Thanks for checking out my project! This deck represents the culmincation of my Coursera Data Science specialization, which I'm ashamed to admit took me 3 years of on-and-off work to complete.
The idea was simply to create a basic predictive text algo, which suggests the most likely upcoming word as you type a sentence.
You can go straight to the application here: https://philjette.shinyapps.io/WordPrediction/
The text data comes from an english language corpus of blog, news, and twitter data located here: http://www.corpora.heliohost.org/aboutcorpus.html. Given the size of the corpus, (over 4 million lines combined), a 10% random sample was used.
Furthermore, I kept only the top 80% most frequently occuring n-grams. This didn't seem to affect prediction performance, while reducing memory requirements.
Clean-up and tokenization tasks were performed in order to:
The model algorithm works as follows:
Note that prediction is based on frequency of occurence of the n-gram. We can produce a list by frequency (from most to least likely). But for this mmodel, we only grab the most frequently occuring n-gram.
In the future I'd like to go to a 4-gram (rather than the current 3-gram), in order to utilize the last 3 words (rather than the last 2). With the current method, some results are a bit wonky.
Thanks for reading and enjoy the app here: https://philjette.shinyapps.io/WordPrediction/