7/2/2021
The Problem
- We are assigned the task of designing an R Shiny app that predicts the next word following a phrase. This project builds on our skills learned across the Data Science Specialization.
- SwiftKey has provided us a corpora of text scraped from blogs, news sources, and Twitter, to serve as the foundation for our prediction algorithm.
My approach
- I use a relatively simple algorithm for my prediction. It accepts a phrase of length N (where N < 4), searches for all (N+1)-grams in the corpus that contain that phrase, and returns the final word of the most common (N+1)-gram.
- What happens when the phrase entered by the user does not exist in the corpus? My algorithm takes the words of the entered phrase, finds the sub-corpus of all lines that contain any of those words, and returns the word that is disproportionately represented in that sub-corpus relative to the main corpus.
- Other intermediate steps included the removal of profanity and the removal of punctuation.
My app
- In my R Shiny app, the user also enters a phrase between 1 and 3 words long.
- The app returns one word based on the algorithm.
- An area for improvement in my app is that it always returns a lowercase result. I have not determined how to predict to vary capitalization based on what we expect the user to want.
- I tried to enable the user to select which corpus (blogs, news, or Twitter) to use as the universe of data for the algorithm. Unfortunately, I encountered very stubborn errors so I removed this feature from the app.
Check it out for yourself!