7/2/2021

The Problem

  • We are assigned the task of designing an R Shiny app that predicts the next word following a phrase. This project builds on our skills learned across the Data Science Specialization.
  • SwiftKey has provided us a corpora of text scraped from blogs, news sources, and Twitter, to serve as the foundation for our prediction algorithm.

My approach

  • I use a relatively simple algorithm for my prediction. It accepts a phrase of length N (where N < 4), searches for all (N+1)-grams in the corpus that contain that phrase, and returns the final word of the most common (N+1)-gram.
  • What happens when the phrase entered by the user does not exist in the corpus? My algorithm takes the words of the entered phrase, finds the sub-corpus of all lines that contain any of those words, and returns the word that is disproportionately represented in that sub-corpus relative to the main corpus.
  • Other intermediate steps included the removal of profanity and the removal of punctuation.

My app

  • In my R Shiny app, the user also enters a phrase between 1 and 3 words long.
  • The app returns one word based on the algorithm.
  • An area for improvement in my app is that it always returns a lowercase result. I have not determined how to predict to vary capitalization based on what we expect the user to want.
  • I tried to enable the user to select which corpus (blogs, news, or Twitter) to use as the universe of data for the algorithm. Unfortunately, I encountered very stubborn errors so I removed this feature from the app.

Check it out for yourself!