Natural Language Processing with R

7/2/2021

The Problem

We are assigned the task of designing an R Shiny app that predicts the next word following a phrase. This project builds on our skills learned across the Data Science Specialization.
SwiftKey has provided us a corpora of text scraped from blogs, news sources, and Twitter, to serve as the foundation for our prediction algorithm.

I use a relatively simple algorithm for my prediction. It accepts a phrase of length N (where N < 4), searches for all (N+1)-grams in the corpus that contain that phrase, and returns the final word of the most common (N+1)-gram.
What happens when the phrase entered by the user does not exist in the corpus? My algorithm takes the words of the entered phrase, finds the sub-corpus of all lines that contain any of those words, and returns the word that is disproportionately represented in that sub-corpus relative to the main corpus.
Other intermediate steps included the removal of profanity and the removal of punctuation.

In my R Shiny app, the user also enters a phrase between 1 and 3 words long.
The app returns one word based on the algorithm.
An area for improvement in my app is that it always returns a lowercase result. I have not determined how to predict to vary capitalization based on what we expect the user to want.
I tried to enable the user to select which corpus (blogs, news, or Twitter) to use as the universe of data for the algorithm. Unfortunately, I encountered very stubborn errors so I removed this feature from the app.

My app is located at https://petergranville.shinyapps.io/DSCproject/
My milestone report, detailing my exploratory data analysis, is located at https://rpubs.com/PeterGranville/DataScienceCapstoneMilestoneReport
If you’re reading this you’re probably in the Data Science Specialization and have also made it to the end. Congrats to you!