Patrick Rotzetter
2022-04-04
The objective of this capstone project is to build a Shiny app to allow a user to enter a phrase and have the application predict what the next word will be.
Input for this project was provide as text data from twitter, blogs and news feeds. An exploratory data analysis phase was completed.
The 10% of the data was sampled and cleaned by converting the data to all lowercase and removing punctuation, white space, numbers and special character (such as quotes and hyphens).
A predictive lnguage model was built to estimate the next word after entering a few words.
The data from the three input sources was combined to create a single Corpus. The corpus was tokenized into a series of the most common n-grams of 1, 2, 3 and 4 word phrases (unigram, bi-gram, tri-gram and quad-gram).
A backoff model strategy was then employed to try to match a users input to common 4 word phrases. If not found, the model would backoff and look for similar 3 words phrases, and then 2 words phrases.
Finally, a shiny application was then built to allow reviewers to test the project code.
The shiny app can be found here :
The app is made as straight forward as possible. The user can enter a word or multiple words in the input field for which the next word is to be predicted.
Simply start typing on the text field and up to 4 possible next words will automatically display below the field. The 10 most likely next words will then show up.
Resources
Shiny app: (https://patrick-rotzetter.shinyapps.io/NextWordPrediction/?_ga=2.81638679.1612771637.1649048980-103564281.1649048980)
Source code: (https://github.com/protzetter/Data-Science-Specialization-Capstone-Project)
Presentation: (https://rpubs.com/protzetter/capstone)