William Green
Data Science Specialization: Capstone Project
The goal of this project was to allow a user to input a phrase into the application, and it would predict the next word that they “most likely” want to type.
The primary use case for this application is text messaging on mobile phones, in which successfully predicting the next word a user wants to type will save them from actually having to type that word, increasing their overall speed.
The data available for training the predictive model is millions of tweets, blog posts, and news articles in English. (Other language files were available but were not used.)
https://dskswu.shinyapps.io/myapp/ is an R Shiny application exploring the topic of text prediction, in part fulfillment of the Coursera JHU Data Science Specialization Capstone Project.
The application was designed with the following goals in mind:
The first step in model training was learning all of the 2-grams (word pairs), 3-grams (word triplets), and 4-grams (word quadruplets) in about half of the training data, as well as their frequencies.
Each 4-gram was then broken into a 3-gram (its first 3 words) and the final word. For each of the resulting 3-grams, the most common final word was calculated.
This process was repeated for the original set of 3-grams, producing a set of 2-grams and the most common next word for each 2-gram.
When a user types a phrase into the application, the application quickly makes a single prediction for the next word. The prediction algorithm is simple:
Examine the final three words typed by the user. If that 3-gram was present in the training data, predict the most common next word. If not, continue:
Examine the final two words typed. If that 2-gram was present in the training data, predict the most common next word. If not, continue:
Examine the last word typed. If that word was present in the training data, predict the most common next word. If not, predict the word “the”.
*User enters a sequence of words in the text box, then press “Next Word” button.
*The predicted next word is displayed with a note indicating which specific n-gram was used for next word prediction.
*User entered sentence is also displayed in the Shiny GUI