Data Science Capstone

Riley Visscher

2025-12-16

Predictive Language Model

The first step in creating the Predictive language model is processing the raw data, which was done through turning each word in the raw text file into individual tokens. Any token that were considered profanity was then removed from the data set.

The next step is to build the context for the predictive model, which in this case is bi-grams. Bi-grams look at the the words proceeding the final word to generate what the most common sentences are. These are then used to generate what the most likely following word is. For example, in the sentence “Read this book soon!” a bi-gram splits the sentence into “read this”, “this book” and “book soon”.

Only the top 10,000 most common tokens/words were considered in this model, no punctuation marks were included. This concession to accuracy was done to speed up the final model. The more bi-grams or tri-grams, the slower the model.

The next step is to look at the frequency of the most common tokens, and finally return them in a table.

Quantitative Performance

The following predictive language model works slowly. The model takes up to 5 minutes based on the sentence it is given. That being said, the accuracy is high. For the partial phrase “I love”, the model was able to finf 7,298,131 total bi-grams, with “you”, “for”, and “it” as the top 3 most likely next words.

The App

Please find the final Predictive Language Model app at the following link: https://rileyvicoursera.shinyapps.io/datasci-capstone/