John Hopkins University - Data Science Specialization - Capstone Project Presentation

Patricia Londono
January 15 2021

Project Goal

Develop a predictive model of text starting with a really large, unstructured database of the English language and A Shiny app that takes as input a phrase (multiple words), one clicks submit, and it predicts the next word.

Data Cleaning and Sampling

The provided data was collected from publicly available tweets, blogs and news by a web crawler.

Data Summary

File_Name File_Size Total_Lines Total_Characters Total_Words
blogs 200.4242 899288 208361438 37334131
news 196.2775 77259 15683765 2643969
twitter 159.3641 2360148 162385035 30373583

Given the data size, 40% of the corpus was selected at random to be used for training the model. Then all punctuation, whitespace, numbers, email patterns, uppercase letters and profanity words were removed.

Text Prediction Algorithm

The model was trained using the Stupid Back Off via N-Grams Algorithm. N-Grams are used to calculate the probability of a word in text. If a word with prob=0 is encountered, then it goes back to an n-1 gram level where the odds are multiplied by lambda(0.4) so the new probability is calculated as:

0.4 * P(“Desired Output”|“Text Input”)

The final probability is found by multiplying by 0.4n where n is the number of levels to the unigram.

I used the R SBO package and train the model with parameters: 5-grams, 80% dictionary target, lambda=0.4

Shiny App