John Hopkins University - Data Science Specialization - Capstone Project Presentation

Patricia Londono
January 15 2021

Project Goal

Develop a predictive model of text starting with a really large, unstructured database of the English language and A Shiny app that takes as input a phrase (multiple words), one clicks submit, and it predicts the next word.

Data Cleaning and Sampling

The provided data was collected from publicly available tweets, blogs and news by a web crawler.

Data Summary

File_Name	File_Size	Total_Lines	Total_Characters	Total_Words
blogs	200.4242	899288	208361438	37334131
news	196.2775	77259	15683765	2643969
twitter	159.3641	2360148	162385035	30373583

Given the data size, 40% of the corpus was selected at random to be used for training the model. Then all punctuation, whitespace, numbers, email patterns, uppercase letters and profanity words were removed.

Text Prediction Algorithm

The model was trained using the Stupid Back Off via N-Grams Algorithm. N-Grams are used to calculate the probability of a word in text. If a word with prob=0 is encountered, then it goes back to an n-1 gram level where the odds are multiplied by lambda(0.4) so the new probability is calculated as:

0.4 * P(“Desired Output”|“Text Input”)

The final probability is found by multiplying by 0.4ⁿ where n is the number of levels to the unigram.

I used the R SBO package and train the model with parameters: 5-grams, 80% dictionary target, lambda=0.4

Shiny App

Shiny App: https://patrickl086.shinyapps.io/Text_Prediction_App/ Github Repository: https://github.com/patrickl086/datasciencecoursera/tree/master/Capstone%20Project/Text%20Prediction%20App“

Thanks!