Patricia Londono
January 15 2021
The provided data was collected from publicly available tweets, blogs and news by a web crawler.
Data Summary
| File_Name | File_Size | Total_Lines | Total_Characters | Total_Words |
|---|---|---|---|---|
| blogs | 200.4242 | 899288 | 208361438 | 37334131 |
| news | 196.2775 | 77259 | 15683765 | 2643969 |
| 159.3641 | 2360148 | 162385035 | 30373583 |
Given the data size, 40% of the corpus was selected at random to be used for training the model. Then all punctuation, whitespace, numbers, email patterns, uppercase letters and profanity words were removed.
The model was trained using the Stupid Back Off via N-Grams Algorithm. N-Grams are used to calculate the probability of a word in text. If a word with prob=0 is encountered, then it goes back to an n-1 gram level where the odds are multiplied by lambda(0.4) so the new probability is calculated as:
0.4 * P(“Desired Output”|“Text Input”)
The final probability is found by multiplying by 0.4n where n is the number of levels to the unigram.
I used the R SBO package and train the model with parameters: 5-grams, 80% dictionary target, lambda=0.4
Shiny App: https://patrickl086.shinyapps.io/Text_Prediction_App/ Github Repository: https://github.com/patrickl086/datasciencecoursera/tree/master/Capstone%20Project/Text%20Prediction%20App“
Thanks!