TextPrediction for DS Capstone

Jayesh Gokhale
5th June 2021

Technologies Used: R, quanteda, data.table, Shiny

Dictionary Used: qdapDictionaries::GradyAugmented

Issues with Corpus Data

Non-English Words: Removed Based on Dictionary
Numbers and Dates: Removed everything that is not an alphabet
Special and Unicode Characters: Removed everything that is not an alphabet
Profane and Insensitive Words: Removed based on “Bad Words” list published on CMU Portal
Internet Vocabulary: Popular Slangs manually replaced by classic English Words
Non-Dictionary Words: Excluded based on Dictionary (qdapDictionaries)

Random Sampling of Data (Test Results below)
- 20% Sample gives around 80% of Unique Tokens
- 44% Sample gives around 90% of Unique Tokens
- 68% Sample gives around 95% of Unique Tokens
20% would be too aggressive and 68% may not help much: 44% is the right balance (which will give 90% Unique Tokens)

Generate Combined Corpus from Blogs, News and Twitter
Tokenize Combined Corpus
Clean Up Tokens
- Garbage Clean Up
- Profanity and Insensitive Words – Bad Words List has some grey-area words like “amateur”. I am not a Subject Matter Expert and hence have excluded ALL the words from the list.
- Internet Slangs Replacement
- Non-Dictionary Words Removal – The dictionary itself () may not be exhaustive – Proper Nouns are excluded as a result
Sampling Tests (44% of Tokens)
Generate n-Grams (2,3,4,5,6)

Two algorithms
- Interpolation
- Kneser-Ney Smoothing
Shiny Web App Deployment
- Time Taken for each algorithm
- Top 5 Predictions for next word from each algorithm
Validation R-Pubs Link
- Accuracy is defined as ratio of target words “catched” in top 5 ranks: Measured at around 33%
- Time Taken: 0.44 to 0.52 seconds per prediction (all 5 ranks)
“Feel Good Feeling” - Generally at least one Sensible Prediction
Concern - n-Gram Models do not capture long range dependencies