TextPrediction for DS Capstone

Jayesh Gokhale
5th June 2021

Technologies Used: R, quanteda, data.table, Shiny

Dictionary Used: qdapDictionaries::GradyAugmented

Corpus Data Cleaning

Issues with Corpus Data

  • Non-English Words: Removed Based on Dictionary
  • Numbers and Dates: Removed everything that is not an alphabet
  • Special and Unicode Characters: Removed everything that is not an alphabet
  • Profane and Insensitive Words: Removed based on “Bad Words” list published on CMU Portal
  • Internet Vocabulary: Popular Slangs manually replaced by classic English Words
  • Non-Dictionary Words: Excluded based on Dictionary (qdapDictionaries)

Technology

Technical Challenges

  • Insufficient RAM
  • Data Pre-processing Time around 2-3 hours

Workaround

  • Random Sampling of Data (Test Results below)
    • 20% Sample gives around 80% of Unique Tokens
    • 44% Sample gives around 90% of Unique Tokens
    • 68% Sample gives around 95% of Unique Tokens
  • 20% would be too aggressive and 68% may not help much: 44% is the right balance (which will give 90% Unique Tokens)

Solution

Model Building

  • Generate Combined Corpus from Blogs, News and Twitter
  • Tokenize Combined Corpus
  • Clean Up Tokens
    • Garbage Clean Up
    • Profanity and Insensitive Words – Bad Words List has some grey-area words like “amateur”. I am not a Subject Matter Expert and hence have excluded ALL the words from the list.
    • Internet Slangs Replacement
    • Non-Dictionary Words Removal – The dictionary itself () may not be exhaustive – Proper Nouns are excluded as a result
  • Sampling Tests (44% of Tokens)
  • Generate n-Grams (2,3,4,5,6)

Solution

Prediction & Validation

  • Two algorithms
  • Shiny Web App Deployment
    • Time Taken for each algorithm
    • Top 5 Predictions for next word from each algorithm
  • Validation R-Pubs Link
    • Accuracy is defined as ratio of target words “catched” in top 5 ranks: Measured at around 33%
    • Time Taken: 0.44 to 0.52 seconds per prediction (all 5 ranks)
  • “Feel Good Feeling” - Generally at least one Sensible Prediction
  • Concern - n-Gram Models do not capture long range dependencies