Numbers and Dates: Removed everything that is not an alphabet
Special and Unicode Characters: Removed everything that is not an alphabet
Profane and Insensitive Words: Removed based on “Bad Words” list published on CMU Portal
Internet Vocabulary: Popular Slangs manually replaced by classic English Words
Non-Dictionary Words: Excluded based on Dictionary (qdapDictionaries)
Technology
Technical Challenges
Insufficient RAM
Data Pre-processing Time around 2-3 hours
Workaround
Random Sampling of Data (Test Results below)
20% Sample gives around 80% of Unique Tokens
44% Sample gives around 90% of Unique Tokens
68% Sample gives around 95% of Unique Tokens
20% would be too aggressive and 68% may not help much: 44% is the right balance (which will give 90% Unique Tokens)
Solution
Model Building
Generate Combined Corpus from Blogs, News and Twitter
Tokenize Combined Corpus
Clean Up Tokens
Garbage Clean Up
Profanity and Insensitive Words
– Bad Words List has some grey-area words like “amateur”. I am not a Subject Matter Expert and hence have excluded ALL the words from the list.
Internet Slangs Replacement
Non-Dictionary Words Removal
– The dictionary itself () may not be exhaustive
– Proper Nouns are excluded as a result