In this report the data cleaning and sampling strategies employed are reported and exploratory analysis is performed on the corpus. Although the corpus only has 3 sources: twitter feeds, blog posts and news articles, the sources themselves are a diverse set widely used by the majority of the population and spanning much of our texting verbiage.
Below is the raw word count, length and the size of all three files before any processing is completed.
## Source File.Length Word.Count File.Size.MB
## 1 Twitter 2360148 53438 159.3641
## 2 Blogs 899288 16482 200.4242
## 3 News 77259 269578 196.2775
To process this corpus I had to think about the data types included and try to hone in on my overall goals. Below I list the processing steps performed and my reasoning for each.
1). Split Contractions: Wanted to take the words closer to their root. Ex ‘Weren’t’ is now counted as ‘were’ and ‘not’ instead of its own separate word.
2). Removed everything starting with ‘@’, ‘#’ and all email address using regex. The regex expression chosen may not be perfect, but it is not over-complicated and does not take a ton of memory to process. These would only add one off n-grams to our algorithm.
2). Removed everything except alpha numeric characters and punctuation symbols “.”, “!”, “?”. I am not convinced this is the best way to handle punctuation, and in my simulations below I actually remove all punctuation.
4). Removed all numbers: These are too ambiguous for our prediction model. I may need to think about what kind of n-grams these are leaving behind.
5). Split hyphens: Hyphenated words should be taken to their root forms to be treated as such. Ex. Over-complicate3d should be counted as ‘over’ and ‘complicated’ instead of its own word.
6). Making everything lowercase: We do not want our algorithm to distinguish between two words because one has capital letters and the other does not. Ex. ‘Just’ and ‘just’ should be treated the same.
7). Removing profanity using Google’s profanity list found here: https://code.google.com/archive/p/badwordslist/downloads
After processing, the full data, without sampling, there are a total of 3,336,695 texts, 804,969 unique unigrams, and 11,162,827 unique bigrams in the corpus; when tried obtaining the number of unique trigrams, my machine ran out of memory. I also made summary tables of the top 1000 frequent unigrams and bigrams, but I could not include this because of the way Rmd knits: all code must be compiled within the document and will not compile if only within the global environment, this makes Rmd documents completely reproducible, but makes handling of large objects very difficult; the size of the bigrams DFM was over 300 Trillion elements.
I would like to capture 95% of the vocabulary of the full corpus and still get a high percentage of bigrams in my sample set, while remaining lightweight enough to be useful on a mobile device.
The corpus is sampled as such: 10% of the blogs(~89,930 texts), 5% of the news(~38,630 texts) and 2% of the twitter feeds(47,203 texts). These percentages were chosen as they seemed to be low enough to be lightweight, the total size of this character vector is 43.9 MB and in trying to think logically about this, I came to the conclusion that blogs will be the best source for a predictive text app because they are indicative of how we text most frequently. The news will have a lot of jargon not readily captured by the blogs and will be a more structured and grammatically correct source. The twitter feeds take on a style of the person making the feed, which seemingly will make them harder to use for a predictive text app, but will capture a lot of language not captured by the blogs and news sources.
However after running a sample with these percentages, the number of unigrams in the sample corpus is 159,507 which is only 19.8% of all unigrams and much lower than my ideal of 95% stated above. Also, the number of bigrams in the sample corpus is 1,755,178 which is a low 15.7% of all bigrams. I will play with these percentages a bit and look at what size my sample can be to run smoothly on a mobile phone. Mobile phones of our era have multiple GBs of RAM to work with, seemingly an average of 4GBs, but many other applications may be running in the background taking up multiple GBs at a time. Therefore the application will need to be lightweight enough to run while other applications are running.
Side Note: Stemming or other cleaning tactics would reduce the vocabulary for the corpus and increase the percentages, but I am still not sure if stemming is the right thing to do.
Below are histograms for the most frequent unigrams, bigrams and trigrams in the corpus. All three sets make sense which shows that the processing steps I performed look work well, at least at this level.
1). Figure out whether or not I should include punctuation in my analysis.
2). Should I use interpolation of n-grams with (n-1)grams, etc., to create a good smoothing model? Like in Kneser-Ney smoothing. I have searched and cannot find any good smoothing packages for things like unseen n-grams. I might try to implement my own smoothing algorithm, but this may be a bit ambitious.
3). Possibly use a bag of words model with unigrams, bigrams and trigrams for this application.
4). Need to figure out if I should employ stemming or use a vocabulary like the Fry_1000 dictionary in qdapDictionaries which holds the top 1000 words making up 90% of all printed text; this particular dictionary seems like it would severely limit the words, but would be great for processing capabilities. I could also include higher numbers of n-grams and use a limiting vocabulary like the one mentioned above, but this does not seem like a good idea.
5). The evaluation possibilities for my model include: a). Intrinsic Evaluation: Will look into using ‘perplexity’ to judge the model. b). Extrinsic Evaluation: looked at by a human on a trial-by-trial basis for how many times it gets the word correct.
6) Might use a list of common internet/twitter shorthand to turn any shorthand found in the corpus back to its root word. 7) Still need to figure out what to do about foreign language. I would think that removing all non-alphanumeric tokens besides some punctuation during processing would have taken care of a bit of this.