This report aims at explaining the exploratory analysis and our goals for the eventual app and algorithm we must develop for the Data Science Capstone Project. So far, we have downloaded the data and have successfully loaded it in R; We decided to partition each data set into a training and a testing data set; We have created this basic report of summary statistics about the training data sets; We have identified interesting findings; and we introduce our plans and some “data-based” feedback for creating the prediction algorithm and Shiny app. Most of the manipulation of text objects has been done using the quanteda package.
We loaded each of the data sets: en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt, and partitioned each of them into a training data set containing 70% of the documents, and a testing set. This exploratory data analysis is solely based on the training sets.
The analysis of each training set consists of: 1) Counting the number of documents on each file, 2) Counting the number of words (features) on each file, 3) Identifying the words (features) with the greater usage -number of times each word appears in the data set; 4) Presenting an histogram of the top 500 words and their frequency of use on each data set; 5) Presenting histograms with the 500 most used 2-grams.
## [1] "Total number of twitter messages: 1652104"
## [1] "Total number of different features(words) used on twitter: 370796"
## [1] "Top 50 Features used on twitter"
## . ! the to , i a you and ?
## 1752837 895250 654782 551318 521115 505275 426276 382928 306046 297649
## : for in is of it my " on that
## 279324 269652 264337 251743 251375 205887 203865 194942 193847 164636
## me ) be at with your have this so are
## 141207 139604 131319 130656 121236 119888 118101 114697 113949 111171
## just we i'm but not like all was - out
## 105866 93350 90560 89410 86755 85162 84666 82339 81466 80223
## up get what do if love & good ( will
## 79399 78533 78283 74596 74486 74266 72736 70418 68868 66208
## [1] "Total number of blogs: 629502"
## [1] "Total number of different features(words) used on blogs: 364125"
## [1] "Top 50 Features used on blogs"
## . the , and to a of i \200 in
## 1450100 1294915 1241895 764244 746255 627344 611617 538141 519822 414571
## that is it \231 for you with was on my
## 321331 301941 280412 271451 254309 207227 200520 193914 191178 189387
## this â as have " be but are ! )
## 180573 163373 156198 153203 147920 145964 143509 135447 133463 130158
## we not s at ( so from he or all
## 129643 121402 120891 120079 119433 115533 103335 101274 100936 100779
## they me : ? one by about will his an
## 97019 96978 95768 94775 87133 85608 80353 78911 76727 76384
## [1] "Total number of news articles: 707169"
## [1] "Total number of different features(words) used on news: 331973"
## [1] "Top 50 Features used on news"
## . the , to and a of " in for
## 1390797 1379866 1378105 631095 619268 612755 540154 524623 471981 245411
## that is on \200 with said was he it at
## 242650 198442 186517 182266 177997 175615 160178 159256 152153 148777
## as his i from be â but have are by
## 131272 110345 106782 106725 106505 105521 105426 100234 97247 92737
## : an this has not they who will ) (
## 92543 85995 85193 84557 78652 78387 76492 75945 75737 75266
## $ \231 or we you - about more their had
## 72853 72226 69400 67270 66694 65397 62845 61678 60443 58418
As a general rule, we found that punctuation signs are heavily used on the three different data sets, as they represent the features with the largest number of repetitions on the texts analyzed. We believe punctuation will not add much value to the prediction algorithm so we will proceed with the analysis excluding punctuation characters, starting with the 2-grams we just presented above. We also need to eliminate profanity from our data sets. Finally, we believe stemming the words will result in a loss of prediction accuracy, thus we will not use this option.
We found a series of issues on each data set. The twitter data set requires a deeper cleaning as it contains abbreviations, misspelled words, and uncommon characters. The blogs data set, as well as the news data set, require cleaning special characters such as TM, superscripts and foreign language symbols.
We found that the 2grams can provide important information for predicting combinations of words, and keep exploring ways to include them in our algorithm.
Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. (2018) “quanteda: An R package for the quantitative analysis of textual data”. Journal of Open Source Software. 3(30), 774. https://doi.org/10.21105/joss.00774.