Summary

This report aims at explaining the exploratory analysis and our goals for the eventual app and algorithm we must develop for the Data Science Capstone Project. So far, we have downloaded the data and have successfully loaded it in R; We decided to partition each data set into a training and a testing data set; We have created this basic report of summary statistics about the training data sets; We have identified interesting findings; and we introduce our plans and some “data-based” feedback for creating the prediction algorithm and Shiny app. Most of the manipulation of text objects has been done using the quanteda package.

Exploratory Data Analysis

We loaded each of the data sets: en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt, and partitioned each of them into a training data set containing 70% of the documents, and a testing set. This exploratory data analysis is solely based on the training sets.

The analysis of each training set consists of: 1) Counting the number of documents on each file, 2) Counting the number of words (features) on each file, 3) Identifying the words (features) with the greater usage -number of times each word appears in the data set; 4) Presenting an histogram of the top 500 words and their frequency of use on each data set; 5) Presenting histograms with the 500 most used 2-grams.

## [1] "Total number of twitter messages: 1652104"
## [1] "Total number of different features(words) used on twitter: 370796"

## [1] "Top 50 Features used on twitter"
##       .       !     the      to       ,       i       a     you     and       ? 
## 1752837  895250  654782  551318  521115  505275  426276  382928  306046  297649 
##       :     for      in      is      of      it      my       "      on    that 
##  279324  269652  264337  251743  251375  205887  203865  194942  193847  164636 
##      me       )      be      at    with    your    have    this      so     are 
##  141207  139604  131319  130656  121236  119888  118101  114697  113949  111171 
##    just      we     i'm     but     not    like     all     was       -     out 
##  105866   93350   90560   89410   86755   85162   84666   82339   81466   80223 
##      up     get    what      do      if    love       &    good       (    will 
##   79399   78533   78283   74596   74486   74266   72736   70418   68868   66208

## [1] "Total number of blogs: 629502"
## [1] "Total number of different features(words) used on blogs: 364125"

## [1] "Top 50 Features used on blogs"
##       .     the       ,     and      to       a      of       i       \200      in 
## 1450100 1294915 1241895  764244  746255  627344  611617  538141  519822  414571 
##    that      is      it       \231     for     you    with     was      on      my 
##  321331  301941  280412  271451  254309  207227  200520  193914  191178  189387 
##    this       â      as    have       "      be     but     are       !       ) 
##  180573  163373  156198  153203  147920  145964  143509  135447  133463  130158 
##      we     not       s      at       (      so    from      he      or     all 
##  129643  121402  120891  120079  119433  115533  103335  101274  100936  100779 
##    they      me       :       ?     one      by   about    will     his      an 
##   97019   96978   95768   94775   87133   85608   80353   78911   76727   76384

## [1] "Total number of news articles: 707169"
## [1] "Total number of different features(words) used on news: 331973"

## [1] "Top 50 Features used on news"
##       .     the       ,      to     and       a      of       "      in     for 
## 1390797 1379866 1378105  631095  619268  612755  540154  524623  471981  245411 
##    that      is      on       \200    with    said     was      he      it      at 
##  242650  198442  186517  182266  177997  175615  160178  159256  152153  148777 
##      as     his       i    from      be       â     but    have     are      by 
##  131272  110345  106782  106725  106505  105521  105426  100234   97247   92737 
##       :      an    this     has     not    they     who    will       )       ( 
##   92543   85995   85193   84557   78652   78387   76492   75945   75737   75266 
##       $       \231      or      we     you       -   about    more   their     had 
##   72853   72226   69400   67270   66694   65397   62845   61678   60443   58418

Results and Discussion

As a general rule, we found that punctuation signs are heavily used on the three different data sets, as they represent the features with the largest number of repetitions on the texts analyzed. We believe punctuation will not add much value to the prediction algorithm so we will proceed with the analysis excluding punctuation characters, starting with the 2-grams we just presented above. We also need to eliminate profanity from our data sets. Finally, we believe stemming the words will result in a loss of prediction accuracy, thus we will not use this option.

We found a series of issues on each data set. The twitter data set requires a deeper cleaning as it contains abbreviations, misspelled words, and uncommon characters. The blogs data set, as well as the news data set, require cleaning special characters such as TM, superscripts and foreign language symbols.

We found that the 2grams can provide important information for predicting combinations of words, and keep exploring ways to include them in our algorithm.

Quanteda Package

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. (2018) “quanteda: An R package for the quantitative analysis of textual data”. Journal of Open Source Software. 3(30), 774. https://doi.org/10.21105/joss.00774.