Tidyverse vs fivethirtyeight Recipes Sergio

The fivethirtyeight R Package

The objective of this homework is generating a “recipe” for Tidyverse while using a Fivethirtyeight(538) dataset and collaborate using GitHub. I would like to take a different approach in a sense of explaining the use of a package than can be useful for using and streamlining 538 datasets in a faster and more efficient way and already by itself is a tidyverse recipe while doing a comparisson between the two of them.

FiveThirtyEight.com is a data-driven journalism website founded by Nate Silver and owned by Disney/ESPN that reports on politics, economics, sports, and other current events. FiveThirtyEight data used in many of their articles is accessible on their GitHub repository page https://github.com/fivethirtyeight/data . The fivethirtyeight R package goes one step further by making this data and its corresponding documentation easily accessible. The homepage for the fivethirtyeight R package can be found at https://fivethirtyeight-r.netlify.com/.

Basic Usage

I’ll be using couple of datasets from 538 identified in each section.The first thing is, if you haven’t, install the fivethirtyeight package into your R development studio.The first dataset is the Bechdel test dataset that represents “How women are represented in movies.”

library(fivethirtyeight)

## Warning: package 'fivethirtyeight' was built under R version 3.5.1

#Showing dataset bechdel
head(bechdel)

##   year      imdb            title            test clean_test binary
## 1 2013 tt1711425        21 & Over          notalk     notalk   FAIL
## 2 2012 tt1343727         Dredd 3D     ok-disagree         ok   PASS
## 3 2013 tt2024544 12 Years a Slave notalk-disagree     notalk   FAIL
## 4 2013 tt1272878           2 Guns          notalk     notalk   FAIL
## 5 2013 tt0453562               42             men        men   FAIL
## 6 2013 tt1335975         47 Ronin             men        men   FAIL
##      budget domgross  intgross     code budget_2013 domgross_2013
## 1  13000000 25682380  42195766 2013FAIL    13000000      25682380
## 2  45000000 13414714  40868994 2012PASS    45658735      13611086
## 3  20000000 53107035 158607035 2013FAIL    20000000      53107035
## 4  61000000 75612460 132493015 2013FAIL    61000000      75612460
## 5  40000000 95020213  95020213 2013FAIL    40000000      95020213
## 6 225000000 38362475 145803842 2013FAIL   225000000      38362475
##   intgross_2013 period_code decade_code
## 1      42195766           1           1
## 2      41467257           1           1
## 3     158607035           1           1
## 4     132493015           1           1
## 5      95020213           1           1
## 6     145803842           1           1

?bechdel

## starting httpd help server ... done

There’s around 107 datasets ready to be consumed.

vignette("fivethirtyeight", package="fivethirtyeight")

Naming conventions:

One of the first examples is preprocess variable names. Let’s consider the dataset from the Article “41% of Fliers thinkyou’re rude if you recline your seat”, i will access the dataset from the fivethirtyeight github dataset repository in order to do Tidyverse in paralel with 538 package

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

url<-"https://raw.githubusercontent.com/fivethirtyeight/data/master/flying-etiquette-survey/flying-etiquette.csv"
flying_raw <- read_csv(url)

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   RespondentID = col_double()
## )

## See spec(...) for full column specifications.

colnames(flying_raw)[1:5]

## [1] "RespondentID"                               
## [2] "How often do you travel by plane?"          
## [3] "Do you ever recline your seat when you fly?"
## [4] "How tall are you?"                          
## [5] "Do you have any children under 18?"

flying_raw2 <- flying_raw
#Contiinuation of the tidyverse recipe will be change column names for shorter more maneagable names
colnames(flying_raw2)<- c("respondent_id", "gender", "age", "height", "children_under_18")
#so this add serveral extra steps

We contrast this to the corresponding flying data frame in the fivethirtyeight package:

library(fivethirtyeight)
colnames(flying)[1:5]

## [1] "respondent_id"     "gender"            "age"              
## [4] "height"            "children_under_18"

One of the advantages of this package is that the tidy process is emmbedded in this package and will allow a clean start without a lot of effort

For example, consider the following two ggplot() commands to generate the barplot in Figure to visualize the relationship between the two categorical variables of interest: using the raw data necessitates tick marks to access the variables, whereas using the latter data doesn’t since they have been cleaned , white spaces extracted, chaged to lower case, etc.

# Using raw data:
ggplot(flying_raw, 
       aes(x = `Do you have any children under 18?`, 
           fill = `In general, is itrude to bring a baby on a plane?`)) +
  geom_bar(position = "fill") +
  labs(x = "Children under 18?", y = "Proportion", fill = "Is it rude?")

## Warning: Removed 189 rows containing non-finite values (stat_count).

# Using fivethirtyeight package data:
ggplot(flying, aes(x = children_under_18, fill = baby)) +
  geom_bar(position = "fill") +
  labs(x = "Children under 18?", y = "Proportion", fill = "Is it rude?")

## Warning: Removed 189 rows containing non-finite values (stat_count).

Date management

As an example of the importance of preprocessing dates, consider the data corresponding to the FiveThirtyEight article “Some People Are Too Superstitious To Have A Baby On Friday The 13th”. Load this data, filter for only those rows corresponding to 1999 births, and save this in a data frame US_births_1999_raw. The raw data is saved in a format that makes it difficult to create a time series plot or any data manipulation for that regard.

url2<- "https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv"
library(tidyverse)
US_births_1999_2003_raw <- read_csv(url2)

## Parsed with column specification:
## cols(
##   year = col_integer(),
##   month = col_integer(),
##   date_of_month = col_integer(),
##   day_of_week = col_integer(),
##   births = col_integer()
## )

US_births_1999_raw <- US_births_1999_2003_raw[US_births_1999_2003_raw$year == 1999, ]
head(US_births_1999_raw)

## # A tibble: 6 × 5
##    year month date_of_month day_of_week births
##   <int> <int>         <int>       <int>  <int>
## 1  1999     1             1           5   8163
## 2  1999     1             2           6   7637
## 3  1999     1             3           7   7416
## 4  1999     1             4           1  10396
## 5  1999     1             5           2  12004
## 6  1999     1             6           3  11718

When using the pre-processed US_births_1994_2003 data frame from the fivethirtyeight package we observe that there is a variable date, which can be treated as a numerical variable. Furthermore, the day of the week is indicated with more informative text rather that values between 1 and 7.

library(fivethirtyeight)
US_births_1999 <- US_births_1994_2003[US_births_1994_2003$year == 1999, ]
head(US_births_1999)

## # A tibble: 6 × 6
##    year month date_of_month       date day_of_week births
##   <int> <int>         <int>     <date>       <ord>  <int>
## 1  1999     1             1 1999-01-01         Fri   8163
## 2  1999     1             2 1999-01-02         Sat   7637
## 3  1999     1             3 1999-01-03         Sun   7416
## 4  1999     1             4 1999-01-04         Mon  10396
## 5  1999     1             5 1999-01-05        Tues  12004
## 6  1999     1             6 1999-01-06         Wed  11718

plot(US_births_1999$date, US_births_1999$births, type = "l",
     xlab = "Date", ylab = "# of births")

Check the anomalous spike in the number of births that occured roughly a month before October 1st, 1999:

head(US_births_1999[which.max(US_births_1999$births), ])

## # A tibble: 1 × 6
##    year month date_of_month       date day_of_week births
##   <int> <int>         <int>     <date>       <ord>  <int>
## 1  1999     9             9 1999-09-09       Thurs  14540

Tidy Data Format

One of the first experiences we had was data cleaning and transformation starting with how Wickham describes a dataset/data frame as being in tidy format if it satisfies the following criteria:

-Each variable must have its own column. -Each observation must have its own row. -Each type of observational unit forms a table.

For example, say we want to create a barplot comparing consumption of beer, spirits, and wine between the United States and France using the drinks dataset (Chalabi 2014) The data is saved in “wide” format and thus cannot be used in the ggplot() function.

library(tidyverse)
library(fivethirtyeight)
drinks %>% 
  filter(country %in% c("USA", "France"))

## # A tibble: 2 × 5
##   country beer_servings spirit_servings wine_servings
##     <chr>         <int>           <int>         <int>
## 1  France           127             151           370
## 2     USA           249             158            84
## # ... with 1 more variables: total_litres_of_pure_alcohol <dbl>

However, the help file for this dataset, accessible by typing ?drinks in the console, provides the gather() code necessary to convert this data into “tidy” format:

drinks_tidy_US_FR <- drinks %>%
  filter(country %in% c("USA", "France")) %>% 
  gather(type, servings, -c(country, total_litres_of_pure_alcohol))
drinks_tidy_US_FR

## # A tibble: 6 × 4
##   country total_litres_of_pure_alcohol            type servings
##     <chr>                        <dbl>           <chr>    <int>
## 1  France                         11.8   beer_servings      127
## 2     USA                          8.7   beer_servings      249
## 3  France                         11.8 spirit_servings      151
## 4     USA                          8.7 spirit_servings      158
## 5  France                         11.8   wine_servings      370
## 6     USA                          8.7   wine_servings       84

Conclusions

The 538 package allow you to save a lot of the tidy process directly from the library and with hundreds of datasets available, does this will be enough to stop using tidyverse, straight answer no, but will streamline certain tasks and the “swiss army knife” of tools that tidyverse still offer really powerful means for data analysis

Sources

Albert y.Kim, Chester Ismay, et.al The fivethityeight R Package March 2,2018