The objective of this homework is generating a “recipe” for Tidyverse while using a Fivethirtyeight(538) dataset and collaborate using GitHub. I would like to take a different approach in a sense of explaining the use of a package than can be useful for using and streamlining 538 datasets in a faster and more efficient way and already by itself is a tidyverse recipe while doing a comparisson between the two of them.
FiveThirtyEight.com is a data-driven journalism website founded by Nate Silver and owned by Disney/ESPN that reports on politics, economics, sports, and other current events. FiveThirtyEight data used in many of their articles is accessible on their GitHub repository page https://github.com/fivethirtyeight/data . The fivethirtyeight R package goes one step further by making this data and its corresponding documentation easily accessible. The homepage for the fivethirtyeight R package can be found at https://fivethirtyeight-r.netlify.com/.
I’ll be using couple of datasets from 538 identified in each section.The first thing is, if you haven’t, install the fivethirtyeight package into your R development studio.The first dataset is the Bechdel test dataset that represents “How women are represented in movies.”
library(fivethirtyeight)
## Warning: package 'fivethirtyeight' was built under R version 3.5.1
#Showing dataset bechdel
head(bechdel)
## year imdb title test clean_test binary
## 1 2013 tt1711425 21 & Over notalk notalk FAIL
## 2 2012 tt1343727 Dredd 3D ok-disagree ok PASS
## 3 2013 tt2024544 12 Years a Slave notalk-disagree notalk FAIL
## 4 2013 tt1272878 2 Guns notalk notalk FAIL
## 5 2013 tt0453562 42 men men FAIL
## 6 2013 tt1335975 47 Ronin men men FAIL
## budget domgross intgross code budget_2013 domgross_2013
## 1 13000000 25682380 42195766 2013FAIL 13000000 25682380
## 2 45000000 13414714 40868994 2012PASS 45658735 13611086
## 3 20000000 53107035 158607035 2013FAIL 20000000 53107035
## 4 61000000 75612460 132493015 2013FAIL 61000000 75612460
## 5 40000000 95020213 95020213 2013FAIL 40000000 95020213
## 6 225000000 38362475 145803842 2013FAIL 225000000 38362475
## intgross_2013 period_code decade_code
## 1 42195766 1 1
## 2 41467257 1 1
## 3 158607035 1 1
## 4 132493015 1 1
## 5 95020213 1 1
## 6 145803842 1 1
?bechdel
## starting httpd help server ... done
There’s around 107 datasets ready to be consumed.
vignette("fivethirtyeight", package="fivethirtyeight")
One of the first examples is preprocess variable names. Let’s consider the dataset from the Article “41% of Fliers thinkyou’re rude if you recline your seat”, i will access the dataset from the fivethirtyeight github dataset repository in order to do Tidyverse in paralel with 538 package
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
url<-"https://raw.githubusercontent.com/fivethirtyeight/data/master/flying-etiquette-survey/flying-etiquette.csv"
flying_raw <- read_csv(url)
## Parsed with column specification:
## cols(
## .default = col_character(),
## RespondentID = col_double()
## )
## See spec(...) for full column specifications.
colnames(flying_raw)[1:5]
## [1] "RespondentID"
## [2] "How often do you travel by plane?"
## [3] "Do you ever recline your seat when you fly?"
## [4] "How tall are you?"
## [5] "Do you have any children under 18?"
flying_raw2 <- flying_raw
#Contiinuation of the tidyverse recipe will be change column names for shorter more maneagable names
colnames(flying_raw2)<- c("respondent_id", "gender", "age", "height", "children_under_18")
#so this add serveral extra steps
We contrast this to the corresponding flying data frame in the fivethirtyeight package:
library(fivethirtyeight)
colnames(flying)[1:5]
## [1] "respondent_id" "gender" "age"
## [4] "height" "children_under_18"
One of the advantages of this package is that the tidy process is emmbedded in this package and will allow a clean start without a lot of effort
For example, consider the following two ggplot() commands to generate the barplot in Figure to visualize the relationship between the two categorical variables of interest: using the raw data necessitates tick marks to access the variables, whereas using the latter data doesn’t since they have been cleaned , white spaces extracted, chaged to lower case, etc.
# Using raw data:
ggplot(flying_raw,
aes(x = `Do you have any children under 18?`,
fill = `In general, is itrude to bring a baby on a plane?`)) +
geom_bar(position = "fill") +
labs(x = "Children under 18?", y = "Proportion", fill = "Is it rude?")
## Warning: Removed 189 rows containing non-finite values (stat_count).
# Using fivethirtyeight package data:
ggplot(flying, aes(x = children_under_18, fill = baby)) +
geom_bar(position = "fill") +
labs(x = "Children under 18?", y = "Proportion", fill = "Is it rude?")
## Warning: Removed 189 rows containing non-finite values (stat_count).
As an example of the importance of preprocessing dates, consider the data corresponding to the FiveThirtyEight article “Some People Are Too Superstitious To Have A Baby On Friday The 13th”. Load this data, filter for only those rows corresponding to 1999 births, and save this in a data frame US_births_1999_raw. The raw data is saved in a format that makes it difficult to create a time series plot or any data manipulation for that regard.
url2<- "https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv"
library(tidyverse)
US_births_1999_2003_raw <- read_csv(url2)
## Parsed with column specification:
## cols(
## year = col_integer(),
## month = col_integer(),
## date_of_month = col_integer(),
## day_of_week = col_integer(),
## births = col_integer()
## )
US_births_1999_raw <- US_births_1999_2003_raw[US_births_1999_2003_raw$year == 1999, ]
head(US_births_1999_raw)
## # A tibble: 6 × 5
## year month date_of_month day_of_week births
## <int> <int> <int> <int> <int>
## 1 1999 1 1 5 8163
## 2 1999 1 2 6 7637
## 3 1999 1 3 7 7416
## 4 1999 1 4 1 10396
## 5 1999 1 5 2 12004
## 6 1999 1 6 3 11718
When using the pre-processed US_births_1994_2003 data frame from the fivethirtyeight package we observe that there is a variable date, which can be treated as a numerical variable. Furthermore, the day of the week is indicated with more informative text rather that values between 1 and 7.
library(fivethirtyeight)
US_births_1999 <- US_births_1994_2003[US_births_1994_2003$year == 1999, ]
head(US_births_1999)
## # A tibble: 6 × 6
## year month date_of_month date day_of_week births
## <int> <int> <int> <date> <ord> <int>
## 1 1999 1 1 1999-01-01 Fri 8163
## 2 1999 1 2 1999-01-02 Sat 7637
## 3 1999 1 3 1999-01-03 Sun 7416
## 4 1999 1 4 1999-01-04 Mon 10396
## 5 1999 1 5 1999-01-05 Tues 12004
## 6 1999 1 6 1999-01-06 Wed 11718
plot(US_births_1999$date, US_births_1999$births, type = "l",
xlab = "Date", ylab = "# of births")
Check the anomalous spike in the number of births that occured roughly a month before October 1st, 1999:
head(US_births_1999[which.max(US_births_1999$births), ])
## # A tibble: 1 × 6
## year month date_of_month date day_of_week births
## <int> <int> <int> <date> <ord> <int>
## 1 1999 9 9 1999-09-09 Thurs 14540
One of the first experiences we had was data cleaning and transformation starting with how Wickham describes a dataset/data frame as being in tidy format if it satisfies the following criteria:
-Each variable must have its own column. -Each observation must have its own row. -Each type of observational unit forms a table.
For example, say we want to create a barplot comparing consumption of beer, spirits, and wine between the United States and France using the drinks dataset (Chalabi 2014) The data is saved in “wide” format and thus cannot be used in the ggplot() function.
library(tidyverse)
library(fivethirtyeight)
drinks %>%
filter(country %in% c("USA", "France"))
## # A tibble: 2 × 5
## country beer_servings spirit_servings wine_servings
## <chr> <int> <int> <int>
## 1 France 127 151 370
## 2 USA 249 158 84
## # ... with 1 more variables: total_litres_of_pure_alcohol <dbl>
However, the help file for this dataset, accessible by typing ?drinks in the console, provides the gather() code necessary to convert this data into “tidy” format:
drinks_tidy_US_FR <- drinks %>%
filter(country %in% c("USA", "France")) %>%
gather(type, servings, -c(country, total_litres_of_pure_alcohol))
drinks_tidy_US_FR
## # A tibble: 6 × 4
## country total_litres_of_pure_alcohol type servings
## <chr> <dbl> <chr> <int>
## 1 France 11.8 beer_servings 127
## 2 USA 8.7 beer_servings 249
## 3 France 11.8 spirit_servings 151
## 4 USA 8.7 spirit_servings 158
## 5 France 11.8 wine_servings 370
## 6 USA 8.7 wine_servings 84
The 538 package allow you to save a lot of the tidy process directly from the library and with hundreds of datasets available, does this will be enough to stop using tidyverse, straight answer no, but will streamline certain tasks and the “swiss army knife” of tools that tidyverse still offer really powerful means for data analysis
Albert y.Kim, Chester Ismay, et.al The fivethityeight R Package March 2,2018