DATA 607 - TIDYVERSE-PART 1

Usage

I’ll be using couple of datasets from 538 identified in each section.The first thing is, if you haven’t, install the fivethirtyeight package into your R development studio.The first dataset is the airline safety test dataset that represents “Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past”

library(fivethirtyeight)

Airline Data

These data are courtesy of fivethirtyeight and relates to their article titled, Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past?.

Loading the Data

The data are stored on fivethirtyeight’s github, so we will load it directly from that site:

# Load from the Github URL
airline <- read_csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv"),col_types="cdiiiiii")
# Look at the data
glimpse(airline)

## Observations: 56
## Variables: 8
## $ airline                <chr> "Aer Lingus", "Aeroflot*", "Aerolineas ...
## $ avail_seat_km_per_week <dbl> 320906734, 1197672318, 385803648, 59687...
## $ incidents_85_99        <int> 2, 76, 6, 3, 2, 14, 2, 3, 5, 7, 3, 21, ...
## $ fatal_accidents_85_99  <int> 0, 14, 0, 1, 0, 4, 1, 0, 0, 2, 1, 5, 0,...
## $ fatalities_85_99       <int> 0, 128, 0, 64, 0, 79, 329, 0, 0, 50, 1,...
## $ incidents_00_14        <int> 0, 6, 1, 5, 2, 6, 4, 5, 5, 4, 7, 17, 1,...
## $ fatal_accidents_00_14  <int> 0, 1, 0, 0, 0, 2, 1, 1, 1, 0, 0, 3, 0, ...
## $ fatalities_00_14       <int> 0, 88, 0, 0, 0, 337, 158, 7, 88, 0, 0, ...

We have 8 variables in the raw data. Looking at them, however, we see that the data is not in a tidy format. Specifically, the incidents, accidents, and fatalities columns are variables, and the year ranges are observations.

Tidy Data

We can correct this quite easily using the tidyr package included as part of the tidyverse. Specifically, we want to use the gather function which, as its name implies, gathers columns into rows.

air <- airline %>% gather(key = "measure", value = "val", -airline, -avail_seat_km_per_week)
glimpse(air)

## Observations: 336
## Variables: 4
## $ airline                <chr> "Aer Lingus", "Aeroflot*", "Aerolineas ...
## $ avail_seat_km_per_week <dbl> 320906734, 1197672318, 385803648, 59687...
## $ measure                <chr> "incidents_85_99", "incidents_85_99", "...
## $ val                    <int> 2, 76, 6, 3, 2, 14, 2, 3, 5, 7, 3, 21, ...

The above statement takes the columns we select (here, we selected all except airline and avail_seat_km_per_week) and transforms them so that the name of that column is populated in a new “key” column named measure and it’s value is put into a new column named val.

However we’re not exactly where we want to be, as incidents, fatal accidents, and fatalities are all variables and should be in their own columns.

# First split years away from the variables
air$year <- str_sub(air$measure,-5,-1)
air$measure <- str_extract(air$measure,"[a-z|_]+(?![0-9])")
# Now "spread" those variables into their own columns
airTidy <- spread(air, measure, val)
glimpse(airTidy)

## Observations: 112
## Variables: 6
## $ airline                <chr> "Aer Lingus", "Aer Lingus", "Aeroflot*"...
## $ avail_seat_km_per_week <dbl> 320906734, 320906734, 1197672318, 11976...
## $ year                   <chr> "00_14", "85_99", "00_14", "85_99", "00...
## $ fatal_accidents        <int> 0, 0, 1, 14, 0, 0, 0, 1, 0, 0, 2, 4, 1,...
## $ fatalities             <int> 0, 0, 88, 128, 0, 0, 0, 64, 0, 0, 337, ...
## $ incidents              <int> 0, 2, 6, 76, 1, 6, 5, 3, 2, 2, 6, 14, 4...

Now it appears our dataset is tidy. Each row is an observation (airline and year range) and each column is a variable (seats per km, fatal accidents, fatalities, and incidents).

Fliers Data

These data are courtesy of fivethirtyeight and relates to their article titled, * 41 Percent of Fliers Say It’s Rude To Recline Your Airplane Seat*.

Load Data

url<-"https://raw.githubusercontent.com/fivethirtyeight/data/master/flying-etiquette-survey/flying-etiquette.csv"
flying_raw <- read_csv(url)

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   RespondentID = col_double()
## )

## See spec(...) for full column specifications.

colnames(flying_raw)[1:5]

## [1] "RespondentID"                               
## [2] "How often do you travel by plane?"          
## [3] "Do you ever recline your seat when you fly?"
## [4] "How tall are you?"                          
## [5] "Do you have any children under 18?"

flying_raw2 <- flying_raw
#the tidyverse recipe will be change column names for shorter more maneagable names
colnames(flying_raw2)<- c("respondent_id", "gender", "age", "height", "children_under_18")

New Columns Names

colnames(flying)[1:5]

## [1] "respondent_id"     "gender"            "age"              
## [4] "height"            "children_under_18"

Graphs

For example, consider the following two ggplot() commands to generate the barplot in Figure to visualize the relationship between the two categorical variables of interest: using the raw data necessitates tick marks to access the variables, whereas using the latter data doesn’t since they have been cleaned , white spaces extracted, chaged to lower case, etc.

# Using raw data:
ggplot(flying_raw, 
       aes(x = `Do you have any children under 18?`, 
           fill = `In general, is itrude to bring a baby on a plane?`)) +
  geom_bar(position = "fill") +
  labs(x = "Children under 18?", y = "Proportion", fill = "Is it rude?")

Children Under 18 Graphs

# Using fivethirtyeight package data:
ggplot(flying, aes(x = children_under_18, fill = baby)) +
  geom_bar(position = "fill") +
  labs(x = "Children under 18?", y = "Proportion", fill = "Is it rude?")