DataTransformation

Introduction

In this workshop we will do some of the exercises from Chapter 5 of R4DS.

Exercises from 5.2.4

Use a separate code block for each exercise.

for example: 1. Find all flights that had an arrival delay of two or more hours.

flights %>%
  filter(arr_delay >= 120) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = arr_delay), binwidth = 5)

Find all United Airline flights to “IAH”. (note this is not exactly of the book exercises.)

flights %>%
  filter(dest == "IAH")

## # A tibble: 7,198 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      623            627        -4      933            932
##  4  2013     1     1      728            732        -4     1041           1038
##  5  2013     1     1      739            739         0     1104           1038
##  6  2013     1     1      908            908         0     1228           1219
##  7  2013     1     1     1028           1026         2     1350           1339
##  8  2013     1     1     1044           1045        -1     1352           1351
##  9  2013     1     1     1114            900       134     1447           1222
## 10  2013     1     1     1205           1200         5     1503           1505
## # … with 7,188 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Find all flights that arrived more than two hours late, but didn’t leave late

flights %>%
  filter(arr_delay > 120, dep_delay == 0)

## # A tibble: 3 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013    10     7     1350           1350         0     1736           1526
## 2  2013     5    23     1810           1810         0     2208           2000
## 3  2013     7     1      905            905         0     1443           1223
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Find all flights the departed between midnight and 6am (inclusive).

flights %>%
  filter(dep_time >= 0000, dep_delay <= 0600)

## # A tibble: 328,481 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 328,471 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!) How does this differ with normal practice in Mathematics? *The rule seems to be assuming that NA <> 0 because NA^0 returns 1. FALSE | NA and TRUE & NA return NA, while TRUE | NA returns TRUE and FALSE & NA returns FALSE. So it seems like TRUE could be values other than 0 and FALSE could be just 0.
Find the 5 most delayed flights. (not exactly the same as book)

arrange(flights, desc(arr_delay))

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     7    22     2257            759       898      121           1026
##  9  2013    12     5      756           1700       896     1058           2020
## 10  2013     5     3     1133           2055       878     1250           2215
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Using a helper function, extract all columns concerning time.

flights %>% select(contains("time"))

## # A tibble: 336,776 x 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
##       <int>          <int>    <int>          <int>    <dbl> <dttm>             
##  1      517            515      830            819      227 2013-01-01 05:00:00
##  2      533            529      850            830      227 2013-01-01 05:00:00
##  3      542            540      923            850      160 2013-01-01 05:00:00
##  4      544            545     1004           1022      183 2013-01-01 05:00:00
##  5      554            600      812            837      116 2013-01-01 06:00:00
##  6      554            558      740            728      150 2013-01-01 05:00:00
##  7      555            600      913            854      158 2013-01-01 06:00:00
##  8      557            600      709            723       53 2013-01-01 06:00:00
##  9      557            600      838            846      140 2013-01-01 06:00:00
## 10      558            600      753            745      138 2013-01-01 06:00:00
## # … with 336,766 more rows

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight. (hint: you may need to make a function and also use maodular arithmetic as I mentioned last week.)

t <- 1357
t%/%100

## [1] 13

t%%100

## [1] 57

time2min <- function(t) {t %/% 100 * 60 + t %% 100}
time2min(1357)

## [1] 837

9.Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it? *We tried converting the 3 time columns to minutes instead of hours and minutes but that still leaves the “new” (converted arr_time - converted dep_time) <> “airtime2” (converted air_time) so there must be extra time somewhere where the plane is sitting on the runway.

mutate(flights, new = time2min(arr_time) - time2min(dep_time), airtime2 = time2min(air_time))

## # A tibble: 336,776 x 21
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 13 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   new <dbl>, airtime2 <dbl>

Explain the use of the pipe function %>% How does the design of the tidyverse facillitate the use of pipes? The pipe function allows you to link together multiple functions like filter + select + arrange. Instead of nesting functions inside of each other which can get kind of complicated, you can stack the functions and make the code easier to understand. When you put functions together with the pipe function you are asking R to do actions in a descending order. That makes the pipe operator like a “then” (ex: filter then arrange).
Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
A flight is always 10 minutes late.
A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
99% of the time a flight is on time. 1% of the time it’s 2 hours late.

First you have to define “early” and “late”. For this question, “early” means that arr_delay is equal to a positive number, “late” means that arr_delay is equal to a negative number, and “on time” means that arr_delay is equal to 0.

5 strategies: 1) Filter using arr_delay = something. If the third scenario is true, then if you filter for flights that are 30 min early and for flights that are 30 min late, the number of 30 min early flights should equal the number of 30 min late flight. 2) Use count + filter. We can make the first strategy shorter by just counting the number of flights that are 30 min early. If the third scenario is true, the number of 30 min early flights should be 50% of all of the flights. So if we count and find that the number of 30 min early flights is 50% of all the flights, we will know that the scenario is true without considering or calculating the number of 30 min late flights. 3) Use summarize function. If the first scenario is true, then the sum of arr_delay must be 0 because 50% of arr_delay = +15 [15 min early] and 50% of arr_delay = -15 [15 min late], and +15 + -15 = 0. 4) Use group_by. If you can group by flight then you can pull a separate dataset for each flight and use that to find out if a flight meets the definition of any of the 4 scenarios. 5) Use n() to find the number of records. For the fourth scenario to be true, you would need to find a flight where 1% of the time arr_delay = -120 (2 hrs late). So if you define a variable to equal the number of times arr_delay = -120 and you find that that number is not 1% of all the flights, then you know the fourth scenario isn’t true.

Which is more important: arrival delay or departure delay?

Whether “arrival delay” is more important than “departure delay” or vice versa depends on which stakeholder(s) you are considering and what their needs are. Arrival delay might be more important to most passengers because it means that if they are running late then they might not miss the flight. However it wouldn’t be good if an arrival delay caused a departure delay which then caused them to miss the next flight or their ride from the airport. What is important to the passenger isn’s so much arrival delay vs. departure delay, instead it is whatever causes the least amount of inconvenience. Since the two delays feed off of each other (one can cause another) airline and airport employees likely think that they are equally important because they would want to keep things on schedule.

DataTransformation

Tara Bhat

7/20/2021

Introduction

Exercises from 5.2.4