DataTransformation

Introduction

In this workshop we will do some of the exercises from Chapter 5 of R4DS.

Exercises from 5.2.4

Use a separate code block for each exercise.

?flights

## starting httpd help server ... done

for example: 1. Find all flights that had an arrival delay of two or more hours.

flights %>%
  filter(arr_delay >= 120) # note delays are in minutes

## # A tibble: 10,200 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      811            630       101     1047            830
##  2  2013     1     1      848           1835       853     1001           1950
##  3  2013     1     1      957            733       144     1056            853
##  4  2013     1     1     1114            900       134     1447           1222
##  5  2013     1     1     1505           1310       115     1638           1431
##  6  2013     1     1     1525           1340       105     1831           1626
##  7  2013     1     1     1549           1445        64     1912           1656
##  8  2013     1     1     1558           1359       119     1718           1515
##  9  2013     1     1     1732           1630        62     2028           1825
## 10  2013     1     1     1803           1620       103     2008           1750
## # ... with 10,190 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Find all United Airline flights to “IAH”. (note this is not exactly of the book exercises.)

filter(flights, dest == "IAH")

## # A tibble: 7,198 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      623            627        -4      933            932
##  4  2013     1     1      728            732        -4     1041           1038
##  5  2013     1     1      739            739         0     1104           1038
##  6  2013     1     1      908            908         0     1228           1219
##  7  2013     1     1     1028           1026         2     1350           1339
##  8  2013     1     1     1044           1045        -1     1352           1351
##  9  2013     1     1     1114            900       134     1447           1222
## 10  2013     1     1     1205           1200         5     1503           1505
## # ... with 7,188 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Find all flights that arrived more than two hours late, but didn’t leave late

filter(flights, arr_delay > 120, dep_delay <= 0)

## # A tibble: 29 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1    27     1419           1420        -1     1754           1550
##  2  2013    10     7     1350           1350         0     1736           1526
##  3  2013    10     7     1357           1359        -2     1858           1654
##  4  2013    10    16      657            700        -3     1258           1056
##  5  2013    11     1      658            700        -2     1329           1015
##  6  2013     3    18     1844           1847        -3       39           2219
##  7  2013     4    17     1635           1640        -5     2049           1845
##  8  2013     4    18      558            600        -2     1149            850
##  9  2013     4    18      655            700        -5     1213            950
## 10  2013     5    22     1827           1830        -3     2217           2010
## # ... with 19 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Find all flights the departed between midnight and 6am (inclusive).

summary(flights$dep_time)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1     907    1401    1349    1744    2400    8255

filter(flights, dep_time <= 600 | dep_time == 2400) #In dep_time midnight is 2400, not 0 so we must include 2400.

## # A tibble: 9,373 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 9,363 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!) How does this differ with normal practice in Mathematics?
Find the 5 most delayed flights. (not exactly the same as book)

arrange(flights, desc(dep_delay))

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

# The first five values are the five most delayed flights.

Using a helper function, extract all columns concerning time.

select(flights, contains("TIME"))

## # A tibble: 336,776 x 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
##       <int>          <int>    <int>          <int>    <dbl> <dttm>             
##  1      517            515      830            819      227 2013-01-01 05:00:00
##  2      533            529      850            830      227 2013-01-01 05:00:00
##  3      542            540      923            850      160 2013-01-01 05:00:00
##  4      544            545     1004           1022      183 2013-01-01 05:00:00
##  5      554            600      812            837      116 2013-01-01 06:00:00
##  6      554            558      740            728      150 2013-01-01 05:00:00
##  7      555            600      913            854      158 2013-01-01 06:00:00
##  8      557            600      709            723       53 2013-01-01 06:00:00
##  9      557            600      838            846      140 2013-01-01 06:00:00
## 10      558            600      753            745      138 2013-01-01 06:00:00
## # ... with 336,766 more rows

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight. (hint: you may need to make a function and also use maodular arithmetic as I mentioned last week.)

mutate(flights,
       dep_time = (dep_time %/% 100) * 60 + (dep_time %% 100),
       sched_dep_time = (sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100))

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <dbl>          <dbl>     <dbl>    <int>          <int>
##  1  2013     1     1      317            315         2      830            819
##  2  2013     1     1      333            329         4      850            830
##  3  2013     1     1      342            340         2      923            850
##  4  2013     1     1      344            345        -1     1004           1022
##  5  2013     1     1      354            360        -6      812            837
##  6  2013     1     1      354            358        -4      740            728
##  7  2013     1     1      355            360        -5      913            854
##  8  2013     1     1      357            360        -3      709            723
##  9  2013     1     1      357            360        -3      838            846
## 10  2013     1     1      358            360        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

9.Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

#air_time is the amount of time spend in the air while arr_time - dep_time is the total time it takes the aircraft to depart, fly, and arrive at its destination.
flights %>% 
  mutate(dep_time = (dep_time %/% 100) * 60 + (dep_time %% 100),
         sched_dep_time = (sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100),
         arr_time = (arr_time %/% 100) * 60 + (arr_time %% 100),
         sched_arr_time = (sched_arr_time %/% 100) * 60 + (sched_arr_time %% 100)) %>%
  transmute((arr_time - dep_time) %% (60*24) - air_time)

## # A tibble: 336,776 x 1
##    `(arr_time - dep_time)%%(60 * 24) - air_time`
##                                            <dbl>
##  1                                           -34
##  2                                           -30
##  3                                            61
##  4                                            77
##  5                                            22
##  6                                           -44
##  7                                            40
##  8                                            19
##  9                                            21
## 10                                           -23
## # ... with 336,766 more rows

Explain the use of the pipe function %>% How does the design of the tidyverse facilitate the use of pipes?

The pipe function %>% passes the result on its left into the first argument of the function on its right. Tidyverse facilitates use of pipes since the dplyr function makes it easy to read pipes: each function name is a verb so code resembles sentences. Also each dplyr function returns a data frame which can be piped into another dplyr function which will accept the data frame as its first argument. This way, pipes can be used to combine multiple simple tasks to perform more complex ones.

Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.

flights %>%
  group_by(flight) %>%
  summarize(early_15_min = sum(arr_delay <= -15, na.rm = TRUE) / n(),
            late_15_min = sum(arr_delay >= 15, na.rm = TRUE) / n(),
            n = n()) %>%
  filter(early_15_min == 0.5,
         late_15_min == 0.5)

## # A tibble: 18 x 4
##    flight early_15_min late_15_min     n
##     <int>        <dbl>       <dbl> <int>
##  1    107          0.5         0.5     2
##  2   2072          0.5         0.5     2
##  3   2366          0.5         0.5     2
##  4   2500          0.5         0.5     2
##  5   2552          0.5         0.5     2
##  6   3495          0.5         0.5     2
##  7   3518          0.5         0.5     2
##  8   3544          0.5         0.5     2
##  9   3651          0.5         0.5     2
## 10   3705          0.5         0.5     2
## 11   3916          0.5         0.5     2
## 12   3951          0.5         0.5     2
## 13   4273          0.5         0.5     2
## 14   4313          0.5         0.5     2
## 15   5297          0.5         0.5     2
## 16   5322          0.5         0.5     2
## 17   5388          0.5         0.5     2
## 18   5505          0.5         0.5     4

A flight is always 10 minutes late.

flights %>% 
  group_by(flight) %>% 
  summarise(prop.same.late = n_distinct(arr_delay, na.rm = TRUE) / n(), 
            mean.arr.delay = mean(arr_delay, na.rm = TRUE),
            n = n()) %>%
  filter(prop.same.late == 1 & mean.arr.delay == 10)

## # A tibble: 4 x 4
##   flight prop.same.late mean.arr.delay     n
##    <int>          <dbl>          <dbl> <int>
## 1   2254              1             10     1
## 2   3656              1             10     1
## 3   3880              1             10     1
## 4   5854              1             10     1

A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.

flights %>% 
  group_by(flight) %>% 
  summarise(early.30.prop = sum(arr_delay <= -30, na.rm = TRUE) / n(),
            late.30.prop = sum(arr_delay >= 30, na.rm = TRUE) / n(),
            n = n()) %>% 
  filter(early.30.prop == .5 & late.30.prop == .5)

## # A tibble: 3 x 4
##   flight early.30.prop late.30.prop     n
##    <int>         <dbl>        <dbl> <int>
## 1   3651           0.5          0.5     2
## 2   3916           0.5          0.5     2
## 3   3951           0.5          0.5     2

99% of the time a flight is on time. 1% of the time it’s 2 hours late.

flights %>% 
  group_by(flight) %>% 
  summarise(early.prop = sum(arr_delay <= 0, na.rm = TRUE) / n(),
            late.prop = sum(arr_delay >= 120, na.rm = TRUE) / n(),
            n = n()) %>% 
  filter(early.prop == .99 & late.prop == .01 )

## # A tibble: 0 x 4
## # ... with 4 variables: flight <int>, early.prop <dbl>, late.prop <dbl>,
## #   n <int>

Which is more important: arrival delay or departure delay?

In most cases arrival delay is more important, since changes to the arrival time can cause more of a distruption to the passenger as connecting flights might be missed. If departure is delayed but arrival is unaffected, it would not affect passenger plans. Arrival time is also less consistent: a departure delay can be planned for and accommodated, but variations in flight time can widely affect the arrival time.

DataTransformation

Tycho Gormley

7/20/2021

Introduction

Exercises from 5.2.4