In this workshop we will do some of the exercises from Chapter 5 of R4DS.
Use a separate code block for each exercise.
for example: 1. Find all flights that had an arrival delay of two or more hours.
flights %>%
filter(arr_delay >= 120) %>%
ggplot() +
geom_histogram(mapping = aes(x = arr_delay), binwidth = 5)
flights %>%
filter(dest == "IAH")
## # A tibble: 7,198 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 623 627 -4 933 932
## 4 2013 1 1 728 732 -4 1041 1038
## 5 2013 1 1 739 739 0 1104 1038
## 6 2013 1 1 908 908 0 1228 1219
## 7 2013 1 1 1028 1026 2 1350 1339
## 8 2013 1 1 1044 1045 -1 1352 1351
## 9 2013 1 1 1114 900 134 1447 1222
## 10 2013 1 1 1205 1200 5 1503 1505
## # … with 7,188 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
filter(arr_delay > 120, dep_delay == 0)
## # A tibble: 3 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 10 7 1350 1350 0 1736 1526
## 2 2013 5 23 1810 1810 0 2208 2000
## 3 2013 7 1 905 905 0 1443 1223
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
filter(dep_time >= 0000, dep_delay <= 0600)
## # A tibble: 328,481 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 328,471 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!) How does this differ with normal practice in Mathematics? *The rule seems to be assuming that NA <> 0 because NA^0 returns 1. FALSE | NA and TRUE & NA return NA, while TRUE | NA returns TRUE and FALSE & NA returns FALSE. So it seems like TRUE could be values other than 0 and FALSE could be just 0.
Find the 5 most delayed flights. (not exactly the same as book)
arrange(flights, desc(arr_delay))
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 7 22 2257 759 898 121 1026
## 9 2013 12 5 756 1700 896 1058 2020
## 10 2013 5 3 1133 2055 878 1250 2215
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% select(contains("time"))
## # A tibble: 336,776 x 6
## dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
## <int> <int> <int> <int> <dbl> <dttm>
## 1 517 515 830 819 227 2013-01-01 05:00:00
## 2 533 529 850 830 227 2013-01-01 05:00:00
## 3 542 540 923 850 160 2013-01-01 05:00:00
## 4 544 545 1004 1022 183 2013-01-01 05:00:00
## 5 554 600 812 837 116 2013-01-01 06:00:00
## 6 554 558 740 728 150 2013-01-01 05:00:00
## 7 555 600 913 854 158 2013-01-01 06:00:00
## 8 557 600 709 723 53 2013-01-01 06:00:00
## 9 557 600 838 846 140 2013-01-01 06:00:00
## 10 558 600 753 745 138 2013-01-01 06:00:00
## # … with 336,766 more rows
t <- 1357
t%/%100
## [1] 13
t%%100
## [1] 57
time2min <- function(t) {t %/% 100 * 60 + t %% 100}
time2min(1357)
## [1] 837
9.Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it? *We tried converting the 3 time columns to minutes instead of hours and minutes but that still leaves the “new” (converted arr_time - converted dep_time) <> “airtime2” (converted air_time) so there must be extra time somewhere where the plane is sitting on the runway.
mutate(flights, new = time2min(arr_time) - time2min(dep_time), airtime2 = time2min(air_time))
## # A tibble: 336,776 x 21
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 13 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # new <dbl>, airtime2 <dbl>
Explain the use of the pipe function %>% How does the design of the tidyverse facillitate the use of pipes? The pipe function allows you to link together multiple functions like filter + select + arrange. Instead of nesting functions inside of each other which can get kind of complicated, you can stack the functions and make the code easier to understand. When you put functions together with the pipe function you are asking R to do actions in a descending order. That makes the pipe operator like a “then” (ex: filter then arrange).
Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
First you have to define “early” and “late”. For this question, “early” means that arr_delay is equal to a positive number, “late” means that arr_delay is equal to a negative number, and “on time” means that arr_delay is equal to 0.
5 strategies: 1) Filter using arr_delay = something. If the third scenario is true, then if you filter for flights that are 30 min early and for flights that are 30 min late, the number of 30 min early flights should equal the number of 30 min late flight. 2) Use count + filter. We can make the first strategy shorter by just counting the number of flights that are 30 min early. If the third scenario is true, the number of 30 min early flights should be 50% of all of the flights. So if we count and find that the number of 30 min early flights is 50% of all the flights, we will know that the scenario is true without considering or calculating the number of 30 min late flights. 3) Use summarize function. If the first scenario is true, then the sum of arr_delay must be 0 because 50% of arr_delay = +15 [15 min early] and 50% of arr_delay = -15 [15 min late], and +15 + -15 = 0. 4) Use group_by. If you can group by flight then you can pull a separate dataset for each flight and use that to find out if a flight meets the definition of any of the 4 scenarios. 5) Use n() to find the number of records. For the fourth scenario to be true, you would need to find a flight where 1% of the time arr_delay = -120 (2 hrs late). So if you define a variable to equal the number of times arr_delay = -120 and you find that that number is not 1% of all the flights, then you know the fourth scenario isn’t true.
Whether “arrival delay” is more important than “departure delay” or vice versa depends on which stakeholder(s) you are considering and what their needs are. Arrival delay might be more important to most passengers because it means that if they are running late then they might not miss the flight. However it wouldn’t be good if an arrival delay caused a departure delay which then caused them to miss the next flight or their ride from the airport. What is important to the passenger isn’s so much arrival delay vs. departure delay, instead it is whatever causes the least amount of inconvenience. Since the two delays feed off of each other (one can cause another) airline and airport employees likely think that they are equally important because they would want to keep things on schedule.