In this workshop we will do some of the exercises from Chapter 5 of R4DS.
Use a separate code block for each exercise.
for example: 1. Find all flights that had an arrival delay of two or more hours.
flights %>%
filter(arr_delay >= 120) %>%
ggplot() +
geom_histogram(aes(x = arr_delay),binwidth = 5)# note delays are in minutes
flights %>%
filter(dest == "IAH" & carrier == "UA")
## # A tibble: 6,924 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 623 627 -4 933 932
## 4 2013 1 1 728 732 -4 1041 1038
## 5 2013 1 1 739 739 0 1104 1038
## 6 2013 1 1 908 908 0 1228 1219
## 7 2013 1 1 1028 1026 2 1350 1339
## 8 2013 1 1 1044 1045 -1 1352 1351
## 9 2013 1 1 1114 900 134 1447 1222
## 10 2013 1 1 1205 1200 5 1503 1505
## # … with 6,914 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Find all flights that arrived more than two hours late, but didn’t leave late
Find all flights the departed between midnight and 6am (inclusive).
Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!) How does this differ with normal practice in Mathematics?
Find the 5 most delayed flights. (not exactly the same as book)
Using a helper function, extract all columns concerning time.
flights %>%
select(ends_with("time"))
## # A tibble: 336,776 x 5
## dep_time sched_dep_time arr_time sched_arr_time air_time
## <int> <int> <int> <int> <dbl>
## 1 517 515 830 819 227
## 2 533 529 850 830 227
## 3 542 540 923 850 160
## 4 544 545 1004 1022 183
## 5 554 600 812 837 116
## 6 554 558 740 728 150
## 7 555 600 913 854 158
## 8 557 600 709 723 53
## 9 557 600 838 846 140
## 10 558 600 753 745 138
## # … with 336,766 more rows
9.Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?
Explain the use of the pipe function %>% How does the design of the tidyverse facillitate the use of pipes?
Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios: