Introduction

In this workshop we will do some of the exercises from Chapter 5 of R4DS.

Exercises from 5.2.4

Use a separate code block for each exercise.

for example: 1. Find all flights that had an arrival delay of two or more hours.

flights %>%
  filter(arr_delay >= 120) # note delays are in minutes
## # A tibble: 10,200 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      811            630       101     1047            830
##  2  2013     1     1      848           1835       853     1001           1950
##  3  2013     1     1      957            733       144     1056            853
##  4  2013     1     1     1114            900       134     1447           1222
##  5  2013     1     1     1505           1310       115     1638           1431
##  6  2013     1     1     1525           1340       105     1831           1626
##  7  2013     1     1     1549           1445        64     1912           1656
##  8  2013     1     1     1558           1359       119     1718           1515
##  9  2013     1     1     1732           1630        62     2028           1825
## 10  2013     1     1     1803           1620       103     2008           1750
## # … with 10,190 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  1. Find all United Airline flights to “IAH”. (note this is not exactly of the book exercises.)

  2. Find all flights that arrived more than two hours late, but didn’t leave late

  3. Find all flights the departed between midnight and 6am (inclusive).

  4. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!) How does this differ with normal practice in Mathematics?

  5. Find the 5 most delayed flights. (not exactly the same as book)

  6. Using a helper function, extract all columns concerning time.

  7. Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight. (hint: you may need to make a function and also use maodular arithmetic as I mentioned last week.)

9.Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

  1. Explain the use of the pipe function %>% How does the design of the tidyverse facillitate the use of pipes?

  2. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

  • A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
  • A flight is always 10 minutes late.
  • A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
  • 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
  1. Which is more important: arrival delay or departure delay?