5.2.4

1. Find all flights that

  1. Had an arrival delay of two or more hours

    flights %>% filter(arr_delay >= 120)
  2. Flew to Houston (IAH or HOU)

    flights %>% filter(dest %in% c("IAH", "HOU"))
  3. Were operated by United, American, or Delta

    flights %>% left_join(airlines) %>%
      filter(str_detect(name, "United|American|Delta"))
  4. Departed in summer (July, August, and September)

    flights %>% filter(between(month, 7, 9))
  5. Arrived more than two hours late, but didn’t leave late

    flights %>% filter(arr_delay > 120, dep_delay <= 0)
  6. Were delayed by at least an hour, but made up over 30 minutes in flight

    flights %>% filter(dep_delay >= 60, arr_delay < dep_delay - 30)
  7. Departed between midnight and 6am (inclusive)

    flights %>% mutate(dep_hour = dep_time %/% 100) %>%
      filter(dep_time == 2400 | dep_hour %in% c(0:5) | dep_time == 600)

2. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

  1. Departed between midnight and 6am (inclusive)

    flights %>% filter(dep_time == 2400 | dep_time %>% between(0, 600))

3. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

na_names <- flights %>% map_dfr(. %>% is.na %>% sum) %>% select_if(~ . != 0)
kable(na_names)
dep_time dep_delay arr_time arr_delay tailnum air_time
8255 8255 8713 9430 2512 9430
  • dep_timearr_timeがセットでNAのデータは、フライトが中止したと考えられる。
  • arr_timeだけNAのデータは墜落? 多すぎる。
  • arr_delayair_timeNAだけNAのデータは謎。

4. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

全ての入力に対して値が1つに定まる式の場合、NAに対してもその値を返す。

?'^'

‘1 ^ y’ and ‘y ^ 0’ are ‘1’, always.

?'&'

‘NA’ is a valid logical object. Where a component of ‘x’ or ‘y’ is ‘NA’, the result will be ‘NA’ if the outcome is ambiguous. In other words ‘NA & TRUE’ evaluates to ‘NA’, but ‘NA & FALSE’ evaluates to ‘FALSE’. See the examples below.

x * 0は全てのxに対して0を返すとは限らない。

1 * 0
Inf * 0
NaN * 0

5.3.1 Exercises

1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

df <- tibble(x = c(5, 2, NA))
df %>% arrange(is.na(x) %>% desc)
df %>% arrange(!is.na(x))

2. Sort flights to find the most delayed flights. Find the flights that left earliest.

flights %>% arrange(desc(dep_delay))
flights %>% arrange(dep_delay)

3. Sort `flights`` to find the fastest flights.

flights %>% arrange(air_time)

4. Which flights travelled the longest? Which travelled the shortest?

flights %>% arrange(distance)
flights %>% arrange(desc(distance))

5.4.1 Exercises

1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

flights %>% select(starts_with("dep_"), starts_with("arr_"))
flights %>% select(ends_with("_time"), ends_with("_delay"), -matches("^(sched_|air_)"))
flights %>% select(matches("^(arr|dep)"))

2. What happens if you include the name of a variable multiple times in a select() call?

flights %>% select(dep_time, dep_time, arr_time, dep_time) %>% head(1) %>% kable
dep_time arr_time
517 830

3. What does the one_of() function do? Why might it be helpful in conjunction with this vector?

vars <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>% select(one_of(vars))
## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # … with 336,766 more rows
flights %>% select(vars)
## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # … with 336,766 more rows
flights %>% mutate(vars = 1) %>% select(vars)
## # A tibble: 336,776 x 1
##     vars
##    <dbl>
##  1     1
##  2     1
##  3     1
##  4     1
##  5     1
##  6     1
##  7     1
##  8     1
##  9     1
## 10     1
## # … with 336,766 more rows
flights %>% mutate(vars = 1) %>% select(!!vars)
## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # … with 336,766 more rows

4. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

select(flights, contains("TIME"))
select(flights, contains("TIME", ignore.case = FALSE))

5.5.2 Exercises

1. Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

time2min <- . %>% {. %/% 100 * 60 + . %% 100}
flights2 <- flights %>% mutate_at(vars(ends_with("_time"), -air_time), time2min)

2. Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

arr_timedep_timeHHMMのフォーマットになっており、引き算をしても分にならない。 上の問題と同じ変換をかけてから引き算をしてみる。

flights2 %>% transmute(air_time, air_time2 = arr_time - dep_time) %>%
  count(air_time == air_time2) %>% kable
air_time == air_time2 n
FALSE 327150
TRUE 196
NA 9430

ほとんどFALSEになってしまった。タイムゾーンの問題と思われる。

4. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

flights %>% transmute(dep_delay, rank = min_rank(desc(dep_delay) )) %>%
  filter(rank < 10) %>% arrange(rank) %>% kable
dep_delay rank
1301 1
1137 2
1126 3
1014 4
1005 5
960 6
911 7
899 8
898 9
x <- c(5, 1, 3, 2, 2, NA)
row_number(x)
## [1]  5  1  4  2  3 NA
min_rank(x)
## [1]  5  1  4  2  2 NA
dense_rank(x)
## [1]  4  1  3  2  2 NA

5. What does 1:3 + 1:10 return? Why?

1:3 + 1:10
## Warning in 1:3 + 1:10: 長いオブジェクトの長さが短いオブジェクトの長さの倍数
## になっていません
##  [1]  2  4  6  5  7  9  8 10 12 11

“recycling rules”が適用される。

If one parameter is shorter than the other, it will be automatically extended to be the same length.

c(1,2,3,1,2,3,1,2,3,1) + 1:10
##  [1]  2  4  6  5  7  9  8 10 12 11

6.What trigonometric functions does R provide?

?sinででてくる。

5.6.7 Exercises

1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

* A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
* A flight is always 10 minutes late.
* A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
* 99% of the time a flight is on time. 1% of the time it’s 2 hours late.

わからん。

2. Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).

not_cancelled %>% group_by(dest) %>% summarise(n = n())
not_cancelled %>% group_by(tailnum) %>% summarise(n = sum(distance))

3. Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?

flights %>% filter(!is.na(air_time))

5. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))

## 各ルート,キャリアごとに平均遅延を計算し
## 各ルートでの平均からの残差を求めた
flights %>% mutate(route = str_c(origin, dest, sep = "->")) %>%
  group_by(route, carrier) %>%
  summarise(n = n(), delay = mean(arr_delay, na.rm = TRUE)) %>%
  filter(n > 20) %>%
  mutate(route_delay = mean(delay, na.rm = TRUE),
         residual = scale(delay - route_delay)) %>%
  ungroup() %>%
  group_by(carrier) %>%
  summarise(mean_residual = mean(residual, na.rm = TRUE)) %>%
  arrange(mean_residual %>% desc) %>%
  kable()
carrier mean_residual
F9 1.4965288
YV 0.5793190
B6 0.5670779
FL 0.5303111
MQ 0.3710127
EV 0.2608272
UA 0.0472507
WN 0.0032268
9E -0.1796967
AA -0.2340007
VX -0.2368222
DL -0.5672194
AS -0.7071068
OO -0.7445727
US -0.7937979
HA NaN

6. What does the sort argument to count() do. When might you use it?

sort: if ‘TRUE’ will sort output in descending order of ‘n’

5.7.1 Exercises

1. Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.

いろいろある

2. Which plane (tailnum) has the worst on-time record?

離陸、着陸時間のずれを見る

resid <- function(x, n) {
  sqrt(sum(x^2, na.rm = TRUE)) / n
}

flights_tail <- flights %>% filter(!is.na(tailnum)) %>% group_by(tailnum)



flights_tail %>%
  summarise(
    dep_resid = resid(dep_delay, n()),
    arr_resid = resid(arr_delay, n()),
    n = n()
    ) %>%
  mutate(
    resid = dep_resid + arr_resid,
    r = percent_rank(desc(n))) %>%
  arrange(resid %>% desc) %>%
  filter(r < 0.9) %>% head(10) %>%
  kable()
tailnum dep_resid arr_resid n resid r
N665MQ 167.66170 165.18878 6 332.85048 0.8748144
N276AT 79.30759 77.00866 6 156.31625 0.8748144
N550NW 58.61114 55.26892 7 113.88005 0.8631865
N370SW 49.36148 50.91114 6 100.27262 0.8748144
N652SW 48.81797 50.34823 6 99.16620 0.8748144
N387SW 48.07120 49.92554 5 97.99674 0.8879268
N521US 45.34809 49.44261 7 94.79069 0.8631865
N919FJ 42.26504 44.60475 6 86.86978 0.8748144
N917DN 37.19785 47.38228 5 84.58012 0.8879268
N609SW 39.91190 44.60762 5 84.51953 0.8879268

3. What time of day should you fly if you want to avoid delays as much as possible?

arr_delayでみます。

flights %>% group_by(hour) %>%
  summarise(m = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(m)
## # A tibble: 20 x 2
##     hour       m
##    <dbl>   <dbl>
##  1     7  -5.30 
##  2     5  -4.80 
##  3     6  -3.38 
##  4     9  -1.45 
##  5     8  -1.11 
##  6    10   0.954
##  7    11   1.48 
##  8    12   3.49 
##  9    13   6.54 
## 10    14   9.20 
## 11    23  11.8  
## 12    15  12.3  
## 13    16  12.6  
## 14    18  14.8  
## 15    22  16.0  
## 16    17  16.0  
## 17    19  16.7  
## 18    20  16.7  
## 19    21  18.4  
## 20     1 NaN

朝出る飛行機がよい?

4. For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

flights %>% group_by(dest) %>%
  summarise(
    arr = sum(arr_delay, na.rm = TRUE),
    dep = sum(dep_delay, na.rm = TRUE)
    )
## # A tibble: 105 x 3
##    dest     arr    dep
##    <chr>  <dbl>  <dbl>
##  1 ABQ     1113   3490
##  2 ACK     1281   1711
##  3 ALB     6018   9897
##  4 ANC      -20    103
##  5 ATL   190260 211391
##  6 AUS    14514  31496
##  7 AVL     2089   2154
##  8 BDL     2904   7301
##  9 BGR     2874   7011
## 10 BHM     4540   8077
## # … with 95 more rows
flights %>% group_by(dest) %>%
  mutate(
    arr_delay_total = sum(arr_delay, na.rm = TRUE),
    dep_delay_total= sum(dep_delay, na.rm = TRUE),
    arr_delay_p= arr_delay / arr_delay_total,
    dep_delay_p= dep_delay / dep_delay_total) %>%
  select(starts_with(c("arr_delay")), starts_with("dep_delay"))
## Adding missing grouping variables: `dest`
## # A tibble: 336,776 x 7
## # Groups:   dest [105]
##    dest  arr_delay arr_delay_total arr_delay_p dep_delay dep_delay_total
##    <chr>     <dbl>           <dbl>       <dbl>     <dbl>           <dbl>
##  1 IAH          11           30046   0.000366          2           77012
##  2 IAH          20           30046   0.000666          4           77012
##  3 MIA          33            3467   0.00952           2          103261
##  4 BQN         -18            7322  -0.00246          -1           11032
##  5 ATL         -25          190260  -0.000131         -6          211391
##  6 ORD          12           97352   0.000123         -4          225840
##  7 FLL          19           96153   0.000198         -5          151933
##  8 IAD         -14           74631  -0.000188         -3           91555
##  9 MCO          -8           76185  -0.000105         -3          157661
## 10 ORD           8           97352   0.0000822        -2          225840
## # … with 336,766 more rows, and 1 more variable: dep_delay_p <dbl>

6. Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

flights %>% mutate(speed = distance / air_time * 60 * 1.609) %>%
  arrange(speed %>% desc) %>% select(speed, dest, distance, air_time) %>%
  arrange(desc(speed)) %>% head(10) %>% kable
speed dest distance air_time
1131.7458 ATL 762 65
1046.3690 MSP 1008 93
1042.6320 GSP 594 55
1031.5989 BNA 748 70
951.6086 PBI 1035 105
907.4760 SJU 1598 170
896.9240 SJU 1598 172
895.3395 STT 1623 175
891.7394 SJU 1598 173
891.7394 SJU 1598 173
flights %>% group_by(dest) %>%
  mutate(r = min_rank(air_time)) %>%
  filter(r < 2) %>%
  select(air_time, dest, distance)
## # A tibble: 175 x 3
## # Groups:   dest [104]
##    air_time dest  distance
##       <dbl> <chr>    <dbl>
##  1       31 BWI        169
##  2      105 PBI       1035
##  3       38 PWM        273
##  4       20 BDL        116
##  5       38 PWM        269
##  6       38 PWM        269
##  7       35 ACK        199
##  8       35 ACK        199
##  9      102 MSN        812
## 10       92 MDW        711
## # … with 165 more rows

7. Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.

flights %>% group_by(dest) %>%
  mutate(n = n_distinct(carrier)) %>%
  filter(n > 1)
## # A tibble: 325,397 x 20
## # Groups:   dest [76]
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 325,387 more rows, and 13 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, n <int>

8. For each plane, count the number of flights before the first delay of greater than 1 hour.

flights %>%
  filter(!is.na(tailnum)) %>%
  group_by(tailnum) %>%
  transmute(
    i = row_number(),
    j = ifelse(arr_delay > 60, i, NA)) %>%
  summarise(
    n = sum(i < min(j, na.rm = TRUE), na.rm = TRUE) ) %>% arrange(desc(n))
## # A tibble: 4,043 x 2
##    tailnum     n
##    <chr>   <int>
##  1 N705TW    160
##  2 N961UW    139
##  3 N706TW    137
##  4 N346NB    128
##  5 N713TW    128
##  6 N315NB    125
##  7 N744P     116
##  8 N945UW    104
##  9 N75436    102
## 10 N3769L     99
## # … with 4,033 more rows