5.2.4

1. Find all flights that

  1. Had an arrival delay of two or more hours

    flights %>% filter(arr_delay >= 120)
  2. Flew to Houston (IAH or HOU)

    flights %>% filter(dest %in% c("IAH", "HOU"))
  3. Were operated by United, American, or Delta

    flights %>% left_join(airlines) %>%
      filter(str_detect(name, "United|American|Delta"))
  4. Departed in summer (July, August, and September)

    flights %>% filter(between(month, 7, 9))
  5. Arrived more than two hours late, but didn’t leave late

    flights %>% filter(arr_delay > 120, dep_delay <= 0)
  6. Were delayed by at least an hour, but made up over 30 minutes in flight

    flights %>% filter(dep_delay >= 60, arr_delay < dep_delay - 30)
  7. Departed between midnight and 6am (inclusive)

    flights %>% mutate(dep_hour = dep_time %/% 100) %>%
      filter(dep_time == 2400 | dep_hour %in% c(0:5) | dep_time == 600)

2. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

  1. Departed between midnight and 6am (inclusive)

    flights %>% filter(dep_time == 2400 | dep_time %>% between(0, 600))

3. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

na_names <- flights %>% map_dfr(. %>% is.na %>% sum) %>% select_if(~ . != 0)
kable(na_names)
dep_time dep_delay arr_time arr_delay tailnum air_time
8255 8255 8713 9430 2512 9430
  • dep_timearr_timeがセットでNAのデータは、フライトが中止したと考えられる。
  • arr_timeだけNAのデータは墜落? 多すぎる。
  • arr_delayair_timeNAだけNAのデータは謎。

4. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

全ての入力に対して値が1つに定まる式の場合、NAに対してもその値を返す。

?'^'

‘1 ^ y’ and ‘y ^ 0’ are ‘1’, always.

?'&'

‘NA’ is a valid logical object. Where a component of ‘x’ or ‘y’ is ‘NA’, the result will be ‘NA’ if the outcome is ambiguous. In other words ‘NA & TRUE’ evaluates to ‘NA’, but ‘NA & FALSE’ evaluates to ‘FALSE’. See the examples below.

x * 0は全てのxに対して0を返すとは限らない。

1 * 0
Inf * 0
NaN * 0

5.3.1 Exercises

1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

df <- tibble(x = c(5, 2, NA))
df %>% arrange(is.na(x) %>% desc)
df %>% arrange(!is.na(x))

2. Sort flights to find the most delayed flights. Find the flights that left earliest.

flights %>% arrange(desc(dep_delay))
flights %>% arrange(dep_delay)

3. Sort `flights`` to find the fastest flights.

flights %>% arrange(air_time)

4. Which flights travelled the longest? Which travelled the shortest?

flights %>% arrange(distance)
flights %>% arrange(desc(distance))

5.4.1 Exercises

1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

flights %>% select(starts_with("dep_"), starts_with("arr_"))
flights %>% select(ends_with("_time"), ends_with("_delay"), -matches("^(sched_|air_)"))
flights %>% select(matches("^(arr|dep)"))

2. What happens if you include the name of a variable multiple times in a select() call?

flights %>% select(dep_time, dep_time, arr_time, dep_time) %>% head(1) %>% kable
dep_time arr_time
517 830

3. What does the one_of() function do? Why might it be helpful in conjunction with this vector?

vars <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>% select(one_of(vars))
## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # … with 336,766 more rows
flights %>% select(vars)
## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # … with 336,766 more rows
flights %>% mutate(vars = 1) %>% select(vars)
## # A tibble: 336,776 x 1
##     vars
##    <dbl>
##  1     1
##  2     1
##  3     1
##  4     1
##  5     1
##  6     1
##  7     1
##  8     1
##  9     1
## 10     1
## # … with 336,766 more rows
flights %>% mutate(vars = 1) %>% select(!!vars)
## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # … with 336,766 more rows

4. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

select(flights, contains("TIME"))
select(flights, contains("TIME", ignore.case = FALSE))

5.5.2 Exercises

1. Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

time2min <- . %>% {. %/% 100 * 60 + . %% 100}
flights2 <- flights %>% mutate_at(vars(ends_with("_time"), -air_time), time2min)

2. Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

arr_timedep_timeHHMMのフォーマットになっており、引き算をしても分にならない。 上の問題と同じ変換をかけてから引き算をしてみる。

flights2 %>% transmute(air_time, air_time2 = arr_time - dep_time) %>%
  count(air_time == air_time2) %>% kable
air_time == air_time2 n
FALSE 327150
TRUE 196
NA 9430

ほとんどFALSEになってしまった。タイムゾーンの問題と思われる。

4. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

flights %>% transmute(dep_delay, rank = min_rank(desc(dep_delay) )) %>%
  filter(rank < 10) %>% arrange(rank) %>% kable
dep_delay rank
1301 1
1137 2
1126 3
1014 4
1005 5
960 6
911 7
899 8
898 9
x <- c(5, 1, 3, 2, 2, NA)
row_number(x)
## [1]  5  1  4  2  3 NA
min_rank(x)
## [1]  5  1  4  2  2 NA
dense_rank(x)
## [1]  4  1  3  2  2 NA

5. What does 1:3 + 1:10 return? Why?

1:3 + 1:10
## Warning in 1:3 + 1:10: 長いオブジェクトの長さが短いオブジェクトの長さの倍数
## になっていません
##  [1]  2  4  6  5  7  9  8 10 12 11

“recycling rules”が適用される。

If one parameter is shorter than the other, it will be automatically extended to be the same length.

c(1,2,3,1,2,3,1,2,3,1) + 1:10
##  [1]  2  4  6  5  7  9  8 10 12 11

6.What trigonometric functions does R provide?

?sinででてくる。

5.6.7 Exercises

1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

* A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
* A flight is always 10 minutes late.
* A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
* 99% of the time a flight is on time. 1% of the time it’s 2 hours late.

わからん。

2. Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).

not_cancelled %>% group_by(dest) %>% summarise(n = n())
not_cancelled %>% group_by(tailnum) %>% summarise(n = sum(distance))

3. Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?

flights %>% filter(!is.na(air_time))

5. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))

carrier_delay <- flights %>% group_by(carrier) %>%
  summarise(n = n(), delay = mean(arr_delay, na.rm = TRUE))

## 各ルート,キャリアごとに平均遅延を計算し
## 各ルートでの平均からの残差を求めた
flights %>% mutate(route = str_c(origin, dest, sep = "->")) %>%
  group_by(route, carrier) %>%
  summarise(n = n(), delay = mean(arr_delay, na.rm = TRUE)) %>%
  filter(n > 20) %>%
  mutate(route_delay = mean(delay, na.rm = TRUE),
         residual = scale(delay - route_delay)) %>%
  ungroup() %>%
  group_by(carrier) %>%
  summarise(mean_residual = mean(residual, na.rm = TRUE)) %>%
  left_join(carrier_delay) %>% arrange(mean_residual %>% desc) %>%
  kable()
## Joining, by = "carrier"
carrier mean_residual n delay
F9 1.4965288 685 21.9207048
YV 0.5793190 601 15.5569853
B6 0.5670779 54635 9.4579733
FL 0.5303111 3260 20.1159055
MQ 0.3710127 26397 10.7747334
EV 0.2608272 54173 15.7964311
UA 0.0472507 58665 3.5580111
WN 0.0032268 12275 9.6491199
9E -0.1796967 18460 7.3796692
AA -0.2340007 32729 0.3642909
VX -0.2368222 5162 1.7644644
DL -0.5672194 48110 1.6443409
AS -0.7071068 714 -9.9308886
OO -0.7445727 32 11.9310345
US -0.7937979 20536 2.1295951
HA NaN 342 -6.9152047

6. What does the sort argument to count() do. When might you use it?

sort: if ‘TRUE’ will sort output in descending order of ‘n’