5 Data transformation

5.2.4
5.3.1 Exercises
5.4.1 Exercises
5.5.2 Exercises
5.6.7 Exercises

5.2.4

1. Find all flights that

Had an arrival delay of two or more hours
```
flights %>% filter(arr_delay >= 120)
```

Flew to Houston (IAH or HOU)

flights %>% filter(dest %in% c("IAH", "HOU"))

Were operated by United, American, or Delta

flights %>% left_join(airlines) %>%
  filter(str_detect(name, "United|American|Delta"))

Departed in summer (July, August, and September)
```
flights %>% filter(between(month, 7, 9))
```
Arrived more than two hours late, but didn’t leave late
```
flights %>% filter(arr_delay > 120, dep_delay <= 0)
```
Were delayed by at least an hour, but made up over 30 minutes in flight
```
flights %>% filter(dep_delay >= 60, arr_delay < dep_delay - 30)
```

Departed between midnight and 6am (inclusive)

flights %>% mutate(dep_hour = dep_time %/% 100) %>%
  filter(dep_time == 2400 | dep_hour %in% c(0:5) | dep_time == 600)

2. Another useful dplyr filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges?

Departed between midnight and 6am (inclusive)

flights %>% filter(dep_time == 2400 | dep_time %>% between(0, 600))

3. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

na_names <- flights %>% map_dfr(. %>% is.na %>% sum) %>% select_if(~ . != 0)
kable(na_names)

dep_time	dep_delay	arr_time	arr_delay	tailnum	air_time
8255	8255	8713	9430	2512	9430

dep_timeとarr_timeがセットでNAのデータは、フライトが中止したと考えられる。
arr_timeだけNAのデータは墜落? 多すぎる。
arr_delayとair_timeがNAだけNAのデータは謎。

4. Why is `NA ^ 0` not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

全ての入力に対して値が1つに定まる式の場合、NAに対してもその値を返す。

?'^'

‘1 ^ y’ and ‘y ^ 0’ are ‘1’, always.

?'&'

‘NA’ is a valid logical object. Where a component of ‘x’ or ‘y’ is ‘NA’, the result will be ‘NA’ if the outcome is ambiguous. In other words ‘NA & TRUE’ evaluates to ‘NA’, but ‘NA & FALSE’ evaluates to ‘FALSE’. See the examples below.

x * 0は全てのxに対して0を返すとは限らない。

1 * 0
Inf * 0
NaN * 0

5.3.1 Exercises

1. How could you use `arrange()` to sort all missing values to the start? (Hint: use `is.na()`).

df <- tibble(x = c(5, 2, NA))
df %>% arrange(is.na(x) %>% desc)
df %>% arrange(!is.na(x))

2. Sort flights to find the most delayed `flights`. Find the flights that left earliest.

flights %>% arrange(desc(dep_delay))
flights %>% arrange(dep_delay)

3. Sort `flights`` to find the fastest flights.

flights %>% arrange(air_time)

4. Which flights travelled the longest? Which travelled the shortest?

flights %>% arrange(distance)
flights %>% arrange(desc(distance))

5.4.1 Exercises

1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

flights %>% select(starts_with("dep_"), starts_with("arr_"))
flights %>% select(ends_with("_time"), ends_with("_delay"), -matches("^(sched_|air_)"))
flights %>% select(matches("^(arr|dep)"))

2. What happens if you include the name of a variable multiple times in a select() call?

flights %>% select(dep_time, dep_time, arr_time, dep_time) %>% head(1) %>% kable

dep_time	arr_time
517	830

3. What does the `one_of()` function do? Why might it be helpful in conjunction with this vector?

vars <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>% select(one_of(vars))

## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # … with 336,766 more rows

flights %>% select(vars)

## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # … with 336,766 more rows

flights %>% mutate(vars = 1) %>% select(vars)

## # A tibble: 336,776 x 1
##     vars
##    <dbl>
##  1     1
##  2     1
##  3     1
##  4     1
##  5     1
##  6     1
##  7     1
##  8     1
##  9     1
## 10     1
## # … with 336,766 more rows

flights %>% mutate(vars = 1) %>% select(!!vars)

## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # … with 336,766 more rows

4. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

select(flights, contains("TIME"))
select(flights, contains("TIME", ignore.case = FALSE))

5.5.2 Exercises

1. Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

time2min <- . %>% {. %/% 100 * 60 + . %% 100}
flights2 <- flights %>% mutate_at(vars(ends_with("_time"), -air_time), time2min)

2. Compare `air_time` with `arr_time - dep_time`. What do you expect to see? What do you see? What do you need to do to fix it?

arr_timeとdep_timeはHHMMのフォーマットになっており、引き算をしても分にならない。上の問題と同じ変換をかけてから引き算をしてみる。

flights2 %>% transmute(air_time, air_time2 = arr_time - dep_time) %>%
  count(air_time == air_time2) %>% kable

air_time == air_time2	n
FALSE	327150
TRUE	196
NA	9430

ほとんどFALSEになってしまった。タイムゾーンの問題と思われる。

3. Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you expect those three numbers to be related?

dep_timeとsched_dep_timeを分に直して差をとれば、dep_delayと同じものがでてくる。しかしsched_dep_timeにdep_delayを足したものが日付を跨ぐ場合、この計算方法だとずれが出るはずである。

flights3 <- flights %>% mutate(dep_time_in_minutes = time2min(dep_time),
                               sched_dep_time_in_minutes = time2min(sched_dep_time),
                               dep_delay2 = dep_time_in_minutes - sched_dep_time_in_minutes,
                               rev_dep_time_in_minutes= sched_dep_time_in_minutes + dep_delay,
                               over_night = rev_dep_time_in_minutes > 60 * 24)
flights3 %>% filter(dep_delay != dep_delay2) %>% nrow

## [1] 1207

flights3 %>% filter(over_night) %>% nrow

## [1] 1207

4. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for `min_rank()`.

flights %>% transmute(dep_delay, rank = min_rank(desc(dep_delay) )) %>%
  filter(rank < 10) %>% arrange(rank) %>% kable

dep_delay	rank
1301	1
1137	2
1126	3
1014	4
1005	5
960	6
911	7
899	8
898	9

x <- c(5, 1, 3, 2, 2, NA)
row_number(x)

## [1]  5  1  4  2  3 NA

min_rank(x)

## [1]  5  1  4  2  2 NA

dense_rank(x)

## [1]  4  1  3  2  2 NA

5. What does `1:3 + 1:10` return? Why?

1:3 + 1:10

## Warning in 1:3 + 1:10: 長いオブジェクトの長さが短いオブジェクトの長さの倍数
## になっていません

##  [1]  2  4  6  5  7  9  8 10 12 11

“recycling rules”が適用される。

If one parameter is shorter than the other, it will be automatically extended to be the same length.

c(1,2,3,1,2,3,1,2,3,1) + 1:10

##  [1]  2  4  6  5  7  9  8 10 12 11

6.What trigonometric functions does R provide?

?sinででてくる。

5.6.7 Exercises

1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

* A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
* A flight is always 10 minutes late.
* A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
* 99% of the time a flight is on time. 1% of the time it’s 2 hours late.

わからん。

2. Come up with another approach that will give you the same output as `not_cancelled %>% count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` (without using `count()`).

not_cancelled %>% group_by(dest) %>% summarise(n = n())
not_cancelled %>% group_by(tailnum) %>% summarise(n = sum(distance))

3. Our definition of cancelled flights `(is.na(dep_delay) | is.na(arr_delay) )` is slightly suboptimal. Why? Which is the most important column?

flights %>% filter(!is.na(air_time))

4. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

cancel_count <- flights %>% group_by(year, month, day) %>%
  summarise(n = n(),
            n_cancelled = sum(is.na(air_time)),
            not_cancelled = n - n_cancelled,
            p_cancelled = n_cancelled / n,
            delay = mean(arr_delay, na.rm = TRUE)
            )


cancel_count %>%
  ggplot(aes(delay, p_cancelled)) +
  geom_point(aes(size = n), alpha = 0.3) +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

cancel_count %>% arrange(desc(p_cancelled)) %>% head(2) %>% kable

year	month	day	n	n_cancelled	not_cancelled	p_cancelled	delay
2013	2	9	684	393	291	0.5745614	6.639175
2013	2	8	930	475	455	0.5107527	24.228571

2013年2月8,9日に何があったのだろうか。ブリザードが来てたようです。 https://www.weather.gov/okx/storm02092013

5. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)

carrier_delay <- flights %>% group_by(carrier) %>%
  summarise(n = n(), delay = mean(arr_delay, na.rm = TRUE))

## 各ルート,キャリアごとに平均遅延を計算し
## 各ルートでの平均からの残差を求めた
flights %>% mutate(route = str_c(origin, dest, sep = "->")) %>%
  group_by(route, carrier) %>%
  summarise(n = n(), delay = mean(arr_delay, na.rm = TRUE)) %>%
  filter(n > 20) %>%
  mutate(route_delay = mean(delay, na.rm = TRUE),
         residual = scale(delay - route_delay)) %>%
  ungroup() %>%
  group_by(carrier) %>%
  summarise(mean_residual = mean(residual, na.rm = TRUE)) %>%
  left_join(carrier_delay) %>% arrange(mean_residual %>% desc) %>%
  kable()

## Joining, by = "carrier"

carrier	mean_residual	n	delay
F9	1.4965288	685	21.9207048
YV	0.5793190	601	15.5569853
B6	0.5670779	54635	9.4579733
FL	0.5303111	3260	20.1159055
MQ	0.3710127	26397	10.7747334
EV	0.2608272	54173	15.7964311
UA	0.0472507	58665	3.5580111
WN	0.0032268	12275	9.6491199
9E	-0.1796967	18460	7.3796692
AA	-0.2340007	32729	0.3642909
VX	-0.2368222	5162	1.7644644
DL	-0.5672194	48110	1.6443409
AS	-0.7071068	714	-9.9308886
OO	-0.7445727	32	11.9310345
US	-0.7937979	20536	2.1295951
HA	NaN	342	-6.9152047

6. What does the `sort` argument to `count()` do. When might you use it?

sort: if ‘TRUE’ will sort output in descending order of ‘n’

5 Data transformation

2019-05-22

5.2.4

1. Find all flights that

2. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

3. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

4. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

5.3.1 Exercises

1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

2. Sort flights to find the most delayed flights. Find the flights that left earliest.