5 Data transformation

5.2.4
5.3.1 Exercises
5.4.1 Exercises
5.5.2 Exercises
5.6.7 Exercises
5.7.1 Exercises

5.2.4

1. Find all flights that

Had an arrival delay of two or more hours
```
flights %>% filter(arr_delay >= 120)
```

Flew to Houston (IAH or HOU)

flights %>% filter(dest %in% c("IAH", "HOU"))

Were operated by United, American, or Delta

flights %>% left_join(airlines) %>%
  filter(str_detect(name, "United|American|Delta"))

Departed in summer (July, August, and September)
```
flights %>% filter(between(month, 7, 9))
```
Arrived more than two hours late, but didn’t leave late
```
flights %>% filter(arr_delay > 120, dep_delay <= 0)
```
Were delayed by at least an hour, but made up over 30 minutes in flight
```
flights %>% filter(dep_delay >= 60, arr_delay < dep_delay - 30)
```

Departed between midnight and 6am (inclusive)

flights %>% mutate(dep_hour = dep_time %/% 100) %>%
  filter(dep_time == 2400 | dep_hour %in% c(0:5) | dep_time == 600)

2. Another useful dplyr filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges?

Departed between midnight and 6am (inclusive)

flights %>% filter(dep_time == 2400 | dep_time %>% between(0, 600))

3. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

na_names <- flights %>% map_dfr(. %>% is.na %>% sum) %>% select_if(~ . != 0)
kable(na_names)

dep_time	dep_delay	arr_time	arr_delay	tailnum	air_time
8255	8255	8713	9430	2512	9430

dep_timeとarr_timeがセットでNAのデータは、フライトが中止したと考えられる。
arr_timeだけNAのデータは墜落? 多すぎる。
arr_delayとair_timeがNAだけNAのデータは謎。

4. Why is `NA ^ 0` not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

全ての入力に対して値が1つに定まる式の場合、NAに対してもその値を返す。

?'^'

‘1 ^ y’ and ‘y ^ 0’ are ‘1’, always.

?'&'

‘NA’ is a valid logical object. Where a component of ‘x’ or ‘y’ is ‘NA’, the result will be ‘NA’ if the outcome is ambiguous. In other words ‘NA & TRUE’ evaluates to ‘NA’, but ‘NA & FALSE’ evaluates to ‘FALSE’. See the examples below.

x * 0は全てのxに対して0を返すとは限らない。

1 * 0
Inf * 0
NaN * 0

5.3.1 Exercises

1. How could you use `arrange()` to sort all missing values to the start? (Hint: use `is.na()`).

df <- tibble(x = c(5, 2, NA))
df %>% arrange(is.na(x) %>% desc)
df %>% arrange(!is.na(x))

2. Sort flights to find the most delayed `flights`. Find the flights that left earliest.

flights %>% arrange(desc(dep_delay))
flights %>% arrange(dep_delay)

3. Sort `flights`` to find the fastest flights.

flights %>% arrange(air_time)

4. Which flights travelled the longest? Which travelled the shortest?

flights %>% arrange(distance)
flights %>% arrange(desc(distance))

5.4.1 Exercises

1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

flights %>% select(starts_with("dep_"), starts_with("arr_"))
flights %>% select(ends_with("_time"), ends_with("_delay"), -matches("^(sched_|air_)"))
flights %>% select(matches("^(arr|dep)"))

2. What happens if you include the name of a variable multiple times in a select() call?

flights %>% select(dep_time, dep_time, arr_time, dep_time) %>% head(1) %>% kable

dep_time	arr_time
517	830

3. What does the `one_of()` function do? Why might it be helpful in conjunction with this vector?

vars <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>% select(one_of(vars))

## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # … with 336,766 more rows

flights %>% select(vars)

## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # … with 336,766 more rows

flights %>% mutate(vars = 1) %>% select(vars)

## # A tibble: 336,776 x 1
##     vars
##    <dbl>
##  1     1
##  2     1
##  3     1
##  4     1
##  5     1
##  6     1
##  7     1
##  8     1
##  9     1
## 10     1
## # … with 336,766 more rows

flights %>% mutate(vars = 1) %>% select(!!vars)

## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # … with 336,766 more rows

4. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

select(flights, contains("TIME"))
select(flights, contains("TIME", ignore.case = FALSE))

5.5.2 Exercises

1. Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

time2min <- . %>% {. %/% 100 * 60 + . %% 100}
flights2 <- flights %>% mutate_at(vars(ends_with("_time"), -air_time), time2min)

2. Compare `air_time` with `arr_time - dep_time`. What do you expect to see? What do you see? What do you need to do to fix it?

arr_timeとdep_timeはHHMMのフォーマットになっており、引き算をしても分にならない。上の問題と同じ変換をかけてから引き算をしてみる。

flights2 %>% transmute(air_time, air_time2 = arr_time - dep_time) %>%
  count(air_time == air_time2) %>% kable

air_time == air_time2	n
FALSE	327150
TRUE	196
NA	9430

ほとんどFALSEになってしまった。タイムゾーンの問題と思われる。

3. Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you expect those three numbers to be related?

dep_timeとsched_dep_timeを分に直して差をとれば、dep_delayと同じものがでてくる。しかしsched_dep_timeにdep_delayを足したものが日付を跨ぐ場合、この計算方法だとずれが出るはずである。

flights3 <- flights %>% mutate(dep_time_in_minutes = time2min(dep_time),
                               sched_dep_time_in_minutes = time2min(sched_dep_time),
                               dep_delay2 = dep_time_in_minutes - sched_dep_time_in_minutes,
                               rev_dep_time_in_minutes= sched_dep_time_in_minutes + dep_delay,
                               over_night = rev_dep_time_in_minutes > 60 * 24)
flights3 %>% filter(dep_delay != dep_delay2) %>% nrow

## [1] 1207

flights3 %>% filter(over_night) %>% nrow

## [1] 1207

4. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for `min_rank()`.

flights %>% transmute(dep_delay, rank = min_rank(desc(dep_delay) )) %>%
  filter(rank < 10) %>% arrange(rank) %>% kable

dep_delay	rank
1301	1
1137	2
1126	3
1014	4
1005	5
960	6
911	7
899	8
898	9

x <- c(5, 1, 3, 2, 2, NA)
row_number(x)

## [1]  5  1  4  2  3 NA

min_rank(x)

## [1]  5  1  4  2  2 NA

dense_rank(x)

## [1]  4  1  3  2  2 NA

5. What does `1:3 + 1:10` return? Why?

1:3 + 1:10

## Warning in 1:3 + 1:10: 長いオブジェクトの長さが短いオブジェクトの長さの倍数
## になっていません

##  [1]  2  4  6  5  7  9  8 10 12 11

“recycling rules”が適用される。

If one parameter is shorter than the other, it will be automatically extended to be the same length.

c(1,2,3,1,2,3,1,2,3,1) + 1:10

##  [1]  2  4  6  5  7  9  8 10 12 11

6.What trigonometric functions does R provide?

?sinででてくる。

5.6.7 Exercises

1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

* A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
* A flight is always 10 minutes late.
* A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
* 99% of the time a flight is on time. 1% of the time it’s 2 hours late.

わからん。

2. Come up with another approach that will give you the same output as `not_cancelled %>% count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` (without using `count()`).

not_cancelled %>% group_by(dest) %>% summarise(n = n())
not_cancelled %>% group_by(tailnum) %>% summarise(n = sum(distance))

3. Our definition of cancelled flights `(is.na(dep_delay) | is.na(arr_delay) )` is slightly suboptimal. Why? Which is the most important column?

flights %>% filter(!is.na(air_time))

4. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

cancel_count <- flights %>% group_by(year, month, day) %>%
  summarise(n = n(),
            n_cancelled = sum(is.na(air_time)),
            not_cancelled = n - n_cancelled,
            p_cancelled = n_cancelled / n,
            delay = mean(arr_delay, na.rm = TRUE)
            )


cancel_count %>%
  ggplot(aes(delay, p_cancelled)) +
  geom_point(aes(size = n), alpha = 0.3) +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

cancel_count %>% arrange(desc(p_cancelled)) %>% head(2) %>% kable

year	month	day	n	n_cancelled	not_cancelled	p_cancelled	delay
2013	2	9	684	393	291	0.5745614	6.639175
2013	2	8	930	475	455	0.5107527	24.228571

2013年2月8,9日に何があったのだろうか。ブリザードが来てたようです。 https://www.weather.gov/okx/storm02092013

5. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)

## 各ルート,キャリアごとに平均遅延を計算し
## 各ルートでの平均からの残差を求めた
flights %>% mutate(route = str_c(origin, dest, sep = "->")) %>%
  group_by(route, carrier) %>%
  summarise(n = n(), delay = mean(arr_delay, na.rm = TRUE)) %>%
  filter(n > 20) %>%
  mutate(route_delay = mean(delay, na.rm = TRUE),
         residual = scale(delay - route_delay)) %>%
  ungroup() %>%
  group_by(carrier) %>%
  summarise(mean_residual = mean(residual, na.rm = TRUE)) %>%
  arrange(mean_residual %>% desc) %>%
  kable()

carrier	mean_residual
F9	1.4965288
YV	0.5793190
B6	0.5670779
FL	0.5303111
MQ	0.3710127
EV	0.2608272
UA	0.0472507
WN	0.0032268
9E	-0.1796967
AA	-0.2340007
VX	-0.2368222
DL	-0.5672194
AS	-0.7071068
OO	-0.7445727
US	-0.7937979
HA	NaN

6. What does the `sort` argument to `count()` do. When might you use it?

sort: if ‘TRUE’ will sort output in descending order of ‘n’

5.7.1 Exercises

1. Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.

いろいろある

2. Which plane (`tailnum`) has the worst on-time record?

離陸、着陸時間のずれを見る

resid <- function(x, n) {
  sqrt(sum(x^2, na.rm = TRUE)) / n
}

flights_tail <- flights %>% filter(!is.na(tailnum)) %>% group_by(tailnum)



flights_tail %>%
  summarise(
    dep_resid = resid(dep_delay, n()),
    arr_resid = resid(arr_delay, n()),
    n = n()
    ) %>%
  mutate(
    resid = dep_resid + arr_resid,
    r = percent_rank(desc(n))) %>%
  arrange(resid %>% desc) %>%
  filter(r < 0.9) %>% head(10) %>%
  kable()

tailnum	dep_resid	arr_resid	n	resid	r
N665MQ	167.66170	165.18878	6	332.85048	0.8748144
N276AT	79.30759	77.00866	6	156.31625	0.8748144
N550NW	58.61114	55.26892	7	113.88005	0.8631865
N370SW	49.36148	50.91114	6	100.27262	0.8748144
N652SW	48.81797	50.34823	6	99.16620	0.8748144
N387SW	48.07120	49.92554	5	97.99674	0.8879268
N521US	45.34809	49.44261	7	94.79069	0.8631865
N919FJ	42.26504	44.60475	6	86.86978	0.8748144
N917DN	37.19785	47.38228	5	84.58012	0.8879268
N609SW	39.91190	44.60762	5	84.51953	0.8879268

3. What time of day should you fly if you want to avoid delays as much as possible?

arr_delayでみます。

flights %>% group_by(hour) %>%
  summarise(m = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(m)

## # A tibble: 20 x 2
##     hour       m
##    <dbl>   <dbl>
##  1     7  -5.30 
##  2     5  -4.80 
##  3     6  -3.38 
##  4     9  -1.45 
##  5     8  -1.11 
##  6    10   0.954
##  7    11   1.48 
##  8    12   3.49 
##  9    13   6.54 
## 10    14   9.20 
## 11    23  11.8  
## 12    15  12.3  
## 13    16  12.6  
## 14    18  14.8  
## 15    22  16.0  
## 16    17  16.0  
## 17    19  16.7  
## 18    20  16.7  
## 19    21  18.4  
## 20     1 NaN

朝出る飛行機がよい?

4. For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

flights %>% group_by(dest) %>%
  summarise(
    arr = sum(arr_delay, na.rm = TRUE),
    dep = sum(dep_delay, na.rm = TRUE)
    )

## # A tibble: 105 x 3
##    dest     arr    dep
##    <chr>  <dbl>  <dbl>
##  1 ABQ     1113   3490
##  2 ACK     1281   1711
##  3 ALB     6018   9897
##  4 ANC      -20    103
##  5 ATL   190260 211391
##  6 AUS    14514  31496
##  7 AVL     2089   2154
##  8 BDL     2904   7301
##  9 BGR     2874   7011
## 10 BHM     4540   8077
## # … with 95 more rows

flights %>% group_by(dest) %>%
  mutate(
    arr_delay_total = sum(arr_delay, na.rm = TRUE),
    dep_delay_total= sum(dep_delay, na.rm = TRUE),
    arr_delay_p= arr_delay / arr_delay_total,
    dep_delay_p= dep_delay / dep_delay_total) %>%
  select(starts_with(c("arr_delay")), starts_with("dep_delay"))

## Adding missing grouping variables: `dest`

## # A tibble: 336,776 x 7
## # Groups:   dest [105]
##    dest  arr_delay arr_delay_total arr_delay_p dep_delay dep_delay_total
##    <chr>     <dbl>           <dbl>       <dbl>     <dbl>           <dbl>
##  1 IAH          11           30046   0.000366          2           77012
##  2 IAH          20           30046   0.000666          4           77012
##  3 MIA          33            3467   0.00952           2          103261
##  4 BQN         -18            7322  -0.00246          -1           11032
##  5 ATL         -25          190260  -0.000131         -6          211391
##  6 ORD          12           97352   0.000123         -4          225840
##  7 FLL          19           96153   0.000198         -5          151933
##  8 IAD         -14           74631  -0.000188         -3           91555
##  9 MCO          -8           76185  -0.000105         -3          157661
## 10 ORD           8           97352   0.0000822        -2          225840
## # … with 336,766 more rows, and 1 more variable: dep_delay_p <dbl>

5. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the delay of a flight is related to the delay of the immediately preceding flight.

一日ごとに、lagとの差で平均をだした。

flights %>% group_by(year, month, day) %>%
  mutate(lag_dif = dep_delay - lag(dep_delay)) %>%
  summarise(
    mean_lag_dif= mean(lag_dif, na.rm = TRUE),
    mean_delay= mean(dep_delay, na.rm = TRUE)) %>%
  ungroup() %>% mutate(d = row_number()) %>%
  gather(key = type, value = minutes, mean_lag_dif, mean_delay) %>%
  ggplot(aes(x = d)) +
  geom_point(aes(y = minutes, color = type))

6. Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

flights %>% mutate(speed = distance / air_time * 60 * 1.609) %>%
  arrange(speed %>% desc) %>% select(speed, dest, distance, air_time) %>%
  arrange(desc(speed)) %>% head(10) %>% kable

speed	dest	distance	air_time
1131.7458	ATL	762	65
1046.3690	MSP	1008	93
1042.6320	GSP	594	55
1031.5989	BNA	748	70
951.6086	PBI	1035	105
907.4760	SJU	1598	170
896.9240	SJU	1598	172
895.3395	STT	1623	175
891.7394	SJU	1598	173
891.7394	SJU	1598	173

flights %>% group_by(dest) %>%
  mutate(r = min_rank(air_time)) %>%
  filter(r < 2) %>%
  select(air_time, dest, distance)

## # A tibble: 175 x 3
## # Groups:   dest [104]
##    air_time dest  distance
##       <dbl> <chr>    <dbl>
##  1       31 BWI        169
##  2      105 PBI       1035
##  3       38 PWM        273
##  4       20 BDL        116
##  5       38 PWM        269
##  6       38 PWM        269
##  7       35 ACK        199
##  8       35 ACK        199
##  9      102 MSN        812
## 10       92 MDW        711
## # … with 165 more rows

7. Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.

flights %>% group_by(dest) %>%
  mutate(n = n_distinct(carrier)) %>%
  filter(n > 1)

## # A tibble: 325,397 x 20
## # Groups:   dest [76]
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 325,387 more rows, and 13 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, n <int>

8. For each plane, count the number of flights before the first delay of greater than 1 hour.

flights %>%
  filter(!is.na(tailnum)) %>%
  group_by(tailnum) %>%
  transmute(
    i = row_number(),
    j = ifelse(arr_delay > 60, i, NA)) %>%
  summarise(
    n = sum(i < min(j, na.rm = TRUE), na.rm = TRUE) ) %>% arrange(desc(n))

## # A tibble: 4,043 x 2
##    tailnum     n
##    <chr>   <int>
##  1 N705TW    160
##  2 N961UW    139
##  3 N706TW    137
##  4 N346NB    128
##  5 N713TW    128
##  6 N315NB    125
##  7 N744P     116
##  8 N945UW    104
##  9 N75436    102
## 10 N3769L     99
## # … with 4,033 more rows