between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)one_of() function do? Why might it be helpful in conjunction with this vector?dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?min_rank().1:3 + 1:10 return? Why?not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).(is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?flights %>% group_by(carrier, dest) %>% summarise(n()))sort argument to count() do. When might you use it?tailnum) has the worst on-time record?Had an arrival delay of two or more hours
flights %>% filter(arr_delay >= 120)Flew to Houston (IAH or HOU)
flights %>% filter(dest %in% c("IAH", "HOU"))Were operated by United, American, or Delta
flights %>% left_join(airlines) %>%
filter(str_detect(name, "United|American|Delta"))Departed in summer (July, August, and September)
flights %>% filter(between(month, 7, 9))Arrived more than two hours late, but didn’t leave late
flights %>% filter(arr_delay > 120, dep_delay <= 0)Were delayed by at least an hour, but made up over 30 minutes in flight
flights %>% filter(dep_delay >= 60, arr_delay < dep_delay - 30)Departed between midnight and 6am (inclusive)
flights %>% mutate(dep_hour = dep_time %/% 100) %>%
filter(dep_time == 2400 | dep_hour %in% c(0:5) | dep_time == 600)between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?Departed between midnight and 6am (inclusive)
flights %>% filter(dep_time == 2400 | dep_time %>% between(0, 600))na_names <- flights %>% map_dfr(. %>% is.na %>% sum) %>% select_if(~ . != 0)
kable(na_names)
| dep_time | dep_delay | arr_time | arr_delay | tailnum | air_time |
|---|---|---|---|---|---|
| 8255 | 8255 | 8713 | 9430 | 2512 | 9430 |
dep_timeとarr_timeがセットでNAのデータは、フライトが中止したと考えられる。arr_timeだけNAのデータは墜落? 多すぎる。arr_delayとair_timeがNAだけNAのデータは謎。NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)全ての入力に対して値が1つに定まる式の場合、NAに対してもその値を返す。
?'^'
‘1 ^ y’ and ‘y ^ 0’ are ‘1’, always.
?'&'
‘NA’ is a valid logical object. Where a component of ‘x’ or ‘y’ is ‘NA’, the result will be ‘NA’ if the outcome is ambiguous. In other words ‘NA & TRUE’ evaluates to ‘NA’, but ‘NA & FALSE’ evaluates to ‘FALSE’. See the examples below.
x * 0は全てのxに対して0を返すとは限らない。
1 * 0
Inf * 0
NaN * 0
arrange() to sort all missing values to the start? (Hint: use is.na()).df <- tibble(x = c(5, 2, NA))
df %>% arrange(is.na(x) %>% desc)
df %>% arrange(!is.na(x))
flights. Find the flights that left earliest.flights %>% arrange(desc(dep_delay))
flights %>% arrange(dep_delay)
flights %>% arrange(air_time)
flights %>% arrange(distance)
flights %>% arrange(desc(distance))
flights %>% select(starts_with("dep_"), starts_with("arr_"))
flights %>% select(ends_with("_time"), ends_with("_delay"), -matches("^(sched_|air_)"))
flights %>% select(matches("^(arr|dep)"))
flights %>% select(dep_time, dep_time, arr_time, dep_time) %>% head(1) %>% kable
| dep_time | arr_time |
|---|---|
| 517 | 830 |
one_of() function do? Why might it be helpful in conjunction with this vector?vars <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>% select(one_of(vars))
## # A tibble: 336,776 x 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # … with 336,766 more rows
flights %>% select(vars)
## # A tibble: 336,776 x 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # … with 336,766 more rows
flights %>% mutate(vars = 1) %>% select(vars)
## # A tibble: 336,776 x 1
## vars
## <dbl>
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## 7 1
## 8 1
## 9 1
## 10 1
## # … with 336,766 more rows
flights %>% mutate(vars = 1) %>% select(!!vars)
## # A tibble: 336,776 x 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # … with 336,766 more rows
select(flights, contains("TIME"))
select(flights, contains("TIME", ignore.case = FALSE))
dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.time2min <- . %>% {. %/% 100 * 60 + . %% 100}
flights2 <- flights %>% mutate_at(vars(ends_with("_time"), -air_time), time2min)
air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?arr_timeとdep_timeはHHMMのフォーマットになっており、引き算をしても分にならない。 上の問題と同じ変換をかけてから引き算をしてみる。
flights2 %>% transmute(air_time, air_time2 = arr_time - dep_time) %>%
count(air_time == air_time2) %>% kable
| air_time == air_time2 | n |
|---|---|
| FALSE | 327150 |
| TRUE | 196 |
| NA | 9430 |
ほとんどFALSEになってしまった。タイムゾーンの問題と思われる。
min_rank().flights %>% transmute(dep_delay, rank = min_rank(desc(dep_delay) )) %>%
filter(rank < 10) %>% arrange(rank) %>% kable
| dep_delay | rank |
|---|---|
| 1301 | 1 |
| 1137 | 2 |
| 1126 | 3 |
| 1014 | 4 |
| 1005 | 5 |
| 960 | 6 |
| 911 | 7 |
| 899 | 8 |
| 898 | 9 |
x <- c(5, 1, 3, 2, 2, NA)
row_number(x)
## [1] 5 1 4 2 3 NA
min_rank(x)
## [1] 5 1 4 2 2 NA
dense_rank(x)
## [1] 4 1 3 2 2 NA
1:3 + 1:10 return? Why?1:3 + 1:10
## Warning in 1:3 + 1:10: 長いオブジェクトの長さが短いオブジェクトの長さの倍数
## になっていません
## [1] 2 4 6 5 7 9 8 10 12 11
“recycling rules”が適用される。
If one parameter is shorter than the other, it will be automatically extended to be the same length.
c(1,2,3,1,2,3,1,2,3,1) + 1:10
## [1] 2 4 6 5 7 9 8 10 12 11
?sinででてくる。
* A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
* A flight is always 10 minutes late.
* A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
* 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
わからん。
not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).not_cancelled %>% group_by(dest) %>% summarise(n = n())
not_cancelled %>% group_by(tailnum) %>% summarise(n = sum(distance))
(is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?flights %>% filter(!is.na(air_time))
flights %>% group_by(carrier, dest) %>% summarise(n()))## 各ルート,キャリアごとに平均遅延を計算し
## 各ルートでの平均からの残差を求めた
flights %>% mutate(route = str_c(origin, dest, sep = "->")) %>%
group_by(route, carrier) %>%
summarise(n = n(), delay = mean(arr_delay, na.rm = TRUE)) %>%
filter(n > 20) %>%
mutate(route_delay = mean(delay, na.rm = TRUE),
residual = scale(delay - route_delay)) %>%
ungroup() %>%
group_by(carrier) %>%
summarise(mean_residual = mean(residual, na.rm = TRUE)) %>%
arrange(mean_residual %>% desc) %>%
kable()
| carrier | mean_residual |
|---|---|
| F9 | 1.4965288 |
| YV | 0.5793190 |
| B6 | 0.5670779 |
| FL | 0.5303111 |
| MQ | 0.3710127 |
| EV | 0.2608272 |
| UA | 0.0472507 |
| WN | 0.0032268 |
| 9E | -0.1796967 |
| AA | -0.2340007 |
| VX | -0.2368222 |
| DL | -0.5672194 |
| AS | -0.7071068 |
| OO | -0.7445727 |
| US | -0.7937979 |
| HA | NaN |
sort argument to count() do. When might you use it?sort: if ‘TRUE’ will sort output in descending order of ‘n’
いろいろある
tailnum) has the worst on-time record?離陸、着陸時間のずれを見る
resid <- function(x, n) {
sqrt(sum(x^2, na.rm = TRUE)) / n
}
flights_tail <- flights %>% filter(!is.na(tailnum)) %>% group_by(tailnum)
flights_tail %>%
summarise(
dep_resid = resid(dep_delay, n()),
arr_resid = resid(arr_delay, n()),
n = n()
) %>%
mutate(
resid = dep_resid + arr_resid,
r = percent_rank(desc(n))) %>%
arrange(resid %>% desc) %>%
filter(r < 0.9) %>% head(10) %>%
kable()
| tailnum | dep_resid | arr_resid | n | resid | r |
|---|---|---|---|---|---|
| N665MQ | 167.66170 | 165.18878 | 6 | 332.85048 | 0.8748144 |
| N276AT | 79.30759 | 77.00866 | 6 | 156.31625 | 0.8748144 |
| N550NW | 58.61114 | 55.26892 | 7 | 113.88005 | 0.8631865 |
| N370SW | 49.36148 | 50.91114 | 6 | 100.27262 | 0.8748144 |
| N652SW | 48.81797 | 50.34823 | 6 | 99.16620 | 0.8748144 |
| N387SW | 48.07120 | 49.92554 | 5 | 97.99674 | 0.8879268 |
| N521US | 45.34809 | 49.44261 | 7 | 94.79069 | 0.8631865 |
| N919FJ | 42.26504 | 44.60475 | 6 | 86.86978 | 0.8748144 |
| N917DN | 37.19785 | 47.38228 | 5 | 84.58012 | 0.8879268 |
| N609SW | 39.91190 | 44.60762 | 5 | 84.51953 | 0.8879268 |
arr_delayでみます。
flights %>% group_by(hour) %>%
summarise(m = mean(arr_delay, na.rm = TRUE)) %>%
arrange(m)
## # A tibble: 20 x 2
## hour m
## <dbl> <dbl>
## 1 7 -5.30
## 2 5 -4.80
## 3 6 -3.38
## 4 9 -1.45
## 5 8 -1.11
## 6 10 0.954
## 7 11 1.48
## 8 12 3.49
## 9 13 6.54
## 10 14 9.20
## 11 23 11.8
## 12 15 12.3
## 13 16 12.6
## 14 18 14.8
## 15 22 16.0
## 16 17 16.0
## 17 19 16.7
## 18 20 16.7
## 19 21 18.4
## 20 1 NaN
朝出る飛行機がよい?
flights %>% group_by(dest) %>%
summarise(
arr = sum(arr_delay, na.rm = TRUE),
dep = sum(dep_delay, na.rm = TRUE)
)
## # A tibble: 105 x 3
## dest arr dep
## <chr> <dbl> <dbl>
## 1 ABQ 1113 3490
## 2 ACK 1281 1711
## 3 ALB 6018 9897
## 4 ANC -20 103
## 5 ATL 190260 211391
## 6 AUS 14514 31496
## 7 AVL 2089 2154
## 8 BDL 2904 7301
## 9 BGR 2874 7011
## 10 BHM 4540 8077
## # … with 95 more rows
flights %>% group_by(dest) %>%
mutate(
arr_delay_total = sum(arr_delay, na.rm = TRUE),
dep_delay_total= sum(dep_delay, na.rm = TRUE),
arr_delay_p= arr_delay / arr_delay_total,
dep_delay_p= dep_delay / dep_delay_total) %>%
select(starts_with(c("arr_delay")), starts_with("dep_delay"))
## Adding missing grouping variables: `dest`
## # A tibble: 336,776 x 7
## # Groups: dest [105]
## dest arr_delay arr_delay_total arr_delay_p dep_delay dep_delay_total
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 IAH 11 30046 0.000366 2 77012
## 2 IAH 20 30046 0.000666 4 77012
## 3 MIA 33 3467 0.00952 2 103261
## 4 BQN -18 7322 -0.00246 -1 11032
## 5 ATL -25 190260 -0.000131 -6 211391
## 6 ORD 12 97352 0.000123 -4 225840
## 7 FLL 19 96153 0.000198 -5 151933
## 8 IAD -14 74631 -0.000188 -3 91555
## 9 MCO -8 76185 -0.000105 -3 157661
## 10 ORD 8 97352 0.0000822 -2 225840
## # … with 336,766 more rows, and 1 more variable: dep_delay_p <dbl>
flights %>% mutate(speed = distance / air_time * 60 * 1.609) %>%
arrange(speed %>% desc) %>% select(speed, dest, distance, air_time) %>%
arrange(desc(speed)) %>% head(10) %>% kable
| speed | dest | distance | air_time |
|---|---|---|---|
| 1131.7458 | ATL | 762 | 65 |
| 1046.3690 | MSP | 1008 | 93 |
| 1042.6320 | GSP | 594 | 55 |
| 1031.5989 | BNA | 748 | 70 |
| 951.6086 | PBI | 1035 | 105 |
| 907.4760 | SJU | 1598 | 170 |
| 896.9240 | SJU | 1598 | 172 |
| 895.3395 | STT | 1623 | 175 |
| 891.7394 | SJU | 1598 | 173 |
| 891.7394 | SJU | 1598 | 173 |
flights %>% group_by(dest) %>%
mutate(r = min_rank(air_time)) %>%
filter(r < 2) %>%
select(air_time, dest, distance)
## # A tibble: 175 x 3
## # Groups: dest [104]
## air_time dest distance
## <dbl> <chr> <dbl>
## 1 31 BWI 169
## 2 105 PBI 1035
## 3 38 PWM 273
## 4 20 BDL 116
## 5 38 PWM 269
## 6 38 PWM 269
## 7 35 ACK 199
## 8 35 ACK 199
## 9 102 MSN 812
## 10 92 MDW 711
## # … with 165 more rows
flights %>% group_by(dest) %>%
mutate(n = n_distinct(carrier)) %>%
filter(n > 1)
## # A tibble: 325,397 x 20
## # Groups: dest [76]
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # … with 325,387 more rows, and 13 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, n <int>
flights %>%
filter(!is.na(tailnum)) %>%
group_by(tailnum) %>%
transmute(
i = row_number(),
j = ifelse(arr_delay > 60, i, NA)) %>%
summarise(
n = sum(i < min(j, na.rm = TRUE), na.rm = TRUE) ) %>% arrange(desc(n))
## # A tibble: 4,043 x 2
## tailnum n
## <chr> <int>
## 1 N705TW 160
## 2 N961UW 139
## 3 N706TW 137
## 4 N346NB 128
## 5 N713TW 128
## 6 N315NB 125
## 7 N744P 116
## 8 N945UW 104
## 9 N75436 102
## 10 N3769L 99
## # … with 4,033 more rows