between()
. What does it do? Can you use it to simplify the code needed to answer the previous challenges?NA ^ 0
not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)one_of()
function do? Why might it be helpful in conjunction with this vector?dep_time
and sched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.air_time
with arr_time - dep_time
. What do you expect to see? What do you see? What do you need to do to fix it?dep_time
, sched_dep_time
, and dep_delay
. How would you expect those three numbers to be related?min_rank()
.1:3 + 1:10
return? Why?not_cancelled %>% count(dest)
and not_cancelled %>% count(tailnum, wt = distance)
(without using count()
).(is.na(dep_delay) | is.na(arr_delay) )
is slightly suboptimal. Why? Which is the most important column?flights %>% group_by(carrier, dest) %>% summarise(n())
)sort
argument to count()
do. When might you use it?Had an arrival delay of two or more hours
flights %>% filter(arr_delay >= 120)
Flew to Houston (IAH or HOU)
flights %>% filter(dest %in% c("IAH", "HOU"))
Were operated by United, American, or Delta
flights %>% left_join(airlines) %>%
filter(str_detect(name, "United|American|Delta"))
Departed in summer (July, August, and September)
flights %>% filter(between(month, 7, 9))
Arrived more than two hours late, but didn’t leave late
flights %>% filter(arr_delay > 120, dep_delay <= 0)
Were delayed by at least an hour, but made up over 30 minutes in flight
flights %>% filter(dep_delay >= 60, arr_delay < dep_delay - 30)
Departed between midnight and 6am (inclusive)
flights %>% mutate(dep_hour = dep_time %/% 100) %>%
filter(dep_time == 2400 | dep_hour %in% c(0:5) | dep_time == 600)
between()
. What does it do? Can you use it to simplify the code needed to answer the previous challenges?Departed between midnight and 6am (inclusive)
flights %>% filter(dep_time == 2400 | dep_time %>% between(0, 600))
na_names <- flights %>% map_dfr(. %>% is.na %>% sum) %>% select_if(~ . != 0)
kable(na_names)
dep_time | dep_delay | arr_time | arr_delay | tailnum | air_time |
---|---|---|---|---|---|
8255 | 8255 | 8713 | 9430 | 2512 | 9430 |
dep_time
とarr_time
がセットでNA
のデータは、フライトが中止したと考えられる。arr_time
だけNA
のデータは墜落? 多すぎる。arr_delay
とair_time
がNA
だけNA
のデータは謎。NA ^ 0
not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)全ての入力に対して値が1つに定まる式の場合、NA
に対してもその値を返す。
?'^'
‘1 ^ y’ and ‘y ^ 0’ are ‘1’, always.
?'&'
‘NA’ is a valid logical object. Where a component of ‘x’ or ‘y’ is ‘NA’, the result will be ‘NA’ if the outcome is ambiguous. In other words ‘NA & TRUE’ evaluates to ‘NA’, but ‘NA & FALSE’ evaluates to ‘FALSE’. See the examples below.
x * 0
は全てのx
に対して0を返すとは限らない。
1 * 0
Inf * 0
NaN * 0
arrange()
to sort all missing values to the start? (Hint: use is.na()
).df <- tibble(x = c(5, 2, NA))
df %>% arrange(is.na(x) %>% desc)
df %>% arrange(!is.na(x))
flights
. Find the flights that left earliest.flights %>% arrange(desc(dep_delay))
flights %>% arrange(dep_delay)
flights %>% arrange(air_time)
flights %>% arrange(distance)
flights %>% arrange(desc(distance))
flights %>% select(starts_with("dep_"), starts_with("arr_"))
flights %>% select(ends_with("_time"), ends_with("_delay"), -matches("^(sched_|air_)"))
flights %>% select(matches("^(arr|dep)"))
flights %>% select(dep_time, dep_time, arr_time, dep_time) %>% head(1) %>% kable
dep_time | arr_time |
---|---|
517 | 830 |
one_of()
function do? Why might it be helpful in conjunction with this vector?vars <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>% select(one_of(vars))
## # A tibble: 336,776 x 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # … with 336,766 more rows
flights %>% select(vars)
## # A tibble: 336,776 x 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # … with 336,766 more rows
flights %>% mutate(vars = 1) %>% select(vars)
## # A tibble: 336,776 x 1
## vars
## <dbl>
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## 7 1
## 8 1
## 9 1
## 10 1
## # … with 336,766 more rows
flights %>% mutate(vars = 1) %>% select(!!vars)
## # A tibble: 336,776 x 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # … with 336,766 more rows
select(flights, contains("TIME"))
select(flights, contains("TIME", ignore.case = FALSE))
dep_time
and sched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.time2min <- . %>% {. %/% 100 * 60 + . %% 100}
flights2 <- flights %>% mutate_at(vars(ends_with("_time"), -air_time), time2min)
air_time
with arr_time - dep_time
. What do you expect to see? What do you see? What do you need to do to fix it?arr_time
とdep_time
はHHMM
のフォーマットになっており、引き算をしても分にならない。 上の問題と同じ変換をかけてから引き算をしてみる。
flights2 %>% transmute(air_time, air_time2 = arr_time - dep_time) %>%
count(air_time == air_time2) %>% kable
air_time == air_time2 | n |
---|---|
FALSE | 327150 |
TRUE | 196 |
NA | 9430 |
ほとんどFALSE
になってしまった。タイムゾーンの問題と思われる。
min_rank()
.flights %>% transmute(dep_delay, rank = min_rank(desc(dep_delay) )) %>%
filter(rank < 10) %>% arrange(rank) %>% kable
dep_delay | rank |
---|---|
1301 | 1 |
1137 | 2 |
1126 | 3 |
1014 | 4 |
1005 | 5 |
960 | 6 |
911 | 7 |
899 | 8 |
898 | 9 |
x <- c(5, 1, 3, 2, 2, NA)
row_number(x)
## [1] 5 1 4 2 3 NA
min_rank(x)
## [1] 5 1 4 2 2 NA
dense_rank(x)
## [1] 4 1 3 2 2 NA
1:3 + 1:10
return? Why?1:3 + 1:10
## Warning in 1:3 + 1:10: 長いオブジェクトの長さが短いオブジェクトの長さの倍数
## になっていません
## [1] 2 4 6 5 7 9 8 10 12 11
“recycling rules”が適用される。
If one parameter is shorter than the other, it will be automatically extended to be the same length.
c(1,2,3,1,2,3,1,2,3,1) + 1:10
## [1] 2 4 6 5 7 9 8 10 12 11
?sin
ででてくる。
* A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
* A flight is always 10 minutes late.
* A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
* 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
わからん。
not_cancelled %>% count(dest)
and not_cancelled %>% count(tailnum, wt = distance)
(without using count()
).not_cancelled %>% group_by(dest) %>% summarise(n = n())
not_cancelled %>% group_by(tailnum) %>% summarise(n = sum(distance))
(is.na(dep_delay) | is.na(arr_delay) )
is slightly suboptimal. Why? Which is the most important column?flights %>% filter(!is.na(air_time))
flights %>% group_by(carrier, dest) %>% summarise(n())
)carrier_delay <- flights %>% group_by(carrier) %>%
summarise(n = n(), delay = mean(arr_delay, na.rm = TRUE))
## 各ルート,キャリアごとに平均遅延を計算し
## 各ルートでの平均からの残差を求めた
flights %>% mutate(route = str_c(origin, dest, sep = "->")) %>%
group_by(route, carrier) %>%
summarise(n = n(), delay = mean(arr_delay, na.rm = TRUE)) %>%
filter(n > 20) %>%
mutate(route_delay = mean(delay, na.rm = TRUE),
residual = scale(delay - route_delay)) %>%
ungroup() %>%
group_by(carrier) %>%
summarise(mean_residual = mean(residual, na.rm = TRUE)) %>%
left_join(carrier_delay) %>% arrange(mean_residual %>% desc) %>%
kable()
## Joining, by = "carrier"
carrier | mean_residual | n | delay |
---|---|---|---|
F9 | 1.4965288 | 685 | 21.9207048 |
YV | 0.5793190 | 601 | 15.5569853 |
B6 | 0.5670779 | 54635 | 9.4579733 |
FL | 0.5303111 | 3260 | 20.1159055 |
MQ | 0.3710127 | 26397 | 10.7747334 |
EV | 0.2608272 | 54173 | 15.7964311 |
UA | 0.0472507 | 58665 | 3.5580111 |
WN | 0.0032268 | 12275 | 9.6491199 |
9E | -0.1796967 | 18460 | 7.3796692 |
AA | -0.2340007 | 32729 | 0.3642909 |
VX | -0.2368222 | 5162 | 1.7644644 |
DL | -0.5672194 | 48110 | 1.6443409 |
AS | -0.7071068 | 714 | -9.9308886 |
OO | -0.7445727 | 32 | 11.9310345 |
US | -0.7937979 | 20536 | 2.1295951 |
HA | NaN | 342 | -6.9152047 |
sort
argument to count()
do. When might you use it?sort: if ‘TRUE’ will sort output in descending order of ‘n’