library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(nycflights13)
Find all flights that
flights %>% filter(arr_delay >= 120) %>% head()
2. Flew to Houston (`IAH` or `HOU`)
flights %>% filter(dest == 'IAH' | dest == 'HOU') %>% head()
3. Were operated by United, American, or Delta
flights %>% filter(carrier %in% c('UA', 'AA', 'DL')) %>% head()
4. Departed in summer (July, August, and September)
flights %>% filter(month %in% c(7,8,9)) %>% head()
5. Arrived more than two hours late, but didn\'t leave late
flights %>% filter(arr_delay >= 120 & dep_delay <= 0) %>% head()
6. Were delayed by at least an hour, but made up over 30 minutes in flight
flights %>% filter(arr_delay >= 60 & (dep_delay-arr_delay) > 30) %>% head()
7. Departed between midnight and 6am (inclusive)
flights %>% filter(dep_time <= 600 | dep_time == 2400) %>% head()
between().
What does it do? Can you use it to simplify the code needed to answer
the previous challenges?between(x, left, right) is a shortcut for
x >= left & x <= right.
dep_time? What other
variables are missing? What might these rows represent?flights %>% filter(is.na(dep_time)) %>% head()
arr_time is also missing. These were probably canceled
flights.
NA ^ 0 not missing? Why is
NA | TRUE not missing? Why is FALSE & NA
not missing? Can you figure out the general rule? (NA * 0
is a tricky counterexample!)Any number n ^ 0 always is 1. Any value
n | True is always true. Any value
n & false is false. The general rule is: if the value
of an expression is always the same regardless of the value of a number,
then if the number is NA, it can be expressed just like any other
number. This looks to be true for NA * 0, but infinity * 0
is not a real number, thus NaN.
arrange() to sort all missing values
to the start? (Hint: use is.na().flights %>% arrange(desc(is.na(dep_time))) %>% head()
flights to find the most delayed flights. Find the
flights that left earliest.flights %>% arrange(desc(dep_delay)) %>% head()
flights %>% arrange(dep_delay) %>% head()
flights to find the fastest (highest speed)
flights.flights %>% arrange(hour, minute) %>% head()
flights %>% arrange(desc(distance)) %>% head()
flights %>% arrange(distance) %>% head()
dep_time,
dep_delay, arr_time, and
arr_delay from flights.flights %>% select(dep_time, dep_delay, arr_time, arr_delay) %>% head()
flights %>% select(starts_with("dep") | starts_with("arr")) %>% head()
select(flights, 4, 6, 7, 9) %>% head()
select() call?flights %>% select(year, year, year) %>% head()
It does not duplicate columns
any_of() function do? Why might it be
helpful in conjunction with this vector?vars <- c("year", "month", "day", "dep_delay", "arr_delay")
It checks if a variable is equal to an of the items in a container.
flights %>% select(any_of(vars)) %>% head()
select(flights, contains("TIME")) %>% head()
It ignores case. You can change this by setting
ignore.case = FALSE
dep_time and sched_dep_time are
convenient to look at, but hard to compute with because they’re not
really continuous numbers. Convert them to a more convenient
representation of number of minutes since midnight.flights %>% mutate(dep_time_mins = ((dep_time %% 100) + (dep_time %/% 100 * 60)) %% 1440, sched_dep_time_mins = ((sched_dep_time %% 100) + (sched_dep_time %/% 100 * 60)) %% 1440) %>% head()
air_time with arr_time - dep_time.
What do you expect to see? What do you see? What do you need to do to
fix it?flights %>% mutate(arr_time_mins = ((arr_time %% 100) + (arr_time %/% 100 * 60)) %% 1440, dep_time_mins = ((dep_time %% 100) + (dep_time %/% 100 * 60)) %% 1440) %>% mutate(air_time_diff = arr_time_mins - dep_time_mins) %>% select(air_time, air_time_diff) %>% head()
We would expect to see the time from when they left the ground to when the landed, or the air time. However, they are not the same. However because of time zones and date changes most are different.
dep_time, sched_dep_time, and
dep_delay. How would you expect those three numbers to be
related?changing to mins manually is annoying lets write a function:
tomins <- function(x){
(x %% 100) + (x %/% 100 * 60) %% 1440
}
flights %>% select(dep_time, sched_dep_time, dep_delay) %>% mutate(dep_time = tomins(dep_time), sched_dep_time = tomins(sched_dep_time)) %>% mutate(calculated_delay = dep_time - sched_dep_time) %>% head()
We can see that dep_delay is equal to the difference
between the scheduled and actual departure time
min_rank().flights %>% mutate(dep_delay_min_rank = min_rank(desc(dep_delay))) %>% filter(dep_delay_min_rank <= 10) %>% arrange(dep_delay_min_rank)%>% select(month, day, carrier, flight, dep_delay, dep_delay_min_rank) %>% head()
1:3 + 1:10 return? Why?(1:3 + 1:10)
## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter object
## length
## [1] 2 4 6 5 7 9 8 10 12 11
This expression performs vector addition, which, when the vector lengths are not equal, repeats the shorter vector. This results in what we see.
?Trig
cos(x)
sin(x)
tan(x)
acos(x)
asin(x)
atan(x)
atan2(y, x)
cospi(x)
sinpi(x)
tanpi(x)
Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
A flight is always 10 minutes late.
A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
99% of the time a flight is on time. 1% of the time it’s 2 hours late.
Which is more important: arrival delay or departure delay?
not_cancelled <- flights %>% filter(!is.na(air_time))
I am confused, will come back later.
not_cancelled %>% count(dest) and
not_cancelled %>% count(tailnum, wt = distance) (without
using count()).not_cancelled %>% count(dest) %>% head()
not_cancelled %>% count(tailnum, wt = distance) %>% head()
not_cancelled %>% group_by(dest) %>% summarise(n = length(dest)) %>% head()
not_cancelled %>% group_by(tailnum) %>% summarise(n = sum(distance)) %>% head()
is.na(dep_delay) | is.na(arr_delay) ) is slightly
suboptimal. Why? Which is the most important column?dep_delay is more important because if the flight never
takes off it is truly cancelled, where if it took off but never reached
its intended destination it is not ‘cancelled’ it is just diverted.
df <- flights %>% mutate(cancelled = is.na(air_time)) %>% group_by(year, month, day) %>% summarise(canceled_per = mean(cancelled), avg_delay = mean(dep_delay, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
ggplot(df) +
geom_point(aes(avg_delay, canceled_per))
We can see a positive correlation between the avg_delay and percentage of canceled flights.
flights %>% group_by(carrier, dest) %>% summarise(n()))flights %>% group_by(carrier) %>% summarise(total_delay = mean(arr_delay, na.rm=TRUE)+mean(dep_delay, na.rm=TRUE)) %>% arrange(desc(total_delay))
Lets try to account for airports. The plan is to compare the average delay for a route for a carrier to the overall average delay for that route. This will be a bit complex im going to try my best.
flights %>% group_by(origin, dest, carrier) %>% summarise(avg_delay_per_carrier = mean(arr_delay, na.rm=TRUE)) %>% head()
## `summarise()` has grouped output by 'origin', 'dest'. You can override using
## the `.groups` argument.
flights %>% group_by(origin, dest) %>% summarise(avg_delay_per_route = mean(arr_delay, na.rm=TRUE)) %>% head()
## `summarise()` has grouped output by 'origin'. You can override using the
## `.groups` argument.
ok so we have our data, great. now how do we compare the carrier delay to the route delay and end up with a dataframe we can graph?
probably need to use join and stuff i will come back later
sort argument to count() do.
When might you use it?it will show the largest groups at the top, basically saves you a
%>% arrange(desc(n))
It applies to the entire group rather than to each row.
tailnum) has the worst on-time
record?flights %>% group_by(tailnum) %>% summarise(avg_delay = mean(arr_delay, na.rm=TRUE)) %>% arrange(desc(avg_delay)) %>% head()
N844MH, you suck.
flights %>% mutate(sched_dep_time_mins = ((sched_dep_time %% 100) + (sched_dep_time %/% 100 * 60)) %% 1440) %>% group_by(sched_dep_time_mins) %>% summarise(avg_delay = mean(dep_delay, na.rm=TRUE)) %>%
ggplot()+
geom_point(aes(sched_dep_time_mins, avg_delay))
## Warning: Removed 1 rows containing missing values (geom_point).
We can see delays quite strongly increase the later into the day you go.
flights %>%
group_by(dest) %>%
mutate(dest_total_delay=mean(arr_delay+dep_delay, na.rm=TRUE), n = n()) %>%
ungroup() %>%
mutate(prop_delay=(arr_delay+dep_delay)/dest_total_delay) %>%
filter(n > 10000) %>%
select(flight, dest, dep_delay, arr_delay, dest_total_delay, prop_delay) -> qtmp
head(qtmp)
qqtmp <- qtmp %>% filter(dest=='MIA')
ggplot(data=qqtmp) + geom_boxplot(aes(x=dest, y=log(max(1,prop_delay))))
## Warning: Removed 11728 rows containing non-finite values (stat_boxplot).
lag(), explore
how the delay of a flight is related to the delay of the immediately
preceding flight.flights %>% group_by(origin) %>% mutate(delay_lay=lag(dep_delay)) %>% head()
flights %>% group_by(origin, dest) %>% mutate(shortest_time=min(air_time)) %>% ungroup() %>% mutate(time_prop = air_time/shortest_time) %>% arrange(desc(time_prop))
flights %>% group_by(dest) %>% mutate(carrier_count=n_distinct(carrier)) %>% filter(carrier_count > 1) %>% group_by(carrier) %>% summarize(dest_count = n_distinct(dest)) %>% arrange(desc(dest_count))
flights %>% group_by(tailnum) %>% mutate(hour_delays=cumsum(dep_delay > 60)) %>% summarise(n = sum(hour_delays < 1)) %>% arrange(desc(n))