suppressPackageStartupMessages(library("tidyverse"))
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3
suppressPackageStartupMessages(library("lubridate"))
suppressPackageStartupMessages(library("nycflights13"))
package 㤼㸱nycflights13㤼㸲 was built under R version 3.6.3
The following code from the chapter is used
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
sched_dep <- flights_dt %>%
mutate(minute = minute(sched_dep_time)) %>%
group_by(minute) %>%
summarise(
avg_delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
In the previous code, the difference between rounded and un-rounded dates provides the within-period time.
1. How does the distribution of flight times within a day change over the course of the year?
Let’s try plotting this by month:
flights_dt %>%
filter(!is.na(dep_time)) %>%
mutate(dep_hour = update(dep_time, yday = 1)) %>%
mutate(month = factor(month(dep_time))) %>%
ggplot(aes(dep_hour, color = month)) +
geom_freqpoly(binwidth = 60 * 60)

This will look better if everything is normalized within groups. The reason that February is lower is that there are fewer days and thus fewer flights.
flights_dt %>%
filter(!is.na(dep_time)) %>%
mutate(dep_hour = update(dep_time, yday = 1)) %>%
mutate(month = factor(month(dep_time))) %>%
ggplot(aes(dep_hour, color = month)) +
geom_freqpoly(aes(y = ..density..), binwidth = 60 * 60)

At least to me there doesn’t appear to much difference in within-day distribution over the year, but I maybe thinking about it incorrectly.
2. Compare dep_time
, sched_dep_time
and dep_delay
. Are they consistent? Explain your findings.
If they are consistent, then dep_time
= sched_dep_time
+ dep_delay
.
flights_dt %>%
mutate(dep_time_ = sched_dep_time + dep_delay * 60) %>%
filter(dep_time_ != dep_time) %>%
select(dep_time_, dep_time, sched_dep_time, dep_delay)
There exist discrepancies. It looks like there are mistakes in the dates. These are flights in which the actual departure time is on the next day relative to the scheduled departure time. We forgot to account for this when creating the date-times using make_datetime_100()
function. The code would have had to check if the departure time is less than the scheduled departure time plus departure delay (in minutes). Alternatively, simply adding the departure delay to the scheduled departure time is a more robust way to construct the departure time because it will automatically account for crossing into the next day.
3. Compare air_time
with the duration between the departure and arrival. Explain your findings.
flights_dt %>%
mutate(
flight_duration = as.numeric(arr_time - dep_time),
air_time_mins = air_time,
diff = flight_duration - air_time_mins
) %>%
select(origin, dest, flight_duration, air_time_mins, diff)
4. How does the average delay time change over the course of a day? Should you use dep_time
or sched_dep_time
? Why?
Use sched_dep_time
because that is the relevant metric for someone scheduling a flight. Also, using dep_time
will always bias delays to later in the day since delays will push flights later.
flights_dt %>%
mutate(sched_dep_hour = hour(sched_dep_time)) %>%
group_by(sched_dep_hour) %>%
summarise(dep_delay = mean(dep_delay)) %>%
ggplot(aes(y = dep_delay, x = sched_dep_hour)) +
geom_point() +
geom_smooth()

5. On what day of the week should you leave if you want to minimize the chance of a delay?
Saturday has the lowest average departure delay time and the lowest average arrival delay time.
flights_dt %>%
mutate(dow = wday(sched_dep_time)) %>%
group_by(dow) %>%
summarise(
dep_delay = mean(dep_delay),
arr_delay = mean(arr_delay, na.rm = TRUE)
) %>%
print(n = Inf)
flights_dt %>%
mutate(wday = wday(dep_time, label = TRUE)) %>%
group_by(wday) %>%
summarize(ave_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = wday, y = ave_dep_delay)) +
geom_bar(stat = "identity")

flights_dt %>%
mutate(wday = wday(dep_time, label = TRUE)) %>%
group_by(wday) %>%
summarize(ave_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
ggplot(aes(x = wday, y = ave_arr_delay)) +
geom_bar(stat = "identity")

6. What makes the distribution of diamonds$carat
and flights$sched_dep_time
similar?
ggplot(diamonds, aes(x = carat)) +
geom_density()

In both carat
and sched_dep_time
there are abnormally large numbers of values are at nice “human” numbers. In sched_dep_time
it is at 00 and 30 minutes. In carats, it is at 0, 1/3, 1/2, 2/3,
ggplot(diamonds, aes(x = carat %% 1 * 100)) +
geom_histogram(binwidth = 1)

In scheduled departure times it is 00 and 30 minutes, and minutes ending in 0 and 5.
ggplot(flights_dt, aes(x = minute(sched_dep_time))) +
geom_histogram(binwidth = 1)

7. Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.
First, I create a binary variable early
that is equal to 1 if a flight leaves early, and 0 if it does not. Then, I group flights by the minute of departure. This shows that the proportion of flights that are early departures is highest between minutes 20–30 and 50–60.
flights_dt %>%
mutate(
minute = minute(dep_time),
early = dep_delay < 0
) %>%
group_by(minute) %>%
summarise(
early = mean(early, na.rm = TRUE),
n = n()
) %>%
ggplot(aes(minute, early)) +
geom_line()

