suppressPackageStartupMessages(library("tidyverse"))
suppressPackageStartupMessages(library("modelr"))
suppressPackageStartupMessages(library("lubridate"))
suppressPackageStartupMessages(library("broom"))
suppressPackageStartupMessages(library("nycflights13"))
suppressPackageStartupMessages(library("splines"))
This code is copied from the book and needed for the exercises.
daily <- flights %>%
mutate(date = make_date(year, month, day)) %>%
group_by(date) %>%
summarise(n = n())
daily
daily <- daily %>%
mutate(wday = wday(date, label = TRUE))
term <- function(date) {
cut(date,
breaks = ymd(20130101, 20130605, 20130825, 20140101),
labels = c("spring", "summer", "fall")
)
}
daily <- daily %>%
mutate(term = term(date))
mod <- lm(n ~ wday, data = daily)
daily <- daily %>%
add_residuals(mod)
mod1 <- lm(n ~ wday, data = daily)
mod2 <- lm(n ~ wday * term, data = daily)
These are the Sundays before Monday holidays Martin Luther King Jr. Day, Memorial Day, and Labor Day. For other years, use the dates of the holidays for those years—the third Monday of January for Martin Luther King Jr. Day, the last Monday of May for Memorial Day, and the first Monday in September for Labor Day.
The top three days correspond to the Saturday after Thanksgiving (November 30th), the Sunday after Thanksgiving (December 1st), and the Saturday after Christmas (December 28th).
top_n(daily, 3, resid)
We could generalize these to other years using the dates of those holidays on those years.
wday
variable into terms, but only for Saturdays, i.e., it should have Thurs, Fri, but Sat-summer, Sat-spring, Sat-fall How does this model compare with the model with every combination of wday and term?I’ll use the function case_when()
to do this, though there are other ways which it could be solved.
daily <- daily %>%
mutate(
wday2 =
case_when(
wday == "Sat" & term == "summer" ~ "Sat-summer",
wday == "Sat" & term == "fall" ~ "Sat-fall",
wday == "Sat" & term == "spring" ~ "Sat-spring",
TRUE ~ as.character(wday)
)
)
mod3 <- lm(n ~ wday2, data = daily)
daily %>%
gather_residuals(sat_term = mod3, all_interact = mod2) %>%
ggplot(aes(date, resid, colour = model)) +
geom_line(alpha = 0.75)
I think the overlapping plot is hard to understand. If we are interested in the differences, it is better to plot the differences directly. In this code, I use spread_residuals()
to add one column per model, rather than gather_residuals()
which creates a new row for each model.
daily %>%
spread_residuals(sat_term = mod3, all_interact = mod2) %>%
mutate(resid_diff = sat_term - all_interact) %>%
ggplot(aes(date, resid_diff)) +
geom_line(alpha = 0.75)
The model with terms ×
Saturday has higher residuals in the fall and lower residuals in the spring than the model with all interactions.
Comparing models, mod3
has a lower \(R^2\) and regression standard error, \(\hat\sigma\) , despite using fewer variables. More importantly for prediction purposes, this model has a higher AIC, which is an estimate of the out of sample error.
glance(mod3) %>% select(r.squared, sigma, AIC, df)
glance(mod2) %>% select(r.squared, sigma, AIC, df)
wday
variable that combines the day of week, term (for Saturdays), and public holidays. What do the residuals of that model look like?The question is unclear how to handle public holidays. There are several questions to consider.
First, what are the public holidays? I include all federal holidays in the United States in 2013. Other holidays to consider would be Easter and Good Friday which is US stock market holiday and widely celebrated religious holiday, Mothers Day, Fathers Day, and Patriots’ Day, which is a holiday in several states, and other state holidays.
holidays_2013 <-
tribble(
~holiday, ~date,
"New Year's Day", 20130101,
"Martin Luther King Jr. Day", 20130121,
"Washington's Birthday", 20130218,
"Memorial Day", 20130527,
"Independence Day", 20130704,
"Labor Day", 20130902,
"Columbus Day", 20131028,
"Veteran's Day", 20131111,
"Thanksgiving", 20131128,
"Christmas", 20131225
) %>%
mutate(date = lubridate::ymd(date))
The model could include a single dummy variable which indicates a day was a public holiday. Alternatively, I could include a dummy variable for each public holiday. I would expect that Veteran’s Day and Washington’s Birthday have a different effect on travel than Thanksgiving, Christmas, and New Year’s Day.
Another question is whether and how I should handle the days before and after holidays. Travel could be lighter on the day of the holiday, but heavier the day before or after.
daily <- daily %>%
mutate(
wday3 =
case_when(
date %in% (holidays_2013$date - 1L) ~ "day before holiday",
date %in% (holidays_2013$date + 1L) ~ "day after holiday",
date %in% holidays_2013$date ~ "holiday",
.$wday == "Sat" & .$term == "summer" ~ "Sat-summer",
.$wday == "Sat" & .$term == "fall" ~ "Sat-fall",
.$wday == "Sat" & .$term == "spring" ~ "Sat-spring",
TRUE ~ as.character(.$wday)
)
)
mod4 <- lm(n ~ wday3, data = daily)
daily %>%
spread_residuals(resid_sat_terms = mod3, resid_holidays = mod4) %>%
mutate(resid_diff = resid_holidays - resid_sat_terms) %>%
ggplot(aes(date, resid_diff)) +
geom_line(alpha = 0.75)
n ~ wday * month
)? Why is this not very helpful?daily <- mutate(daily, month = factor(lubridate::month(date)))
mod6 <- lm(n ~ wday * month, data = daily)
print(summary(mod6))
Call:
lm(formula = n ~ wday * month, data = daily)
Residuals:
Min 1Q Median 3Q Max
-269.2 -5.0 1.5 8.8 113.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 867.4000 7.5983 114.157 < 2e-16 ***
wday.L -64.0744 20.8737 -3.070 0.002353 **
wday.Q -165.6001 20.1555 -8.216 7.77e-15 ***
wday.C -68.2591 20.3115 -3.361 0.000885 ***
wday^4 -92.0814 20.4991 -4.492 1.03e-05 ***
wday^5 9.7925 19.7334 0.496 0.620111
wday^6 -20.4376 18.9922 -1.076 0.282802
month2 23.7071 10.9946 2.156 0.031912 *
month3 67.8857 10.7456 6.318 1.04e-09 ***
month4 74.5929 10.8292 6.888 3.70e-11 ***
month5 56.2786 10.7456 5.237 3.20e-07 ***
month6 80.3071 10.8292 7.416 1.43e-12 ***
month7 77.1143 10.7456 7.176 6.39e-12 ***
month8 81.6357 10.7456 7.597 4.52e-13 ***
month9 51.3714 10.8292 4.744 3.34e-06 ***
month10 60.1357 10.7456 5.596 5.20e-08 ***
month11 46.9143 10.8292 4.332 2.06e-05 ***
month12 38.7786 10.7456 3.609 0.000364 ***
wday.L:month2 -3.7230 29.6267 -0.126 0.900089
wday.Q:month2 -3.8188 29.1251 -0.131 0.895776
wday.C:month2 0.4899 29.2333 0.017 0.986641
wday^4:month2 4.5690 29.3639 0.156 0.876460
wday^5:month2 -4.2552 28.8346 -0.148 0.882784
wday^6:month2 12.0570 28.3325 0.426 0.670760
wday.L:month3 -14.5705 28.4302 -0.513 0.608703
wday.Q:month3 15.4389 28.2073 0.547 0.584581
wday.C:month3 8.2262 28.4672 0.289 0.772817
wday^4:month3 22.7202 28.7015 0.792 0.429261
wday^5:month3 -15.3298 28.5042 -0.538 0.591135
wday^6:month3 11.3727 28.2682 0.402 0.687759
wday.L:month4 -16.6682 29.3590 -0.568 0.570667
wday.Q:month4 10.7254 28.9620 0.370 0.711418
wday.C:month4 -0.2449 28.7249 -0.009 0.993202
wday^4:month4 23.2883 28.8711 0.807 0.420561
wday^5:month4 -17.8720 28.0764 -0.637 0.524935
wday^6:month4 5.3524 27.8883 0.192 0.847940
wday.L:month5 3.6663 29.3590 0.125 0.900711
wday.Q:month5 -20.6652 28.6699 -0.721 0.471632
wday.C:month5 4.6336 28.7249 0.161 0.871965
wday^4:month5 5.9994 28.5109 0.210 0.833490
wday^5:month5 -16.9119 28.0764 -0.602 0.547425
wday^6:month5 12.7643 27.1935 0.469 0.639158
wday.L:month6 -4.5261 28.6515 -0.158 0.874593
wday.Q:month6 23.8130 28.2073 0.844 0.399267
wday.C:month6 13.7580 28.7249 0.479 0.632342
wday^4:month6 24.1183 29.1875 0.826 0.409322
wday^5:month6 -17.6484 28.7981 -0.613 0.540483
wday^6:month6 10.5256 28.3291 0.372 0.710510
wday.L:month7 -28.7914 29.3590 -0.981 0.327601
wday.Q:month7 49.5846 28.6699 1.730 0.084818 .
wday.C:month7 54.5011 28.7249 1.897 0.058807 .
wday^4:month7 50.8474 28.5109 1.783 0.075594 .
wday^5:month7 -33.6983 28.0764 -1.200 0.231058
wday^6:month7 -13.8943 27.1935 -0.511 0.609793
wday.L:month8 -20.4479 28.8711 -0.708 0.479378
wday.Q:month8 6.7648 28.5042 0.237 0.812578
wday.C:month8 6.0012 28.4672 0.211 0.833186
wday^4:month8 19.0738 28.7814 0.663 0.508058
wday^5:month8 -19.3123 28.0576 -0.688 0.491827
wday^6:month8 9.5074 27.8866 0.341 0.733410
wday.L:month9 -30.3411 28.9257 -1.049 0.295110
wday.Q:month9 -42.0342 28.6699 -1.466 0.143726
wday.C:month9 -20.7186 28.7249 -0.721 0.471338
wday^4:month9 -20.3752 28.7914 -0.708 0.479728
wday^5:month9 -18.2376 28.5226 -0.639 0.523079
wday^6:month9 11.7263 28.2699 0.415 0.678606
wday.L:month10 -61.0507 29.5199 -2.068 0.039544 *
wday.Q:month10 -26.2352 28.5042 -0.920 0.358153
wday.C:month10 -32.4353 28.7249 -1.129 0.259788
wday^4:month10 -12.2122 28.9901 -0.421 0.673890
wday^5:month10 -27.6864 27.9072 -0.992 0.322008
wday^6:month10 0.1234 26.8590 0.005 0.996339
wday.L:month11 -54.9466 28.9257 -1.900 0.058512 .
wday.Q:month11 16.0117 28.6699 0.558 0.576957
wday.C:month11 54.9502 28.7249 1.913 0.056766 .
wday^4:month11 47.2857 28.7914 1.642 0.101635
wday^5:month11 -44.7401 28.5226 -1.569 0.117871
wday^6:month11 -20.6876 28.2699 -0.732 0.464907
wday.L:month12 -9.5058 28.8711 -0.329 0.742212
wday.Q:month12 75.2088 28.5042 2.639 0.008791 **
wday.C:month12 -25.0256 28.4672 -0.879 0.380097
wday^4:month12 -23.7798 28.7814 -0.826 0.409380
wday^5:month12 20.4470 28.0576 0.729 0.466761
wday^6:month12 9.5864 27.8866 0.344 0.731282
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 42.05 on 281 degrees of freedom
Multiple R-squared: 0.8355, Adjusted R-squared: 0.7869
F-statistic: 17.2 on 83 and 281 DF, p-value: < 2.2e-16
If we fit a day of week effect that varies by month, there will be 12 * 7 = 84
parameters in the model. Since each month has only four to five weeks, each of these day of week \(\times\) month effects is the average of only four or five observations. These estimates have large standard errors and likely not generalize well beyond the sample data, since they are estimated from only a few observations.
n ~ wday + ns(date, 5)
to look like? Knowing what you know about the data, why would you expect it to be not particularly effective?Previous models fit in the chapter and exercises show that the effects of days of the week vary across different times of the year. The model wday + ns(date, 5)
does not interact the day of week effect (wday
) with the time of year effects (ns(date, 5)
).
I estimate a model which does not interact the day of week effects (mod7
) with the spline to that which does (mod8
). I need to load the splines package to use the ns()
function.
mod7 <- lm(n ~ wday + ns(date, 5), data = daily)
mod8 <- lm(n ~ wday * ns(date, 5), data = daily)
The residuals of the model that does not interact day of week with time of year (mod7
) are larger than those of the model that does (mod8
). The model mod7
underestimates weekends during the summer and overestimates weekends during the autumn.
daily %>%
gather_residuals(mod7, mod8) %>%
ggplot(aes(x = date, y = resid, color = model)) +
geom_line(alpha = 0.75)
Comparing the average distances of flights by day of week, Sunday flights are the second longest. Saturday flights are the longest on average. Saturday may have the longest flights on average because there are fewer regularly scheduled short business/commuter flights on the weekends but that is speculation.
flights %>%
mutate(
date = make_date(year, month, day),
wday = wday(date, label = TRUE)
) %>%
ggplot(aes(y = distance, x = wday)) +
geom_boxplot() +
labs(x = "Day of Week", y = "Average Distance")
Hide outliers.
flights %>%
mutate(
date = make_date(year, month, day),
wday = wday(date, label = TRUE)
) %>%
ggplot(aes(y = distance, x = wday)) +
geom_boxplot(outlier.shape = NA) +
labs(x = "Day of Week", y = "Average Distance")
Try pointrange with mean and standard error of the mean (sd / sqrt(n)).
flights %>%
mutate(
date = make_date(year, month, day),
wday = wday(date, label = TRUE)
) %>%
ggplot(aes(y = distance, x = wday)) +
stat_summary() +
labs(x = "Day of Week", y = "Average Distance")
Try pointrange with mean and standard error of the mean (sd / sqrt(n)).
flights %>%
mutate(
date = make_date(year, month, day),
wday = wday(date, label = TRUE)
) %>%
ggplot(aes(y = distance, x = wday)) +
geom_violin() +
labs(x = "Day of Week", y = "Average Distance")
flights %>%
mutate(
date = make_date(year, month, day),
wday = wday(date, label = TRUE)
) %>%
filter(
distance < 3000,
hour >= 5, hour <= 21
) %>%
ggplot(aes(x = hour, color = wday, y = ..density..)) +
geom_freqpoly(binwidth = 1)
flights %>%
mutate(
date = make_date(year, month, day),
wday = wday(date, label = TRUE)
) %>%
filter(
distance < 3000,
hour >= 5, hour <= 21
) %>%
group_by(wday, hour) %>%
summarise(distance = mean(distance)) %>%
ggplot(aes(x = hour, color = wday, y = distance)) +
geom_line()
flights %>%
mutate(
date = make_date(year, month, day),
wday = wday(date, label = TRUE)
) %>%
filter(
distance < 3000,
hour >= 5, hour <= 21
) %>%
group_by(wday, hour) %>%
summarise(distance = sum(distance)) %>%
group_by(wday) %>%
mutate(prop_distance = distance / sum(distance)) %>%
ungroup() %>%
ggplot(aes(x = hour, color = wday, y = prop_distance)) +
geom_line()
See the lecture Factors for the function fct_relevel()
. Use fct_relevel()
to put all levels in-front of the first level (“Sunday”).
monday_first <- function(x) {
fct_relevel(x, levels(x)[-1])
}
Now Monday is the first day of the week.
daily <- daily %>%
mutate(wday = wday(date, label = TRUE))
ggplot(daily, aes(monday_first(wday), n)) +
geom_boxplot() +
labs(x = "Day of Week", y = "Number of flights")