library(tidyverse)
library(modelr)
library(nycflights13)
library(lubridate)
diamonds
data setIf we wanted to, we could continue to build up our model, moving the
effects we’ve observed into the model to make them explicit. For
example, we could include color
, cut
, and
clarity
into the model so that we also make explicit the
effect of these three categorical variables on price
:
diamonds2 <- diamonds %>%
filter (carat <= 3) %>%
mutate (lprice = log2(price), lcarat = log2(carat))
mod_diamond <- lm(lprice ~ lcarat + color + cut + clarity, data = diamonds2)
This model now includes four predictors, so it’s getting harder to visualise. Fortunately, they’re currently all independent which means that we can plot them individually in four plots.
grid <- diamonds2 %>%
data_grid(cut, .model = mod_diamond) %>%
add_predictions(mod_diamond)
grid
## # A tibble: 5 × 5
## cut lcarat color clarity pred
## <ord> <dbl> <chr> <chr> <dbl>
## 1 Fair -0.515 G VS2 11.2
## 2 Good -0.515 G VS2 11.3
## 3 Very Good -0.515 G VS2 11.4
## 4 Premium -0.515 G VS2 11.4
## 5 Ideal -0.515 G VS2 11.4
Here since we only create data_grid
for cut
variable alone, other predictors will be filled with “typical” values
(median for continuous variables and mode for categorical variables) by
specifying a model with the .model
argument. By doing this,
we are comparing the prediction with different cut
quality
when holding all other variables as constants.
ggplot(grid, aes(cut, pred)) +
geom_point()
So we see that the model correctly predicts higher price with better
cut
quality. Similarly, we can create the plot for other
variables as well.
grid <- diamonds2 %>%
data_grid(color, .model = mod_diamond) %>%
add_predictions(mod_diamond)
ggplot(grid, aes(color, pred)) +
geom_point()
grid <- diamonds2 %>%
data_grid(clarity, .model = mod_diamond) %>%
add_predictions(mod_diamond)
ggplot(grid, aes(clarity, pred)) +
geom_point()
We see that the model correctly predicts higher price for better
color (from “D” to “J”) and better clarity as well (from “I1” to “IF”).
Here cut
, color
, and clarity
are
treated as ordinal variables and the model works pretty well using
polynomial contrasts which can capture non-linear pattern between
levels.
grid <- diamonds2 %>%
data_grid(lcarat, .model = mod_diamond) %>%
add_predictions(mod_diamond)
ggplot(grid, aes(lcarat, pred)) +
geom_point()
As expected, the relationship between lprice
and
lcarat
is a linear one. As usual, we need to check the
residual plot to make sure that the model assumption are valid.
diamonds2 <- diamonds2 %>%
add_predictions(mod_diamond) %>%
add_residuals(mod_diamond, "lresid2")
ggplot(diamonds2, aes(pred, lresid2)) +
geom_bin_2d(bins = 50) +
geom_smooth(method = "lm", color = "red")
The plot above is very similar to that created by the
plot
function.
plot(mod_diamond, which = 1)
We see that the residuals seem to be independent of the target variable, which is what we seek for. We can also check the normality plot or residuals.
qqnorm(diamonds2$lresid2)
which is similar to
plot(mod_diamond, which = 2)
As we see, the majority of residuals follow a nice normal distribution. However, for data points with largest residuals (potential outliers), the residuals deviate from a normal distribution. We may hope to investigate those samples more closely.
diamonds2 %>%
filter(abs(lresid2) > 1) %>%
add_predictions(mod_diamond) %>%
mutate(pred = round(2 ^ pred)) %>%
select(price, pred, carat:table, x:z) %>%
arrange(price)
## # A tibble: 18 × 11
## price pred carat cut color clarity depth table x y z
## <int> <dbl> <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1013 265 0.25 Fair F SI2 54.4 64 4.3 4.23 2.32
## 2 1186 285 0.25 Premium G SI2 59 60 5.33 5.28 3.12
## 3 1186 285 0.25 Premium G SI2 58.8 60 5.33 5.28 3.12
## 4 1262 2622 1.03 Fair E I1 78.2 54 5.72 5.59 4.42
## 5 1415 640 0.35 Fair G VS2 65.9 54 5.57 5.53 3.66
## 6 1415 640 0.35 Fair G VS2 65.9 54 5.57 5.53 3.66
## 7 1715 577 0.32 Fair F VS2 59.6 60 4.42 4.34 2.61
## 8 1776 413 0.29 Fair F SI1 55.8 60 4.48 4.41 2.48
## 9 2160 312 0.34 Fair F I1 55.8 62 4.72 4.6 2.6
## 10 2366 775 0.3 Very Good D VVS2 60.6 58 4.33 4.35 2.63
## 11 3360 1374 0.51 Premium F SI1 62.7 62 5.09 4.96 3.15
## 12 3807 1539 0.61 Good F SI2 62.5 65 5.36 5.29 3.33
## 13 3920 1705 0.51 Fair F VVS2 65.4 60 4.98 4.9 3.23
## 14 4368 1705 0.51 Fair F VVS2 60.7 66 5.21 5.11 3.13
## 15 6512 18145 3 Very Good H I1 63.1 55 9.23 9.1 5.77
## 16 8044 16148 3 Fair H I1 67.1 57 8.93 8.84 5.97
## 17 10011 4042 1.01 Fair D SI2 64.6 58 6.25 6.2 4.02
## 18 10470 23552 2.46 Premium E SI2 59.7 59 8.82 8.76 5.25
Here we keep samples where the absolute residual is greater than one (which means the predicted price is more than double or less than half of the actual price) and print out all information of those diamonds including the predicted price. To summarize what we see here:
By checking all other information, we don’t observe any particular reason to explain the discrepancies. So there are two possibilities:
So in practice, this can be normally what we expect - our models work reasonably well for the majority of samples but there are exceptions.
mod_diamond
, do a good job of predicting diamond prices?
Would you trust it to tell you how much to spend if you were buying a
diamond?flights
data setLet’s work through a similar process for a dataset that seems even
simpler at first glance: the number of flights that leave NYC per day.
This is a really small dataset — only 365 rows and 2 columns — and we’re
not going to end up with a fully realised model, but as you’ll see, the
steps along the way will help us better understand the data. Let’s get
started by counting the number of flights per day and visualising it
with ggplot2
.
daily <- flights %>%
mutate(date = make_date(year, month, day)) %>%
group_by(date) %>%
summarise(n = n())
daily
## # A tibble: 365 × 2
## date n
## <date> <int>
## 1 2013-01-01 842
## 2 2013-01-02 943
## 3 2013-01-03 914
## 4 2013-01-04 915
## 5 2013-01-05 720
## 6 2013-01-06 832
## 7 2013-01-07 933
## 8 2013-01-08 899
## 9 2013-01-09 902
## 10 2013-01-10 932
## # … with 355 more rows
ggplot(daily, aes(date, n)) +
geom_line() +
ylim(0, 1200)
Here we observe a periodic pattern for each week - the number of flights is significantly less on Saturdays. Such pattern is commonly seen in data with respect to time (which is called time series data).
To confirm the pattern, we will look at the distribution of flight number vs weekday. We have done this when studying the date-time data type.
daily <- daily %>%
mutate(wday = wday(date, label = TRUE))
glimpse(daily)
## Rows: 365
## Columns: 3
## $ date <date> 2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-05, 2013…
## $ n <int> 842, 943, 914, 915, 720, 832, 933, 899, 902, 932, 930, 690, 828, …
## $ wday <ord> Tue, Wed, Thu, Fri, Sat, Sun, Mon, Tue, Wed, Thu, Fri, Sat, Sun, …
Note that wday
is an ordinal variable by default.
ggplot(daily, aes(wday, n)) +
geom_boxplot()
Note that we see some potential outliers here (most of them are on the lower end) and we will explain them later.
Since the number of flights depends on day-of-week, let’s create a model between them.
mod <- lm(n ~ wday, data = daily)
grid <- daily %>%
data_grid(wday) %>%
add_predictions(mod, "n")
ggplot(daily, aes(wday, n)) +
geom_boxplot() +
geom_point(data = grid, colour = "red", size = 4)
So the model predicts an average flight number for each weekday (marked as the red markers). Next, let’s compute and visualise the residuals:
daily <- daily %>%
add_residuals(mod)
daily %>%
ggplot(aes(date, resid)) +
geom_ref_line(h = 0) +
geom_line()
The residuals still show some strong pattern, indicating the necessity for further polishing our model. There are two strong patterns here:
To verify our observation, let’s do some EDA work.
daily2 <- daily %>%
filter(resid < -100) %>%
add_predictions(mod) %>%
arrange(n) %>%
print()
## # A tibble: 11 × 5
## date n wday resid pred
## <date> <int> <ord> <dbl> <dbl>
## 1 2013-11-28 634 Thu -332. 966.
## 2 2013-11-29 661 Fri -306. 967.
## 3 2013-09-01 718 Sun -173. 891.
## 4 2013-12-25 719 Wed -244. 963.
## 5 2013-05-26 729 Sun -162. 891.
## 6 2013-07-04 737 Thu -229. 966.
## 7 2013-12-24 761 Tue -190. 951.
## 8 2013-12-31 776 Tue -175. 951.
## 9 2013-01-20 786 Sun -105. 891.
## 10 2013-07-05 822 Fri -145. 967.
## 11 2013-01-01 842 Tue -109. 951.
There are 11 days on which the actual flight number is much less (over 100) than the predicted value, and they are all related to holidays:
Therefore to improve our model, we need a new categorical variable to mark out holidays (and the days before or after) for our model to account for holiday effects.
holidays <- c("20130101", "20130121", "20130527", "20130704", "20130902", "20131128", "20131225", "20140101")
holidays <- ymd(holidays)
daily <- daily %>%
mutate(holiday_flag = case_when(
date %in% holidays ~ "holiday",
(date + days(1)) %in% holidays ~ "before holiday",
(date - days(1)) %in% holidays ~ "after holiday",
.default = "regular"
)
)
print(daily, n = 30)
## # A tibble: 365 × 5
## date n wday resid holiday_flag
## <date> <int> <ord> <dbl> <chr>
## 1 2013-01-01 842 Tue -109. holiday
## 2 2013-01-02 943 Wed -19.7 after holiday
## 3 2013-01-03 914 Thu -51.8 regular
## 4 2013-01-04 915 Fri -52.5 regular
## 5 2013-01-05 720 Sat -24.6 regular
## 6 2013-01-06 832 Sun -59.5 regular
## 7 2013-01-07 933 Mon -41.8 regular
## 8 2013-01-08 899 Tue -52.4 regular
## 9 2013-01-09 902 Wed -60.7 regular
## 10 2013-01-10 932 Thu -33.8 regular
## 11 2013-01-11 930 Fri -37.5 regular
## 12 2013-01-12 690 Sat -54.6 regular
## 13 2013-01-13 828 Sun -63.5 regular
## 14 2013-01-14 928 Mon -46.8 regular
## 15 2013-01-15 894 Tue -57.4 regular
## 16 2013-01-16 901 Wed -61.7 regular
## 17 2013-01-17 927 Thu -38.8 regular
## 18 2013-01-18 924 Fri -43.5 regular
## 19 2013-01-19 674 Sat -70.6 regular
## 20 2013-01-20 786 Sun -105. before holiday
## 21 2013-01-21 912 Mon -62.8 holiday
## 22 2013-01-22 890 Tue -61.4 after holiday
## 23 2013-01-23 897 Wed -65.7 regular
## 24 2013-01-24 925 Thu -40.8 regular
## 25 2013-01-25 922 Fri -45.5 regular
## 26 2013-01-26 680 Sat -64.6 regular
## 27 2013-01-27 823 Sun -68.5 regular
## 28 2013-01-28 923 Mon -51.8 regular
## 29 2013-01-29 890 Tue -61.4 regular
## 30 2013-01-30 900 Wed -62.7 regular
## # … with 335 more rows
So we see that we correctly add holiday labels to each row of our data. Let’s update our model now and see whether this improves its performance on holidays. We also include interaction terms to improve performance (why?).
mod2 <- lm(n ~ wday * holiday_flag, data = daily)
daily <- daily %>%
add_residuals(mod2, "resid2")
daily %>%
ggplot(aes(date, resid2)) +
geom_ref_line(h = 0) +
geom_line() +
ylim(-300, 100)
Compared with the first model, we have successfully removed those big
residuals which occurred on holidays! Note that the ylim
function is used to make the plot in the same scale as that from the
previous model for better comparison.
Now let’s handle the pattern in residuals that is related to month. Obviously, we have less flights in winter and more flights in summer. Again, let’s visualise this effect first:
daily <- daily %>%
mutate(month = month(date, label = TRUE))
daily %>%
group_by(month) %>%
summarise(mean_monthly_resid = mean(resid2)) %>%
ggplot() + geom_point(aes(month, mean_monthly_resid))
The graph above confirms our observation - less flights in January, Feburary and more flights starting spring until the end of summer vacation.
Now let’s incorporate month
into our model. Note that
one should not include interaction terms for month
since
that would introduce more parameters than the actual sample size (which
is 365)!
mod3 <- lm(n ~ wday * holiday_flag + month, data = daily)
summary(mod3)
##
## Call:
## lm(formula = n ~ wday * holiday_flag + month, data = daily)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83.786 -9.048 0.315 9.722 110.070
##
## Coefficients: (10 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 754.417 86.181 8.754 < 2e-16 ***
## wday.L -389.369 489.946 -0.795 0.427338
## wday.Q -631.683 240.877 -2.622 0.009128 **
## wday.C -158.783 242.481 -0.655 0.513026
## wday^4 -80.663 3.168 -25.462 < 2e-16 ***
## wday^5 -6.617 3.188 -2.076 0.038673 *
## wday^6 -10.590 3.204 -3.306 0.001050 **
## holiday_flagbefore holiday 399.645 353.210 1.131 0.258666
## holiday_flagholiday 163.377 123.711 1.321 0.187523
## holiday_flagregular 174.272 86.190 2.022 0.043972 *
## month.L 37.329 4.159 8.975 < 2e-16 ***
## month.Q -60.313 4.124 -14.626 < 2e-16 ***
## month.C 24.993 4.128 6.055 3.76e-09 ***
## month^4 6.054 4.093 1.479 0.140014
## month^5 1.797 4.106 0.438 0.661958
## month^6 -4.188 4.177 -1.003 0.316677
## month^7 -25.481 4.091 -6.229 1.40e-09 ***
## month^8 15.275 4.040 3.781 0.000185 ***
## month^9 9.440 4.073 2.318 0.021073 *
## month^10 -2.927 4.033 -0.726 0.468416
## month^11 1.856 4.074 0.456 0.648982
## wday.L:holiday_flagbefore holiday 1472.415 435.836 3.378 0.000815 ***
## wday.Q:holiday_flagbefore holiday 958.143 843.356 1.136 0.256721
## wday.C:holiday_flagbefore holiday NA NA NA NA
## wday^4:holiday_flagbefore holiday NA NA NA NA
## wday^5:holiday_flagbefore holiday NA NA NA NA
## wday^6:holiday_flagbefore holiday NA NA NA NA
## wday.L:holiday_flagholiday 1086.011 692.144 1.569 0.117577
## wday.Q:holiday_flagholiday 1011.290 344.317 2.937 0.003542 **
## wday.C:holiday_flagholiday 754.756 340.287 2.218 0.027223 *
## wday^4:holiday_flagholiday NA NA NA NA
## wday^5:holiday_flagholiday NA NA NA NA
## wday^6:holiday_flagholiday NA NA NA NA
## wday.L:holiday_flagregular 305.315 489.939 0.623 0.533595
## wday.Q:holiday_flagregular 471.143 240.882 1.956 0.051304 .
## wday.C:holiday_flagregular 87.181 242.503 0.360 0.719444
## wday^4:holiday_flagregular NA NA NA NA
## wday^5:holiday_flagregular NA NA NA NA
## wday^6:holiday_flagregular NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.21 on 336 degrees of freedom
## Multiple R-squared: 0.9451, Adjusted R-squared: 0.9406
## F-statistic: 206.7 on 28 and 336 DF, p-value: < 2.2e-16
daily <- daily %>%
add_residuals(mod3, "resid3")
daily %>%
ggplot(aes(date, resid3)) +
geom_ref_line(h = 0) +
geom_line() +
ylim(-200, 200)
Now the residuals look much more random with less obvious patterns. Let’s look at the residual plots:
plot(mod3, which = 1)
plot(mod3, which = 2)
Now the model is already a pretty good one. Again we see some deviation from the normal distribution for potential outliers. Let’s inspect them more closely.
daily %>%
filter (abs(resid3) > 50) %>%
add_predictions(mod3) %>%
select(-resid, -resid2) %>%
arrange(desc(resid3))
## # A tibble: 19 × 7
## date n wday holiday_flag month resid3 pred
## <date> <int> <ord> <chr> <ord> <dbl> <dbl>
## 1 2013-11-30 857 Sat regular Nov 110. 747.
## 2 2013-12-01 987 Sun regular Dec 89.2 898.
## 3 2013-01-20 786 Sun before holiday Jan 77.2 709.
## 4 2013-12-28 814 Sat regular Dec 71.4 743.
## 5 2013-12-21 811 Sat regular Dec 68.4 743.
## 6 2013-07-05 822 Fri after holiday Jul 64.8 757.
## 7 2013-12-14 692 Sat regular Dec -50.6 743.
## 8 2013-09-07 688 Sat regular Sep -51.1 739.
## 9 2013-12-07 691 Sat regular Dec -51.6 743.
## 10 2013-09-14 686 Sat regular Sep -53.1 739.
## 11 2013-10-05 687 Sat regular Oct -56.0 743.
## 12 2013-10-31 922 Thu regular Oct -56.6 979.
## 13 2013-09-28 682 Sat regular Sep -57.1 739.
## 14 2013-11-02 689 Sat regular Nov -57.9 747.
## 15 2013-10-26 685 Sat regular Oct -58.0 743.
## 16 2013-10-19 684 Sat regular Oct -59.0 743.
## 17 2013-11-29 661 Fri after holiday Nov -64.8 726.
## 18 2013-10-12 676 Sat regular Oct -67.0 743.
## 19 2013-08-31 680 Sat regular Aug -83.8 764.
As we see, now most big residuals occur on regular Saturdays. For
Saturdays after Thanksgiving and before/after Christmas, the model
under-predicts the flight number. While on other Saturdays in late
Auguest, September, October and early November, the model over-predicts
the flight number. Since we don’t want our model to be overly
complicated, let’s introduce another binary variable -
Season_Sat
which stands for “seasonal Saturdays”.
daily <- daily %>%
mutate(Season_Sat = ifelse(wday == "Sat" & between(date, ymd(20130825), ymd(20131105)), "Yes", "No"))
daily %>%
filter(date >= ymd(20130815)) %>%
print(n=30)
## # A tibble: 139 × 9
## date n wday resid holiday_flag resid2 month resid3 Season…¹
## <date> <int> <ord> <dbl> <chr> <dbl> <ord> <dbl> <chr>
## 1 2013-08-15 1000 Thu 34.2 regular 22.2 Aug 0.616 No
## 2 2013-08-16 998 Fri 30.5 regular 21.5 Aug 1.24 No
## 3 2013-08-17 780 Sat 35.4 regular 35.4 Aug 16.2 No
## 4 2013-08-18 914 Sun 22.5 regular 13.5 Aug -5.00 No
## 5 2013-08-19 996 Mon 21.2 regular 18.0 Aug -0.0641 No
## 6 2013-08-20 986 Tue 34.6 regular 24.0 Aug 6.09 No
## 7 2013-08-21 990 Wed 27.3 regular 23.3 Aug 3.14 No
## 8 2013-08-22 990 Thu 24.2 regular 12.2 Aug -9.38 No
## 9 2013-08-23 989 Fri 21.5 regular 12.5 Aug -7.76 No
## 10 2013-08-24 774 Sat 29.4 regular 29.4 Aug 10.2 No
## 11 2013-08-25 903 Sun 11.5 regular 2.51 Aug -16.0 No
## 12 2013-08-26 982 Mon 7.19 regular 4.02 Aug -14.1 No
## 13 2013-08-27 965 Tue 13.6 regular 2.96 Aug -14.9 No
## 14 2013-08-28 973 Wed 10.3 regular 6.31 Aug -13.9 No
## 15 2013-08-29 979 Thu 13.2 regular 1.20 Aug -20.4 No
## 16 2013-08-30 965 Fri -2.46 regular -11.5 Aug -31.8 No
## 17 2013-08-31 680 Sat -64.6 regular -64.6 Aug -83.8 Yes
## 18 2013-09-01 718 Sun -173. before holiday -26.3 Sep -41.2 No
## 19 2013-09-02 929 Mon -45.8 holiday 6.00 Sep -8.91 No
## 20 2013-09-03 956 Tue 4.64 after holiday 13.7 Sep -1.24 No
## 21 2013-09-04 948 Wed -14.7 regular -18.7 Sep -14.1 No
## 22 2013-09-05 969 Thu 3.25 regular -8.80 Sep -5.66 No
## 23 2013-09-06 967 Fri -0.462 regular -9.50 Sep -5.04 No
## 24 2013-09-07 688 Sat -56.6 regular -56.6 Sep -51.1 Yes
## 25 2013-09-08 908 Sun 16.5 regular 7.51 Sep 13.7 No
## 26 2013-09-09 991 Mon 16.2 regular 13.0 Sep 19.7 No
## 27 2013-09-10 961 Tue 9.64 regular -1.04 Sep 5.81 No
## 28 2013-09-11 947 Wed -15.7 regular -19.7 Sep -15.1 No
## 29 2013-09-12 992 Thu 26.2 regular 14.2 Sep 17.3 No
## 30 2013-09-13 996 Fri 28.5 regular 19.5 Sep 24.0 No
## # … with 109 more rows, and abbreviated variable name ¹Season_Sat
The labels of “Season_Sat` are correctly added to our data. Now let’s add this into our model and redo all analysis steps:
mod4 <- lm(n ~ wday * holiday_flag + month + Season_Sat, data = daily)
summary(mod4)
##
## Call:
## lm(formula = n ~ wday * holiday_flag + month + Season_Sat, data = daily)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.890 -7.178 0.685 7.500 94.251
##
## Coefficients: (10 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 751.8269 72.7008 10.341 < 2e-16 ***
## wday.L -360.7971 413.3129 -0.873 0.383321
## wday.Q -638.8126 203.1983 -3.144 0.001817 **
## wday.C -148.3470 204.5525 -0.725 0.468819
## wday^4 -76.9858 2.6908 -28.611 < 2e-16 ***
## wday^5 -4.9801 2.6926 -1.850 0.065261 .
## wday^6 -10.0722 2.7029 -3.726 0.000228 ***
## holiday_flagbefore holiday 405.7317 297.9592 1.362 0.174207
## holiday_flagholiday 167.1205 104.3599 1.601 0.110234
## holiday_flagregular 179.0877 72.7084 2.463 0.014277 *
## month.L 43.8188 3.5522 12.336 < 2e-16 ***
## month.Q -62.0289 3.4816 -17.816 < 2e-16 ***
## month.C 17.0668 3.5473 4.811 2.27e-06 ***
## month^4 0.2092 3.4882 0.060 0.952223
## month^5 2.1533 3.4642 0.622 0.534641
## month^6 1.0161 3.5512 0.286 0.774967
## month^7 -21.0211 3.4718 -6.055 3.77e-09 ***
## month^8 15.8691 3.4083 4.656 4.65e-06 ***
## month^9 8.3144 3.4375 2.419 0.016108 *
## month^10 -4.2762 3.4037 -1.256 0.209866
## month^11 0.8737 3.4381 0.254 0.799553
## Season_SatYes -81.7454 6.9798 -11.712 < 2e-16 ***
## wday.L:holiday_flagbefore holiday 1455.3865 367.6630 3.958 9.21e-05 ***
## wday.Q:holiday_flagbefore holiday 974.1545 711.4347 1.369 0.171829
## wday.C:holiday_flagbefore holiday NA NA NA NA
## wday^4:holiday_flagbefore holiday NA NA NA NA
## wday^5:holiday_flagbefore holiday NA NA NA NA
## wday^6:holiday_flagbefore holiday NA NA NA NA
## wday.L:holiday_flagholiday 1072.1030 583.8764 1.836 0.067217 .
## wday.Q:holiday_flagholiday 1019.0230 290.4574 3.508 0.000512 ***
## wday.C:holiday_flagholiday 754.3983 287.0571 2.628 0.008983 **
## wday^4:holiday_flagholiday NA NA NA NA
## wday^5:holiday_flagholiday NA NA NA NA
## wday^6:holiday_flagholiday NA NA NA NA
## wday.L:holiday_flagregular 285.6327 413.3034 0.691 0.489983
## wday.Q:holiday_flagregular 487.0970 203.2064 2.397 0.017075 *
## wday.C:holiday_flagregular 83.2412 204.5693 0.407 0.684334
## wday^4:holiday_flagregular NA NA NA NA
## wday^5:holiday_flagregular NA NA NA NA
## wday^6:holiday_flagregular NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.74 on 335 degrees of freedom
## Multiple R-squared: 0.9611, Adjusted R-squared: 0.9577
## F-statistic: 285.2 on 29 and 335 DF, p-value: < 2.2e-16
daily <- daily %>%
add_residuals(mod4, "resid4")
daily %>%
ggplot(aes(date, resid4)) +
geom_ref_line(h = 0) +
geom_line() +
ylim(-200, 200)
plot(mod4, which = 1)
plot(mod4, which = 2)
daily %>%
filter (abs(resid4) > 50) %>%
add_predictions(mod4) %>%
select(-resid, -resid2, -resid3) %>%
arrange(desc(resid4))
## # A tibble: 10 × 8
## date n wday holiday_flag month Season_Sat resid4 pred
## <date> <int> <ord> <chr> <ord> <chr> <dbl> <dbl>
## 1 2013-11-30 857 Sat regular Nov No 94.3 763.
## 2 2013-12-01 987 Sun regular Dec No 91.5 896.
## 3 2013-01-20 786 Sun before holiday Jan No 80.9 705.
## 4 2013-07-05 822 Fri after holiday Jul No 65.9 756.
## 5 2013-12-28 814 Sat regular Dec No 57.9 756.
## 6 2013-12-21 811 Sat regular Dec No 54.9 756.
## 7 2013-12-14 692 Sat regular Dec No -64.1 756.
## 8 2013-10-31 922 Thu regular Oct No -65.0 987.
## 9 2013-12-07 691 Sat regular Dec No -65.1 756.
## 10 2013-11-29 661 Fri after holiday Nov No -65.9 727.
We see that now big residuals are around weekends after Thanksgiving, on Halloween Day, around Martin Luther King Jr. Day, weekends around Christmas etc. One may further polish the model by introducing new variables following a similar approach until one is happy with the model.
As a last but maybe one of the most important comments, models are not necessarily better with lower residuals. We can easily construct a model with zero residuals!!
daily2 <- flights %>%
mutate(date = make_date(year, month, day)) %>%
group_by(date) %>%
summarise(n = n(), date = as.character(date))
mod_date <- lm(n ~ date, data = daily2)
summary(mod_date)
##
## Call:
## lm(formula = n ~ date, data = daily2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.820e-08 0.000e+00 0.000e+00 0.000e+00 2.425e-07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.420e+02 3.052e-11 2.759e+13 <2e-16 ***
## date2013-01-02 1.010e+02 4.199e-11 2.405e+12 <2e-16 ***
## date2013-01-03 7.200e+01 4.230e-11 1.702e+12 <2e-16 ***
## date2013-01-04 7.300e+01 4.229e-11 1.726e+12 <2e-16 ***
## date2013-01-05 -1.220e+02 4.495e-11 -2.714e+12 <2e-16 ***
## date2013-01-06 -1.000e+01 4.329e-11 -2.310e+11 <2e-16 ***
## date2013-01-07 9.100e+01 4.209e-11 2.162e+12 <2e-16 ***
## date2013-01-08 5.700e+01 4.247e-11 1.342e+12 <2e-16 ***
## date2013-01-09 6.000e+01 4.244e-11 1.414e+12 <2e-16 ***
## date2013-01-10 9.000e+01 4.210e-11 2.138e+12 <2e-16 ***
## date2013-01-11 8.800e+01 4.213e-11 2.089e+12 <2e-16 ***
## date2013-01-12 -1.520e+02 4.547e-11 -3.343e+12 <2e-16 ***
## date2013-01-13 -1.400e+01 4.334e-11 -3.230e+11 <2e-16 ***
## date2013-01-14 8.600e+01 4.215e-11 2.040e+12 <2e-16 ***
## date2013-01-15 5.200e+01 4.253e-11 1.223e+12 <2e-16 ***
## date2013-01-16 5.900e+01 4.245e-11 1.390e+12 <2e-16 ***
## date2013-01-17 8.500e+01 4.216e-11 2.016e+12 <2e-16 ***
## date2013-01-18 8.200e+01 4.219e-11 1.944e+12 <2e-16 ***
## date2013-01-19 -1.680e+02 4.577e-11 -3.671e+12 <2e-16 ***
## date2013-01-20 -5.600e+01 4.392e-11 -1.275e+12 <2e-16 ***
## date2013-01-21 7.000e+01 4.232e-11 1.654e+12 <2e-16 ***
## date2013-01-22 4.800e+01 4.257e-11 1.127e+12 <2e-16 ***
## date2013-01-23 5.500e+01 4.249e-11 1.294e+12 <2e-16 ***
## date2013-01-24 8.300e+01 4.218e-11 1.968e+12 <2e-16 ***
## date2013-01-25 8.000e+01 4.221e-11 1.895e+12 <2e-16 ***
## date2013-01-26 -1.620e+02 4.566e-11 -3.548e+12 <2e-16 ***
## date2013-01-27 -1.900e+01 4.341e-11 -4.377e+11 <2e-16 ***
## date2013-01-28 8.100e+01 4.220e-11 1.919e+12 <2e-16 ***
## date2013-01-29 4.800e+01 4.257e-11 1.127e+12 <2e-16 ***
## date2013-01-30 5.800e+01 4.246e-11 1.366e+12 <2e-16 ***
## date2013-01-31 8.600e+01 4.215e-11 2.040e+12 <2e-16 ***
## date2013-02-01 8.400e+01 4.217e-11 1.992e+12 <2e-16 ***
## date2013-02-02 -1.600e+02 4.562e-11 -3.507e+12 <2e-16 ***
## date2013-02-03 -2.800e+01 4.353e-11 -6.433e+11 <2e-16 ***
## date2013-02-04 9.000e+01 4.210e-11 2.138e+12 <2e-16 ***
## date2013-02-05 5.400e+01 4.250e-11 1.270e+12 <2e-16 ***
## date2013-02-06 5.900e+01 4.245e-11 1.390e+12 <2e-16 ***
## date2013-02-07 9.000e+01 4.210e-11 2.138e+12 <2e-16 ***
## date2013-02-08 8.800e+01 4.213e-11 2.089e+12 <2e-16 ***
## date2013-02-09 -1.580e+02 4.558e-11 -3.466e+12 <2e-16 ***
## date2013-02-10 -1.300e+01 4.333e-11 -3.000e+11 <2e-16 ***
## date2013-02-11 8.700e+01 4.214e-11 2.065e+12 <2e-16 ***
## date2013-02-12 5.100e+01 4.254e-11 1.199e+12 <2e-16 ***
## date2013-02-13 7.600e+01 4.226e-11 1.799e+12 <2e-16 ***
## date2013-02-14 1.140e+02 4.185e-11 2.724e+12 <2e-16 ***
## date2013-02-15 1.120e+02 4.187e-11 2.675e+12 <2e-16 ***
## date2013-02-16 -1.040e+02 4.465e-11 -2.329e+12 <2e-16 ***
## date2013-02-17 6.000e+00 4.308e-11 1.393e+11 <2e-16 ***
## date2013-02-18 1.060e+02 4.194e-11 2.528e+12 <2e-16 ***
## date2013-02-19 1.010e+02 4.199e-11 2.405e+12 <2e-16 ***
## date2013-02-20 1.070e+02 4.193e-11 2.552e+12 <2e-16 ***
## date2013-02-21 1.190e+02 4.180e-11 2.847e+12 <2e-16 ***
## date2013-02-22 1.150e+02 4.184e-11 2.748e+12 <2e-16 ***
## date2013-02-23 -9.900e+01 4.457e-11 -2.221e+12 <2e-16 ***
## date2013-02-24 3.800e+01 4.269e-11 8.901e+11 <2e-16 ***
## date2013-02-25 1.190e+02 4.180e-11 2.847e+12 <2e-16 ***
## date2013-02-26 9.600e+01 4.204e-11 2.284e+12 <2e-16 ***
## date2013-02-27 1.030e+02 4.197e-11 2.454e+12 <2e-16 ***
## date2013-02-28 1.220e+02 4.177e-11 2.921e+12 <2e-16 ***
## date2013-03-01 1.160e+02 4.183e-11 2.773e+12 <2e-16 ***
## date2013-03-02 -7.700e+01 4.423e-11 -1.741e+12 <2e-16 ***
## date2013-03-03 7.100e+01 4.231e-11 1.678e+12 <2e-16 ***
## date2013-03-04 1.350e+02 4.164e-11 3.242e+12 <2e-16 ***
## date2013-03-05 1.230e+02 4.176e-11 2.945e+12 <2e-16 ***
## date2013-03-06 1.300e+02 4.169e-11 3.118e+12 <2e-16 ***
## date2013-03-07 1.380e+02 4.161e-11 3.316e+12 <2e-16 ***
## date2013-03-08 1.370e+02 4.162e-11 3.292e+12 <2e-16 ***
## date2013-03-09 -7.700e+01 4.423e-11 -1.741e+12 <2e-16 ***
## date2013-03-10 6.600e+01 4.237e-11 1.558e+12 <2e-16 ***
## date2013-03-11 1.380e+02 4.161e-11 3.316e+12 <2e-16 ***
## date2013-03-12 1.240e+02 4.175e-11 2.970e+12 <2e-16 ***
## date2013-03-13 1.320e+02 4.167e-11 3.168e+12 <2e-16 ***
## date2013-03-14 1.400e+02 4.159e-11 3.366e+12 <2e-16 ***
## date2013-03-15 1.370e+02 4.162e-11 3.292e+12 <2e-16 ***
## date2013-03-16 -7.500e+01 4.420e-11 -1.697e+12 <2e-16 ***
## date2013-03-17 6.500e+01 4.238e-11 1.534e+12 <2e-16 ***
## date2013-03-18 1.390e+02 4.160e-11 3.341e+12 <2e-16 ***
## date2013-03-19 1.250e+02 4.174e-11 2.995e+12 <2e-16 ***
## date2013-03-20 1.280e+02 4.171e-11 3.069e+12 <2e-16 ***
## date2013-03-21 1.380e+02 4.161e-11 3.316e+12 <2e-16 ***
## date2013-03-22 1.350e+02 4.164e-11 3.242e+12 <2e-16 ***
## date2013-03-23 -7.500e+01 4.420e-11 -1.697e+12 <2e-16 ***
## date2013-03-24 6.300e+01 4.240e-11 1.486e+12 <2e-16 ***
## date2013-03-25 1.360e+02 4.163e-11 3.267e+12 <2e-16 ***
## date2013-03-26 1.310e+02 4.168e-11 3.143e+12 <2e-16 ***
## date2013-03-27 1.350e+02 4.164e-11 3.242e+12 <2e-16 ***
## date2013-03-28 1.400e+02 4.159e-11 3.366e+12 <2e-16 ***
## date2013-03-29 1.320e+02 4.167e-11 3.168e+12 <2e-16 ***
## date2013-03-30 -7.300e+01 4.417e-11 -1.653e+12 <2e-16 ***
## date2013-03-31 5.500e+01 4.249e-11 1.294e+12 <2e-16 ***
## date2013-04-01 1.280e+02 4.171e-11 3.069e+12 <2e-16 ***
## date2013-04-02 1.410e+02 4.158e-11 3.391e+12 <2e-16 ***
## date2013-04-03 1.500e+02 4.150e-11 3.615e+12 <2e-16 ***
## date2013-04-04 1.430e+02 4.156e-11 3.441e+12 <2e-16 ***
## date2013-04-05 1.390e+02 4.160e-11 3.341e+12 <2e-16 ***
## date2013-04-06 -7.200e+01 4.416e-11 -1.631e+12 <2e-16 ***
## date2013-04-07 6.900e+01 4.233e-11 1.630e+12 <2e-16 ***
## date2013-04-08 1.390e+02 4.160e-11 3.341e+12 <2e-16 ***
## date2013-04-09 1.330e+02 4.166e-11 3.192e+12 <2e-16 ***
## date2013-04-10 1.470e+02 4.152e-11 3.540e+12 <2e-16 ***
## date2013-04-11 1.500e+02 4.150e-11 3.615e+12 <2e-16 ***
## date2013-04-12 1.470e+02 4.152e-11 3.540e+12 <2e-16 ***
## date2013-04-13 -7.200e+01 4.416e-11 -1.631e+12 <2e-16 ***
## date2013-04-14 7.500e+01 4.227e-11 1.774e+12 <2e-16 ***
## date2013-04-15 1.530e+02 4.147e-11 3.690e+12 <2e-16 ***
## date2013-04-16 1.320e+02 4.167e-11 3.168e+12 <2e-16 ***
## date2013-04-17 1.460e+02 4.153e-11 3.515e+12 <2e-16 ***
## date2013-04-18 1.500e+02 4.150e-11 3.615e+12 <2e-16 ***
## date2013-04-19 1.460e+02 4.153e-11 3.515e+12 <2e-16 ***
## date2013-04-20 -7.600e+01 4.422e-11 -1.719e+12 <2e-16 ***
## date2013-04-21 7.700e+01 4.225e-11 1.823e+12 <2e-16 ***
## date2013-04-22 1.430e+02 4.156e-11 3.441e+12 <2e-16 ***
## date2013-04-23 1.230e+02 4.176e-11 2.945e+12 <2e-16 ***
## date2013-04-24 1.340e+02 4.165e-11 3.217e+12 <2e-16 ***
## date2013-04-25 1.410e+02 4.158e-11 3.391e+12 <2e-16 ***
## date2013-04-26 1.390e+02 4.160e-11 3.341e+12 <2e-16 ***
## date2013-04-27 -8.500e+01 4.435e-11 -1.916e+12 <2e-16 ***
## date2013-04-28 7.100e+01 4.231e-11 1.678e+12 <2e-16 ***
## date2013-04-29 1.410e+02 4.158e-11 3.391e+12 <2e-16 ***
## date2013-04-30 1.180e+02 4.181e-11 2.822e+12 <2e-16 ***
## date2013-05-01 1.220e+02 4.177e-11 2.921e+12 <2e-16 ***
## date2013-05-02 1.410e+02 4.158e-11 3.391e+12 <2e-16 ***
## date2013-05-03 1.360e+02 4.163e-11 3.267e+12 <2e-16 ***
## date2013-05-04 -9.700e+01 4.454e-11 -2.178e+12 <2e-16 ***
## date2013-05-05 7.000e+01 4.232e-11 1.654e+12 <2e-16 ***
## date2013-05-06 1.380e+02 4.161e-11 3.316e+12 <2e-16 ***
## date2013-05-07 1.130e+02 4.186e-11 2.699e+12 <2e-16 ***
## date2013-05-08 1.230e+02 4.176e-11 2.945e+12 <2e-16 ***
## date2013-05-09 1.390e+02 4.160e-11 3.341e+12 <2e-16 ***
## date2013-05-10 1.360e+02 4.163e-11 3.267e+12 <2e-16 ***
## date2013-05-11 -1.040e+02 4.465e-11 -2.329e+12 <2e-16 ***
## date2013-05-12 5.400e+01 4.250e-11 1.270e+12 <2e-16 ***
## date2013-05-13 1.370e+02 4.162e-11 3.292e+12 <2e-16 ***
## date2013-05-14 1.130e+02 4.186e-11 2.699e+12 <2e-16 ***
## date2013-05-15 1.250e+02 4.174e-11 2.995e+12 <2e-16 ***
## date2013-05-16 1.400e+02 4.159e-11 3.366e+12 <2e-16 ***
## date2013-05-17 1.380e+02 4.161e-11 3.316e+12 <2e-16 ***
## date2013-05-18 -9.300e+01 4.448e-11 -2.091e+12 <2e-16 ***
## date2013-05-19 6.900e+01 4.233e-11 1.630e+12 <2e-16 ***
## date2013-05-20 1.410e+02 4.158e-11 3.391e+12 <2e-16 ***
## date2013-05-21 1.200e+02 4.179e-11 2.871e+12 <2e-16 ***
## date2013-05-22 1.300e+02 4.169e-11 3.118e+12 <2e-16 ***
## date2013-05-23 1.460e+02 4.153e-11 3.515e+12 <2e-16 ***
## date2013-05-24 1.360e+02 4.163e-11 3.267e+12 <2e-16 ***
## date2013-05-25 -1.140e+02 4.482e-11 -2.544e+12 <2e-16 ***
## date2013-05-26 -1.130e+02 4.480e-11 -2.522e+12 <2e-16 ***
## date2013-05-27 8.600e+01 4.215e-11 2.040e+12 <2e-16 ***
## date2013-05-28 1.390e+02 4.160e-11 3.341e+12 <2e-16 ***
## date2013-05-29 1.320e+02 4.167e-11 3.168e+12 <2e-16 ***
## date2013-05-30 1.470e+02 4.152e-11 3.540e+12 <2e-16 ***
## date2013-05-31 1.440e+02 4.155e-11 3.465e+12 <2e-16 ***
## date2013-06-01 -8.800e+01 4.440e-11 -1.982e+12 <2e-16 ***
## date2013-06-02 6.900e+01 4.233e-11 1.630e+12 <2e-16 ***
## date2013-06-03 1.400e+02 4.159e-11 3.366e+12 <2e-16 ***
## date2013-06-04 1.180e+02 4.181e-11 2.822e+12 <2e-16 ***
## date2013-06-05 1.280e+02 4.171e-11 3.069e+12 <2e-16 ***
## date2013-06-06 1.340e+02 4.165e-11 3.217e+12 <2e-16 ***
## date2013-06-07 1.330e+02 4.166e-11 3.192e+12 <2e-16 ***
## date2013-06-08 -6.300e+01 4.402e-11 -1.431e+12 <2e-16 ***
## date2013-06-09 6.600e+01 4.237e-11 1.558e+12 <2e-16 ***
## date2013-06-10 1.450e+02 4.154e-11 3.490e+12 <2e-16 ***
## date2013-06-11 1.380e+02 4.161e-11 3.316e+12 <2e-16 ***
## date2013-06-12 1.410e+02 4.158e-11 3.391e+12 <2e-16 ***
## date2013-06-13 1.470e+02 4.152e-11 3.540e+12 <2e-16 ***
## date2013-06-14 1.470e+02 4.152e-11 3.540e+12 <2e-16 ***
## date2013-06-15 -4.100e+01 4.371e-11 -9.380e+11 <2e-16 ***
## date2013-06-16 7.600e+01 4.226e-11 1.799e+12 <2e-16 ***
## date2013-06-17 1.480e+02 4.151e-11 3.565e+12 <2e-16 ***
## date2013-06-18 1.400e+02 4.159e-11 3.366e+12 <2e-16 ***
## date2013-06-19 1.430e+02 4.156e-11 3.441e+12 <2e-16 ***
## date2013-06-20 1.530e+02 4.147e-11 3.690e+12 <2e-16 ***
## date2013-06-21 1.510e+02 4.149e-11 3.640e+12 <2e-16 ***
## date2013-06-22 -3.000e+01 4.356e-11 -6.888e+11 <2e-16 ***
## date2013-06-23 8.100e+01 4.220e-11 1.919e+12 <2e-16 ***
## date2013-06-24 1.520e+02 4.148e-11 3.665e+12 <2e-16 ***
## date2013-06-25 1.510e+02 4.149e-11 3.640e+12 <2e-16 ***
## date2013-06-26 1.530e+02 4.147e-11 3.690e+12 <2e-16 ***
## date2013-06-27 1.530e+02 4.147e-11 3.690e+12 <2e-16 ***
## date2013-06-28 1.520e+02 4.148e-11 3.665e+12 <2e-16 ***
## date2013-06-29 -3.000e+01 4.356e-11 -6.888e+11 <2e-16 ***
## date2013-06-30 7.600e+01 4.226e-11 1.799e+12 <2e-16 ***
## date2013-07-01 1.240e+02 4.175e-11 2.970e+12 <2e-16 ***
## date2013-07-02 1.030e+02 4.197e-11 2.454e+12 <2e-16 ***
## date2013-07-03 1.410e+02 4.158e-11 3.391e+12 <2e-16 ***
## date2013-07-04 -1.050e+02 4.467e-11 -2.351e+12 <2e-16 ***
## date2013-07-05 -2.000e+01 4.342e-11 -4.606e+11 <2e-16 ***
## date2013-07-06 -3.700e+01 4.365e-11 -8.476e+11 <2e-16 ***
## date2013-07-07 9.200e+01 4.208e-11 2.186e+12 <2e-16 ***
## date2013-07-08 1.620e+02 4.138e-11 3.915e+12 <2e-16 ***
## date2013-07-09 1.590e+02 4.141e-11 3.840e+12 <2e-16 ***
## date2013-07-10 1.620e+02 4.138e-11 3.915e+12 <2e-16 ***
## date2013-07-11 1.640e+02 4.136e-11 3.965e+12 <2e-16 ***
## date2013-07-12 1.600e+02 4.140e-11 3.865e+12 <2e-16 ***
## date2013-07-13 -3.100e+01 4.357e-11 -7.115e+11 <2e-16 ***
## date2013-07-14 8.900e+01 4.212e-11 2.113e+12 <2e-16 ***
## date2013-07-15 1.570e+02 4.143e-11 3.790e+12 <2e-16 ***
## date2013-07-16 1.540e+02 4.146e-11 3.715e+12 <2e-16 ***
## date2013-07-17 1.590e+02 4.141e-11 3.840e+12 <2e-16 ***
## date2013-07-18 1.610e+02 4.139e-11 3.890e+12 <2e-16 ***
## date2013-07-19 1.570e+02 4.143e-11 3.790e+12 <2e-16 ***
## date2013-07-20 -3.200e+01 4.358e-11 -7.342e+11 <2e-16 ***
## date2013-07-21 8.700e+01 4.214e-11 2.065e+12 <2e-16 ***
## date2013-07-22 1.580e+02 4.142e-11 3.815e+12 <2e-16 ***
## date2013-07-23 1.550e+02 4.145e-11 3.740e+12 <2e-16 ***
## date2013-07-24 1.580e+02 4.142e-11 3.815e+12 <2e-16 ***
## date2013-07-25 1.610e+02 4.139e-11 3.890e+12 <2e-16 ***
## date2013-07-26 1.570e+02 4.143e-11 3.790e+12 <2e-16 ***
## date2013-07-27 -3.100e+01 4.357e-11 -7.115e+11 <2e-16 ***
## date2013-07-28 8.800e+01 4.213e-11 2.089e+12 <2e-16 ***
## date2013-07-29 1.570e+02 4.143e-11 3.790e+12 <2e-16 ***
## date2013-07-30 1.550e+02 4.145e-11 3.740e+12 <2e-16 ***
## date2013-07-31 1.590e+02 4.141e-11 3.840e+12 <2e-16 ***
## date2013-08-01 1.580e+02 4.142e-11 3.815e+12 <2e-16 ***
## date2013-08-02 1.570e+02 4.143e-11 3.790e+12 <2e-16 ***
## date2013-08-03 -3.300e+01 4.360e-11 -7.569e+11 <2e-16 ***
## date2013-08-04 8.700e+01 4.214e-11 2.065e+12 <2e-16 ***
## date2013-08-05 1.580e+02 4.142e-11 3.815e+12 <2e-16 ***
## date2013-08-06 1.540e+02 4.146e-11 3.715e+12 <2e-16 ***
## date2013-08-07 1.590e+02 4.141e-11 3.840e+12 <2e-16 ***
## date2013-08-08 1.590e+02 4.141e-11 3.840e+12 <2e-16 ***
## date2013-08-09 1.570e+02 4.143e-11 3.790e+12 <2e-16 ***
## date2013-08-10 -3.500e+01 4.362e-11 -8.023e+11 <2e-16 ***
## date2013-08-11 8.700e+01 4.214e-11 2.065e+12 <2e-16 ***
## date2013-08-12 1.590e+02 4.141e-11 3.840e+12 <2e-16 ***
## date2013-08-13 1.530e+02 4.147e-11 3.690e+12 <2e-16 ***
## date2013-08-14 1.550e+02 4.145e-11 3.740e+12 <2e-16 ***
## date2013-08-15 1.580e+02 4.142e-11 3.815e+12 <2e-16 ***
## date2013-08-16 1.560e+02 4.144e-11 3.765e+12 <2e-16 ***
## date2013-08-17 -6.200e+01 4.401e-11 -1.409e+12 <2e-16 ***
## date2013-08-18 7.200e+01 4.230e-11 1.702e+12 <2e-16 ***
## date2013-08-19 1.540e+02 4.146e-11 3.715e+12 <2e-16 ***
## date2013-08-20 1.440e+02 4.155e-11 3.465e+12 <2e-16 ***
## date2013-08-21 1.480e+02 4.151e-11 3.565e+12 <2e-16 ***
## date2013-08-22 1.480e+02 4.151e-11 3.565e+12 <2e-16 ***
## date2013-08-23 1.470e+02 4.152e-11 3.540e+12 <2e-16 ***
## date2013-08-24 -6.800e+01 4.410e-11 -1.542e+12 <2e-16 ***
## date2013-08-25 6.100e+01 4.242e-11 1.438e+12 <2e-16 ***
## date2013-08-26 1.400e+02 4.159e-11 3.366e+12 <2e-16 ***
## date2013-08-27 1.230e+02 4.176e-11 2.945e+12 <2e-16 ***
## date2013-08-28 1.310e+02 4.168e-11 3.143e+12 <2e-16 ***
## date2013-08-29 1.370e+02 4.162e-11 3.292e+12 <2e-16 ***
## date2013-08-30 1.230e+02 4.176e-11 2.945e+12 <2e-16 ***
## date2013-08-31 -1.620e+02 4.566e-11 -3.548e+12 <2e-16 ***
## date2013-09-01 -1.240e+02 4.498e-11 -2.757e+12 <2e-16 ***
## date2013-09-02 8.700e+01 4.214e-11 2.065e+12 <2e-16 ***
## date2013-09-03 1.140e+02 4.185e-11 2.724e+12 <2e-16 ***
## date2013-09-04 1.060e+02 4.194e-11 2.528e+12 <2e-16 ***
## date2013-09-05 1.270e+02 4.172e-11 3.044e+12 <2e-16 ***
## date2013-09-06 1.250e+02 4.174e-11 2.995e+12 <2e-16 ***
## date2013-09-07 -1.540e+02 4.551e-11 -3.384e+12 <2e-16 ***
## date2013-09-08 6.600e+01 4.237e-11 1.558e+12 <2e-16 ***
## date2013-09-09 1.490e+02 4.151e-11 3.590e+12 <2e-16 ***
## date2013-09-10 1.190e+02 4.180e-11 2.847e+12 <2e-16 ***
## date2013-09-11 1.050e+02 4.195e-11 2.503e+12 <2e-16 ***
## date2013-09-12 1.500e+02 4.150e-11 3.615e+12 <2e-16 ***
## date2013-09-13 1.540e+02 4.146e-11 3.715e+12 <2e-16 ***
## date2013-09-14 -1.560e+02 4.555e-11 -3.425e+12 <2e-16 ***
## date2013-09-15 5.800e+01 4.246e-11 1.366e+12 <2e-16 ***
## date2013-09-16 1.500e+02 4.150e-11 3.615e+12 <2e-16 ***
## date2013-09-17 1.190e+02 4.180e-11 2.847e+12 <2e-16 ***
## date2013-09-18 1.300e+02 4.169e-11 3.118e+12 <2e-16 ***
## date2013-09-19 1.500e+02 4.150e-11 3.615e+12 <2e-16 ***
## date2013-09-20 1.520e+02 4.148e-11 3.665e+12 <2e-16 ***
## date2013-09-21 -1.490e+02 4.542e-11 -3.281e+12 <2e-16 ***
## date2013-09-22 6.200e+01 4.241e-11 1.462e+12 <2e-16 ***
## date2013-09-23 1.510e+02 4.149e-11 3.640e+12 <2e-16 ***
## date2013-09-24 1.180e+02 4.181e-11 2.822e+12 <2e-16 ***
## date2013-09-25 1.340e+02 4.165e-11 3.217e+12 <2e-16 ***
## date2013-09-26 1.540e+02 4.146e-11 3.715e+12 <2e-16 ***
## date2013-09-27 1.540e+02 4.146e-11 3.715e+12 <2e-16 ***
## date2013-09-28 -1.600e+02 4.562e-11 -3.507e+12 <2e-16 ***
## date2013-09-29 7.200e+01 4.230e-11 1.702e+12 <2e-16 ***
## date2013-09-30 1.510e+02 4.149e-11 3.640e+12 <2e-16 ***
## date2013-10-01 1.230e+02 4.176e-11 2.945e+12 <2e-16 ***
## date2013-10-02 1.330e+02 4.166e-11 3.192e+12 <2e-16 ***
## date2013-10-03 1.530e+02 4.147e-11 3.690e+12 <2e-16 ***
## date2013-10-04 1.530e+02 4.147e-11 3.690e+12 <2e-16 ***
## date2013-10-05 -1.550e+02 4.553e-11 -3.404e+12 <2e-16 ***
## date2013-10-06 7.500e+01 4.227e-11 1.774e+12 <2e-16 ***
## date2013-10-07 1.520e+02 4.148e-11 3.665e+12 <2e-16 ***
## date2013-10-08 1.220e+02 4.177e-11 2.921e+12 <2e-16 ***
## date2013-10-09 1.320e+02 4.167e-11 3.168e+12 <2e-16 ***
## date2013-10-10 1.520e+02 4.148e-11 3.665e+12 <2e-16 ***
## date2013-10-11 1.490e+02 4.151e-11 3.590e+12 <2e-16 ***
## date2013-10-12 -1.660e+02 4.573e-11 -3.630e+12 <2e-16 ***
## date2013-10-13 6.000e+01 4.244e-11 1.414e+12 <2e-16 ***
## date2013-10-14 1.450e+02 4.154e-11 3.490e+12 <2e-16 ***
## date2013-10-15 1.210e+02 4.178e-11 2.896e+12 <2e-16 ***
## date2013-10-16 1.320e+02 4.167e-11 3.168e+12 <2e-16 ***
## date2013-10-17 1.530e+02 4.147e-11 3.690e+12 <2e-16 ***
## date2013-10-18 1.510e+02 4.149e-11 3.640e+12 <2e-16 ***
## date2013-10-19 -1.580e+02 4.558e-11 -3.466e+12 <2e-16 ***
## date2013-10-20 7.300e+01 4.229e-11 1.726e+12 <2e-16 ***
## date2013-10-21 1.490e+02 4.151e-11 3.590e+12 <2e-16 ***
## date2013-10-22 1.220e+02 4.177e-11 2.921e+12 <2e-16 ***
## date2013-10-23 1.330e+02 4.166e-11 3.192e+12 <2e-16 ***
## date2013-10-24 1.500e+02 4.150e-11 3.615e+12 <2e-16 ***
## date2013-10-25 1.470e+02 4.152e-11 3.540e+12 <2e-16 ***
## date2013-10-26 -1.570e+02 4.557e-11 -3.446e+12 <2e-16 ***
## date2013-10-27 6.800e+01 4.235e-11 1.606e+12 <2e-16 ***
## date2013-10-28 1.410e+02 4.158e-11 3.391e+12 <2e-16 ***
## date2013-10-29 1.230e+02 4.176e-11 2.945e+12 <2e-16 ***
## date2013-10-30 1.310e+02 4.168e-11 3.143e+12 <2e-16 ***
## date2013-10-31 8.000e+01 4.221e-11 1.895e+12 <2e-16 ***
## date2013-11-01 1.440e+02 4.155e-11 3.465e+12 <2e-16 ***
## date2013-11-02 -1.530e+02 4.549e-11 -3.363e+12 <2e-16 ***
## date2013-11-03 6.000e+01 4.244e-11 1.414e+12 <2e-16 ***
## date2013-11-04 1.360e+02 4.163e-11 3.267e+12 <2e-16 ***
## date2013-11-05 1.250e+02 4.174e-11 2.995e+12 <2e-16 ***
## date2013-11-06 1.310e+02 4.168e-11 3.143e+12 <2e-16 ***
## date2013-11-07 1.490e+02 4.151e-11 3.590e+12 <2e-16 ***
## date2013-11-08 1.440e+02 4.155e-11 3.465e+12 <2e-16 ***
## date2013-11-09 -1.270e+02 4.503e-11 -2.820e+12 <2e-16 ***
## date2013-11-10 5.300e+01 4.252e-11 1.247e+12 <2e-16 ***
## date2013-11-11 1.410e+02 4.158e-11 3.391e+12 <2e-16 ***
## date2013-11-12 1.310e+02 4.168e-11 3.143e+12 <2e-16 ***
## date2013-11-13 1.340e+02 4.165e-11 3.217e+12 <2e-16 ***
## date2013-11-14 1.460e+02 4.153e-11 3.515e+12 <2e-16 ***
## date2013-11-15 1.430e+02 4.156e-11 3.441e+12 <2e-16 ***
## date2013-11-16 -1.280e+02 4.505e-11 -2.841e+12 <2e-16 ***
## date2013-11-17 5.400e+01 4.250e-11 1.270e+12 <2e-16 ***
## date2013-11-18 1.430e+02 4.156e-11 3.441e+12 <2e-16 ***
## date2013-11-19 1.310e+02 4.168e-11 3.143e+12 <2e-16 ***
## date2013-11-20 1.350e+02 4.164e-11 3.242e+12 <2e-16 ***
## date2013-11-21 1.580e+02 4.142e-11 3.815e+12 <2e-16 ***
## date2013-11-22 1.570e+02 4.143e-11 3.790e+12 <2e-16 ***
## date2013-11-23 -9.800e+01 4.456e-11 -2.199e+12 <2e-16 ***
## date2013-11-24 5.400e+01 4.250e-11 1.270e+12 <2e-16 ***
## date2013-11-25 1.000e+02 4.200e-11 2.381e+12 <2e-16 ***
## date2013-11-26 1.470e+02 4.152e-11 3.540e+12 <2e-16 ***
## date2013-11-27 1.720e+02 4.129e-11 4.166e+12 <2e-16 ***
## date2013-11-28 -2.080e+02 4.656e-11 -4.467e+12 <2e-16 ***
## date2013-11-29 -1.810e+02 4.602e-11 -3.933e+12 <2e-16 ***
## date2013-11-30 1.500e+01 4.297e-11 3.491e+11 <2e-16 ***
## date2013-12-01 1.450e+02 4.154e-11 3.490e+12 <2e-16 ***
## date2013-12-02 1.620e+02 4.138e-11 3.915e+12 <2e-16 ***
## date2013-12-03 1.310e+02 4.168e-11 3.143e+12 <2e-16 ***
## date2013-12-04 1.160e+02 4.183e-11 2.773e+12 <2e-16 ***
## date2013-12-05 1.270e+02 4.172e-11 3.044e+12 <2e-16 ***
## date2013-12-06 1.280e+02 4.171e-11 3.069e+12 <2e-16 ***
## date2013-12-07 -1.510e+02 4.546e-11 -3.322e+12 <2e-16 ***
## date2013-12-08 3.300e+01 4.275e-11 7.719e+11 <2e-16 ***
## date2013-12-09 1.200e+02 4.179e-11 2.871e+12 <2e-16 ***
## date2013-12-10 1.010e+02 4.199e-11 2.405e+12 <2e-16 ***
## date2013-12-11 1.120e+02 4.187e-11 2.675e+12 <2e-16 ***
## date2013-12-12 1.260e+02 4.173e-11 3.019e+12 <2e-16 ***
## date2013-12-13 1.280e+02 4.171e-11 3.069e+12 <2e-16 ***
## date2013-12-14 -1.500e+02 4.544e-11 -3.301e+12 <2e-16 ***
## date2013-12-15 3.800e+01 4.269e-11 8.901e+11 <2e-16 ***
## date2013-12-16 1.220e+02 4.177e-11 2.921e+12 <2e-16 ***
## date2013-12-17 1.070e+02 4.193e-11 2.552e+12 <2e-16 ***
## date2013-12-18 1.140e+02 4.185e-11 2.724e+12 <2e-16 ***
## date2013-12-19 1.320e+02 4.167e-11 3.168e+12 <2e-16 ***
## date2013-12-20 1.380e+02 4.161e-11 3.316e+12 <2e-16 ***
## date2013-12-21 -3.100e+01 4.357e-11 -7.115e+11 <2e-16 ***
## date2013-12-22 5.300e+01 4.252e-11 1.247e+12 <2e-16 ***
## date2013-12-23 1.430e+02 4.156e-11 3.441e+12 <2e-16 ***
## date2013-12-24 -8.100e+01 4.429e-11 -1.829e+12 <2e-16 ***
## date2013-12-25 -1.230e+02 4.497e-11 -2.735e+12 <2e-16 ***
## date2013-12-26 9.400e+01 4.206e-11 2.235e+12 <2e-16 ***
## date2013-12-27 1.210e+02 4.178e-11 2.896e+12 <2e-16 ***
## date2013-12-28 -2.800e+01 4.353e-11 -6.433e+11 <2e-16 ***
## date2013-12-29 4.600e+01 4.260e-11 1.080e+12 <2e-16 ***
## date2013-12-30 1.260e+02 4.173e-11 3.019e+12 <2e-16 ***
## date2013-12-31 -6.600e+01 4.407e-11 -1.498e+12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.856e-10 on 336411 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.259e+24 on 364 and 336411 DF, p-value: < 2.2e-16
plot(mod_date, which = 1)
Above we introduce a new categorical variable date
which
is simply the character date
by itself! By doing so, we
introduce 365 coefficients for each date and no wonder we easily get
zero residuals since the number of parameters is the same as the number
of coefficients!!
When our model is too flexible, it takes the noise into account and therefore it is not useful in predicting the future or unknown samples. In this case, if we use this to predict the flights number in 2014 it will be quite off since we simply use the same number on the same date while ignoring any other pattern!
With these case studies, we learn some basic principles of data modeling: