Load Libraries

library(tidyverse)
library(modelr)
library(nycflights13)
library(lubridate)


1. A more complicated model for diamonds data set


If we wanted to, we could continue to build up our model, moving the effects we’ve observed into the model to make them explicit. For example, we could include color, cut, and clarity into the model so that we also make explicit the effect of these three categorical variables on price:

diamonds2 <- diamonds %>%
  filter (carat <= 3) %>%
  mutate (lprice = log2(price), lcarat = log2(carat))

mod_diamond <- lm(lprice ~ lcarat + color + cut + clarity, data = diamonds2)

This model now includes four predictors, so it’s getting harder to visualise. Fortunately, they’re currently all independent which means that we can plot them individually in four plots.

grid <- diamonds2 %>% 
  data_grid(cut, .model = mod_diamond) %>% 
  add_predictions(mod_diamond)
grid
## # A tibble: 5 × 5
##   cut       lcarat color clarity  pred
##   <ord>      <dbl> <chr> <chr>   <dbl>
## 1 Fair      -0.515 G     VS2      11.2
## 2 Good      -0.515 G     VS2      11.3
## 3 Very Good -0.515 G     VS2      11.4
## 4 Premium   -0.515 G     VS2      11.4
## 5 Ideal     -0.515 G     VS2      11.4

Here since we only create data_grid for cut variable alone, other predictors will be filled with “typical” values (median for continuous variables and mode for categorical variables) by specifying a model with the .model argument. By doing this, we are comparing the prediction with different cut quality when holding all other variables as constants.

ggplot(grid, aes(cut, pred)) + 
  geom_point()

So we see that the model correctly predicts higher price with better cut quality. Similarly, we can create the plot for other variables as well.

grid <- diamonds2 %>% 
  data_grid(color, .model = mod_diamond) %>% 
  add_predictions(mod_diamond)

ggplot(grid, aes(color, pred)) + 
  geom_point()

grid <- diamonds2 %>% 
  data_grid(clarity, .model = mod_diamond) %>% 
  add_predictions(mod_diamond)

ggplot(grid, aes(clarity, pred)) + 
  geom_point()

We see that the model correctly predicts higher price for better color (from “D” to “J”) and better clarity as well (from “I1” to “IF”). Here cut, color, and clarity are treated as ordinal variables and the model works pretty well using polynomial contrasts which can capture non-linear pattern between levels.

grid <- diamonds2 %>% 
  data_grid(lcarat, .model = mod_diamond) %>% 
  add_predictions(mod_diamond)

ggplot(grid, aes(lcarat, pred)) + 
  geom_point()

As expected, the relationship between lprice and lcarat is a linear one. As usual, we need to check the residual plot to make sure that the model assumption are valid.

diamonds2 <- diamonds2 %>% 
  add_predictions(mod_diamond) %>%
  add_residuals(mod_diamond, "lresid2")

ggplot(diamonds2, aes(pred, lresid2)) +
  geom_bin_2d(bins = 50) +
  geom_smooth(method = "lm", color = "red")

The plot above is very similar to that created by the plot function.

plot(mod_diamond, which = 1)

We see that the residuals seem to be independent of the target variable, which is what we seek for. We can also check the normality plot or residuals.

qqnorm(diamonds2$lresid2)

which is similar to

plot(mod_diamond, which = 2)


Check Potential Outliers

As we see, the majority of residuals follow a nice normal distribution. However, for data points with largest residuals (potential outliers), the residuals deviate from a normal distribution. We may hope to investigate those samples more closely.

diamonds2 %>% 
  filter(abs(lresid2) > 1) %>% 
  add_predictions(mod_diamond) %>% 
  mutate(pred = round(2 ^ pred)) %>% 
  select(price, pred, carat:table, x:z) %>% 
  arrange(price)
## # A tibble: 18 × 11
##    price  pred carat cut       color clarity depth table     x     y     z
##    <int> <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1013   265  0.25 Fair      F     SI2      54.4    64  4.3   4.23  2.32
##  2  1186   285  0.25 Premium   G     SI2      59      60  5.33  5.28  3.12
##  3  1186   285  0.25 Premium   G     SI2      58.8    60  5.33  5.28  3.12
##  4  1262  2622  1.03 Fair      E     I1       78.2    54  5.72  5.59  4.42
##  5  1415   640  0.35 Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  6  1415   640  0.35 Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  7  1715   577  0.32 Fair      F     VS2      59.6    60  4.42  4.34  2.61
##  8  1776   413  0.29 Fair      F     SI1      55.8    60  4.48  4.41  2.48
##  9  2160   312  0.34 Fair      F     I1       55.8    62  4.72  4.6   2.6 
## 10  2366   775  0.3  Very Good D     VVS2     60.6    58  4.33  4.35  2.63
## 11  3360  1374  0.51 Premium   F     SI1      62.7    62  5.09  4.96  3.15
## 12  3807  1539  0.61 Good      F     SI2      62.5    65  5.36  5.29  3.33
## 13  3920  1705  0.51 Fair      F     VVS2     65.4    60  4.98  4.9   3.23
## 14  4368  1705  0.51 Fair      F     VVS2     60.7    66  5.21  5.11  3.13
## 15  6512 18145  3    Very Good H     I1       63.1    55  9.23  9.1   5.77
## 16  8044 16148  3    Fair      H     I1       67.1    57  8.93  8.84  5.97
## 17 10011  4042  1.01 Fair      D     SI2      64.6    58  6.25  6.2   4.02
## 18 10470 23552  2.46 Premium   E     SI2      59.7    59  8.82  8.76  5.25

Here we keep samples where the absolute residual is greater than one (which means the predicted price is more than double or less than half of the actual price) and print out all information of those diamonds including the predicted price. To summarize what we see here:

  • Sometimes the actual price is much higher than the predicted value for some small diamonds.
  • Sometimes the actual price is much lower than the predicted value for some big diamonds.

By checking all other information, we don’t observe any particular reason to explain the discrepancies. So there are two possibilities:

  • The data of those diamonds are not correct.
  • There are other factors resulting in unusually high or low price which are not included in the data set.

So in practice, this can be normally what we expect - our models work reasonably well for the majority of samples but there are exceptions.


Lab Exercise: Does the final model, mod_diamond, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond?

Homework Exercise: Is there any interaction


2. Modeling the number of daily flights in the flights data set


Let’s work through a similar process for a dataset that seems even simpler at first glance: the number of flights that leave NYC per day. This is a really small dataset — only 365 rows and 2 columns — and we’re not going to end up with a fully realised model, but as you’ll see, the steps along the way will help us better understand the data. Let’s get started by counting the number of flights per day and visualising it with ggplot2.

daily <- flights %>% 
  mutate(date = make_date(year, month, day)) %>% 
  group_by(date) %>% 
  summarise(n = n())
daily
## # A tibble: 365 × 2
##    date           n
##    <date>     <int>
##  1 2013-01-01   842
##  2 2013-01-02   943
##  3 2013-01-03   914
##  4 2013-01-04   915
##  5 2013-01-05   720
##  6 2013-01-06   832
##  7 2013-01-07   933
##  8 2013-01-08   899
##  9 2013-01-09   902
## 10 2013-01-10   932
## # … with 355 more rows
ggplot(daily, aes(date, n)) + 
  geom_line() +
  ylim(0, 1200)

Here we observe a periodic pattern for each week - the number of flights is significantly less on Saturdays. Such pattern is commonly seen in data with respect to time (which is called time series data).

To confirm the pattern, we will look at the distribution of flight number vs weekday. We have done this when studying the date-time data type.

daily <- daily %>% 
  mutate(wday = wday(date, label = TRUE))
glimpse(daily)
## Rows: 365
## Columns: 3
## $ date <date> 2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-05, 2013…
## $ n    <int> 842, 943, 914, 915, 720, 832, 933, 899, 902, 932, 930, 690, 828, …
## $ wday <ord> Tue, Wed, Thu, Fri, Sat, Sun, Mon, Tue, Wed, Thu, Fri, Sat, Sun, …

Note that wday is an ordinal variable by default.

ggplot(daily, aes(wday, n)) + 
  geom_boxplot()

Note that we see some potential outliers here (most of them are on the lower end) and we will explain them later.

Since the number of flights depends on day-of-week, let’s create a model between them.

mod <- lm(n ~ wday, data = daily)

grid <- daily %>% 
  data_grid(wday) %>% 
  add_predictions(mod, "n")

ggplot(daily, aes(wday, n)) + 
  geom_boxplot() +
  geom_point(data = grid, colour = "red", size = 4)

So the model predicts an average flight number for each weekday (marked as the red markers). Next, let’s compute and visualise the residuals:

daily <- daily %>% 
  add_residuals(mod)
daily %>% 
  ggplot(aes(date, resid)) + 
  geom_ref_line(h = 0) + 
  geom_line()


Add holiday effects

The residuals still show some strong pattern, indicating the necessity for further polishing our model. There are two strong patterns here:

  • Occasionally we have much lower flights number than predicted ones due to holiday effect, such as the Independence Day (July 4th), Memorial Day(May 27th, 2013) etc.
  • The residuals seem to depend on months - there are less than average flights in January, February, and more than average flights in June, July and August.

To verify our observation, let’s do some EDA work.

daily2 <- daily %>%
  filter(resid < -100) %>%
  add_predictions(mod) %>%
  arrange(n) %>%
  print()
## # A tibble: 11 × 5
##    date           n wday  resid  pred
##    <date>     <int> <ord> <dbl> <dbl>
##  1 2013-11-28   634 Thu   -332.  966.
##  2 2013-11-29   661 Fri   -306.  967.
##  3 2013-09-01   718 Sun   -173.  891.
##  4 2013-12-25   719 Wed   -244.  963.
##  5 2013-05-26   729 Sun   -162.  891.
##  6 2013-07-04   737 Thu   -229.  966.
##  7 2013-12-24   761 Tue   -190.  951.
##  8 2013-12-31   776 Tue   -175.  951.
##  9 2013-01-20   786 Sun   -105.  891.
## 10 2013-07-05   822 Fri   -145.  967.
## 11 2013-01-01   842 Tue   -109.  951.

There are 11 days on which the actual flight number is much less (over 100) than the predicted value, and they are all related to holidays:

  • Dec 31 and Jan 1 were related to New Year Day
  • Jan 20 was the Sunday before Martin Luther King Jr. Day
  • May 26 was the Sunday before Memorial Day
  • July 4 and July 5 were related to Independence Day
  • Sep 1 was the Sunday before Labor Day
  • Nov 28 and Nov 29 were related to Thanksgiving Day
  • Dec 24 and Dec 25 were related to Chirstmas Day

Therefore to improve our model, we need a new categorical variable to mark out holidays (and the days before or after) for our model to account for holiday effects.

holidays <- c("20130101", "20130121", "20130527", "20130704", "20130902", "20131128", "20131225", "20140101")
holidays <- ymd(holidays)

daily <- daily %>%
  mutate(holiday_flag = case_when(
            date %in% holidays ~ "holiday",
            (date + days(1)) %in% holidays ~ "before holiday",
            (date - days(1)) %in% holidays ~ "after holiday",
            .default = "regular"
          )
        ) 

print(daily, n = 30)
## # A tibble: 365 × 5
##    date           n wday   resid holiday_flag  
##    <date>     <int> <ord>  <dbl> <chr>         
##  1 2013-01-01   842 Tue   -109.  holiday       
##  2 2013-01-02   943 Wed    -19.7 after holiday 
##  3 2013-01-03   914 Thu    -51.8 regular       
##  4 2013-01-04   915 Fri    -52.5 regular       
##  5 2013-01-05   720 Sat    -24.6 regular       
##  6 2013-01-06   832 Sun    -59.5 regular       
##  7 2013-01-07   933 Mon    -41.8 regular       
##  8 2013-01-08   899 Tue    -52.4 regular       
##  9 2013-01-09   902 Wed    -60.7 regular       
## 10 2013-01-10   932 Thu    -33.8 regular       
## 11 2013-01-11   930 Fri    -37.5 regular       
## 12 2013-01-12   690 Sat    -54.6 regular       
## 13 2013-01-13   828 Sun    -63.5 regular       
## 14 2013-01-14   928 Mon    -46.8 regular       
## 15 2013-01-15   894 Tue    -57.4 regular       
## 16 2013-01-16   901 Wed    -61.7 regular       
## 17 2013-01-17   927 Thu    -38.8 regular       
## 18 2013-01-18   924 Fri    -43.5 regular       
## 19 2013-01-19   674 Sat    -70.6 regular       
## 20 2013-01-20   786 Sun   -105.  before holiday
## 21 2013-01-21   912 Mon    -62.8 holiday       
## 22 2013-01-22   890 Tue    -61.4 after holiday 
## 23 2013-01-23   897 Wed    -65.7 regular       
## 24 2013-01-24   925 Thu    -40.8 regular       
## 25 2013-01-25   922 Fri    -45.5 regular       
## 26 2013-01-26   680 Sat    -64.6 regular       
## 27 2013-01-27   823 Sun    -68.5 regular       
## 28 2013-01-28   923 Mon    -51.8 regular       
## 29 2013-01-29   890 Tue    -61.4 regular       
## 30 2013-01-30   900 Wed    -62.7 regular       
## # … with 335 more rows

So we see that we correctly add holiday labels to each row of our data. Let’s update our model now and see whether this improves its performance on holidays. We also include interaction terms to improve performance (why?).

mod2 <- lm(n ~ wday * holiday_flag, data = daily)

daily <- daily %>% 
  add_residuals(mod2, "resid2")
daily %>% 
  ggplot(aes(date, resid2)) + 
  geom_ref_line(h = 0) + 
  geom_line() +
  ylim(-300, 100)

Compared with the first model, we have successfully removed those big residuals which occurred on holidays! Note that the ylim function is used to make the plot in the same scale as that from the previous model for better comparison.


Add month effect

Now let’s handle the pattern in residuals that is related to month. Obviously, we have less flights in winter and more flights in summer. Again, let’s visualise this effect first:

daily <- daily %>%
  mutate(month = month(date, label = TRUE))

daily %>%
  group_by(month) %>%
  summarise(mean_monthly_resid = mean(resid2)) %>%
  ggplot() + geom_point(aes(month, mean_monthly_resid))

The graph above confirms our observation - less flights in January, Feburary and more flights starting spring until the end of summer vacation.

Now let’s incorporate month into our model. Note that one should not include interaction terms for month since that would introduce more parameters than the actual sample size (which is 365)!

mod3 <- lm(n ~ wday * holiday_flag + month, data = daily)
summary(mod3)
## 
## Call:
## lm(formula = n ~ wday * holiday_flag + month, data = daily)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -83.786  -9.048   0.315   9.722 110.070 
## 
## Coefficients: (10 not defined because of singularities)
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        754.417     86.181   8.754  < 2e-16 ***
## wday.L                            -389.369    489.946  -0.795 0.427338    
## wday.Q                            -631.683    240.877  -2.622 0.009128 ** 
## wday.C                            -158.783    242.481  -0.655 0.513026    
## wday^4                             -80.663      3.168 -25.462  < 2e-16 ***
## wday^5                              -6.617      3.188  -2.076 0.038673 *  
## wday^6                             -10.590      3.204  -3.306 0.001050 ** 
## holiday_flagbefore holiday         399.645    353.210   1.131 0.258666    
## holiday_flagholiday                163.377    123.711   1.321 0.187523    
## holiday_flagregular                174.272     86.190   2.022 0.043972 *  
## month.L                             37.329      4.159   8.975  < 2e-16 ***
## month.Q                            -60.313      4.124 -14.626  < 2e-16 ***
## month.C                             24.993      4.128   6.055 3.76e-09 ***
## month^4                              6.054      4.093   1.479 0.140014    
## month^5                              1.797      4.106   0.438 0.661958    
## month^6                             -4.188      4.177  -1.003 0.316677    
## month^7                            -25.481      4.091  -6.229 1.40e-09 ***
## month^8                             15.275      4.040   3.781 0.000185 ***
## month^9                              9.440      4.073   2.318 0.021073 *  
## month^10                            -2.927      4.033  -0.726 0.468416    
## month^11                             1.856      4.074   0.456 0.648982    
## wday.L:holiday_flagbefore holiday 1472.415    435.836   3.378 0.000815 ***
## wday.Q:holiday_flagbefore holiday  958.143    843.356   1.136 0.256721    
## wday.C:holiday_flagbefore holiday       NA         NA      NA       NA    
## wday^4:holiday_flagbefore holiday       NA         NA      NA       NA    
## wday^5:holiday_flagbefore holiday       NA         NA      NA       NA    
## wday^6:holiday_flagbefore holiday       NA         NA      NA       NA    
## wday.L:holiday_flagholiday        1086.011    692.144   1.569 0.117577    
## wday.Q:holiday_flagholiday        1011.290    344.317   2.937 0.003542 ** 
## wday.C:holiday_flagholiday         754.756    340.287   2.218 0.027223 *  
## wday^4:holiday_flagholiday              NA         NA      NA       NA    
## wday^5:holiday_flagholiday              NA         NA      NA       NA    
## wday^6:holiday_flagholiday              NA         NA      NA       NA    
## wday.L:holiday_flagregular         305.315    489.939   0.623 0.533595    
## wday.Q:holiday_flagregular         471.143    240.882   1.956 0.051304 .  
## wday.C:holiday_flagregular          87.181    242.503   0.360 0.719444    
## wday^4:holiday_flagregular              NA         NA      NA       NA    
## wday^5:holiday_flagregular              NA         NA      NA       NA    
## wday^6:holiday_flagregular              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.21 on 336 degrees of freedom
## Multiple R-squared:  0.9451, Adjusted R-squared:  0.9406 
## F-statistic: 206.7 on 28 and 336 DF,  p-value: < 2.2e-16
daily <- daily %>% 
  add_residuals(mod3, "resid3")
daily %>% 
  ggplot(aes(date, resid3)) + 
  geom_ref_line(h = 0) + 
  geom_line() +
  ylim(-200, 200)

Now the residuals look much more random with less obvious patterns. Let’s look at the residual plots:

plot(mod3, which = 1)

plot(mod3, which = 2)

Now the model is already a pretty good one. Again we see some deviation from the normal distribution for potential outliers. Let’s inspect them more closely.

daily %>%
  filter (abs(resid3) > 50) %>%
  add_predictions(mod3) %>% 
  select(-resid, -resid2) %>%
  arrange(desc(resid3))
## # A tibble: 19 × 7
##    date           n wday  holiday_flag   month resid3  pred
##    <date>     <int> <ord> <chr>          <ord>  <dbl> <dbl>
##  1 2013-11-30   857 Sat   regular        Nov    110.   747.
##  2 2013-12-01   987 Sun   regular        Dec     89.2  898.
##  3 2013-01-20   786 Sun   before holiday Jan     77.2  709.
##  4 2013-12-28   814 Sat   regular        Dec     71.4  743.
##  5 2013-12-21   811 Sat   regular        Dec     68.4  743.
##  6 2013-07-05   822 Fri   after holiday  Jul     64.8  757.
##  7 2013-12-14   692 Sat   regular        Dec    -50.6  743.
##  8 2013-09-07   688 Sat   regular        Sep    -51.1  739.
##  9 2013-12-07   691 Sat   regular        Dec    -51.6  743.
## 10 2013-09-14   686 Sat   regular        Sep    -53.1  739.
## 11 2013-10-05   687 Sat   regular        Oct    -56.0  743.
## 12 2013-10-31   922 Thu   regular        Oct    -56.6  979.
## 13 2013-09-28   682 Sat   regular        Sep    -57.1  739.
## 14 2013-11-02   689 Sat   regular        Nov    -57.9  747.
## 15 2013-10-26   685 Sat   regular        Oct    -58.0  743.
## 16 2013-10-19   684 Sat   regular        Oct    -59.0  743.
## 17 2013-11-29   661 Fri   after holiday  Nov    -64.8  726.
## 18 2013-10-12   676 Sat   regular        Oct    -67.0  743.
## 19 2013-08-31   680 Sat   regular        Aug    -83.8  764.

As we see, now most big residuals occur on regular Saturdays. For Saturdays after Thanksgiving and before/after Christmas, the model under-predicts the flight number. While on other Saturdays in late Auguest, September, October and early November, the model over-predicts the flight number. Since we don’t want our model to be overly complicated, let’s introduce another binary variable - Season_Sat which stands for “seasonal Saturdays”.

daily <- daily %>%
  mutate(Season_Sat = ifelse(wday == "Sat" & between(date, ymd(20130825), ymd(20131105)), "Yes", "No")) 

daily %>%
  filter(date >= ymd(20130815)) %>%
  print(n=30)
## # A tibble: 139 × 9
##    date           n wday     resid holiday_flag   resid2 month   resid3 Season…¹
##    <date>     <int> <ord>    <dbl> <chr>           <dbl> <ord>    <dbl> <chr>   
##  1 2013-08-15  1000 Thu     34.2   regular         22.2  Aug     0.616  No      
##  2 2013-08-16   998 Fri     30.5   regular         21.5  Aug     1.24   No      
##  3 2013-08-17   780 Sat     35.4   regular         35.4  Aug    16.2    No      
##  4 2013-08-18   914 Sun     22.5   regular         13.5  Aug    -5.00   No      
##  5 2013-08-19   996 Mon     21.2   regular         18.0  Aug    -0.0641 No      
##  6 2013-08-20   986 Tue     34.6   regular         24.0  Aug     6.09   No      
##  7 2013-08-21   990 Wed     27.3   regular         23.3  Aug     3.14   No      
##  8 2013-08-22   990 Thu     24.2   regular         12.2  Aug    -9.38   No      
##  9 2013-08-23   989 Fri     21.5   regular         12.5  Aug    -7.76   No      
## 10 2013-08-24   774 Sat     29.4   regular         29.4  Aug    10.2    No      
## 11 2013-08-25   903 Sun     11.5   regular          2.51 Aug   -16.0    No      
## 12 2013-08-26   982 Mon      7.19  regular          4.02 Aug   -14.1    No      
## 13 2013-08-27   965 Tue     13.6   regular          2.96 Aug   -14.9    No      
## 14 2013-08-28   973 Wed     10.3   regular          6.31 Aug   -13.9    No      
## 15 2013-08-29   979 Thu     13.2   regular          1.20 Aug   -20.4    No      
## 16 2013-08-30   965 Fri     -2.46  regular        -11.5  Aug   -31.8    No      
## 17 2013-08-31   680 Sat    -64.6   regular        -64.6  Aug   -83.8    Yes     
## 18 2013-09-01   718 Sun   -173.    before holiday -26.3  Sep   -41.2    No      
## 19 2013-09-02   929 Mon    -45.8   holiday          6.00 Sep    -8.91   No      
## 20 2013-09-03   956 Tue      4.64  after holiday   13.7  Sep    -1.24   No      
## 21 2013-09-04   948 Wed    -14.7   regular        -18.7  Sep   -14.1    No      
## 22 2013-09-05   969 Thu      3.25  regular         -8.80 Sep    -5.66   No      
## 23 2013-09-06   967 Fri     -0.462 regular         -9.50 Sep    -5.04   No      
## 24 2013-09-07   688 Sat    -56.6   regular        -56.6  Sep   -51.1    Yes     
## 25 2013-09-08   908 Sun     16.5   regular          7.51 Sep    13.7    No      
## 26 2013-09-09   991 Mon     16.2   regular         13.0  Sep    19.7    No      
## 27 2013-09-10   961 Tue      9.64  regular         -1.04 Sep     5.81   No      
## 28 2013-09-11   947 Wed    -15.7   regular        -19.7  Sep   -15.1    No      
## 29 2013-09-12   992 Thu     26.2   regular         14.2  Sep    17.3    No      
## 30 2013-09-13   996 Fri     28.5   regular         19.5  Sep    24.0    No      
## # … with 109 more rows, and abbreviated variable name ¹​Season_Sat

The labels of “Season_Sat` are correctly added to our data. Now let’s add this into our model and redo all analysis steps:

mod4 <- lm(n ~ wday * holiday_flag + month + Season_Sat, data = daily)
summary(mod4)
## 
## Call:
## lm(formula = n ~ wday * holiday_flag + month + Season_Sat, data = daily)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.890  -7.178   0.685   7.500  94.251 
## 
## Coefficients: (10 not defined because of singularities)
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        751.8269    72.7008  10.341  < 2e-16 ***
## wday.L                            -360.7971   413.3129  -0.873 0.383321    
## wday.Q                            -638.8126   203.1983  -3.144 0.001817 ** 
## wday.C                            -148.3470   204.5525  -0.725 0.468819    
## wday^4                             -76.9858     2.6908 -28.611  < 2e-16 ***
## wday^5                              -4.9801     2.6926  -1.850 0.065261 .  
## wday^6                             -10.0722     2.7029  -3.726 0.000228 ***
## holiday_flagbefore holiday         405.7317   297.9592   1.362 0.174207    
## holiday_flagholiday                167.1205   104.3599   1.601 0.110234    
## holiday_flagregular                179.0877    72.7084   2.463 0.014277 *  
## month.L                             43.8188     3.5522  12.336  < 2e-16 ***
## month.Q                            -62.0289     3.4816 -17.816  < 2e-16 ***
## month.C                             17.0668     3.5473   4.811 2.27e-06 ***
## month^4                              0.2092     3.4882   0.060 0.952223    
## month^5                              2.1533     3.4642   0.622 0.534641    
## month^6                              1.0161     3.5512   0.286 0.774967    
## month^7                            -21.0211     3.4718  -6.055 3.77e-09 ***
## month^8                             15.8691     3.4083   4.656 4.65e-06 ***
## month^9                              8.3144     3.4375   2.419 0.016108 *  
## month^10                            -4.2762     3.4037  -1.256 0.209866    
## month^11                             0.8737     3.4381   0.254 0.799553    
## Season_SatYes                      -81.7454     6.9798 -11.712  < 2e-16 ***
## wday.L:holiday_flagbefore holiday 1455.3865   367.6630   3.958 9.21e-05 ***
## wday.Q:holiday_flagbefore holiday  974.1545   711.4347   1.369 0.171829    
## wday.C:holiday_flagbefore holiday        NA         NA      NA       NA    
## wday^4:holiday_flagbefore holiday        NA         NA      NA       NA    
## wday^5:holiday_flagbefore holiday        NA         NA      NA       NA    
## wday^6:holiday_flagbefore holiday        NA         NA      NA       NA    
## wday.L:holiday_flagholiday        1072.1030   583.8764   1.836 0.067217 .  
## wday.Q:holiday_flagholiday        1019.0230   290.4574   3.508 0.000512 ***
## wday.C:holiday_flagholiday         754.3983   287.0571   2.628 0.008983 ** 
## wday^4:holiday_flagholiday               NA         NA      NA       NA    
## wday^5:holiday_flagholiday               NA         NA      NA       NA    
## wday^6:holiday_flagholiday               NA         NA      NA       NA    
## wday.L:holiday_flagregular         285.6327   413.3034   0.691 0.489983    
## wday.Q:holiday_flagregular         487.0970   203.2064   2.397 0.017075 *  
## wday.C:holiday_flagregular          83.2412   204.5693   0.407 0.684334    
## wday^4:holiday_flagregular               NA         NA      NA       NA    
## wday^5:holiday_flagregular               NA         NA      NA       NA    
## wday^6:holiday_flagregular               NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.74 on 335 degrees of freedom
## Multiple R-squared:  0.9611, Adjusted R-squared:  0.9577 
## F-statistic: 285.2 on 29 and 335 DF,  p-value: < 2.2e-16
daily <- daily %>% 
  add_residuals(mod4, "resid4")
daily %>% 
  ggplot(aes(date, resid4)) + 
  geom_ref_line(h = 0) + 
  geom_line() +
  ylim(-200, 200)

plot(mod4, which = 1)

plot(mod4, which = 2)

daily %>%
  filter (abs(resid4) > 50) %>%
  add_predictions(mod4) %>% 
  select(-resid, -resid2, -resid3) %>%
  arrange(desc(resid4))
## # A tibble: 10 × 8
##    date           n wday  holiday_flag   month Season_Sat resid4  pred
##    <date>     <int> <ord> <chr>          <ord> <chr>       <dbl> <dbl>
##  1 2013-11-30   857 Sat   regular        Nov   No           94.3  763.
##  2 2013-12-01   987 Sun   regular        Dec   No           91.5  896.
##  3 2013-01-20   786 Sun   before holiday Jan   No           80.9  705.
##  4 2013-07-05   822 Fri   after holiday  Jul   No           65.9  756.
##  5 2013-12-28   814 Sat   regular        Dec   No           57.9  756.
##  6 2013-12-21   811 Sat   regular        Dec   No           54.9  756.
##  7 2013-12-14   692 Sat   regular        Dec   No          -64.1  756.
##  8 2013-10-31   922 Thu   regular        Oct   No          -65.0  987.
##  9 2013-12-07   691 Sat   regular        Dec   No          -65.1  756.
## 10 2013-11-29   661 Fri   after holiday  Nov   No          -65.9  727.

We see that now big residuals are around weekends after Thanksgiving, on Halloween Day, around Martin Luther King Jr. Day, weekends around Christmas etc. One may further polish the model by introducing new variables following a similar approach until one is happy with the model.


Overfitting

As a last but maybe one of the most important comments, models are not necessarily better with lower residuals. We can easily construct a model with zero residuals!!

daily2 <- flights %>% 
  mutate(date = make_date(year, month, day)) %>% 
  group_by(date) %>% 
  summarise(n = n(), date = as.character(date))

mod_date <- lm(n ~ date, data = daily2)
summary(mod_date)
## 
## Call:
## lm(formula = n ~ date, data = daily2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -8.820e-08  0.000e+00  0.000e+00  0.000e+00  2.425e-07 
## 
## Coefficients:
##                  Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)     8.420e+02  3.052e-11  2.759e+13   <2e-16 ***
## date2013-01-02  1.010e+02  4.199e-11  2.405e+12   <2e-16 ***
## date2013-01-03  7.200e+01  4.230e-11  1.702e+12   <2e-16 ***
## date2013-01-04  7.300e+01  4.229e-11  1.726e+12   <2e-16 ***
## date2013-01-05 -1.220e+02  4.495e-11 -2.714e+12   <2e-16 ***
## date2013-01-06 -1.000e+01  4.329e-11 -2.310e+11   <2e-16 ***
## date2013-01-07  9.100e+01  4.209e-11  2.162e+12   <2e-16 ***
## date2013-01-08  5.700e+01  4.247e-11  1.342e+12   <2e-16 ***
## date2013-01-09  6.000e+01  4.244e-11  1.414e+12   <2e-16 ***
## date2013-01-10  9.000e+01  4.210e-11  2.138e+12   <2e-16 ***
## date2013-01-11  8.800e+01  4.213e-11  2.089e+12   <2e-16 ***
## date2013-01-12 -1.520e+02  4.547e-11 -3.343e+12   <2e-16 ***
## date2013-01-13 -1.400e+01  4.334e-11 -3.230e+11   <2e-16 ***
## date2013-01-14  8.600e+01  4.215e-11  2.040e+12   <2e-16 ***
## date2013-01-15  5.200e+01  4.253e-11  1.223e+12   <2e-16 ***
## date2013-01-16  5.900e+01  4.245e-11  1.390e+12   <2e-16 ***
## date2013-01-17  8.500e+01  4.216e-11  2.016e+12   <2e-16 ***
## date2013-01-18  8.200e+01  4.219e-11  1.944e+12   <2e-16 ***
## date2013-01-19 -1.680e+02  4.577e-11 -3.671e+12   <2e-16 ***
## date2013-01-20 -5.600e+01  4.392e-11 -1.275e+12   <2e-16 ***
## date2013-01-21  7.000e+01  4.232e-11  1.654e+12   <2e-16 ***
## date2013-01-22  4.800e+01  4.257e-11  1.127e+12   <2e-16 ***
## date2013-01-23  5.500e+01  4.249e-11  1.294e+12   <2e-16 ***
## date2013-01-24  8.300e+01  4.218e-11  1.968e+12   <2e-16 ***
## date2013-01-25  8.000e+01  4.221e-11  1.895e+12   <2e-16 ***
## date2013-01-26 -1.620e+02  4.566e-11 -3.548e+12   <2e-16 ***
## date2013-01-27 -1.900e+01  4.341e-11 -4.377e+11   <2e-16 ***
## date2013-01-28  8.100e+01  4.220e-11  1.919e+12   <2e-16 ***
## date2013-01-29  4.800e+01  4.257e-11  1.127e+12   <2e-16 ***
## date2013-01-30  5.800e+01  4.246e-11  1.366e+12   <2e-16 ***
## date2013-01-31  8.600e+01  4.215e-11  2.040e+12   <2e-16 ***
## date2013-02-01  8.400e+01  4.217e-11  1.992e+12   <2e-16 ***
## date2013-02-02 -1.600e+02  4.562e-11 -3.507e+12   <2e-16 ***
## date2013-02-03 -2.800e+01  4.353e-11 -6.433e+11   <2e-16 ***
## date2013-02-04  9.000e+01  4.210e-11  2.138e+12   <2e-16 ***
## date2013-02-05  5.400e+01  4.250e-11  1.270e+12   <2e-16 ***
## date2013-02-06  5.900e+01  4.245e-11  1.390e+12   <2e-16 ***
## date2013-02-07  9.000e+01  4.210e-11  2.138e+12   <2e-16 ***
## date2013-02-08  8.800e+01  4.213e-11  2.089e+12   <2e-16 ***
## date2013-02-09 -1.580e+02  4.558e-11 -3.466e+12   <2e-16 ***
## date2013-02-10 -1.300e+01  4.333e-11 -3.000e+11   <2e-16 ***
## date2013-02-11  8.700e+01  4.214e-11  2.065e+12   <2e-16 ***
## date2013-02-12  5.100e+01  4.254e-11  1.199e+12   <2e-16 ***
## date2013-02-13  7.600e+01  4.226e-11  1.799e+12   <2e-16 ***
## date2013-02-14  1.140e+02  4.185e-11  2.724e+12   <2e-16 ***
## date2013-02-15  1.120e+02  4.187e-11  2.675e+12   <2e-16 ***
## date2013-02-16 -1.040e+02  4.465e-11 -2.329e+12   <2e-16 ***
## date2013-02-17  6.000e+00  4.308e-11  1.393e+11   <2e-16 ***
## date2013-02-18  1.060e+02  4.194e-11  2.528e+12   <2e-16 ***
## date2013-02-19  1.010e+02  4.199e-11  2.405e+12   <2e-16 ***
## date2013-02-20  1.070e+02  4.193e-11  2.552e+12   <2e-16 ***
## date2013-02-21  1.190e+02  4.180e-11  2.847e+12   <2e-16 ***
## date2013-02-22  1.150e+02  4.184e-11  2.748e+12   <2e-16 ***
## date2013-02-23 -9.900e+01  4.457e-11 -2.221e+12   <2e-16 ***
## date2013-02-24  3.800e+01  4.269e-11  8.901e+11   <2e-16 ***
## date2013-02-25  1.190e+02  4.180e-11  2.847e+12   <2e-16 ***
## date2013-02-26  9.600e+01  4.204e-11  2.284e+12   <2e-16 ***
## date2013-02-27  1.030e+02  4.197e-11  2.454e+12   <2e-16 ***
## date2013-02-28  1.220e+02  4.177e-11  2.921e+12   <2e-16 ***
## date2013-03-01  1.160e+02  4.183e-11  2.773e+12   <2e-16 ***
## date2013-03-02 -7.700e+01  4.423e-11 -1.741e+12   <2e-16 ***
## date2013-03-03  7.100e+01  4.231e-11  1.678e+12   <2e-16 ***
## date2013-03-04  1.350e+02  4.164e-11  3.242e+12   <2e-16 ***
## date2013-03-05  1.230e+02  4.176e-11  2.945e+12   <2e-16 ***
## date2013-03-06  1.300e+02  4.169e-11  3.118e+12   <2e-16 ***
## date2013-03-07  1.380e+02  4.161e-11  3.316e+12   <2e-16 ***
## date2013-03-08  1.370e+02  4.162e-11  3.292e+12   <2e-16 ***
## date2013-03-09 -7.700e+01  4.423e-11 -1.741e+12   <2e-16 ***
## date2013-03-10  6.600e+01  4.237e-11  1.558e+12   <2e-16 ***
## date2013-03-11  1.380e+02  4.161e-11  3.316e+12   <2e-16 ***
## date2013-03-12  1.240e+02  4.175e-11  2.970e+12   <2e-16 ***
## date2013-03-13  1.320e+02  4.167e-11  3.168e+12   <2e-16 ***
## date2013-03-14  1.400e+02  4.159e-11  3.366e+12   <2e-16 ***
## date2013-03-15  1.370e+02  4.162e-11  3.292e+12   <2e-16 ***
## date2013-03-16 -7.500e+01  4.420e-11 -1.697e+12   <2e-16 ***
## date2013-03-17  6.500e+01  4.238e-11  1.534e+12   <2e-16 ***
## date2013-03-18  1.390e+02  4.160e-11  3.341e+12   <2e-16 ***
## date2013-03-19  1.250e+02  4.174e-11  2.995e+12   <2e-16 ***
## date2013-03-20  1.280e+02  4.171e-11  3.069e+12   <2e-16 ***
## date2013-03-21  1.380e+02  4.161e-11  3.316e+12   <2e-16 ***
## date2013-03-22  1.350e+02  4.164e-11  3.242e+12   <2e-16 ***
## date2013-03-23 -7.500e+01  4.420e-11 -1.697e+12   <2e-16 ***
## date2013-03-24  6.300e+01  4.240e-11  1.486e+12   <2e-16 ***
## date2013-03-25  1.360e+02  4.163e-11  3.267e+12   <2e-16 ***
## date2013-03-26  1.310e+02  4.168e-11  3.143e+12   <2e-16 ***
## date2013-03-27  1.350e+02  4.164e-11  3.242e+12   <2e-16 ***
## date2013-03-28  1.400e+02  4.159e-11  3.366e+12   <2e-16 ***
## date2013-03-29  1.320e+02  4.167e-11  3.168e+12   <2e-16 ***
## date2013-03-30 -7.300e+01  4.417e-11 -1.653e+12   <2e-16 ***
## date2013-03-31  5.500e+01  4.249e-11  1.294e+12   <2e-16 ***
## date2013-04-01  1.280e+02  4.171e-11  3.069e+12   <2e-16 ***
## date2013-04-02  1.410e+02  4.158e-11  3.391e+12   <2e-16 ***
## date2013-04-03  1.500e+02  4.150e-11  3.615e+12   <2e-16 ***
## date2013-04-04  1.430e+02  4.156e-11  3.441e+12   <2e-16 ***
## date2013-04-05  1.390e+02  4.160e-11  3.341e+12   <2e-16 ***
## date2013-04-06 -7.200e+01  4.416e-11 -1.631e+12   <2e-16 ***
## date2013-04-07  6.900e+01  4.233e-11  1.630e+12   <2e-16 ***
## date2013-04-08  1.390e+02  4.160e-11  3.341e+12   <2e-16 ***
## date2013-04-09  1.330e+02  4.166e-11  3.192e+12   <2e-16 ***
## date2013-04-10  1.470e+02  4.152e-11  3.540e+12   <2e-16 ***
## date2013-04-11  1.500e+02  4.150e-11  3.615e+12   <2e-16 ***
## date2013-04-12  1.470e+02  4.152e-11  3.540e+12   <2e-16 ***
## date2013-04-13 -7.200e+01  4.416e-11 -1.631e+12   <2e-16 ***
## date2013-04-14  7.500e+01  4.227e-11  1.774e+12   <2e-16 ***
## date2013-04-15  1.530e+02  4.147e-11  3.690e+12   <2e-16 ***
## date2013-04-16  1.320e+02  4.167e-11  3.168e+12   <2e-16 ***
## date2013-04-17  1.460e+02  4.153e-11  3.515e+12   <2e-16 ***
## date2013-04-18  1.500e+02  4.150e-11  3.615e+12   <2e-16 ***
## date2013-04-19  1.460e+02  4.153e-11  3.515e+12   <2e-16 ***
## date2013-04-20 -7.600e+01  4.422e-11 -1.719e+12   <2e-16 ***
## date2013-04-21  7.700e+01  4.225e-11  1.823e+12   <2e-16 ***
## date2013-04-22  1.430e+02  4.156e-11  3.441e+12   <2e-16 ***
## date2013-04-23  1.230e+02  4.176e-11  2.945e+12   <2e-16 ***
## date2013-04-24  1.340e+02  4.165e-11  3.217e+12   <2e-16 ***
## date2013-04-25  1.410e+02  4.158e-11  3.391e+12   <2e-16 ***
## date2013-04-26  1.390e+02  4.160e-11  3.341e+12   <2e-16 ***
## date2013-04-27 -8.500e+01  4.435e-11 -1.916e+12   <2e-16 ***
## date2013-04-28  7.100e+01  4.231e-11  1.678e+12   <2e-16 ***
## date2013-04-29  1.410e+02  4.158e-11  3.391e+12   <2e-16 ***
## date2013-04-30  1.180e+02  4.181e-11  2.822e+12   <2e-16 ***
## date2013-05-01  1.220e+02  4.177e-11  2.921e+12   <2e-16 ***
## date2013-05-02  1.410e+02  4.158e-11  3.391e+12   <2e-16 ***
## date2013-05-03  1.360e+02  4.163e-11  3.267e+12   <2e-16 ***
## date2013-05-04 -9.700e+01  4.454e-11 -2.178e+12   <2e-16 ***
## date2013-05-05  7.000e+01  4.232e-11  1.654e+12   <2e-16 ***
## date2013-05-06  1.380e+02  4.161e-11  3.316e+12   <2e-16 ***
## date2013-05-07  1.130e+02  4.186e-11  2.699e+12   <2e-16 ***
## date2013-05-08  1.230e+02  4.176e-11  2.945e+12   <2e-16 ***
## date2013-05-09  1.390e+02  4.160e-11  3.341e+12   <2e-16 ***
## date2013-05-10  1.360e+02  4.163e-11  3.267e+12   <2e-16 ***
## date2013-05-11 -1.040e+02  4.465e-11 -2.329e+12   <2e-16 ***
## date2013-05-12  5.400e+01  4.250e-11  1.270e+12   <2e-16 ***
## date2013-05-13  1.370e+02  4.162e-11  3.292e+12   <2e-16 ***
## date2013-05-14  1.130e+02  4.186e-11  2.699e+12   <2e-16 ***
## date2013-05-15  1.250e+02  4.174e-11  2.995e+12   <2e-16 ***
## date2013-05-16  1.400e+02  4.159e-11  3.366e+12   <2e-16 ***
## date2013-05-17  1.380e+02  4.161e-11  3.316e+12   <2e-16 ***
## date2013-05-18 -9.300e+01  4.448e-11 -2.091e+12   <2e-16 ***
## date2013-05-19  6.900e+01  4.233e-11  1.630e+12   <2e-16 ***
## date2013-05-20  1.410e+02  4.158e-11  3.391e+12   <2e-16 ***
## date2013-05-21  1.200e+02  4.179e-11  2.871e+12   <2e-16 ***
## date2013-05-22  1.300e+02  4.169e-11  3.118e+12   <2e-16 ***
## date2013-05-23  1.460e+02  4.153e-11  3.515e+12   <2e-16 ***
## date2013-05-24  1.360e+02  4.163e-11  3.267e+12   <2e-16 ***
## date2013-05-25 -1.140e+02  4.482e-11 -2.544e+12   <2e-16 ***
## date2013-05-26 -1.130e+02  4.480e-11 -2.522e+12   <2e-16 ***
## date2013-05-27  8.600e+01  4.215e-11  2.040e+12   <2e-16 ***
## date2013-05-28  1.390e+02  4.160e-11  3.341e+12   <2e-16 ***
## date2013-05-29  1.320e+02  4.167e-11  3.168e+12   <2e-16 ***
## date2013-05-30  1.470e+02  4.152e-11  3.540e+12   <2e-16 ***
## date2013-05-31  1.440e+02  4.155e-11  3.465e+12   <2e-16 ***
## date2013-06-01 -8.800e+01  4.440e-11 -1.982e+12   <2e-16 ***
## date2013-06-02  6.900e+01  4.233e-11  1.630e+12   <2e-16 ***
## date2013-06-03  1.400e+02  4.159e-11  3.366e+12   <2e-16 ***
## date2013-06-04  1.180e+02  4.181e-11  2.822e+12   <2e-16 ***
## date2013-06-05  1.280e+02  4.171e-11  3.069e+12   <2e-16 ***
## date2013-06-06  1.340e+02  4.165e-11  3.217e+12   <2e-16 ***
## date2013-06-07  1.330e+02  4.166e-11  3.192e+12   <2e-16 ***
## date2013-06-08 -6.300e+01  4.402e-11 -1.431e+12   <2e-16 ***
## date2013-06-09  6.600e+01  4.237e-11  1.558e+12   <2e-16 ***
## date2013-06-10  1.450e+02  4.154e-11  3.490e+12   <2e-16 ***
## date2013-06-11  1.380e+02  4.161e-11  3.316e+12   <2e-16 ***
## date2013-06-12  1.410e+02  4.158e-11  3.391e+12   <2e-16 ***
## date2013-06-13  1.470e+02  4.152e-11  3.540e+12   <2e-16 ***
## date2013-06-14  1.470e+02  4.152e-11  3.540e+12   <2e-16 ***
## date2013-06-15 -4.100e+01  4.371e-11 -9.380e+11   <2e-16 ***
## date2013-06-16  7.600e+01  4.226e-11  1.799e+12   <2e-16 ***
## date2013-06-17  1.480e+02  4.151e-11  3.565e+12   <2e-16 ***
## date2013-06-18  1.400e+02  4.159e-11  3.366e+12   <2e-16 ***
## date2013-06-19  1.430e+02  4.156e-11  3.441e+12   <2e-16 ***
## date2013-06-20  1.530e+02  4.147e-11  3.690e+12   <2e-16 ***
## date2013-06-21  1.510e+02  4.149e-11  3.640e+12   <2e-16 ***
## date2013-06-22 -3.000e+01  4.356e-11 -6.888e+11   <2e-16 ***
## date2013-06-23  8.100e+01  4.220e-11  1.919e+12   <2e-16 ***
## date2013-06-24  1.520e+02  4.148e-11  3.665e+12   <2e-16 ***
## date2013-06-25  1.510e+02  4.149e-11  3.640e+12   <2e-16 ***
## date2013-06-26  1.530e+02  4.147e-11  3.690e+12   <2e-16 ***
## date2013-06-27  1.530e+02  4.147e-11  3.690e+12   <2e-16 ***
## date2013-06-28  1.520e+02  4.148e-11  3.665e+12   <2e-16 ***
## date2013-06-29 -3.000e+01  4.356e-11 -6.888e+11   <2e-16 ***
## date2013-06-30  7.600e+01  4.226e-11  1.799e+12   <2e-16 ***
## date2013-07-01  1.240e+02  4.175e-11  2.970e+12   <2e-16 ***
## date2013-07-02  1.030e+02  4.197e-11  2.454e+12   <2e-16 ***
## date2013-07-03  1.410e+02  4.158e-11  3.391e+12   <2e-16 ***
## date2013-07-04 -1.050e+02  4.467e-11 -2.351e+12   <2e-16 ***
## date2013-07-05 -2.000e+01  4.342e-11 -4.606e+11   <2e-16 ***
## date2013-07-06 -3.700e+01  4.365e-11 -8.476e+11   <2e-16 ***
## date2013-07-07  9.200e+01  4.208e-11  2.186e+12   <2e-16 ***
## date2013-07-08  1.620e+02  4.138e-11  3.915e+12   <2e-16 ***
## date2013-07-09  1.590e+02  4.141e-11  3.840e+12   <2e-16 ***
## date2013-07-10  1.620e+02  4.138e-11  3.915e+12   <2e-16 ***
## date2013-07-11  1.640e+02  4.136e-11  3.965e+12   <2e-16 ***
## date2013-07-12  1.600e+02  4.140e-11  3.865e+12   <2e-16 ***
## date2013-07-13 -3.100e+01  4.357e-11 -7.115e+11   <2e-16 ***
## date2013-07-14  8.900e+01  4.212e-11  2.113e+12   <2e-16 ***
## date2013-07-15  1.570e+02  4.143e-11  3.790e+12   <2e-16 ***
## date2013-07-16  1.540e+02  4.146e-11  3.715e+12   <2e-16 ***
## date2013-07-17  1.590e+02  4.141e-11  3.840e+12   <2e-16 ***
## date2013-07-18  1.610e+02  4.139e-11  3.890e+12   <2e-16 ***
## date2013-07-19  1.570e+02  4.143e-11  3.790e+12   <2e-16 ***
## date2013-07-20 -3.200e+01  4.358e-11 -7.342e+11   <2e-16 ***
## date2013-07-21  8.700e+01  4.214e-11  2.065e+12   <2e-16 ***
## date2013-07-22  1.580e+02  4.142e-11  3.815e+12   <2e-16 ***
## date2013-07-23  1.550e+02  4.145e-11  3.740e+12   <2e-16 ***
## date2013-07-24  1.580e+02  4.142e-11  3.815e+12   <2e-16 ***
## date2013-07-25  1.610e+02  4.139e-11  3.890e+12   <2e-16 ***
## date2013-07-26  1.570e+02  4.143e-11  3.790e+12   <2e-16 ***
## date2013-07-27 -3.100e+01  4.357e-11 -7.115e+11   <2e-16 ***
## date2013-07-28  8.800e+01  4.213e-11  2.089e+12   <2e-16 ***
## date2013-07-29  1.570e+02  4.143e-11  3.790e+12   <2e-16 ***
## date2013-07-30  1.550e+02  4.145e-11  3.740e+12   <2e-16 ***
## date2013-07-31  1.590e+02  4.141e-11  3.840e+12   <2e-16 ***
## date2013-08-01  1.580e+02  4.142e-11  3.815e+12   <2e-16 ***
## date2013-08-02  1.570e+02  4.143e-11  3.790e+12   <2e-16 ***
## date2013-08-03 -3.300e+01  4.360e-11 -7.569e+11   <2e-16 ***
## date2013-08-04  8.700e+01  4.214e-11  2.065e+12   <2e-16 ***
## date2013-08-05  1.580e+02  4.142e-11  3.815e+12   <2e-16 ***
## date2013-08-06  1.540e+02  4.146e-11  3.715e+12   <2e-16 ***
## date2013-08-07  1.590e+02  4.141e-11  3.840e+12   <2e-16 ***
## date2013-08-08  1.590e+02  4.141e-11  3.840e+12   <2e-16 ***
## date2013-08-09  1.570e+02  4.143e-11  3.790e+12   <2e-16 ***
## date2013-08-10 -3.500e+01  4.362e-11 -8.023e+11   <2e-16 ***
## date2013-08-11  8.700e+01  4.214e-11  2.065e+12   <2e-16 ***
## date2013-08-12  1.590e+02  4.141e-11  3.840e+12   <2e-16 ***
## date2013-08-13  1.530e+02  4.147e-11  3.690e+12   <2e-16 ***
## date2013-08-14  1.550e+02  4.145e-11  3.740e+12   <2e-16 ***
## date2013-08-15  1.580e+02  4.142e-11  3.815e+12   <2e-16 ***
## date2013-08-16  1.560e+02  4.144e-11  3.765e+12   <2e-16 ***
## date2013-08-17 -6.200e+01  4.401e-11 -1.409e+12   <2e-16 ***
## date2013-08-18  7.200e+01  4.230e-11  1.702e+12   <2e-16 ***
## date2013-08-19  1.540e+02  4.146e-11  3.715e+12   <2e-16 ***
## date2013-08-20  1.440e+02  4.155e-11  3.465e+12   <2e-16 ***
## date2013-08-21  1.480e+02  4.151e-11  3.565e+12   <2e-16 ***
## date2013-08-22  1.480e+02  4.151e-11  3.565e+12   <2e-16 ***
## date2013-08-23  1.470e+02  4.152e-11  3.540e+12   <2e-16 ***
## date2013-08-24 -6.800e+01  4.410e-11 -1.542e+12   <2e-16 ***
## date2013-08-25  6.100e+01  4.242e-11  1.438e+12   <2e-16 ***
## date2013-08-26  1.400e+02  4.159e-11  3.366e+12   <2e-16 ***
## date2013-08-27  1.230e+02  4.176e-11  2.945e+12   <2e-16 ***
## date2013-08-28  1.310e+02  4.168e-11  3.143e+12   <2e-16 ***
## date2013-08-29  1.370e+02  4.162e-11  3.292e+12   <2e-16 ***
## date2013-08-30  1.230e+02  4.176e-11  2.945e+12   <2e-16 ***
## date2013-08-31 -1.620e+02  4.566e-11 -3.548e+12   <2e-16 ***
## date2013-09-01 -1.240e+02  4.498e-11 -2.757e+12   <2e-16 ***
## date2013-09-02  8.700e+01  4.214e-11  2.065e+12   <2e-16 ***
## date2013-09-03  1.140e+02  4.185e-11  2.724e+12   <2e-16 ***
## date2013-09-04  1.060e+02  4.194e-11  2.528e+12   <2e-16 ***
## date2013-09-05  1.270e+02  4.172e-11  3.044e+12   <2e-16 ***
## date2013-09-06  1.250e+02  4.174e-11  2.995e+12   <2e-16 ***
## date2013-09-07 -1.540e+02  4.551e-11 -3.384e+12   <2e-16 ***
## date2013-09-08  6.600e+01  4.237e-11  1.558e+12   <2e-16 ***
## date2013-09-09  1.490e+02  4.151e-11  3.590e+12   <2e-16 ***
## date2013-09-10  1.190e+02  4.180e-11  2.847e+12   <2e-16 ***
## date2013-09-11  1.050e+02  4.195e-11  2.503e+12   <2e-16 ***
## date2013-09-12  1.500e+02  4.150e-11  3.615e+12   <2e-16 ***
## date2013-09-13  1.540e+02  4.146e-11  3.715e+12   <2e-16 ***
## date2013-09-14 -1.560e+02  4.555e-11 -3.425e+12   <2e-16 ***
## date2013-09-15  5.800e+01  4.246e-11  1.366e+12   <2e-16 ***
## date2013-09-16  1.500e+02  4.150e-11  3.615e+12   <2e-16 ***
## date2013-09-17  1.190e+02  4.180e-11  2.847e+12   <2e-16 ***
## date2013-09-18  1.300e+02  4.169e-11  3.118e+12   <2e-16 ***
## date2013-09-19  1.500e+02  4.150e-11  3.615e+12   <2e-16 ***
## date2013-09-20  1.520e+02  4.148e-11  3.665e+12   <2e-16 ***
## date2013-09-21 -1.490e+02  4.542e-11 -3.281e+12   <2e-16 ***
## date2013-09-22  6.200e+01  4.241e-11  1.462e+12   <2e-16 ***
## date2013-09-23  1.510e+02  4.149e-11  3.640e+12   <2e-16 ***
## date2013-09-24  1.180e+02  4.181e-11  2.822e+12   <2e-16 ***
## date2013-09-25  1.340e+02  4.165e-11  3.217e+12   <2e-16 ***
## date2013-09-26  1.540e+02  4.146e-11  3.715e+12   <2e-16 ***
## date2013-09-27  1.540e+02  4.146e-11  3.715e+12   <2e-16 ***
## date2013-09-28 -1.600e+02  4.562e-11 -3.507e+12   <2e-16 ***
## date2013-09-29  7.200e+01  4.230e-11  1.702e+12   <2e-16 ***
## date2013-09-30  1.510e+02  4.149e-11  3.640e+12   <2e-16 ***
## date2013-10-01  1.230e+02  4.176e-11  2.945e+12   <2e-16 ***
## date2013-10-02  1.330e+02  4.166e-11  3.192e+12   <2e-16 ***
## date2013-10-03  1.530e+02  4.147e-11  3.690e+12   <2e-16 ***
## date2013-10-04  1.530e+02  4.147e-11  3.690e+12   <2e-16 ***
## date2013-10-05 -1.550e+02  4.553e-11 -3.404e+12   <2e-16 ***
## date2013-10-06  7.500e+01  4.227e-11  1.774e+12   <2e-16 ***
## date2013-10-07  1.520e+02  4.148e-11  3.665e+12   <2e-16 ***
## date2013-10-08  1.220e+02  4.177e-11  2.921e+12   <2e-16 ***
## date2013-10-09  1.320e+02  4.167e-11  3.168e+12   <2e-16 ***
## date2013-10-10  1.520e+02  4.148e-11  3.665e+12   <2e-16 ***
## date2013-10-11  1.490e+02  4.151e-11  3.590e+12   <2e-16 ***
## date2013-10-12 -1.660e+02  4.573e-11 -3.630e+12   <2e-16 ***
## date2013-10-13  6.000e+01  4.244e-11  1.414e+12   <2e-16 ***
## date2013-10-14  1.450e+02  4.154e-11  3.490e+12   <2e-16 ***
## date2013-10-15  1.210e+02  4.178e-11  2.896e+12   <2e-16 ***
## date2013-10-16  1.320e+02  4.167e-11  3.168e+12   <2e-16 ***
## date2013-10-17  1.530e+02  4.147e-11  3.690e+12   <2e-16 ***
## date2013-10-18  1.510e+02  4.149e-11  3.640e+12   <2e-16 ***
## date2013-10-19 -1.580e+02  4.558e-11 -3.466e+12   <2e-16 ***
## date2013-10-20  7.300e+01  4.229e-11  1.726e+12   <2e-16 ***
## date2013-10-21  1.490e+02  4.151e-11  3.590e+12   <2e-16 ***
## date2013-10-22  1.220e+02  4.177e-11  2.921e+12   <2e-16 ***
## date2013-10-23  1.330e+02  4.166e-11  3.192e+12   <2e-16 ***
## date2013-10-24  1.500e+02  4.150e-11  3.615e+12   <2e-16 ***
## date2013-10-25  1.470e+02  4.152e-11  3.540e+12   <2e-16 ***
## date2013-10-26 -1.570e+02  4.557e-11 -3.446e+12   <2e-16 ***
## date2013-10-27  6.800e+01  4.235e-11  1.606e+12   <2e-16 ***
## date2013-10-28  1.410e+02  4.158e-11  3.391e+12   <2e-16 ***
## date2013-10-29  1.230e+02  4.176e-11  2.945e+12   <2e-16 ***
## date2013-10-30  1.310e+02  4.168e-11  3.143e+12   <2e-16 ***
## date2013-10-31  8.000e+01  4.221e-11  1.895e+12   <2e-16 ***
## date2013-11-01  1.440e+02  4.155e-11  3.465e+12   <2e-16 ***
## date2013-11-02 -1.530e+02  4.549e-11 -3.363e+12   <2e-16 ***
## date2013-11-03  6.000e+01  4.244e-11  1.414e+12   <2e-16 ***
## date2013-11-04  1.360e+02  4.163e-11  3.267e+12   <2e-16 ***
## date2013-11-05  1.250e+02  4.174e-11  2.995e+12   <2e-16 ***
## date2013-11-06  1.310e+02  4.168e-11  3.143e+12   <2e-16 ***
## date2013-11-07  1.490e+02  4.151e-11  3.590e+12   <2e-16 ***
## date2013-11-08  1.440e+02  4.155e-11  3.465e+12   <2e-16 ***
## date2013-11-09 -1.270e+02  4.503e-11 -2.820e+12   <2e-16 ***
## date2013-11-10  5.300e+01  4.252e-11  1.247e+12   <2e-16 ***
## date2013-11-11  1.410e+02  4.158e-11  3.391e+12   <2e-16 ***
## date2013-11-12  1.310e+02  4.168e-11  3.143e+12   <2e-16 ***
## date2013-11-13  1.340e+02  4.165e-11  3.217e+12   <2e-16 ***
## date2013-11-14  1.460e+02  4.153e-11  3.515e+12   <2e-16 ***
## date2013-11-15  1.430e+02  4.156e-11  3.441e+12   <2e-16 ***
## date2013-11-16 -1.280e+02  4.505e-11 -2.841e+12   <2e-16 ***
## date2013-11-17  5.400e+01  4.250e-11  1.270e+12   <2e-16 ***
## date2013-11-18  1.430e+02  4.156e-11  3.441e+12   <2e-16 ***
## date2013-11-19  1.310e+02  4.168e-11  3.143e+12   <2e-16 ***
## date2013-11-20  1.350e+02  4.164e-11  3.242e+12   <2e-16 ***
## date2013-11-21  1.580e+02  4.142e-11  3.815e+12   <2e-16 ***
## date2013-11-22  1.570e+02  4.143e-11  3.790e+12   <2e-16 ***
## date2013-11-23 -9.800e+01  4.456e-11 -2.199e+12   <2e-16 ***
## date2013-11-24  5.400e+01  4.250e-11  1.270e+12   <2e-16 ***
## date2013-11-25  1.000e+02  4.200e-11  2.381e+12   <2e-16 ***
## date2013-11-26  1.470e+02  4.152e-11  3.540e+12   <2e-16 ***
## date2013-11-27  1.720e+02  4.129e-11  4.166e+12   <2e-16 ***
## date2013-11-28 -2.080e+02  4.656e-11 -4.467e+12   <2e-16 ***
## date2013-11-29 -1.810e+02  4.602e-11 -3.933e+12   <2e-16 ***
## date2013-11-30  1.500e+01  4.297e-11  3.491e+11   <2e-16 ***
## date2013-12-01  1.450e+02  4.154e-11  3.490e+12   <2e-16 ***
## date2013-12-02  1.620e+02  4.138e-11  3.915e+12   <2e-16 ***
## date2013-12-03  1.310e+02  4.168e-11  3.143e+12   <2e-16 ***
## date2013-12-04  1.160e+02  4.183e-11  2.773e+12   <2e-16 ***
## date2013-12-05  1.270e+02  4.172e-11  3.044e+12   <2e-16 ***
## date2013-12-06  1.280e+02  4.171e-11  3.069e+12   <2e-16 ***
## date2013-12-07 -1.510e+02  4.546e-11 -3.322e+12   <2e-16 ***
## date2013-12-08  3.300e+01  4.275e-11  7.719e+11   <2e-16 ***
## date2013-12-09  1.200e+02  4.179e-11  2.871e+12   <2e-16 ***
## date2013-12-10  1.010e+02  4.199e-11  2.405e+12   <2e-16 ***
## date2013-12-11  1.120e+02  4.187e-11  2.675e+12   <2e-16 ***
## date2013-12-12  1.260e+02  4.173e-11  3.019e+12   <2e-16 ***
## date2013-12-13  1.280e+02  4.171e-11  3.069e+12   <2e-16 ***
## date2013-12-14 -1.500e+02  4.544e-11 -3.301e+12   <2e-16 ***
## date2013-12-15  3.800e+01  4.269e-11  8.901e+11   <2e-16 ***
## date2013-12-16  1.220e+02  4.177e-11  2.921e+12   <2e-16 ***
## date2013-12-17  1.070e+02  4.193e-11  2.552e+12   <2e-16 ***
## date2013-12-18  1.140e+02  4.185e-11  2.724e+12   <2e-16 ***
## date2013-12-19  1.320e+02  4.167e-11  3.168e+12   <2e-16 ***
## date2013-12-20  1.380e+02  4.161e-11  3.316e+12   <2e-16 ***
## date2013-12-21 -3.100e+01  4.357e-11 -7.115e+11   <2e-16 ***
## date2013-12-22  5.300e+01  4.252e-11  1.247e+12   <2e-16 ***
## date2013-12-23  1.430e+02  4.156e-11  3.441e+12   <2e-16 ***
## date2013-12-24 -8.100e+01  4.429e-11 -1.829e+12   <2e-16 ***
## date2013-12-25 -1.230e+02  4.497e-11 -2.735e+12   <2e-16 ***
## date2013-12-26  9.400e+01  4.206e-11  2.235e+12   <2e-16 ***
## date2013-12-27  1.210e+02  4.178e-11  2.896e+12   <2e-16 ***
## date2013-12-28 -2.800e+01  4.353e-11 -6.433e+11   <2e-16 ***
## date2013-12-29  4.600e+01  4.260e-11  1.080e+12   <2e-16 ***
## date2013-12-30  1.260e+02  4.173e-11  3.019e+12   <2e-16 ***
## date2013-12-31 -6.600e+01  4.407e-11 -1.498e+12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.856e-10 on 336411 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.259e+24 on 364 and 336411 DF,  p-value: < 2.2e-16
plot(mod_date, which = 1)

Above we introduce a new categorical variable date which is simply the character date by itself! By doing so, we introduce 365 coefficients for each date and no wonder we easily get zero residuals since the number of parameters is the same as the number of coefficients!!

When our model is too flexible, it takes the noise into account and therefore it is not useful in predicting the future or unknown samples. In this case, if we use this to predict the flights number in 2014 it will be quite off since we simply use the same number on the same date while ignoring any other pattern!


3. Brief summary

With these case studies, we learn some basic principles of data modeling: