In Assessment Task 2, our linear model was not able to yield much insight to our questions as the assumptions of linear regression were heavily violated. Furthermore, our dataset was large and we did not aggregate the data, making analysis very difficult and harder to identify patterns. Even though we only focused on a particular state, California, for our analysis, our data was in terms of minutes, making it tough for us to look at the big picture. Therefore, for this assessment, I will aggregate data into monthly data to gain relevant insights and extend our model from linear regression to multilevel models.
The dataset we obtained is a nested dataset, multilevel models are an extension of regression models which are designed to model nested data. In linear regression, it is assumed that each observation is independent. The independence assumption was violated in assessment as there was correlation between flights of the same day, by aggregating the data by month, this would not be a problem. Whereas, observations are still not independent because the flight delays are repeatedly recorded in the same airport over a 2 years period. Each airport has different infrastructures, structure, and this is going to be an idiosyncratic factor that affects all observations from the same airport. By utilizing a multi-level model (adding a random effect) as an extension of linear regression, the analysis would be more accurate and provide a better model for future use.
Original research questions: What are the main causes for Airline delays? (Variables that influence departure delays) How is American Airlines on-time performance compared to our two main competitors?
New research question: Which airline should I pick given year, month, and origin airport to minimize expected departure delays?
For the following analysis, the first original question will be removed due to the structure of the data. Since the data has many levels, each level has different reasons to account for airline delays, for example, employees in that certain airport. I decided to shift the focus of research questions to comparing airlines as it is a choice that consumers can make while buying their airplane tickets. From my personal experience, the flexibility in choosing the departure airport, year and month I want to fly out is low, but there are usually flights operated by different airlines for the same route which I can choose from. Therefore, the main research question and analysis would focus on the differences of airlines in on time performance.
In our last assessment, we found out that there are correlations among different states.
I will first create a multilevel model for California, then generalize it to all the airports across different states.
knitr::include_graphics("Multilevel model diagram.jpg")
Figure 1: Multilevel model diagram
Airport is the random effect in the model which represents the inter-dependence of observations of airports . Within the same state, California (CA), there are different departure delay means for different airports (Refer to figure 2). Some airports, such as KSBA and KOAK have the lowest delays, in fact, the mean indicates that the flights depart early. We will model these individual differences by assuming different random intercepts for each airport.
knitr::include_graphics("Figure 2.jpg")
Figure 2: Departure delays based on airport
Fixed effects variables include:
- Airline
- Year
- Month
- Age of the aircraft
To examine these variables and assumption of linear regression, I first fitted a linear regression with all these fixed effects for benchmarking. The multiple R-squared is 0.4118, adjusted R-squared is 0.3983, R-squared are significant (p-value <2.2e-16).
I will not compare multiple R-squared against models with less variables as multiple R-squared will only drop when there are less variables.
Linear regression model formula:
\(delay = month + year + age + airline + \varepsilon\)
To test whether the age of aircraft is a useful variable in explaining delays, I compared the new linear model without the age of aircraft variable against the benchmark model. The model without age of the aircraft has a lower adjusted R-squared 0.3845. After using ANOVA to compare these 2 models, they are significantly different from each other.
However, if we include origin_airport_code in the linear regression, Linear regression model formula:
\(delay = month + year + age + airline + origin airport + \varepsilon\)
The adjusted R-squared is higher, 0.4956. Meaning that origin airport is a significant factor in explaining the variation of departure delays. Afterwards, I did the test on the age of aircraft again and the adjusted R-squared in the model without age of aircraft is 0.4951, showing that the contribution of aircraft age is minimal. I suspect that there is a high association between airport and aircraft age. To verify this, I conducted a chi-square independence test to test the variable association. P-value is 0.3973, > 0.05, meaning that with origin_airport_code, there is no need to include aircraft age.
knitr::include_graphics("old diagnostic plot.jpg")
Figure 3: Old diagnostic plot (From AT2B)
knitr::include_graphics("new diagnostic plot .jpg")
Figure 4: New diagnostic plot
For the new aggregated data, it satisfies most assumptions for linear model except for normal distribution assumption. Heavy tails are present from the normal qq plot in figure 4, but there are no influential observations in the data according to the residual vs leverage plot. Compared to the old diagnostic plot (figure 3), the new diagnostic plots (figure 4) shows a huge improvement.
With the existent knowledge from linear regression from above, the random intercept multilevel model:
\(delay = month + year + airline + (1|origin airport) + \varepsilon\)
knitr::include_graphics("lmer1.jpg")
The variability of airports, i.e. idiosyncratic differences between airports from observations are accounted for around 3.169. The error term, i.e. variation that cannot be explained, is 9.583.
From the above result, there are several variables being statistically significant, especially year, United airlines. The departure delay magnitude is the largest in 2019, while flights from 2020 and 2021 departure delays are much less. According to fig5, the mean delay of 2021 is higher than 2020, however, the slope for year 2021 is smaller than year 2020. There might be interaction variables among the fixed effects.
knitr::include_graphics("Figure 5.jpg")
Figure 5. Box plot of delay vs year
Year and month should be a random effect as well as each data point from airports comes from a certain year and month, i.e. 2020 March. To ensure complete independence of subject, year and month would also be considered as random effects.
Updated random intercept multilevel model:
\(delay = airline + (1|month) + (1|year) + (1|origin airport ) + \varepsilon\)
knitr::include_graphics("lmer3.jpg")
After adding month and year as random effects, origin_airport_code variation increased slightly. Furthermore, year accounts for the most variation.
knitr::include_graphics("coef lmer3.jpg")
The intercept varies among the random effects but slopes are fixed. It would be interesting to investigate whether there are interactions between airlines and the random variables.
Utilizing linear regression interaction variables,
To test airline & origin_airport_code:
Formula 1:
\(delay = airline + origin airport + \varepsilon\)
Formula 2:
\(delay = airline * origin airport + \varepsilon\)
First equation adjusted R-squared is 0.08489, second equation containing interaction effect adjusted R-squared is 0.1342. ANOVA test also shows that there are significant differences between these 2 equations.
Hence, the interaction effect would be considered in the random slope multilevel model. Furthermore, it is expected that there is no correlation among fixed effects, however there is some correlation (0.426) among airlines and it violates the assumption that observations should be independent.
For airline and other random variables, there are no interaction effects.
Final random slope multilevel model:
\(delay = airline + (1|month) + (1|year) + (1+ airline|origin airport) + \varepsilon\)
knitr::include_graphics("lmer4.jpg")
Among all the multilevel models fitted, the final model has the best AIC, meaning it has the best fit and the lowest error term. The interaction between airports and airlines have accounted for more variations compared to airports alone from the previous model. After allowing the slope to vary, correlation of fixed effects are close to 0, meaning that observations are independent.
Research question - How is American Airlines on-time performance compared to our two main competitors?
The intercept and delta airlines variable is not statistically significant as it has a low t value, implying that there is no difference on the on time performance between American Airline and Delta Airline. In the meanwhile, United Airline has a better on time performance as its slope is -2.21. This result is only applicable to California.
The coefficient of the random effects are as follows,
knitr::include_graphics("coef lmer4.jpg")
Research question - which airline should I choose for my flight?
Assume that if I want to take a flight from KBUR on 2021 Nov, I should take United airlines as the slope of airline United airlines Inc. for KBUR is -5.98 and Delta airlines is 2.1637.
Decision would be: United airlines > American airlines > Delta airlines, based on mean.
Regardless of the year and month, I should always take United airlines if I’m departing from KBUR.
Utilizing our model for prediction, it is anticipated that departing from KBUR on 2021 Nov, taking any of the airlines, it is anticipated that it would depart early.
American Airlines: Early by 2.73 minutes +/- 2* 1.8203
CI: [-6.27, 0.9106]
Delta Airlines: Early by 0.57 minutes +/- 2* 0.4382
CI: [-1.4464, 0.3064]
United Airlines: Early by 8.71 minutes +/- 2* 0.7757
CI: [-10.26, 7.1586]
Confidence Interval (CI) is based on 5% and 95%. Negatives means the flight departs early.
Even though American Airlines has a lower mean than Delta Airlines in departure delays, after taking into consideration the confidence interval, customers can choose between american airlines and delta airlines based on their preference.
For example, if I don’t care about the flight departing early or not, I only want to minimize my delay time if the flight does delay, then I would pick Delta Airlines.
For dianostic plots,
knitr::include_graphics("lmer4 diagnostic.jpg")
The diagnostic results are similar to the linear model. All the assumptions for multilevel is satisfied except for normality as there are heavy tails due to outliers. The outliers are not influential points, meaning it will not have a big impact on the analysis. Thus, the model is still reliable for prediction and for analysis.
To extend the model further to other states as well, the multilevel model:
\(delay = airline + (1|month) + (1|year) + (1|origin state) + (1+ airline|origin airport) + \varepsilon\)
There are interaction effects between airline & origin_state_abr and airline & origin_airport_code, however the model cannot include both of these interaction effects as it will overfit the model. To determine which combination should be used, I generated both models and found the one that captures the interaction effect for airline & origin_state_abr is a better fit and accounts for more variation in the random variables.
Research question - How is American Airlines on-time performance compared to our two main competitors?
Considering top 50 airports in America across all states instead of airports only in California, Delta airlines performs the best, followed by United AIrlines, American Airlines being the worst.
Even though this multilevel model is more generalizable to the California specific one, it adds complexity to computation and an extra level in the model.
The multilevel model created is a useful tool to aid customers on their decision for deciding the airline they should take based on year, month, and departure airport. It also gives an overview of how the big 3 airlines are performing in each airport, making this a stepping stone for them to further investigate in airports that they are performing badly against competitors. Then the first original research question can be answered, ‘what are the main causes for Airline delays?’ for a specific airport.
In assessment task 2, not much insights and results were yielded from the statistical model (linear & logistic regression) as we were unaware that the complexity of the data structure renders our analysis useless. Data engineering, i.e. aggregating the data is needed to simplify the data structure and models beyond linear and logistic regression has to be considered in order to bring in better results. After aggregating the data, the new analysis obtained through multilevel models are generalizable to all airports, it also provides insights to customers for making decisions and gives insights to the big 3 airline companies on their competitiveness in terms of on time performance.
Winter, B. (2013, August 26). Linear models and linear mixed effects models in R with linguistic. . . ArXiv.Org. https://arxiv.org/abs/1308.5499
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting Linear Mixed-Effects Models Usinglme4. Journal of Statistical Software, 67(1). https://doi.org/10.18637/jss.v067.i01
Read file
raw_data <- read_feather(here('mlm_dataset_5.feather'))
Aggregate data based on day - data1
data1 <- raw_data %>% filter(origin_state_abr == "CA") %>% group_by(origin_airport_code, airline, year, month) %>% summarise(avg_delay = mean(dep_delay), ave_age = mean(age)) %>% droplevels()
## `summarise()` has grouped output by 'origin_airport_code', 'airline', 'year'. You can override using the `.groups` argument.
#%>% dplyr::summarise(avg_delay = mean(dep_delay), avg_air = mean(air_time))
Aggregate data for logistics regression - data2
data2 <- data1 %>% mutate(delay = sapply(avg_delay, function(x) ifelse(x >0 , 1, 0))) %>% select(-avg_delay)
chisq.test(data1$origin_airport_code, data1$ave_age)
## Warning in chisq.test(data1$origin_airport_code, data1$ave_age): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: data1$origin_airport_code and data1$ave_age
## X-squared = 8409.1, df = 8376, p-value = 0.3973
Plot box
data1 %>% ggplot(aes(x = avg_delay, y = year)) + geom_boxplot(alpha = 0.2) + geom_point(alpha = 0.3)
airport boxplot
data1 %>% ggplot(aes(x = avg_delay, y = origin_airport_code)) + geom_boxplot(alpha = 0.2) + geom_point(alpha = 0.3)
Linear regression
lm1 <- lm(avg_delay ~ month + ave_age + year + airline + origin_airport_code ,data = data1)
plot(lm1)
summary(lm1)
##
## Call:
## lm(formula = avg_delay ~ month + ave_age + year + airline + origin_airport_code,
## data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3436 -1.8405 -0.0812 1.6016 16.0585
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.82581 0.81182 7.176 1.87e-12 ***
## month2 0.98353 0.55037 1.787 0.074376 .
## month3 -0.02812 0.54708 -0.051 0.959022
## month4 -2.77752 0.58016 -4.787 2.07e-06 ***
## month5 -0.42761 0.60432 -0.708 0.479443
## month6 0.39870 0.60804 0.656 0.512227
## month7 -1.01172 0.59423 -1.703 0.089103 .
## month8 -1.04270 0.58860 -1.771 0.076922 .
## month9 -2.36898 0.58598 -4.043 5.88e-05 ***
## month10 -1.23109 0.57786 -2.130 0.033492 *
## month11 -1.95104 0.58293 -3.347 0.000862 ***
## month12 -0.09814 0.58085 -0.169 0.865882
## ave_age -0.05172 0.03984 -1.298 0.194575
## year2020 -5.39344 0.24783 -21.763 < 2e-16 ***
## year2021 -6.88777 0.56117 -12.274 < 2e-16 ***
## airlineDelta Air Lines Inc. 0.37603 0.36172 1.040 0.298905
## airlineUnited Air Lines Inc. -1.87326 0.38359 -4.883 1.30e-06 ***
## origin_airport_codeKFAT 0.87288 0.68353 1.277 0.202027
## origin_airport_codeKLAX 0.45120 0.57932 0.779 0.436340
## origin_airport_codeKLGB 0.86158 1.11408 0.773 0.439579
## origin_airport_codeKOAK -4.97019 0.72527 -6.853 1.61e-11 ***
## origin_airport_codeKONT -2.31271 0.60158 -3.844 0.000132 ***
## origin_airport_codeKPSP 0.34626 0.61845 0.560 0.575744
## origin_airport_codeKSAN -1.74473 0.57809 -3.018 0.002638 **
## origin_airport_codeKSBA -4.54528 0.90677 -5.013 6.84e-07 ***
## origin_airport_codeKSFO -0.21008 0.58317 -0.360 0.718785
## origin_airport_codeKSJC -2.48899 0.60051 -4.145 3.83e-05 ***
## origin_airport_codeKSMF -0.95413 0.57977 -1.646 0.100283
## origin_airport_codeKSNA -0.98079 0.57833 -1.696 0.090359 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.127 on 686 degrees of freedom
## Multiple R-squared: 0.5154, Adjusted R-squared: 0.4956
## F-statistic: 26.05 on 28 and 686 DF, p-value: < 2.2e-16
lm2- only airline
lm2 <- lm(avg_delay ~ month + year + airline + origin_airport_code, data = data1)
summary(lm2)
##
## Call:
## lm(formula = avg_delay ~ month + year + airline + origin_airport_code,
## data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4243 -1.7643 -0.0633 1.5610 16.3044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.15565 0.62694 8.223 9.88e-16 ***
## month2 0.98180 0.55065 1.783 0.075029 .
## month3 -0.03073 0.54735 -0.056 0.955242
## month4 -2.73509 0.57953 -4.720 2.87e-06 ***
## month5 -0.40118 0.60428 -0.664 0.506975
## month6 0.39821 0.60834 0.655 0.512961
## month7 -1.00754 0.59452 -1.695 0.090585 .
## month8 -1.02579 0.58875 -1.742 0.081899 .
## month9 -2.33225 0.58559 -3.983 7.54e-05 ***
## month10 -1.19130 0.57734 -2.063 0.039446 *
## month11 -1.92893 0.58297 -3.309 0.000986 ***
## month12 -0.04016 0.57942 -0.069 0.944763
## year2020 -5.36862 0.24721 -21.717 < 2e-16 ***
## year2021 -6.81362 0.55853 -12.199 < 2e-16 ***
## airlineDelta Air Lines Inc. 0.09867 0.29205 0.338 0.735568
## airlineUnited Air Lines Inc. -2.19533 0.29275 -7.499 1.99e-13 ***
## origin_airport_codeKFAT 0.82246 0.68276 1.205 0.228774
## origin_airport_codeKLAX 0.53348 0.57614 0.926 0.354793
## origin_airport_codeKLGB 0.72644 1.10976 0.655 0.512949
## origin_airport_codeKOAK -5.15615 0.71134 -7.249 1.14e-12 ***
## origin_airport_codeKONT -2.31752 0.60187 -3.851 0.000129 ***
## origin_airport_codeKPSP 0.29624 0.61756 0.480 0.631591
## origin_airport_codeKSAN -1.67866 0.57614 -2.914 0.003688 **
## origin_airport_codeKSBA -4.78873 0.88762 -5.395 9.44e-08 ***
## origin_airport_codeKSFO -0.09048 0.57614 -0.157 0.875250
## origin_airport_codeKSJC -2.51392 0.60051 -4.186 3.20e-05 ***
## origin_airport_codeKSMF -0.90217 0.57867 -1.559 0.119450
## origin_airport_codeKSNA -0.99267 0.57855 -1.716 0.086649 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.128 on 687 degrees of freedom
## Multiple R-squared: 0.5142, Adjusted R-squared: 0.4951
## F-statistic: 26.93 on 27 and 687 DF, p-value: < 2.2e-16
anova(lm1, lm2)
lm3 <- lm(avg_delay ~ airline , data = data1)
summary(lm3)
##
## Call:
## lm(formula = avg_delay ~ airline, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.128 -3.084 -0.329 2.722 21.043
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.466927 0.259104 1.802 0.071956 .
## airlineDelta Air Lines Inc. 0.008991 0.389229 0.023 0.981577
## airlineUnited Air Lines Inc. -1.338767 0.400177 -3.345 0.000865 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.367 on 712 degrees of freedom
## Multiple R-squared: 0.01905, Adjusted R-squared: 0.0163
## F-statistic: 6.914 on 2 and 712 DF, p-value: 0.001062
lm4
lm4 <- lm(avg_delay ~ origin_airport_code * year, data = data1)
summary(lm4)
##
## Call:
## lm(formula = avg_delay ~ origin_airport_code * year, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3320 -1.6051 -0.2122 1.6129 17.6947
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.8238 0.6623 5.773 1.18e-08 ***
## origin_airport_codeKFAT 1.6528 0.9803 1.686 0.092251 .
## origin_airport_codeKLAX 0.7672 0.8622 0.890 0.373879
## origin_airport_codeKLGB -0.1699 2.0235 -0.084 0.933123
## origin_airport_codeKOAK -5.2060 0.9568 -5.441 7.41e-08 ***
## origin_airport_codeKONT -4.0318 0.8672 -4.649 4.01e-06 ***
## origin_airport_codeKPSP 0.0082 0.9038 0.009 0.992764
## origin_airport_codeKSAN -1.6794 0.8622 -1.948 0.051836 .
## origin_airport_codeKSBA -6.7131 1.1982 -5.603 3.07e-08 ***
## origin_airport_codeKSFO 1.2101 0.8622 1.404 0.160919
## origin_airport_codeKSJC -2.4048 0.8622 -2.789 0.005432 **
## origin_airport_codeKSMF -2.1573 0.8622 -2.502 0.012578 *
## origin_airport_codeKSNA -0.9158 0.8622 -1.062 0.288503
## year2020 -5.3569 0.9803 -5.465 6.52e-08 ***
## year2021 -9.4041 2.4336 -3.864 0.000122 ***
## origin_airport_codeKFAT:year2020 -2.2559 1.4881 -1.516 0.129987
## origin_airport_codeKLAX:year2020 -0.8463 1.2531 -0.675 0.499672
## origin_airport_codeKLGB:year2020 1.7081 2.5386 0.673 0.501283
## origin_airport_codeKOAK:year2020 2.5209 1.5604 1.616 0.106651
## origin_airport_codeKONT:year2020 4.3173 1.3158 3.281 0.001087 **
## origin_airport_codeKPSP:year2020 0.2212 1.3643 0.162 0.871232
## origin_airport_codeKSAN:year2020 -0.2118 1.2531 -0.169 0.865801
## origin_airport_codeKSBA:year2020 4.8562 1.9457 2.496 0.012802 *
## origin_airport_codeKSFO:year2020 -2.7376 1.2531 -2.185 0.029253 *
## origin_airport_codeKSJC:year2020 -0.0455 1.3055 -0.035 0.972208
## origin_airport_codeKSMF:year2020 2.1095 1.2602 1.674 0.094607 .
## origin_airport_codeKSNA:year2020 -0.4014 1.2602 -0.318 0.750210
## origin_airport_codeKFAT:year2021 2.2842 3.4537 0.661 0.508607
## origin_airport_codeKLAX:year2021 3.9503 2.8381 1.392 0.164416
## origin_airport_codeKLGB:year2021 8.7501 4.5327 1.930 0.053970 .
## origin_airport_codeKOAK:year2021 NA NA NA NA
## origin_airport_codeKONT:year2021 7.5604 2.9963 2.523 0.011853 *
## origin_airport_codeKPSP:year2021 5.8679 2.8510 2.058 0.039957 *
## origin_airport_codeKSAN:year2021 3.1919 2.8381 1.125 0.261126
## origin_airport_codeKSBA:year2021 NA NA NA NA
## origin_airport_codeKSFO:year2021 1.4290 2.8381 0.504 0.614772
## origin_airport_codeKSJC:year2021 1.6423 3.4221 0.480 0.631445
## origin_airport_codeKSMF:year2021 6.1471 2.8381 2.166 0.030667 *
## origin_airport_codeKSNA:year2021 2.7394 2.8381 0.965 0.334780
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.312 on 678 degrees of freedom
## Multiple R-squared: 0.4627, Adjusted R-squared: 0.4342
## F-statistic: 16.22 on 36 and 678 DF, p-value: < 2.2e-16
–> change No difference in Adjusted R-squared –> no interaction
anova(lm3, lm4)
There are difference –> interaction is present
lm5 - airline & origin_airport_code
lm5 <- lm(avg_delay ~ airline + origin_airport_code , data = data1)
summary(lm5)
##
## Call:
## lm(formula = avg_delay ~ airline + origin_airport_code, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9423 -3.1195 -0.0521 2.7534 19.8037
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4864 0.6319 2.352 0.018945 *
## airlineDelta Air Lines Inc. 0.2285 0.3924 0.582 0.560623
## airlineUnited Air Lines Inc. -1.5441 0.3915 -3.944 8.82e-05 ***
## origin_airport_codeKFAT 1.0981 0.9185 1.195 0.232320
## origin_airport_codeKLAX 0.2607 0.7742 0.337 0.736462
## origin_airport_codeKLGB -0.3155 1.4894 -0.212 0.832284
## origin_airport_codeKOAK -3.9205 0.9542 -4.109 4.45e-05 ***
## origin_airport_codeKONT -2.0040 0.8092 -2.477 0.013498 *
## origin_airport_codeKPSP 0.4883 0.8285 0.589 0.555799
## origin_airport_codeKSAN -1.9515 0.7742 -2.521 0.011937 *
## origin_airport_codeKSBA -4.1890 1.1929 -3.511 0.000474 ***
## origin_airport_codeKSFO -0.3633 0.7742 -0.469 0.639038
## origin_airport_codeKSJC -2.1134 0.8075 -2.617 0.009058 **
## origin_airport_codeKSMF -1.1203 0.7780 -1.440 0.150339
## origin_airport_codeKSNA -1.2245 0.7778 -1.574 0.115881
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.212 on 700 degrees of freedom
## Multiple R-squared: 0.1028, Adjusted R-squared: 0.08489
## F-statistic: 5.731 on 14 and 700 DF, p-value: 1.199e-10
lm6
lm6 <- lm(avg_delay ~ airline * origin_airport_code , data = data1)
summary(lm6)
##
## Call:
## lm(formula = avg_delay ~ airline * origin_airport_code, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3754 -3.0324 -0.1989 2.6510 18.2852
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error
## (Intercept) 1.2550 0.8542
## airlineDelta Air Lines Inc. 4.4728 1.5018
## airlineUnited Air Lines Inc. -4.0857 1.3887
## origin_airport_codeKFAT -0.1403 1.1727
## origin_airport_codeKLAX 1.5884 1.1727
## origin_airport_codeKLGB -4.3285 1.7899
## origin_airport_codeKOAK -2.3774 1.3887
## origin_airport_codeKONT -0.3038 1.1727
## origin_airport_codeKPSP 0.8502 1.1727
## origin_airport_codeKSAN -1.6898 1.1727
## origin_airport_codeKSBA -3.8193 1.4215
## origin_airport_codeKSFO -0.6832 1.1727
## origin_airport_codeKSJC -3.2685 1.1727
## origin_airport_codeKSMF -1.0981 1.1727
## origin_airport_codeKSNA -0.6734 1.1727
## airlineDelta Air Lines Inc.:origin_airport_codeKFAT NA NA
## airlineUnited Air Lines Inc.:origin_airport_codeKFAT 7.1958 1.9931
## airlineDelta Air Lines Inc.:origin_airport_codeKLAX -5.3153 1.8831
## airlineUnited Air Lines Inc.:origin_airport_codeKLAX 0.3235 1.7942
## airlineDelta Air Lines Inc.:origin_airport_codeKLGB NA NA
## airlineUnited Air Lines Inc.:origin_airport_codeKLGB NA NA
## airlineDelta Air Lines Inc.:origin_airport_codeKOAK -6.4742 2.0720
## airlineUnited Air Lines Inc.:origin_airport_codeKOAK NA NA
## airlineDelta Air Lines Inc.:origin_airport_codeKONT -6.3888 1.8900
## airlineUnited Air Lines Inc.:origin_airport_codeKONT -0.7023 1.9931
## airlineDelta Air Lines Inc.:origin_airport_codeKPSP -5.3249 2.0735
## airlineUnited Air Lines Inc.:origin_airport_codeKPSP 2.8559 1.8725
## airlineDelta Air Lines Inc.:origin_airport_codeKSAN -4.6870 1.8831
## airlineUnited Air Lines Inc.:origin_airport_codeKSAN 2.8934 1.7942
## airlineDelta Air Lines Inc.:origin_airport_codeKSBA NA NA
## airlineUnited Air Lines Inc.:origin_airport_codeKSBA 1.9535 2.7230
## airlineDelta Air Lines Inc.:origin_airport_codeKSFO -4.7696 1.8831
## airlineUnited Air Lines Inc.:origin_airport_codeKSFO 4.7206 1.7942
## airlineDelta Air Lines Inc.:origin_airport_codeKSJC -2.2467 1.9339
## airlineUnited Air Lines Inc.:origin_airport_codeKSJC 5.3294 1.8871
## airlineDelta Air Lines Inc.:origin_airport_codeKSMF -4.0418 1.8831
## airlineUnited Air Lines Inc.:origin_airport_codeKSMF 2.9848 1.8092
## airlineDelta Air Lines Inc.:origin_airport_codeKSNA -3.3760 1.8974
## airlineUnited Air Lines Inc.:origin_airport_codeKSNA 0.8054 1.7942
## t value Pr(>|t|)
## (Intercept) 1.469 0.142232
## airlineDelta Air Lines Inc. 2.978 0.003001 **
## airlineUnited Air Lines Inc. -2.942 0.003370 **
## origin_airport_codeKFAT -0.120 0.904794
## origin_airport_codeKLAX 1.355 0.176014
## origin_airport_codeKLGB -2.418 0.015857 *
## origin_airport_codeKOAK -1.712 0.087345 .
## origin_airport_codeKONT -0.259 0.795672
## origin_airport_codeKPSP 0.725 0.468703
## origin_airport_codeKSAN -1.441 0.150038
## origin_airport_codeKSBA -2.687 0.007389 **
## origin_airport_codeKSFO -0.583 0.560352
## origin_airport_codeKSJC -2.787 0.005464 **
## origin_airport_codeKSMF -0.936 0.349367
## origin_airport_codeKSNA -0.574 0.565989
## airlineDelta Air Lines Inc.:origin_airport_codeKFAT NA NA
## airlineUnited Air Lines Inc.:origin_airport_codeKFAT 3.610 0.000328 ***
## airlineDelta Air Lines Inc.:origin_airport_codeKLAX -2.823 0.004903 **
## airlineUnited Air Lines Inc.:origin_airport_codeKLAX 0.180 0.856993
## airlineDelta Air Lines Inc.:origin_airport_codeKLGB NA NA
## airlineUnited Air Lines Inc.:origin_airport_codeKLGB NA NA
## airlineDelta Air Lines Inc.:origin_airport_codeKOAK -3.125 0.001856 **
## airlineUnited Air Lines Inc.:origin_airport_codeKOAK NA NA
## airlineDelta Air Lines Inc.:origin_airport_codeKONT -3.380 0.000765 ***
## airlineUnited Air Lines Inc.:origin_airport_codeKONT -0.352 0.724671
## airlineDelta Air Lines Inc.:origin_airport_codeKPSP -2.568 0.010438 *
## airlineUnited Air Lines Inc.:origin_airport_codeKPSP 1.525 0.127679
## airlineDelta Air Lines Inc.:origin_airport_codeKSAN -2.489 0.013051 *
## airlineUnited Air Lines Inc.:origin_airport_codeKSAN 1.613 0.107297
## airlineDelta Air Lines Inc.:origin_airport_codeKSBA NA NA
## airlineUnited Air Lines Inc.:origin_airport_codeKSBA 0.717 0.473368
## airlineDelta Air Lines Inc.:origin_airport_codeKSFO -2.533 0.011540 *
## airlineUnited Air Lines Inc.:origin_airport_codeKSFO 2.631 0.008707 **
## airlineDelta Air Lines Inc.:origin_airport_codeKSJC -1.162 0.245744
## airlineUnited Air Lines Inc.:origin_airport_codeKSJC 2.824 0.004879 **
## airlineDelta Air Lines Inc.:origin_airport_codeKSMF -2.146 0.032199 *
## airlineUnited Air Lines Inc.:origin_airport_codeKSMF 1.650 0.099439 .
## airlineDelta Air Lines Inc.:origin_airport_codeKSNA -1.779 0.075639 .
## airlineUnited Air Lines Inc.:origin_airport_codeKSNA 0.449 0.653655
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.097 on 681 degrees of freedom
## Multiple R-squared: 0.1742, Adjusted R-squared: 0.1342
## F-statistic: 4.352 on 33 and 681 DF, p-value: 5.404e-14
In the previous assessment, we tested ineraction (inter-dependence) among variables by checking the statistical significance of interaction variables. This posses a challenge, as there are a lot interaction variables. Adjusted R-squared for the model that takes into account of the interaction variable is 0.4342, while the one that does not is 0.3945, both adjusted R-squared being statistically signifciant. The model that accounts for inter-dependence explains the departure delay better.
Test are there any significant inter-dependence
anova(lm5, lm6)
p-value is statistically significant, meaning there are difference between these 2 models. We need to take into account of such interaction effect of in the mixed model.
lm7 - test month & airline
lm7 <- lm(avg_delay ~ airline + month , data = data1)
summary(lm7)
##
## Call:
## lm(formula = avg_delay ~ airline + month, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9643 -3.0615 -0.2416 2.8177 19.9071
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.94012 0.59274 1.586 0.11317
## airlineDelta Air Lines Inc. 0.05144 0.38380 0.134 0.89342
## airlineUnited Air Lines Inc. -1.29619 0.39423 -3.288 0.00106 **
## month2 0.02217 0.72839 0.030 0.97573
## month3 -1.10846 0.72302 -1.533 0.12570
## month4 -2.32934 0.79510 -2.930 0.00350 **
## month5 0.29473 0.82813 0.356 0.72203
## month6 1.21051 0.83292 1.453 0.14658
## month7 -0.34253 0.81459 -0.420 0.67426
## month8 -0.29125 0.80639 -0.361 0.71808
## month9 -1.72793 0.80241 -2.153 0.03162 *
## month10 -0.66418 0.79154 -0.839 0.40170
## month11 -1.27094 0.79872 -1.591 0.11201
## month12 0.61982 0.79512 0.780 0.43593
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.298 on 701 degrees of freedom
## Multiple R-squared: 0.06416, Adjusted R-squared: 0.0468
## F-statistic: 3.697 on 13 and 701 DF, p-value: 1.027e-05
lm8
lm8 <- lm(avg_delay ~ airline * month , data = data1)
summary(lm8)
##
## Call:
## lm(formula = avg_delay ~ airline * month, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6398 -2.9774 -0.2722 2.6974 19.2316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0640 0.9194 1.157 0.248
## airlineDelta Air Lines Inc. 0.3676 1.3323 0.276 0.783
## airlineUnited Air Lines Inc. -2.0604 1.3705 -1.503 0.133
## month2 -0.4743 1.1943 -0.397 0.691
## month3 -1.3671 1.1943 -1.145 0.253
## month4 -1.9532 1.2860 -1.519 0.129
## month5 0.4917 1.3002 0.378 0.705
## month6 1.7618 1.3002 1.355 0.176
## month7 -0.5849 1.3002 -0.450 0.653
## month8 -1.3212 1.3002 -1.016 0.310
## month9 -1.8926 1.3156 -1.439 0.151
## month10 0.3802 1.3002 0.292 0.770
## month11 -1.1281 1.3002 -0.868 0.386
## month12 -0.7801 1.3002 -0.600 0.549
## airlineDelta Air Lines Inc.:month2 1.1871 1.7449 0.680 0.497
## airlineUnited Air Lines Inc.:month2 0.3615 1.7984 0.201 0.841
## airlineDelta Air Lines Inc.:month3 0.1436 1.7379 0.083 0.934
## airlineUnited Air Lines Inc.:month3 0.7150 1.7818 0.401 0.688
## airlineDelta Air Lines Inc.:month4 -1.3410 1.9017 -0.705 0.481
## airlineUnited Air Lines Inc.:month4 0.1357 1.9619 0.069 0.945
## airlineDelta Air Lines Inc.:month5 -1.9449 2.0126 -0.966 0.334
## airlineUnited Air Lines Inc.:month5 1.1537 2.0129 0.573 0.567
## airlineDelta Air Lines Inc.:month6 -2.5525 2.0126 -1.268 0.205
## airlineUnited Air Lines Inc.:month6 0.5077 2.0381 0.249 0.803
## airlineDelta Air Lines Inc.:month7 -0.4634 1.9449 -0.238 0.812
## airlineUnited Air Lines Inc.:month7 1.3727 2.0129 0.682 0.496
## airlineDelta Air Lines Inc.:month8 0.5529 1.9449 0.284 0.776
## airlineUnited Air Lines Inc.:month8 2.9530 1.9712 1.498 0.135
## airlineDelta Air Lines Inc.:month9 -0.1711 1.9219 -0.089 0.929
## airlineUnited Air Lines Inc.:month9 0.7414 1.9814 0.374 0.708
## airlineDelta Air Lines Inc.:month10 -2.3629 1.8842 -1.254 0.210
## airlineUnited Air Lines Inc.:month10 -0.9123 1.9712 -0.463 0.644
## airlineDelta Air Lines Inc.:month11 -0.2687 1.9272 -0.139 0.889
## airlineUnited Air Lines Inc.:month11 -0.1619 1.9538 -0.083 0.934
## airlineDelta Air Lines Inc.:month12 1.6353 1.8842 0.868 0.386
## airlineUnited Air Lines Inc.:month12 3.0110 1.9908 1.512 0.131
##
## Residual standard error: 4.312 on 679 degrees of freedom
## Multiple R-squared: 0.08762, Adjusted R-squared: 0.04059
## F-statistic: 1.863 on 35 and 679 DF, p-value: 0.002102
Adjusted R-squared is even lower…
Boxplot
origin_airport_code boxplot
boxplot(avg_delay ~ origin_airport_code, data = data1)
year boxplot
boxplot(avg_delay ~ year, data = data1)
The mean is significantly different for different years.
airline * origin_airport_code [Relatiomship between airline and origin_airport_code]
boxplot(avg_delay ~ airline * origin_airport_code, data = data1)
Different means among groups. Split into subgroups before performing analysis.
We can model these individual differences by assuming different random intercepts for each year & each airport. Why not airlines? As we can pick airlines ourself… These random effects essentially give structure to the error term.
Mixed model
lmer1 <- lmer(avg_delay ~ airline + year + month + (1|origin_airport_code), data = data1, REML = FALSE)
summary(lmer1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: avg_delay ~ airline + year + month + (1 | origin_airport_code)
## Data: data1
##
## AIC BIC logLik deviance df.resid
## 3717.8 3800.1 -1840.9 3681.8 697
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.6368 -0.5762 -0.0324 0.5081 5.2907
##
## Random effects:
## Groups Name Variance Std.Dev.
## origin_airport_code (Intercept) 3.169 1.780
## Residual 9.583 3.096
## Number of obs: 715, groups: origin_airport_code, 13
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 3.89818 0.67153 5.805
## airlineDelta Air Lines Inc. 0.11484 0.28726 0.400
## airlineUnited Air Lines Inc. -2.16100 0.28935 -7.469
## year2020 -5.34366 0.24446 -21.859
## year2021 -6.75335 0.55234 -12.227
## month2 0.97777 0.54489 1.794
## month3 -0.03747 0.54161 -0.069
## month4 -2.74447 0.57340 -4.786
## month5 -0.40261 0.59792 -0.673
## month6 0.40168 0.60191 0.667
## month7 -1.00432 0.58823 -1.707
## month8 -1.02200 0.58253 -1.754
## month9 -2.32818 0.57940 -4.018
## month10 -1.17772 0.57121 -2.062
## month11 -1.92774 0.57682 -3.342
## month12 -0.02905 0.57336 -0.051
##
## Correlation matrix not shown by default, as p = 16 > 12.
## Use print(x, correlation=TRUE) or
## vcov(x) if you need it
The variance accounted by year is larger than orgin airport. Residual, the variability that is not accounted by year and origin airport has the largest variance.
lmer2
lmer2 <- lmer(avg_delay ~ airline + year + month + (1|origin_airport_code), data = data1, REML = FALSE)
summary(lmer2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: avg_delay ~ airline + year + month + (1 | origin_airport_code)
## Data: data1
##
## AIC BIC logLik deviance df.resid
## 3717.8 3800.1 -1840.9 3681.8 697
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.6368 -0.5762 -0.0324 0.5081 5.2907
##
## Random effects:
## Groups Name Variance Std.Dev.
## origin_airport_code (Intercept) 3.169 1.780
## Residual 9.583 3.096
## Number of obs: 715, groups: origin_airport_code, 13
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 3.89818 0.67153 5.805
## airlineDelta Air Lines Inc. 0.11484 0.28726 0.400
## airlineUnited Air Lines Inc. -2.16100 0.28935 -7.469
## year2020 -5.34366 0.24446 -21.859
## year2021 -6.75335 0.55234 -12.227
## month2 0.97777 0.54489 1.794
## month3 -0.03747 0.54161 -0.069
## month4 -2.74447 0.57340 -4.786
## month5 -0.40261 0.59792 -0.673
## month6 0.40168 0.60191 0.667
## month7 -1.00432 0.58823 -1.707
## month8 -1.02200 0.58253 -1.754
## month9 -2.32818 0.57940 -4.018
## month10 -1.17772 0.57121 -2.062
## month11 -1.92774 0.57682 -3.342
## month12 -0.02905 0.57336 -0.051
##
## Correlation matrix not shown by default, as p = 16 > 12.
## Use print(x, correlation=TRUE) or
## vcov(x) if you need it
Difference in models
anova(lmer1, lmer2)
lmer2 is better, the random effects accounts for more variations.
Check coefficients
coef(lmer2)
## $origin_airport_code
## (Intercept) airlineDelta Air Lines Inc. airlineUnited Air Lines Inc.
## KBUR 5.0541516 0.1148361 -2.161005
## KFAT 5.8015804 0.1148361 -2.161005
## KLAX 5.5895392 0.1148361 -2.161005
## KLGB 5.3906646 0.1148361 -2.161005
## KOAK 0.3009409 0.1148361 -2.161005
## KONT 2.8607443 0.1148361 -2.161005
## KPSP 5.3434286 0.1148361 -2.161005
## KSAN 3.4599584 0.1148361 -2.161005
## KSBA 0.8852358 0.1148361 -2.161005
## KSFO 4.9888639 0.1148361 -2.161005
## KSJC 2.6730857 0.1148361 -2.161005
## KSMF 4.2078303 0.1148361 -2.161005
## KSNA 4.1203339 0.1148361 -2.161005
## year2020 year2021 month2 month3 month4 month5 month6
## KBUR -5.343658 -6.753348 0.9777669 -0.03746621 -2.744472 -0.4026089 0.4016801
## KFAT -5.343658 -6.753348 0.9777669 -0.03746621 -2.744472 -0.4026089 0.4016801
## KLAX -5.343658 -6.753348 0.9777669 -0.03746621 -2.744472 -0.4026089 0.4016801
## KLGB -5.343658 -6.753348 0.9777669 -0.03746621 -2.744472 -0.4026089 0.4016801
## KOAK -5.343658 -6.753348 0.9777669 -0.03746621 -2.744472 -0.4026089 0.4016801
## KONT -5.343658 -6.753348 0.9777669 -0.03746621 -2.744472 -0.4026089 0.4016801
## KPSP -5.343658 -6.753348 0.9777669 -0.03746621 -2.744472 -0.4026089 0.4016801
## KSAN -5.343658 -6.753348 0.9777669 -0.03746621 -2.744472 -0.4026089 0.4016801
## KSBA -5.343658 -6.753348 0.9777669 -0.03746621 -2.744472 -0.4026089 0.4016801
## KSFO -5.343658 -6.753348 0.9777669 -0.03746621 -2.744472 -0.4026089 0.4016801
## KSJC -5.343658 -6.753348 0.9777669 -0.03746621 -2.744472 -0.4026089 0.4016801
## KSMF -5.343658 -6.753348 0.9777669 -0.03746621 -2.744472 -0.4026089 0.4016801
## KSNA -5.343658 -6.753348 0.9777669 -0.03746621 -2.744472 -0.4026089 0.4016801
## month7 month8 month9 month10 month11 month12
## KBUR -1.004316 -1.021997 -2.328179 -1.177725 -1.927744 -0.02905211
## KFAT -1.004316 -1.021997 -2.328179 -1.177725 -1.927744 -0.02905211
## KLAX -1.004316 -1.021997 -2.328179 -1.177725 -1.927744 -0.02905211
## KLGB -1.004316 -1.021997 -2.328179 -1.177725 -1.927744 -0.02905211
## KOAK -1.004316 -1.021997 -2.328179 -1.177725 -1.927744 -0.02905211
## KONT -1.004316 -1.021997 -2.328179 -1.177725 -1.927744 -0.02905211
## KPSP -1.004316 -1.021997 -2.328179 -1.177725 -1.927744 -0.02905211
## KSAN -1.004316 -1.021997 -2.328179 -1.177725 -1.927744 -0.02905211
## KSBA -1.004316 -1.021997 -2.328179 -1.177725 -1.927744 -0.02905211
## KSFO -1.004316 -1.021997 -2.328179 -1.177725 -1.927744 -0.02905211
## KSJC -1.004316 -1.021997 -2.328179 -1.177725 -1.927744 -0.02905211
## KSMF -1.004316 -1.021997 -2.328179 -1.177725 -1.927744 -0.02905211
## KSNA -1.004316 -1.021997 -2.328179 -1.177725 -1.927744 -0.02905211
##
## attr(,"class")
## [1] "coef.mer"
lmer3
lmer3 <- lmer(avg_delay ~ airline + (1|year) + (1|month) + (1|origin_airport_code), data = data1, REML = FALSE)
summary(lmer3)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula:
## avg_delay ~ airline + (1 | year) + (1 | month) + (1 | origin_airport_code)
## Data: data1
##
## AIC BIC logLik deviance df.resid
## 3744.2 3776.2 -1865.1 3730.2 708
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.6660 -0.5959 -0.0289 0.5193 5.2814
##
## Random effects:
## Groups Name Variance Std.Dev.
## origin_airport_code (Intercept) 3.425 1.851
## month (Intercept) 1.102 1.050
## year (Intercept) 8.395 2.897
## Residual 9.766 3.125
## Number of obs: 715, groups: origin_airport_code, 13; month, 12; year, 3
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.8323 1.7932 -0.464
## airlineDelta Air Lines Inc. 0.1144 0.2900 0.395
## airlineUnited Air Lines Inc. -2.1555 0.2920 -7.382
##
## Correlation of Fixed Effects:
## (Intr) aDALI.
## arlnDALInc. -0.069
## arlnUALInc. -0.060 0.426
anova(lmer2, lmer3)
coef(lmer3)
## $origin_airport_code
## (Intercept) airlineDelta Air Lines Inc. airlineUnited Air Lines Inc.
## KBUR 0.3166692 0.1144391 -2.155512
## KFAT 1.0812247 0.1144391 -2.155512
## KLAX 0.8608820 0.1144391 -2.155512
## KLGB 0.6730247 0.1144391 -2.155512
## KOAK -4.4162527 0.1144391 -2.155512
## KONT -1.8701801 0.1144391 -2.155512
## KPSP 0.6351172 0.1144391 -2.155512
## KSAN -1.2732370 0.1144391 -2.155512
## KSBA -3.8926558 0.1144391 -2.155512
## KSFO 0.2589266 0.1144391 -2.155512
## KSJC -2.0524523 0.1144391 -2.155512
## KSMF -0.5262898 0.1144391 -2.155512
## KSNA -0.6141267 0.1144391 -2.155512
##
## $month
## (Intercept) airlineDelta Air Lines Inc. airlineUnited Air Lines Inc.
## 1 -0.1575372 0.1144391 -2.155512
## 2 0.6902799 0.1144391 -2.155512
## 3 -0.2253344 0.1144391 -2.155512
## 4 -2.5335050 0.1144391 -2.155512
## 5 -0.5128050 0.1144391 -2.155512
## 6 0.1656154 0.1144391 -2.155512
## 7 -1.0240491 0.1144391 -2.155512
## 8 -1.0392074 0.1144391 -2.155512
## 9 -2.1653179 0.1144391 -2.155512
## 10 -1.1785374 0.1144391 -2.155512
## 11 -1.8225210 0.1144391 -2.155512
## 12 -0.1841734 0.1144391 -2.155512
##
## $year
## (Intercept) airlineDelta Air Lines Inc. airlineUnited Air Lines Inc.
## 2019 3.102546 0.1144391 -2.155512
## 2020 -2.208223 0.1144391 -2.155512
## 2021 -3.391096 0.1144391 -2.155512
##
## attr(,"class")
## [1] "coef.mer"
The above are random intercept models.
Random slope model –> allow slope to differ
lmer4 <- lmer(avg_delay ~ airline + (1|year) + (1|month) + (1 + airline|origin_airport_code), data = data1, REML = FALSE)
summary(lmer4)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: avg_delay ~ airline + (1 | year) + (1 | month) + (1 + airline |
## origin_airport_code)
## Data: data1
##
## AIC BIC logLik deviance df.resid
## 3697.6 3752.4 -1836.8 3673.6 703
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.960 -0.599 -0.006 0.513 5.706
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## origin_airport_code (Intercept) 3.360 1.833
## airlineDelta Air Lines Inc. 1.189 1.090 0.08
## airlineUnited Air Lines Inc. 5.764 2.401 -0.32 -0.10
## month (Intercept) 1.156 1.075
## year (Intercept) 8.700 2.950
## Residual 8.573 2.928
## Number of obs: 715, groups: origin_airport_code, 13; month, 12; year, 3
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.8723 1.8203 -0.479
## airlineDelta Air Lines Inc. 0.1501 0.4382 0.343
## airlineUnited Air Lines Inc. -2.2177 0.7757 -2.859
##
## Correlation of Fixed Effects:
## (Intr) aDALI.
## arlnDALInc. -0.031
## arlnUALInc. -0.098 0.031
coefficients
coef(lmer4)
## $origin_airport_code
## (Intercept) airlineDelta Air Lines Inc. airlineUnited Air Lines Inc.
## KBUR 0.95172591 2.16037631 -5.9819803
## KFAT 0.05685458 0.04520088 1.0052288
## KLAX 1.35156875 -0.29553265 -3.3516133
## KLGB 0.31471168 0.60308929 -2.7875589
## KOAK -4.08088397 -0.67662155 -0.7476625
## KONT -0.71292282 -1.04055433 -6.1287612
## KPSP 0.71440357 -0.32542698 -2.2908863
## KSAN -1.52087309 -0.15792817 -1.2480626
## KSBA -3.88841428 0.07315146 -2.3799211
## KSFO -0.53646118 -0.21432024 0.4658495
## KSJC -2.68813920 0.76742881 -0.6863337
## KSMF -0.85989826 0.29693617 -1.3924544
## KSNA -0.44143813 0.71597734 -3.3054280
##
## $month
## (Intercept) airlineDelta Air Lines Inc. airlineUnited Air Lines Inc.
## 1 -0.1431508 0.1501366 -2.21766
## 2 0.7294980 0.1501366 -2.21766
## 3 -0.1708902 0.1501366 -2.21766
## 4 -2.5231718 0.1501366 -2.21766
## 5 -0.4888234 0.1501366 -2.21766
## 6 0.1425545 0.1501366 -2.21766
## 7 -1.1439909 0.1501366 -2.21766
## 8 -1.1056968 0.1501366 -2.21766
## 9 -2.2864294 0.1501366 -2.21766
## 10 -1.2985517 0.1501366 -2.21766
## 11 -1.9206356 0.1501366 -2.21766
## 12 -0.2581888 0.1501366 -2.21766
##
## $year
## (Intercept) airlineDelta Air Lines Inc. airlineUnited Air Lines Inc.
## 2019 3.139365 0.1501366 -2.21766
## 2020 -2.253663 0.1501366 -2.21766
## 2021 -3.502572 0.1501366 -2.21766
##
## attr(,"class")
## [1] "coef.mer"
Slopes do differ a lot
origin_airport_code <- c('KBUR', 'KBUR', 'KBUR')
airline <- c('American Airlines Inc.', 'Delta Air Lines Inc.', 'United Air Lines Inc.')
year <- c(2021, 2021, 2021)
month <- c(11, 11, 11)
newdata <- data.frame(origin_airport_code, airline, year, month)
predict
predict(lmer4, newdata)
## 1 2 3
## -2.7269021 -0.5665258 -8.7088824
diagnostic plot
plot(lmer4, type = c("p", "smooth"))
plot(lmer4, sqrt(abs(resid(.))) ~ fitted(.), type = c("p", "smooth"))
qqmath(lmer4, id = 0.05)
confint
#confint(lmer4)
data3
data3 <- raw_data %>% group_by(origin_state_abr, origin_airport_code, airline, year, month) %>% summarise(avg_delay = mean(dep_delay), ave_age = mean(age)) %>% droplevels()
## `summarise()` has grouped output by 'origin_state_abr', 'origin_airport_code', 'airline', 'year'. You can override using the `.groups` argument.
lmer5
lmer5 <- lmer(avg_delay ~ airline + (1|year) + (1|month) + (1|origin_airport_code) + (1+ airline|origin_state_abr), data = data3, REML = FALSE)
summary(lmer5)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula:
## avg_delay ~ airline + (1 | year) + (1 | month) + (1 | origin_airport_code) +
## (1 + airline | origin_state_abr)
## Data: data3
##
## AIC BIC logLik deviance df.resid
## 36477.6 36565.6 -18225.8 36451.6 6383
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.4148 -0.4867 -0.0400 0.3869 23.2122
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## origin_airport_code (Intercept) 2.86078 1.6914
## origin_state_abr (Intercept) 0.08272 0.2876
## airlineDelta Air Lines Inc. 1.18156 1.0870 -0.78
## airlineUnited Air Lines Inc. 3.02602 1.7395 -0.52 -0.13
## month (Intercept) 0.89824 0.9478
## year (Intercept) 4.77851 2.1860
## Residual 16.33502 4.0417
## Number of obs: 6396, groups:
## origin_airport_code, 131; origin_state_abr, 47; month, 12; year, 3
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.4751 1.3067 -0.364
## airlineDelta Air Lines Inc. -1.3599 0.2204 -6.171
## airlineUnited Air Lines Inc. -1.1069 0.3206 -3.453
##
## Correlation of Fixed Effects:
## (Intr) aDALI.
## arlnDALInc. -0.060
## arlnUALInc. -0.037 0.071
lmer6
lmer6 <- lmer(avg_delay ~ airline + (1|year) + (1|month) + (1+ airline|origin_airport_code) + (1|origin_state_abr), data = data3, REML = FALSE)
summary(lmer6)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: avg_delay ~ airline + (1 | year) + (1 | month) + (1 + airline |
## origin_airport_code) + (1 | origin_state_abr)
## Data: data3
##
## AIC BIC logLik deviance df.resid
## 36114.3 36202.2 -18044.1 36088.3 6383
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -9.9789 -0.4804 -0.0348 0.3874 24.3303
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## origin_airport_code (Intercept) 2.8957 1.7017
## airlineDelta Air Lines Inc. 3.3313 1.8252 -0.48
## airlineUnited Air Lines Inc. 19.9879 4.4708 0.23 -0.35
## origin_state_abr (Intercept) 0.1511 0.3887
## month (Intercept) 0.9080 0.9529
## year (Intercept) 4.8660 2.2059
## Residual 14.8082 3.8481
## Number of obs: 6396, groups:
## origin_airport_code, 131; origin_state_abr, 47; month, 12; year, 3
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.4894 1.3204 -0.371
## airlineDelta Air Lines Inc. -1.3862 0.2148 -6.453
## airlineUnited Air Lines Inc. -0.7066 0.4823 -1.465
##
## Correlation of Fixed Effects:
## (Intr) aDALI.
## arlnDALInc. -0.087
## arlnUALInc. 0.014 -0.176
coef lmer6
coef(lmer6)
## $origin_airport_code
## (Intercept) airlineDelta Air Lines Inc. airlineUnited Air Lines Inc.
## KABQ -0.124806656 -2.22576054 -0.18294408
## KAGS -0.864615426 -1.87810165 -0.41020862
## KALB -1.681679541 -1.86563651 1.09178478
## KAMA -0.326643016 -1.47023513 -0.60783678
## KATL 0.206139388 0.56599529 1.39145887
## KAUS -1.708292081 -0.26278792 -0.74205165
## KAVL -0.827251516 -1.82911246 -0.43972769
## KAVP -1.683928367 -2.95233270 0.23708328
## KBDL -2.093709591 -0.78992001 -1.91872524
## KBHM -1.372073150 -0.62232213 -3.03947897
## KBIL -1.188598292 -2.30288739 -0.15424845
## KBIS -1.173216197 -2.28271936 -0.16640095
## KBNA -0.390058781 -2.51927344 1.77800016
## KBOI -0.597068400 -2.16708904 -4.00032676
## KBOS -0.938576576 0.58409063 -0.51667518
## KBTR 0.117346444 -3.17001896 5.08264234
## KBTV 1.256775175 2.14476077 -5.10320486
## KBUF -1.713689287 -0.85466412 -3.30082765
## KBUR 0.728568273 1.88463592 -5.46677180
## KBWI -0.884927077 -1.43339912 -2.13870668
## KBZN -0.029193418 -2.66934079 0.22513936
## KCAE 1.994006413 -4.38981838 2.11616053
## KCHS -1.104696702 -0.49090690 -2.84765990
## KCID 1.039835256 -1.52788244 3.30243538
## KCLE -1.990922065 -0.53181557 -1.18092287
## KCLT 1.446133770 -2.59632296 -2.56746371
## KCMH -0.182680520 -2.62363217 -1.21783332
## KCOS -0.285146319 -1.88469397 0.43295896
## KCRP -2.018270967 0.04612231 -7.85117621
## KCRW -1.157773629 -2.26247205 -0.17860123
## KCVG 1.337658376 -1.89356005 -2.33440990
## KDAL -0.219326160 -1.03203938 -0.92001445
## KDAY -0.943275252 -1.98123541 -0.34806403
## KDCA -1.722547977 -0.81352218 -0.81394012
## KDFW 3.109294613 -3.20063058 -4.40102044
## KDSM -0.933006367 -1.18104941 -0.74355323
## KDTW -1.030875041 0.59307390 0.81843707
## KEGE 1.906223556 -2.46028142 -2.49377785
## KELP -0.838116236 0.28052182 -4.76115318
## KEUG -1.252974825 -0.67083155 -4.27485229
## KEWR -0.269374802 -0.36074558 2.59586475
## KEYW 1.428871061 -2.37467280 0.45593704
## KFAR -1.750488284 -0.52935195 -1.62910276
## KFAT 0.124470034 -1.91524573 1.71691002
## KFAY -0.518434653 -1.42421142 -0.68370608
## KFLL -0.070580008 -0.78029267 -2.43404541
## KFNT -1.615548501 -2.86267737 0.18306029
## KFSD 0.907856990 1.13634888 -3.48863078
## KGEG -3.476503971 0.41552428 -3.11303840
## KGPT -0.544744114 -1.45870672 -0.66292049
## KGRB -3.087709654 -2.19560479 -4.96594965
## KGRR 0.464521343 -3.19790725 -3.80035842
## KGSO -1.228859029 -1.46955485 -0.79976825
## KGSP 1.142197029 -4.63962265 2.64948984
## KGTF -1.186736786 -2.30044670 -0.15571911
## KHOU -1.015647066 -2.07612473 -0.29088723
## KHPN 0.400822630 -0.21893970 -1.40995824
## KIAD -0.121110039 -1.90582423 -0.27205711
## KIAH 1.018786504 0.23099401 -0.58201644
## KICT -0.896486142 -0.77412657 -3.46820487
## KILM -1.335925462 -0.07312441 -1.88987907
## KIND -0.764415335 -1.54843411 -0.38774472
## KJAC 0.703043938 -2.28627580 1.23971298
## KJAN -1.054443943 -2.12699273 -0.26023606
## KJAX -0.861157419 -2.09101810 -1.43634111
## KJFK -0.166810456 0.84313361 1.18941110
## KLAS 0.102163709 -1.90125788 -1.57914633
## KLAX 1.496803097 -0.76723925 -3.30120005
## KLBB -0.696537601 -1.27917572 -0.83235348
## KLEX -0.480480316 -1.37444811 -0.71369161
## KLFT -1.279137278 -2.42159635 -0.08271881
## KLGA -0.619968404 0.70943939 -0.16964069
## KLGB 0.436733491 -0.17185567 -1.43832934
## KLIT 1.440846015 -4.14866715 1.81447378
## KMAF -1.094642796 -0.80582971 -3.66392696
## KMCI -1.696340272 -1.00949521 -2.76239957
## KMCO -0.966324848 -0.58738172 -0.11903815
## KMDT -1.042673909 -1.57112684 3.49870020
## KMDW -1.063309369 -2.13861652 -0.25323200
## KMEM -1.270724205 -1.64503075 -4.13034778
## KMFE -0.149791852 -1.03808504 -5.56492399
## KMFR -0.421569680 -1.44971843 -0.38959279
## KMHT -3.034707651 -0.17189027 -2.17479356
## KMIA 1.669997519 -1.13780017 -1.82295338
## KMKE -1.515360332 -1.90452870 4.46842586
## KMOB -1.095697825 -2.18108220 -0.22764375
## KMSN 2.135142579 -5.19321797 -0.87997703
## KMSO 1.725380223 -3.25207539 1.18950777
## KMSP 0.008099133 -0.47335876 -1.08378611
## KMSY -0.560291225 -0.67440715 -1.44235750
## KMYR -2.368485009 -1.71777572 7.75586071
## KOAK -3.084939639 -1.24046193 -1.36866161
## KOMA -2.070391852 -1.12366140 -0.33799929
## KONT -0.392227216 -1.49204089 -6.01769697
## KORD 2.419264516 -2.52482717 -1.37393217
## KPBI -1.035169378 -0.54686528 -0.77968343
## KPDX 0.416476836 -1.97205533 -2.68458005
## KPHF -1.858225961 -3.18086063 0.37478575
## KPHL -0.404577177 -1.59351733 -1.40309762
## KPHX 0.372098463 -0.60891647 -2.24307151
## KPIT -2.122852646 -0.66507869 0.84640340
## KPNS 0.414422554 -2.90599915 -1.64620657
## KPSP 0.877956692 -1.13285942 -2.04749524
## KPVD -1.669366332 -1.51844768 -2.59524958
## KPWM -1.507813049 -0.25835001 -3.17123826
## KRDU 0.038232676 -1.42573223 -1.32719764
## KRIC -0.995625319 -1.38222652 -3.44970968
## KRNO -0.866761691 -0.83901437 -2.48480857
## KROA -0.282789957 -1.11524908 -0.86987536
## KROC -0.813258606 -3.24826923 1.67510862
## KRSW -0.909453668 -2.13616666 -2.18808181
## KSAN -1.202785145 -0.52707248 -1.39687575
## KSAT -1.424264296 -0.06287290 -2.98270279
## KSAV -0.577039207 -1.11436300 8.61142810
## KSBA -3.125825794 0.05212237 -3.04706892
## KSBN 2.061274282 1.90614283 -2.59542078
## KSDF 0.007914770 -2.29744675 -0.45152310
## KSEA -0.069818835 0.01901700 -2.26914253
## KSFO -0.248506105 -0.67253348 0.43314227
## KSJC -2.368715899 0.62648806 -0.60551023
## KSLC 0.807836979 -1.14948891 0.59552196
## KSMF -0.627544226 -0.04407589 -1.44655929
## KSNA -0.300133868 0.43198059 -3.32572048
## KSRQ -1.372006409 -0.37063331 0.08766097
## KSTL -1.110269337 -1.13637403 4.60826019
## KSYR -2.883140061 -1.05132379 1.17441377
## KTLH -0.834065260 -1.83804621 -0.43434454
## KTPA -0.863439377 -0.52518987 -0.83061246
## KTUL -1.266170508 -1.53738853 -4.09700713
## KTVC 6.180013927 -7.73159321 30.69689371
## KTYS 0.829790432 -4.62654564 2.05011221
##
## $origin_state_abr
## (Intercept) airlineDelta Air Lines Inc. airlineUnited Air Lines Inc.
## AL -0.5886419 -1.38616 -0.7066342
## AR -0.4425415 -1.38616 -0.7066342
## AZ -0.4051077 -1.38616 -0.7066342
## CA -0.1667563 -1.38616 -0.7066342
## CO -0.3563375 -1.38616 -0.7066342
## CT -0.5794514 -1.38616 -0.7066342
## FL -0.3489484 -1.38616 -0.7066342
## GA -0.4446634 -1.38616 -0.7066342
## IA -0.4206968 -1.38616 -0.7066342
## ID -0.5134320 -1.38616 -0.7066342
## IL -0.3845164 -1.38616 -0.7066342
## IN -0.2399631 -1.38616 -0.7066342
## KS -0.4943391 -1.38616 -0.7066342
## KY -0.3683839 -1.38616 -0.7066342
## LA -0.5785233 -1.38616 -0.7066342
## MA -0.4631256 -1.38616 -0.7066342
## MD -0.5151774 -1.38616 -0.7066342
## ME -0.5216482 -1.38616 -0.7066342
## MI -0.3617128 -1.38616 -0.7066342
## MN -0.4281670 -1.38616 -0.7066342
## MO -0.6020677 -1.38616 -0.7066342
## MS -0.5563684 -1.38616 -0.7066342
## MT -0.5541916 -1.38616 -0.7066342
## NC -0.4627414 -1.38616 -0.7066342
## ND -0.6227537 -1.38616 -0.7066342
## NE -0.5904767 -1.38616 -0.7066342
## NH -0.6252932 -1.38616 -0.7066342
## NJ -0.4506183 -1.38616 -0.7066342
## NM -0.4898935 -1.38616 -0.7066342
## NV -0.4689670 -1.38616 -0.7066342
## NY -0.6589458 -1.38616 -0.7066342
## OH -0.6293680 -1.38616 -0.7066342
## OK -0.5406497 -1.38616 -0.7066342
## OR -0.4635166 -1.38616 -0.7066342
## PA -0.7616960 -1.38616 -0.7066342
## RI -0.5704103 -1.38616 -0.7066342
## SC -0.5673327 -1.38616 -0.7066342
## SD -0.3154291 -1.38616 -0.7066342
## TN -0.5840006 -1.38616 -0.7066342
## TX -0.2356007 -1.38616 -0.7066342
## UT -0.3962856 -1.38616 -0.7066342
## VA -0.7024263 -1.38616 -0.7066342
## VT -0.2592775 -1.38616 -0.7066342
## WA -0.5642329 -1.38616 -0.7066342
## WI -0.7084898 -1.38616 -0.7066342
## WV -0.5615489 -1.38616 -0.7066342
## WY -0.4377071 -1.38616 -0.7066342
##
## $month
## (Intercept) airlineDelta Air Lines Inc. airlineUnited Air Lines Inc.
## 1 0.5530361 -1.38616 -0.7066342
## 2 0.9382499 -1.38616 -0.7066342
## 3 -0.5505058 -1.38616 -0.7066342
## 4 -1.7948625 -1.38616 -0.7066342
## 5 -0.9119892 -1.38616 -0.7066342
## 6 0.6979843 -1.38616 -0.7066342
## 7 -0.2347804 -1.38616 -0.7066342
## 8 -0.6804378 -1.38616 -0.7066342
## 9 -1.6794191 -1.38616 -0.7066342
## 10 -0.8607703 -1.38616 -0.7066342
## 11 -1.5606916 -1.38616 -0.7066342
## 12 0.2112274 -1.38616 -0.7066342
##
## $year
## (Intercept) airlineDelta Air Lines Inc. airlineUnited Air Lines Inc.
## 2019 2.593033 -1.38616 -0.7066342
## 2020 -2.028811 -1.38616 -0.7066342
## 2021 -2.032461 -1.38616 -0.7066342
##
## attr(,"class")
## [1] "coef.mer"
lm model to test airline & origin_state_abr
lmtest1 <- lm(avg_delay ~ airline + origin_state_abr, data = data3)
lmtest2 <- lm(avg_delay ~ airline * origin_state_abr, data = data3)
anova(lmtest1, lmtest2)