Introduction

A flight delay is when an airline flight departs or arrives later than its scheduled time. Flight delays have always been an issue in the airline industry, and the effect of flight delays does not reside not only on passengers but also on the airline, as the airline is responsible for compensation.

Since 2003, the Bureau of Transportation Statistics (BTS) has been keeping track of the causes of flight delays, and in 2008, there were 7,009,728 flights tracked by the BTS. One might easily wonder what might be the causes of these flight delays. We are specifically curious about what factors might have caused delays in 2011 and 2012 and whether flight arrival delays differ by variations of these factors. This analysis of dataset can be used to prevent flight delays in the future by airline industries. Moreover, it can be used by passengers who wish to avoid flight delays.

Exploratory Data Analysis and Data Cleaning

BTS provided daily data of flights for 2011 and 2012. This dataset includes 6 variables: day of week of a flight, arrival delay in minutes, whether there was a delay due to a late aircraft, whether there was a delay due to the carrier or a company, whether there was a delay due to the National Air System, and lastly, the total time in the air in minutes. To gain insight of flight delays, we analyze the relationship between delay time and six predictor variables: DayOfWeek, ArrDelay, LateAircraft, Carrier, NAS, and AirTime.

DayOfWeek Sun, Mon, … Fri, Sat(1 = Sun, 2 = Mon, 3 = Tues, … 7 = Sat) ArrDelay arriaval delay in minutes (negative means an early arrival) LateAircraft whether there was a delay due to a late aircraft (1 = Yes, No = 0) Carrier whether there was a delay due to the carrier/ company (1 = Yes, No = 0) NAS whether there was a delay due to the National Air System(1 = Yes, No = 0) AirTime total time in the air in minutes
(1 = 0 - 99 mins; 2 = 100 - 199 mins; 3 = 200-299 mins; 4 = 300+) |

The EDA was performed in a transformed variable - log (ArrDelay), which we will justify in the next section why this variable was transformed. It is important to note that all the Yes answers to categorical variables were represented as 1 and all the No answeres were represented as 0. Also, for the variable DayOfWeek, 1 denotes Sunday, 2 Monday, 3 Tuesday, 4 Wednesday, 5 Thursday, 6 Friday, and 7 Saturday. Lastly, for the variable AirTime, the time in the air in minutes have been categorized as: 1: 0 min ≤ AirTime < 100 mins; 2: 100 mins ≤ AirTime < 200 mins; 3: 200 mins ≤ AirTime < 300 mins; 4: 300+ mins. The reason why the AirTime was broken down into 4 parts is because the minimum value of AirTime started from 9 and the maximum value for Airtime was 300+. Thus, it would be a good fit to break this value by 100s. Lastly, all the categorical variables were defined as factors.

First, we perform EDA on the response variable, ArrDelay.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.708   3.135   3.611   3.746   4.248   6.996

From the univariate plots and summaries, we can state that the log(arrival delay) graph is strongly skewed to the right and an outlier does not seem to exist in the graph. The log(arrival delay) time is centered around 3.746 minutes and the graph is unimodal. The log(arrival delay) time also range from 2.708 to 6.996 minutes. For DayOfWeek, 16% of flights flew on Sunday, 14% on Monday, 13% on Tuesday, 15% on Wednesday, 17% on Thursday, 11% on Friday, and 14% on Saturday. For LateAircraft, 51% of flights were not delayed by the late aircraft and 49% of flights were delayed by late aircraft. For carrier,57% of flights were not delayed by the carrier/ company while 43% of flight were delayed by the carrier/ company. For NAS, 41% of flights were not delayed by the National Air System while 58% of flight were delayed by the National Air System. Lastly, for AirTime, 2% of flights have total 300+ minutes of flight time in the air , 8% of flights have total 200~ 299 minutes of flight time in the air, 34% of flights have total 100~199 minutes of flight time in the air, and 55% of flights have 0~99 minutes of flight time in the air. Now, we perform univariate EDA on each separate factor.

Looking at the boxplots of log(ArrDelay) by DayOfWeek and the table of means and standard deviations, we can state that there is not much difference of in log(ArrDelay) depending on days of week

Looking at the boxplots of log(ArrDelay) by LateAirCraft and the table of means and standard deviations, we can note that for flights that were delayed due to a late aircraft, the log(arrival delay) was a bit higher than the ones that were not delyaed due to a late aircraft. Thus, we can suspect that there may be relationship between log(ArrDelay) and LateAircraft.

Looking at the boxplots of log(ArrDelay) by Carrier and the table of means and standard deviations, there is not much relationship between log(ArrDelay) and Carrier because the flight that was delayed due to the carrier/company has a similar mean log(ArrDelay) time as the flight that was not delayed due to the carrier/company.

Looking at the boxplots of log(ArrDelay) by the NAS and the table of means and standard deviations, there is not much relationship between log(ArrDelay) and the NAS because the flight that was delayed due to the NAS has a similar mean log(ArrDelay) time as the flight that was not delayed due to the NAS.

Looking at the boxplots of log(ArrDelay) by AirTimeLabels and the table of means and standard deviations, we can suspect that there is a decreasing relationship: the longer a flight flew in the air, the shorter log(ArrDelay). Regarding variability, there is the highest variability for flights = 1.

##          0   1 margin
## 1      3.6 3.8    3.7
## 2      3.7 3.9    3.8
## 3      3.6 3.8    3.7
## 4      3.6 3.9    3.7
## 5      3.6 3.9    3.7
## 6      3.6 3.9    3.7
## 7      3.8 3.8    3.8
## margin 3.6 3.9    3.7

The interaction plot, table of means and standard deviation notes that an interaction between DayOfWeek and LateAirCraft probably exists.

##          0   1 margin
## 1      3.7 3.8    3.7
## 2      3.8 3.7    3.8
## 3      3.7 3.7    3.7
## 4      3.8 3.7    3.7
## 5      3.8 3.7    3.7
## 6      3.7 3.7    3.7
## 7      3.8 3.9    3.8
## margin 3.8 3.7    3.7

The interaction plot, table of means and standard deviation notes that an interaction between DayOfWeek and Carrier probably exists.

##          0   1 margin
## 1      3.8 3.7    3.7
## 2      3.7 3.8    3.8
## 3      3.8 3.6    3.7
## 4      3.9 3.6    3.7
## 5      3.7 3.8    3.7
## 6      3.7 3.7    3.7
## 7      3.9 3.8    3.8
## margin 3.8 3.7    3.7

The interaction plot, table of means and standard deviation notes that an interaction between DayOfWeek and NAS probably exists.

##          4   3   2   1 margin
## 1      3.5 3.6 3.8 3.7    3.7
## 2      3.1 3.8 3.9 3.7    3.8
## 3      4.0 3.7 3.6 3.7    3.7
## 4      3.0 3.6 3.7 3.8    3.7
## 5      3.5 3.9 3.9 3.7    3.7
## 6      3.6 3.4 3.6 3.8    3.7
## 7      3.5 3.7 3.8 3.9    3.8
## margin 3.5 3.7 3.8 3.8    3.7

The interaction plot, table of means and standard deviation notes that an interaction between DayOfWeek and AirTime probably exists for certain days.

##          0   1 margin
## 0      3.6 3.7    3.6
## 1      3.9 3.8    3.9
## margin 3.8 3.7    3.7

The interaction plot, table of means and standard deviation notes that an interaction between LateAircraft and Carrier exists.

##          0   1 margin
## 0      3.7 3.6    3.6
## 1      3.8 3.9    3.9
## margin 3.8 3.7    3.7

The interaction plot, table of means and standard deviation notes that an interaction between LateAircraft and NAS exists.

##          4   3   2   1 margin
## 0      3.3 3.6 3.6 3.7    3.6
## 1      3.8 3.7 3.9 3.8    3.9
## margin 3.5 3.7 3.8 3.8    3.7

The interaction plot, table of means and standard deviation notes that an interaction between LateAircraft and AirTimeLabels probably exists.

##          0   1 margin
## 0      3.8 3.7    3.8
## 1      3.7 3.7    3.7
## margin 3.8 3.7    3.7

The interaction plot, table of means and standard deviation notes that there is no interaction between Carrier and NAS, as the lines are parallel to each other.

##          4   3   2   1 margin
## 0      3.4 3.7 3.8 3.8    3.8
## 1      3.5 3.7 3.8 3.7    3.7
## margin 3.5 3.7 3.8 3.8    3.7

The interaction plot, table of means and standard deviation notes that there is no interaction between Carrier and AirTimeLabels, as the lines are almost identitical.

##          4   3   2   1 margin
## 0      3.9 3.6 3.9 3.8    3.8
## 1      3.3 3.7 3.7 3.8    3.7
## margin 3.5 3.7 3.8 3.8    3.7

The interaction plot, table of means and standard deviation notes that an interaction between NAS and AirTimeLabels exists.

Modeling

Now that we have performed EDA, we will fit a multi-way anova model with an interaction term for each pair of explanatory variables: Note that the interaction term Carrier_FACTOR:NAS_FACTOR and Carrier_FACTOR:AirTimeLabels_FACTORS are dropped, as from the interaction plots above, it is clear that there is no interaction between the two variable pair.

-Model with no transformations and interaction terms: The residuals of this model deviated from normality assumption and also homocedasticity assumptions if we look at the QQplot. Other than these violations, the model was not a terrible model overall, as the other error assumptions were met in that residuals were independent(patternlessly spread apart), and residuals were centered around zero on average. There seems to be a slight deviation of standard deviation, as residuals seem to be spread more on the region above zero.

-Model with square root transformation and interaction terms: The residuals of this model also deviated from normality assumption and also homocedasticity assumptions if we look at the QQplot.Other than these violations, the other error assumptions were met, as residuals were independent(patternlessly spread apart) and were centered around zero on average. There seems to be a slight deviation of standard deviation again, as residuals seem to be spread more on the region above zero, but this deviation is less severe than the previous model.

-Model with log transformation and interaction terms: This transformation greatly improved residual analysis, as normality assumption and also homocedasticity assumptions were met along with a standard deviation assumption that was violated.

With these reasons in mind, we decided to model log (ArrDelay) as a multi-way ANOVA with predictors DayOfWeek(1= Sunday, 1 = Monday, 2 = Tuesday … 7 = Saturday), LateAircraft(1= YES, 0 = NO), Carrier(1= YES, 0 = NO), NAS (1 = YES, 0 = No ), and AirTimeLabels(1 = 0 - 99 mins; 2 = 100 - 199 mins; 3 = 200-299 mins; 4 = 300+).

##                                             Df Sum Sq Mean Sq F value
## DayOfWeek_FACTOR                             6    1.6   0.268   0.499
## LateAircraft_FACTOR                          1   13.1  13.142  24.485
## Carrier_FACTOR                               1    0.0   0.001   0.002
## NAS_FACTOR                                   1    0.1   0.097   0.181
## AirTimeLabels_FACTORS                        3    2.3   0.767   1.430
## DayOfWeek_FACTOR:LateAircraft_FACTOR         6    3.9   0.658   1.225
## DayOfWeek_FACTOR:Carrier_FACTOR              6    2.4   0.405   0.755
## DayOfWeek_FACTOR:NAS_FACTOR                  6    4.1   0.691   1.288
## DayOfWeek_FACTOR:AirTimeLabels_FACTORS      18    9.4   0.522   0.973
## LateAircraft_FACTOR:Carrier_FACTOR           1    3.2   3.169   5.904
## LateAircraft_FACTOR:NAS_FACTOR               1    1.8   1.789   3.333
## LateAircraft_FACTOR:AirTimeLabels_FACTORS    3    1.1   0.351   0.654
## NAS_FACTOR:AirTimeLabels_FACTORS             3    2.7   0.901   1.680
## Residuals                                 1026  550.7   0.537        
##                                             Pr(>F)    
## DayOfWeek_FACTOR                            0.8096    
## LateAircraft_FACTOR                       8.75e-07 ***
## Carrier_FACTOR                              0.9658    
## NAS_FACTOR                                  0.6709    
## AirTimeLabels_FACTORS                       0.2325    
## DayOfWeek_FACTOR:LateAircraft_FACTOR        0.2905    
## DayOfWeek_FACTOR:Carrier_FACTOR             0.6057    
## DayOfWeek_FACTOR:NAS_FACTOR                 0.2600    
## DayOfWeek_FACTOR:AirTimeLabels_FACTORS      0.4889    
## LateAircraft_FACTOR:Carrier_FACTOR          0.0153 *  
## LateAircraft_FACTOR:NAS_FACTOR              0.0682 .  
## LateAircraft_FACTOR:AirTimeLabels_FACTORS   0.5805    
## NAS_FACTOR:AirTimeLabels_FACTORS            0.1697    
## Residuals                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the residual plots, the first model assumption about independece is met because residuals are patternlessly spread apart on average. The second assumption about mean being 0 is met because residuals are centered around zero. The third assumption about standard deviation is met because residuals do have a constant spread above and below zero on average. Lastly, if we look at the QQ plot, we can see that points are relatively close to the line on the qq plot.Thus, we can assume that residuals are normally distributed. From the summary of the model, we can see that the interaction between LateAircraft and Carrier is significant (F 5.904, p-value = 0.0153. Also, we can state that there is a significant LateAircraft effect.(F value = 24.485 , p-value = 8.75e-07). For interpretation of this significant interaction, we can go back to the table of means. We can see that log(ArrDelay) decreases if a flight that was delayed due to the carrier/company also gets delayed due to the LateAircraft. But, if a flight that was not delayed due to the carrier/company gets delayed due to the LateAircraft, log(ArrDelay) increases. Hence, we can state that there is a negative relationship between LateAircraft and Carrier. Then, for interpretation of the significant LateAircraft effect, we can state that for flights that were delayed due to a late aircraft, the log(arrival delay) was a bit higher than the ones that were not delyaed due to a late aircraft. Thus, we can suspect that there is a positive effect of a LateAircraft on log(ArrDelay).

Prediction

Now that we have analyzed and built a model that satisifies assumptions, we are interested in predicting the ArrDelay of a flight on a Sunday that was delayed due to a late aircraft and a National Air System alert (but not due to the carrier) with an AirTime of 125.

Sunday = 1 LateAircraft = 1 NAS= 1 Carrier = 0 AirTime of 125 = AirTimeLables = 2

The log(ArrDelay) value is:

log(ArrDelay) = Intercept + 1 * Beta LateAircraft_FACTOR1 + 1 * Beta NAS_FACTOR 
+ 1 * Beta AirTimeLabels2_FACTOR + 1 * Beta LateAircraft_FACTOR1:Beta NAS_FACTOR 
+ 1 * Beta LateAircraft_FACTOR1: AirTimeLabels2_FACTOR + 1 * NAS_FACTOR:AirTimeLabels2_FACTOR 
  
log(ArrDelay) = 3.609953 + 0.329698 -0.450673 +  0.079586 + 0.189300 -0.125200 + 0.288695  
              = 3.921359
  
  
## [1] 3.921359

Hence, log(ArrDelay) of a flight on a Sunday that was delayed due to a late aircraft and a National Air System alert (but not due to the carrier) with an AirTime of 125 would be 3.921359 minutes. Since we have performed log transformation of the response variable, we will leave this value as it is.

## 
## Call:
## lm(formula = log.ArrDelay ~ DayOfWeek_FACTOR + LateAircraft_FACTOR + 
##     Carrier_FACTOR + NAS_FACTOR + AirTimeLabels_FACTORS + DayOfWeek_FACTOR:LateAircraft_FACTOR + 
##     DayOfWeek_FACTOR:Carrier_FACTOR + DayOfWeek_FACTOR:NAS_FACTOR + 
##     DayOfWeek_FACTOR:AirTimeLabels_FACTORS + LateAircraft_FACTOR:Carrier_FACTOR + 
##     LateAircraft_FACTOR:NAS_FACTOR + LateAircraft_FACTOR:AirTimeLabels_FACTORS + 
##     NAS_FACTOR:AirTimeLabels_FACTORS, data = airlines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4300 -0.5584 -0.1129  0.4516  2.9331 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                  3.609953   0.599964   6.017
## DayOfWeek_FACTOR2                           -0.137534   0.693330  -0.198
## DayOfWeek_FACTOR3                            0.126866   0.918043   0.138
## DayOfWeek_FACTOR4                           -0.170207   0.554735  -0.307
## DayOfWeek_FACTOR5                            0.401451   0.588869   0.682
## DayOfWeek_FACTOR6                            0.097427   0.522038   0.187
## DayOfWeek_FACTOR7                            0.281356   0.568629   0.495
## LateAircraft_FACTOR1                         0.329698   0.449571   0.733
## Carrier_FACTOR1                              0.132435   0.131348   1.008
## NAS_FACTOR1                                 -0.450673   0.444699  -1.013
## AirTimeLabels_FACTORS3                      -0.273195   0.636895  -0.429
## AirTimeLabels_FACTORS2                       0.079586   0.589860   0.135
## AirTimeLabels_FACTORS1                      -0.026031   0.585168  -0.044
## DayOfWeek_FACTOR2:LateAircraft_FACTOR1      -0.010445   0.178379  -0.059
## DayOfWeek_FACTOR3:LateAircraft_FACTOR1       0.079053   0.179511   0.440
## DayOfWeek_FACTOR4:LateAircraft_FACTOR1       0.088697   0.169088   0.525
## DayOfWeek_FACTOR5:LateAircraft_FACTOR1       0.016457   0.165233   0.100
## DayOfWeek_FACTOR6:LateAircraft_FACTOR1       0.148114   0.189801   0.780
## DayOfWeek_FACTOR7:LateAircraft_FACTOR1      -0.233126   0.173476  -1.344
## DayOfWeek_FACTOR2:Carrier_FACTOR1           -0.118156   0.175565  -0.673
## DayOfWeek_FACTOR3:Carrier_FACTOR1           -0.069931   0.182497  -0.383
## DayOfWeek_FACTOR4:Carrier_FACTOR1           -0.247620   0.172552  -1.435
## DayOfWeek_FACTOR5:Carrier_FACTOR1           -0.157814   0.163210  -0.967
## DayOfWeek_FACTOR6:Carrier_FACTOR1           -0.117369   0.188024  -0.624
## DayOfWeek_FACTOR7:Carrier_FACTOR1            0.084735   0.180075   0.471
## DayOfWeek_FACTOR2:NAS_FACTOR1                0.095459   0.181097   0.527
## DayOfWeek_FACTOR3:NAS_FACTOR1               -0.116020   0.181592  -0.639
## DayOfWeek_FACTOR4:NAS_FACTOR1               -0.351476   0.178003  -1.975
## DayOfWeek_FACTOR5:NAS_FACTOR1               -0.032699   0.164785  -0.198
## DayOfWeek_FACTOR6:NAS_FACTOR1                0.076224   0.191915   0.397
## DayOfWeek_FACTOR7:NAS_FACTOR1               -0.096339   0.179125  -0.538
## DayOfWeek_FACTOR2:AirTimeLabels_FACTORS3     0.369661   0.737428   0.501
## DayOfWeek_FACTOR3:AirTimeLabels_FACTORS3     0.091260   0.933489   0.098
## DayOfWeek_FACTOR4:AirTimeLabels_FACTORS3     0.555620   0.594050   0.935
## DayOfWeek_FACTOR5:AirTimeLabels_FACTORS3     0.054907   0.633370   0.087
## DayOfWeek_FACTOR6:AirTimeLabels_FACTORS3    -0.293175   0.561053  -0.523
## DayOfWeek_FACTOR7:AirTimeLabels_FACTORS3    -0.089882   0.606821  -0.148
## DayOfWeek_FACTOR2:AirTimeLabels_FACTORS2     0.253260   0.682924   0.371
## DayOfWeek_FACTOR3:AirTimeLabels_FACTORS2    -0.197705   0.895194  -0.221
## DayOfWeek_FACTOR4:AirTimeLabels_FACTORS2     0.400227   0.536572   0.746
## DayOfWeek_FACTOR5:AirTimeLabels_FACTORS2    -0.180238   0.574679  -0.314
## DayOfWeek_FACTOR6:AirTimeLabels_FACTORS2    -0.238894   0.498047  -0.480
## DayOfWeek_FACTOR7:AirTimeLabels_FACTORS2    -0.059124   0.553857  -0.107
## DayOfWeek_FACTOR2:AirTimeLabels_FACTORS1     0.100674   0.679502   0.148
## DayOfWeek_FACTOR3:AirTimeLabels_FACTORS1    -0.087079   0.891384  -0.098
## DayOfWeek_FACTOR4:AirTimeLabels_FACTORS1     0.560183   0.529901   1.057
## DayOfWeek_FACTOR5:AirTimeLabels_FACTORS1    -0.402067   0.568404  -0.707
## DayOfWeek_FACTOR6:AirTimeLabels_FACTORS1    -0.007404   0.491788  -0.015
## DayOfWeek_FACTOR7:AirTimeLabels_FACTORS1    -0.019772   0.548404  -0.036
## LateAircraft_FACTOR1:Carrier_FACTOR1        -0.165022   0.101105  -1.632
## LateAircraft_FACTOR1:NAS_FACTOR1             0.189300   0.103423   1.830
## LateAircraft_FACTOR1:AirTimeLabels_FACTORS3 -0.247806   0.459549  -0.539
## LateAircraft_FACTOR1:AirTimeLabels_FACTORS2 -0.125200   0.438043  -0.286
## LateAircraft_FACTOR1:AirTimeLabels_FACTORS1 -0.189758   0.435569  -0.436
## NAS_FACTOR1:AirTimeLabels_FACTORS3           0.609553   0.459983   1.325
## NAS_FACTOR1:AirTimeLabels_FACTORS2           0.288695   0.436788   0.661
## NAS_FACTOR1:AirTimeLabels_FACTORS1           0.457238   0.433374   1.055
##                                             Pr(>|t|)    
## (Intercept)                                 2.47e-09 ***
## DayOfWeek_FACTOR2                             0.8428    
## DayOfWeek_FACTOR3                             0.8901    
## DayOfWeek_FACTOR4                             0.7590    
## DayOfWeek_FACTOR5                             0.4956    
## DayOfWeek_FACTOR6                             0.8520    
## DayOfWeek_FACTOR7                             0.6208    
## LateAircraft_FACTOR1                          0.4635    
## Carrier_FACTOR1                               0.3136    
## NAS_FACTOR1                                   0.3111    
## AirTimeLabels_FACTORS3                        0.6681    
## AirTimeLabels_FACTORS2                        0.8927    
## AirTimeLabels_FACTORS1                        0.9645    
## DayOfWeek_FACTOR2:LateAircraft_FACTOR1        0.9533    
## DayOfWeek_FACTOR3:LateAircraft_FACTOR1        0.6598    
## DayOfWeek_FACTOR4:LateAircraft_FACTOR1        0.6000    
## DayOfWeek_FACTOR5:LateAircraft_FACTOR1        0.9207    
## DayOfWeek_FACTOR6:LateAircraft_FACTOR1        0.4354    
## DayOfWeek_FACTOR7:LateAircraft_FACTOR1        0.1793    
## DayOfWeek_FACTOR2:Carrier_FACTOR1             0.5011    
## DayOfWeek_FACTOR3:Carrier_FACTOR1             0.7017    
## DayOfWeek_FACTOR4:Carrier_FACTOR1             0.1516    
## DayOfWeek_FACTOR5:Carrier_FACTOR1             0.3338    
## DayOfWeek_FACTOR6:Carrier_FACTOR1             0.5326    
## DayOfWeek_FACTOR7:Carrier_FACTOR1             0.6381    
## DayOfWeek_FACTOR2:NAS_FACTOR1                 0.5982    
## DayOfWeek_FACTOR3:NAS_FACTOR1                 0.5230    
## DayOfWeek_FACTOR4:NAS_FACTOR1                 0.0486 *  
## DayOfWeek_FACTOR5:NAS_FACTOR1                 0.8427    
## DayOfWeek_FACTOR6:NAS_FACTOR1                 0.6913    
## DayOfWeek_FACTOR7:NAS_FACTOR1                 0.5908    
## DayOfWeek_FACTOR2:AirTimeLabels_FACTORS3      0.6163    
## DayOfWeek_FACTOR3:AirTimeLabels_FACTORS3      0.9221    
## DayOfWeek_FACTOR4:AirTimeLabels_FACTORS3      0.3498    
## DayOfWeek_FACTOR5:AirTimeLabels_FACTORS3      0.9309    
## DayOfWeek_FACTOR6:AirTimeLabels_FACTORS3      0.6014    
## DayOfWeek_FACTOR7:AirTimeLabels_FACTORS3      0.8823    
## DayOfWeek_FACTOR2:AirTimeLabels_FACTORS2      0.7108    
## DayOfWeek_FACTOR3:AirTimeLabels_FACTORS2      0.8253    
## DayOfWeek_FACTOR4:AirTimeLabels_FACTORS2      0.4559    
## DayOfWeek_FACTOR5:AirTimeLabels_FACTORS2      0.7539    
## DayOfWeek_FACTOR6:AirTimeLabels_FACTORS2      0.6316    
## DayOfWeek_FACTOR7:AirTimeLabels_FACTORS2      0.9150    
## DayOfWeek_FACTOR2:AirTimeLabels_FACTORS1      0.8822    
## DayOfWeek_FACTOR3:AirTimeLabels_FACTORS1      0.9222    
## DayOfWeek_FACTOR4:AirTimeLabels_FACTORS1      0.2907    
## DayOfWeek_FACTOR5:AirTimeLabels_FACTORS1      0.4795    
## DayOfWeek_FACTOR6:AirTimeLabels_FACTORS1      0.9880    
## DayOfWeek_FACTOR7:AirTimeLabels_FACTORS1      0.9712    
## LateAircraft_FACTOR1:Carrier_FACTOR1          0.1029    
## LateAircraft_FACTOR1:NAS_FACTOR1              0.0675 .  
## LateAircraft_FACTOR1:AirTimeLabels_FACTORS3   0.5898    
## LateAircraft_FACTOR1:AirTimeLabels_FACTORS2   0.7751    
## LateAircraft_FACTOR1:AirTimeLabels_FACTORS1   0.6632    
## NAS_FACTOR1:AirTimeLabels_FACTORS3            0.1854    
## NAS_FACTOR1:AirTimeLabels_FACTORS2            0.5088    
## NAS_FACTOR1:AirTimeLabels_FACTORS1            0.2916    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7326 on 1026 degrees of freedom
## Multiple R-squared:  0.07676,    Adjusted R-squared:  0.02637 
## F-statistic: 1.523 on 56 and 1026 DF,  p-value: 0.008958

Discussion

We found out that arrival delay is different for flights that flew on a different day of a week, that were delayed due to the LateAircraft or not, that were delayed due to the carrier or a company or not, that were delayed due to the National Air System or not, and lastly, that have different total time in the air in minutes. We found out that there is a signficant LaterAircraft effect, meaning that arrival delay time differs greatly depending on whether there was a delay due to the late aircraft. For flights that were delayed due to a late aircraft, the arrival delay was higher than the ones that were not delyaed due to a late aircraft. We also found out that there is a significant interaction between LateAircraft and Carrier. ArrDelay decreases if a flight that was delayed due to the carrier/company also gets delayed due to the LateAircraft. But, if a flight that was not delayed due to the carrier/company gets delayed due to the LateAircraft, ArrDelay increases.

It is important to know which factors affect arrival delay time, which are signficant, and which factors affect one another, as more and more people fly in and out of their cities and flight delays are becoming a nuisance for both passengers and airlines.

The final model seems to have a limitation in that the error assumption on independence is not perfectly met, as the residuals lie a bit more on the right side of the graph and that the EDA of log transformation of ArrDelay was still not perfectly normal. Lastly, I would be more interested in exploring what other factors are there that impact arrival delay time. We can consider adding fuel amount or total passenger number. More additions of variables would provide another perspective of looking at arrival delay times.