Dataset Selection

Setting up the data

flight=read.table("flight_data.txt", header=T, sep="\t")
dim(flight)
## [1] 80789    16
str(flight)
## 'data.frame':    80789 obs. of  16 variables:
##  $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ dep_delay: int  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ arr_delay: int  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier  : Factor w/ 16 levels "9E","AA","AS",..: 12 12 2 4 5 12 4 6 4 2 ...
##  $ tailnum  : Factor w/ 3576 levels "","D942DN","N0EGMQ",..: 179 493 2201 2886 2424 1046 1682 2959 2022 1077 ...
##  $ flight   : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ origin   : Factor w/ 3 levels "EWR","JFK","LGA": 1 3 2 2 3 1 1 3 2 3 ...
##  $ dest     : Factor w/ 96 levels "ALB","ATL","AUS",..: 41 41 53 10 2 63 33 40 49 63 ...
##  $ air_time : int  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance : int  1400 1416 1089 1576 762 719 1065 229 944 733 ...
##  $ hour     : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ minute   : int  17 33 42 44 54 54 55 57 57 58 ...
jfk<-flight[flight$origin=="JFK" & flight$month=="1",]
dim(jfk)
## [1] 9161   16
jfk<-na.omit(jfk)
dim(jfk)
## [1] 9031   16
summary(jfk)
##       year          month        day           dep_time     
##  Min.   :2013   Min.   :1   Min.   : 1.00   Min.   :   1.0  
##  1st Qu.:2013   1st Qu.:1   1st Qu.: 8.00   1st Qu.: 911.5  
##  Median :2013   Median :1   Median :16.00   Median :1507.0  
##  Mean   :2013   Mean   :1   Mean   :15.74   Mean   :1391.8  
##  3rd Qu.:2013   3rd Qu.:1   3rd Qu.:24.00   3rd Qu.:1809.0  
##  Max.   :2013   Max.   :1   Max.   :31.00   Max.   :2359.0  
##                                                             
##    dep_delay           arr_time      arr_delay           carrier    
##  Min.   : -17.000   Min.   :   1   Min.   : -70.000   B6     :3321  
##  1st Qu.:  -5.000   1st Qu.:1110   1st Qu.: -18.000   DL     :1517  
##  Median :  -2.000   Median :1641   Median :  -7.000   9E     :1338  
##  Mean   :   8.558   Mean   :1546   Mean   :   1.368   AA     :1230  
##  3rd Qu.:   5.000   3rd Qu.:2019   3rd Qu.:   9.000   MQ     : 570  
##  Max.   :1301.000   Max.   :2400   Max.   :1272.000   UA     : 377  
##                                                       (Other): 678  
##     tailnum         flight     origin          dest         air_time    
##  N249JB :  43   Min.   :   1   EWR:   0   LAX    : 934   Min.   : 24.0  
##  N334JB :  43   1st Qu.: 155   JFK:9031   SFO    : 667   1st Qu.: 72.0  
##  N239JB :  39   Median : 711   LGA:   0   BOS    : 475   Median :154.0  
##  N281JB :  39   Mean   :1377              MCO    : 456   Mean   :181.2  
##  N283JB :  39   3rd Qu.:2041              FLL    : 437   3rd Qu.:313.0  
##  N266JB :  36   Max.   :5716              SJU    : 410   Max.   :660.0  
##  (Other):8792                             (Other):5652                  
##     distance         hour          minute     
##  Min.   :  94   Min.   : 0.0   Min.   : 0.00  
##  1st Qu.: 425   1st Qu.: 9.0   1st Qu.:16.00  
##  Median :1041   Median :15.0   Median :32.00  
##  Mean   :1241   Mean   :13.6   Mean   :31.99  
##  3rd Qu.:2248   3rd Qu.:18.0   3rd Qu.:50.00  
##  Max.   :4983   Max.   :23.0   Max.   :59.00  
## 

Selection of independent and dependent variables

I am interested in arrival delay as the response/dependent variable. The explanatory/independent variables I want to look at are departure delay and air time because I anticipate that these are predictors that may help explain the variation in arrival delay of flights.

Independent variables:

Dependent variable: arrival delay of flights that originated in JFK in January 2013 in minutes (“arr_delay”)

colnames(jfk)
##  [1] "year"      "month"     "day"       "dep_time"  "dep_delay"
##  [6] "arr_time"  "arr_delay" "carrier"   "tailnum"   "flight"   
## [11] "origin"    "dest"      "air_time"  "distance"  "hour"     
## [16] "minute"
data.jfk=jfk[,c("arr_delay", "dep_delay", "air_time")]

Linear model and null hypothesis

The multiple linear regression model is testing whether the variation in arrival delay of flights originating from JFK in January 2013 can be explained by the variation in the departure delay of flights leaving JFK and the air travel time, where all the variables are in minutes.

The null hypothesis is that there is no association between the independent variables (departure delay and air time) and the dependent variable (arrival delay).

A quick look at some plots and correlations

plot(data.jfk)

cor(data.jfk)
##             arr_delay   dep_delay    air_time
## arr_delay  1.00000000  0.91002261 -0.07826183
## dep_delay  0.91002261  1.00000000 -0.05687568
## air_time  -0.07826183 -0.05687568  1.00000000

Some observations:

data.jfk=data.jfk[data.jfk$dep_delay<400 & data.jfk$air_time<600,]
dim(data.jfk)
## [1] 8998    3

Step-wise linear regression model

Here we add independent variables to the model based on the strength of their correlations with the dependent variable, from highest to lowest. Based on pair-wise correlations from before, departure delay had a higher correlation with arrival delay followed by air time. So we build two models in this order - the first model using departure delay alone, then a second model by adding air time.

Arrival delay and departure delay

attach(data.jfk)
h.lm.one<-lm(arr_delay~dep_delay)
summary(h.lm.one)
## 
## Call:
## lm(formula = arr_delay ~ dep_delay)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.934  -9.853  -0.671   9.283 123.230 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.251721   0.179448  -40.41   <2e-16 ***
## dep_delay    1.015483   0.005533  183.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.46 on 8996 degrees of freedom
## Multiple R-squared:  0.7892, Adjusted R-squared:  0.7892 
## F-statistic: 3.368e+04 on 1 and 8996 DF,  p-value: < 2.2e-16

Departure delay is a significant predictor of arrival delay (P<2.2e-16), and variation in departure delay can explain about 79% of the variation in arrival delay. The slope of the regression equation (1.01) indicates that for every one minute delay in departure of flights leaving JFK in January 2013, there is a corresponding roughly 1 minute arrival delay.

Arrival delay and departure delay + air time

h.lm.two<-lm(arr_delay~dep_delay+air_time)
summary(h.lm.two)
## 
## Call:
## lm(formula = arr_delay ~ dep_delay + air_time)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.972 -10.035  -0.703   9.272 124.779 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.967961   0.329820 -18.095  < 2e-16 ***
## dep_delay    1.013368   0.005546 182.722  < 2e-16 ***
## air_time    -0.007050   0.001521  -4.637 3.59e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.44 on 8995 degrees of freedom
## Multiple R-squared:  0.7897, Adjusted R-squared:  0.7896 
## F-statistic: 1.689e+04 on 2 and 8995 DF,  p-value: < 2.2e-16

Even though air time is a significant predictor of arrival delay (P<2.61e-09), adding this to the model did not improve the R-squared (0.7892 in the previous model and 0.7897 now). In other words, air time did not add much information to the model after departure delay was taken into consideration.

However, I am inclined to keep air time in the model just because it is significant. While this could partly be due to the sample size being rather large, I will not know this until I do a power analysis.

Linear model interpretation

Scattergrams

Now we plot the regression models and look at the relationship between dependent and independent variables visually.

IV departure delay and DV arrival delay.

plot(dep_delay, arr_delay, cex=1,pch=16,col="red", xlab="Departure delay of flights leaving JFK in Jan 2013 (minutes)", ylab="Arrival delay of flights from JFK in Jan 2013 (minutes)", main="Arrival delay vs. Departure delay")
lm.one<-lm(arr_delay~dep_delay)
abline(lm.one$coef, lwd=2)

IV air time and DV arrival delay.

plot(air_time, arr_delay, cex=1,pch=16,col="red", xlab="Air time of flights from JFK in Jan 2013 (minutes)", ylab="Arrival delay of flights from JFK in Jan 2013 (minutes)", main="Arrival delay vs. Departure delay")
lm.two<-lm(arr_delay~air_time)
abline(lm.two$coef, lwd=2)

Plots of residuals and standardized residuals

Next we plot the residuals and standardized residuals to see if their scatter has any pattern that may suggest a problem with the model, or if they are randomly distributed.

lm.all<-lm(arr_delay~dep_delay+air_time)
resid<-resid(lm.all)
st.resid<-rstandard(lm.all)
par(mfrow=c(1,2))
plot(resid, pch=16, col="red", main="Residuals")
abline(0,0)
plot(st.resid, pch=16, col="blue", main="Standardized Residuals")
abline(0,0)

Overall, it looks like the residuals are fairly randomly distributed around 0, but there seems to be a wider scatter of residuals in the positive range of values compared to the negative range.

We need to look at some diagnostic plots of residuals to check whether linear model assumptions are satisfied or violated in this situation.

Diagnostic plots to check linear model assumptions

Assumption 1. The distribution of residuals is normal at every value of the response variable.

For this we will check the normality of residuals via histogram, boxplot and QQ plot.

par(mfrow=c(1,1))
hist(st.resid, breaks = 50, main="Distribution of std. residuals")

boxplot(st.resid, main="Boxplot of std. residuals")

qqnorm(st.resid, ylab="Std. residuals", xlab="Normal scores", main="QQplot of std. residuals")
qqline(st.resid, col="red")

The histogram indicates a fairly normal distribution of residuals with some values extending to the far right. The boxplot depicts the spread of data and can be used to detect obvious outliers. Our plot seems to indicate that the residuals are fairly spread out evenly about the median, and consistent with the histogram, indicates some potential extreme values in the positive range. Finally, the QQ plot serves to illustrate the deviation from the normal distribution, and also suggests some deviation from the ‘theoritical normal’ in the upper range of the distribution.

Assumption 2. Homoscedasticity of the variance of the residuals

For this we will plot the std. residuals against the fitted values of the linear regression model, and also perform White’s test of heteroscedasticity.

par(mfrow=c(1,1))
plot(fitted(lm.all),st.resid, xlab="fitted values", ylab="std. residuals")
abline(c(0,0),lwd=2, col="red")
library("vars")

library("het.test")
dataset1 <- data.frame(x=dep_delay,y=arr_delay)
model1<-VAR(dataset1)
whites.htest(model1)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  240.5556 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.0000
dataset2 <- data.frame(x=air_time,y=arr_delay)
model2<-VAR(dataset2)
whites.htest(model2)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  130.7758 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.0000

Both the plot of the fitted values against the std. residuals and the results of White’s test showing a significant p-value point to a heteroscedastic relationship. This means that the variance of the residuals for every set of values of the predictor is not equal. In other words, the model is insufficient, and the independent variables are not fully explaining the response variable, as there may be other variables involved that we have not accounted for. These could likely relate to indicators such as weather at origin and/or destination or even engine/make of the particular aircrafts involved, for example.

Overall interpretation

Are assumptions satisfied?

While in general assumption 1 is satisfied, where the distribution of residuals appears to be normal, it may not be the case for every value of the response variable. There are occurrences of extreme values of residuals as noted in the histogram, boxplot and qqplots.

We also found a violation of assumption 2 using White’s test where we found heteroscedasticity of variance of the residuals.

Assumption 4 states that at every value of the outcome the expected (mean) value of the residuals is zero. This would mean a perfectly linear relationship between the DV and IV with an R-squared of 1, which is not true in our case as evidenced by the model summary and also partly by the residuals which are not perfectly randomly distributed around 0.

Assumption 5 states that the expected correlation between residuals, for any two cases, is 0. In other words, every case is independent of other cases. Interestingly for our data it may not be far-fetched to imagine that flights departing on a certain day may be more correlated with one another in terms of their departure delay than flights departing on other days, due to factors like local weather. As another example, arrival delays could be dependent on how busy particular airports are, so we could have clustering of data by airports. We cannot visualise these based on residual plots alone; one way to visualise such violations of this assumption may be to do a clustering dendrogram using departure and arrival delay information overlayed on departure days and destination airports.

Assumption 7 states that no predictors are a perfect linear function of other predictors. Based on the correlation matrix from above, we saw that the two IVs are poorly correlated with one another. It is unlikely that departure delay and air time will be related in a perfect linear function.

Assumptions 3 (error term is additive), 6 (all predictor variables are uncorrelated with the error term) and 8 (mean of error term is 0) are hard to test/confirm.

4 issues of regression