I focused on the ‘flights’ dataset within this collection, which looks at arrival and departure delays of flights originating in JFK, LGA and EWR in the year 2013.
After downloading the zipfile, the flight data was compiled according to the R code in the file “flights.R”, with the slight modification that I narrowed the data down to the first 3 months of 2013 only.
flight=read.table("flight_data.txt", header=T, sep="\t")
dim(flight)
## [1] 80789 16
str(flight)
## 'data.frame': 80789 obs. of 16 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ dep_delay: int 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ arr_delay: int 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : Factor w/ 16 levels "9E","AA","AS",..: 12 12 2 4 5 12 4 6 4 2 ...
## $ tailnum : Factor w/ 3576 levels "","D942DN","N0EGMQ",..: 179 493 2201 2886 2424 1046 1682 2959 2022 1077 ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ origin : Factor w/ 3 levels "EWR","JFK","LGA": 1 3 2 2 3 1 1 3 2 3 ...
## $ dest : Factor w/ 96 levels "ALB","ATL","AUS",..: 41 41 53 10 2 63 33 40 49 63 ...
## $ air_time : int 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : int 1400 1416 1089 1576 762 719 1065 229 944 733 ...
## $ hour : int 5 5 5 5 5 5 5 5 5 5 ...
## $ minute : int 17 33 42 44 54 54 55 57 57 58 ...
Because the dataset was still very large, I decided to narrow down to include flights originating in JFK only and in the month of January only.
Finally, missing or NA values were removed.
jfk<-flight[flight$origin=="JFK" & flight$month=="1",]
dim(jfk)
## [1] 9161 16
jfk<-na.omit(jfk)
dim(jfk)
## [1] 9031 16
summary(jfk)
## year month day dep_time
## Min. :2013 Min. :1 Min. : 1.00 Min. : 1.0
## 1st Qu.:2013 1st Qu.:1 1st Qu.: 8.00 1st Qu.: 911.5
## Median :2013 Median :1 Median :16.00 Median :1507.0
## Mean :2013 Mean :1 Mean :15.74 Mean :1391.8
## 3rd Qu.:2013 3rd Qu.:1 3rd Qu.:24.00 3rd Qu.:1809.0
## Max. :2013 Max. :1 Max. :31.00 Max. :2359.0
##
## dep_delay arr_time arr_delay carrier
## Min. : -17.000 Min. : 1 Min. : -70.000 B6 :3321
## 1st Qu.: -5.000 1st Qu.:1110 1st Qu.: -18.000 DL :1517
## Median : -2.000 Median :1641 Median : -7.000 9E :1338
## Mean : 8.558 Mean :1546 Mean : 1.368 AA :1230
## 3rd Qu.: 5.000 3rd Qu.:2019 3rd Qu.: 9.000 MQ : 570
## Max. :1301.000 Max. :2400 Max. :1272.000 UA : 377
## (Other): 678
## tailnum flight origin dest air_time
## N249JB : 43 Min. : 1 EWR: 0 LAX : 934 Min. : 24.0
## N334JB : 43 1st Qu.: 155 JFK:9031 SFO : 667 1st Qu.: 72.0
## N239JB : 39 Median : 711 LGA: 0 BOS : 475 Median :154.0
## N281JB : 39 Mean :1377 MCO : 456 Mean :181.2
## N283JB : 39 3rd Qu.:2041 FLL : 437 3rd Qu.:313.0
## N266JB : 36 Max. :5716 SJU : 410 Max. :660.0
## (Other):8792 (Other):5652
## distance hour minute
## Min. : 94 Min. : 0.0 Min. : 0.00
## 1st Qu.: 425 1st Qu.: 9.0 1st Qu.:16.00
## Median :1041 Median :15.0 Median :32.00
## Mean :1241 Mean :13.6 Mean :31.99
## 3rd Qu.:2248 3rd Qu.:18.0 3rd Qu.:50.00
## Max. :4983 Max. :23.0 Max. :59.00
##
I am interested in arrival delay as the response/dependent variable. The explanatory/independent variables I want to look at are departure delay and air time because I anticipate that these are predictors that may help explain the variation in arrival delay of flights.
Independent variables:
departure delay of flights departing from JFK in January 2013 in minutes (“dep_delay”)
air time of flights originating in JFK January 2013 in minutes (“air_time”)
Dependent variable: arrival delay of flights that originated in JFK in January 2013 in minutes (“arr_delay”)
colnames(jfk)
## [1] "year" "month" "day" "dep_time" "dep_delay"
## [6] "arr_time" "arr_delay" "carrier" "tailnum" "flight"
## [11] "origin" "dest" "air_time" "distance" "hour"
## [16] "minute"
data.jfk=jfk[,c("arr_delay", "dep_delay", "air_time")]
The multiple linear regression model is testing whether the variation in arrival delay of flights originating from JFK in January 2013 can be explained by the variation in the departure delay of flights leaving JFK and the air travel time, where all the variables are in minutes.
The null hypothesis is that there is no association between the independent variables (departure delay and air time) and the dependent variable (arrival delay).
plot(data.jfk)
cor(data.jfk)
## arr_delay dep_delay air_time
## arr_delay 1.00000000 0.91002261 -0.07826183
## dep_delay 0.91002261 1.00000000 -0.05687568
## air_time -0.07826183 -0.05687568 1.00000000
Some observations:
As one would expect, there seems to be a strong positive correlation between departure delay and arrival delay.
Weak negative correlation between air time and arrival delay.
Weak negative correlation between the two independent variables.
I would like to remove some of the extreme values in the dep_delay data, to include only those flights delayed by less than 400 minutes. Also removing some extreme points in the air time data of those flights with air time more than 600 minutes.
data.jfk=data.jfk[data.jfk$dep_delay<400 & data.jfk$air_time<600,]
dim(data.jfk)
## [1] 8998 3
Here we add independent variables to the model based on the strength of their correlations with the dependent variable, from highest to lowest. Based on pair-wise correlations from before, departure delay had a higher correlation with arrival delay followed by air time. So we build two models in this order - the first model using departure delay alone, then a second model by adding air time.
Arrival delay and departure delay
attach(data.jfk)
h.lm.one<-lm(arr_delay~dep_delay)
summary(h.lm.one)
##
## Call:
## lm(formula = arr_delay ~ dep_delay)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.934 -9.853 -0.671 9.283 123.230
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.251721 0.179448 -40.41 <2e-16 ***
## dep_delay 1.015483 0.005533 183.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.46 on 8996 degrees of freedom
## Multiple R-squared: 0.7892, Adjusted R-squared: 0.7892
## F-statistic: 3.368e+04 on 1 and 8996 DF, p-value: < 2.2e-16
Departure delay is a significant predictor of arrival delay (P<2.2e-16), and variation in departure delay can explain about 79% of the variation in arrival delay. The slope of the regression equation (1.01) indicates that for every one minute delay in departure of flights leaving JFK in January 2013, there is a corresponding roughly 1 minute arrival delay.
Arrival delay and departure delay + air time
h.lm.two<-lm(arr_delay~dep_delay+air_time)
summary(h.lm.two)
##
## Call:
## lm(formula = arr_delay ~ dep_delay + air_time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.972 -10.035 -0.703 9.272 124.779
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.967961 0.329820 -18.095 < 2e-16 ***
## dep_delay 1.013368 0.005546 182.722 < 2e-16 ***
## air_time -0.007050 0.001521 -4.637 3.59e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.44 on 8995 degrees of freedom
## Multiple R-squared: 0.7897, Adjusted R-squared: 0.7896
## F-statistic: 1.689e+04 on 2 and 8995 DF, p-value: < 2.2e-16
Even though air time is a significant predictor of arrival delay (P<2.61e-09), adding this to the model did not improve the R-squared (0.7892 in the previous model and 0.7897 now). In other words, air time did not add much information to the model after departure delay was taken into consideration.
However, I am inclined to keep air time in the model just because it is significant. While this could partly be due to the sample size being rather large, I will not know this until I do a power analysis.
Linear model interpretation
Given that both departure delay and air time were significant in the model, we can reject the null hypothesis and conclude that there is a significant association between departure delay and air time and the response variable, arrival delay of flights originating in JFK in January 2013.
When air time is fixed, for every 1 minute delay in departure, there is about a 1 minute delay in arrival.
When departure delay is fixed, for every 1 minute increase in air time, there is a 0.007 minute earlier arrival. Basically, air time has very little influence on arrival delay.
Interpreting the intercept in this scenario is not very meaningful because when there is 0 air time, there is no question of arrival delay. However, in the simple linear regression, where only departure delay was used to model arrival delay, the intercept of -7.25 indicated that in the absense of a departure delay, flights landed about 7 minutes early.
Now we plot the regression models and look at the relationship between dependent and independent variables visually.
IV departure delay and DV arrival delay.
plot(dep_delay, arr_delay, cex=1,pch=16,col="red", xlab="Departure delay of flights leaving JFK in Jan 2013 (minutes)", ylab="Arrival delay of flights from JFK in Jan 2013 (minutes)", main="Arrival delay vs. Departure delay")
lm.one<-lm(arr_delay~dep_delay)
abline(lm.one$coef, lwd=2)
IV air time and DV arrival delay.
plot(air_time, arr_delay, cex=1,pch=16,col="red", xlab="Air time of flights from JFK in Jan 2013 (minutes)", ylab="Arrival delay of flights from JFK in Jan 2013 (minutes)", main="Arrival delay vs. Departure delay")
lm.two<-lm(arr_delay~air_time)
abline(lm.two$coef, lwd=2)
Next we plot the residuals and standardized residuals to see if their scatter has any pattern that may suggest a problem with the model, or if they are randomly distributed.
lm.all<-lm(arr_delay~dep_delay+air_time)
resid<-resid(lm.all)
st.resid<-rstandard(lm.all)
par(mfrow=c(1,2))
plot(resid, pch=16, col="red", main="Residuals")
abline(0,0)
plot(st.resid, pch=16, col="blue", main="Standardized Residuals")
abline(0,0)
Overall, it looks like the residuals are fairly randomly distributed around 0, but there seems to be a wider scatter of residuals in the positive range of values compared to the negative range.
We need to look at some diagnostic plots of residuals to check whether linear model assumptions are satisfied or violated in this situation.
Assumption 1. The distribution of residuals is normal at every value of the response variable.
For this we will check the normality of residuals via histogram, boxplot and QQ plot.
par(mfrow=c(1,1))
hist(st.resid, breaks = 50, main="Distribution of std. residuals")
boxplot(st.resid, main="Boxplot of std. residuals")
qqnorm(st.resid, ylab="Std. residuals", xlab="Normal scores", main="QQplot of std. residuals")
qqline(st.resid, col="red")
The histogram indicates a fairly normal distribution of residuals with some values extending to the far right. The boxplot depicts the spread of data and can be used to detect obvious outliers. Our plot seems to indicate that the residuals are fairly spread out evenly about the median, and consistent with the histogram, indicates some potential extreme values in the positive range. Finally, the QQ plot serves to illustrate the deviation from the normal distribution, and also suggests some deviation from the ‘theoritical normal’ in the upper range of the distribution.
Assumption 2. Homoscedasticity of the variance of the residuals
For this we will plot the std. residuals against the fitted values of the linear regression model, and also perform White’s test of heteroscedasticity.
par(mfrow=c(1,1))
plot(fitted(lm.all),st.resid, xlab="fitted values", ylab="std. residuals")
abline(c(0,0),lwd=2, col="red")
library("vars")
library("het.test")
dataset1 <- data.frame(x=dep_delay,y=arr_delay)
model1<-VAR(dataset1)
whites.htest(model1)
##
## White's Test for Heteroskedasticity:
## ====================================
##
## No Cross Terms
##
## H0: Homoskedasticity
## H1: Heteroskedasticity
##
## Test Statistic:
## 240.5556
##
## Degrees of Freedom:
## 12
##
## P-value:
## 0.0000
dataset2 <- data.frame(x=air_time,y=arr_delay)
model2<-VAR(dataset2)
whites.htest(model2)
##
## White's Test for Heteroskedasticity:
## ====================================
##
## No Cross Terms
##
## H0: Homoskedasticity
## H1: Heteroskedasticity
##
## Test Statistic:
## 130.7758
##
## Degrees of Freedom:
## 12
##
## P-value:
## 0.0000
Both the plot of the fitted values against the std. residuals and the results of White’s test showing a significant p-value point to a heteroscedastic relationship. This means that the variance of the residuals for every set of values of the predictor is not equal. In other words, the model is insufficient, and the independent variables are not fully explaining the response variable, as there may be other variables involved that we have not accounted for. These could likely relate to indicators such as weather at origin and/or destination or even engine/make of the particular aircrafts involved, for example.
Are assumptions satisfied?
While in general assumption 1 is satisfied, where the distribution of residuals appears to be normal, it may not be the case for every value of the response variable. There are occurrences of extreme values of residuals as noted in the histogram, boxplot and qqplots.
We also found a violation of assumption 2 using White’s test where we found heteroscedasticity of variance of the residuals.
Assumption 4 states that at every value of the outcome the expected (mean) value of the residuals is zero. This would mean a perfectly linear relationship between the DV and IV with an R-squared of 1, which is not true in our case as evidenced by the model summary and also partly by the residuals which are not perfectly randomly distributed around 0.
Assumption 5 states that the expected correlation between residuals, for any two cases, is 0. In other words, every case is independent of other cases. Interestingly for our data it may not be far-fetched to imagine that flights departing on a certain day may be more correlated with one another in terms of their departure delay than flights departing on other days, due to factors like local weather. As another example, arrival delays could be dependent on how busy particular airports are, so we could have clustering of data by airports. We cannot visualise these based on residual plots alone; one way to visualise such violations of this assumption may be to do a clustering dendrogram using departure and arrival delay information overlayed on departure days and destination airports.
Assumption 7 states that no predictors are a perfect linear function of other predictors. Based on the correlation matrix from above, we saw that the two IVs are poorly correlated with one another. It is unlikely that departure delay and air time will be related in a perfect linear function.
Assumptions 3 (error term is additive), 6 (all predictor variables are uncorrelated with the error term) and 8 (mean of error term is 0) are hard to test/confirm.
4 issues of regression
Causality: Hard to test this. In general, it may be likely that very large delays in departure may be causal for delays in arrival.
Sample size: About 9000, which is quite large. Maybe too large? Need to do a power analysis to confirm.
Collinearity: We saw that the two IVs are weakly correlated.