Dataset Selection

I selected the NYC airline flights data from the Hadley Wickam datasets: https://github.com/hadley/nycflights13

Setting up the data

I focused on the ‘flights’ dataset within this collection, which looks at arrival and departure delays of flights originating in JFK, LGA and EWR in the year 2013.
After downloading the zipfile, the flight data was compiled according to the R code in the file “flights.R”, with the slight modification that I narrowed the data down to the first 3 months of 2013 only.

flight=read.table("flight_data.txt", header=T, sep="\t")
dim(flight)

## [1] 80789    16

str(flight)

## 'data.frame':    80789 obs. of  16 variables:
##  $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ dep_delay: int  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ arr_delay: int  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier  : Factor w/ 16 levels "9E","AA","AS",..: 12 12 2 4 5 12 4 6 4 2 ...
##  $ tailnum  : Factor w/ 3576 levels "","D942DN","N0EGMQ",..: 179 493 2201 2886 2424 1046 1682 2959 2022 1077 ...
##  $ flight   : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ origin   : Factor w/ 3 levels "EWR","JFK","LGA": 1 3 2 2 3 1 1 3 2 3 ...
##  $ dest     : Factor w/ 96 levels "ALB","ATL","AUS",..: 41 41 53 10 2 63 33 40 49 63 ...
##  $ air_time : int  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance : int  1400 1416 1089 1576 762 719 1065 229 944 733 ...
##  $ hour     : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ minute   : int  17 33 42 44 54 54 55 57 57 58 ...

Because the dataset was still very large, I decided to narrow down to include flights originating in JFK only and in the month of January only.
Finally, missing or NA values were removed.

jfk<-flight[flight$origin=="JFK" & flight$month=="1",]
dim(jfk)

## [1] 9161   16

jfk<-na.omit(jfk)
dim(jfk)

## [1] 9031   16

summary(jfk)

##       year          month        day           dep_time     
##  Min.   :2013   Min.   :1   Min.   : 1.00   Min.   :   1.0  
##  1st Qu.:2013   1st Qu.:1   1st Qu.: 8.00   1st Qu.: 911.5  
##  Median :2013   Median :1   Median :16.00   Median :1507.0  
##  Mean   :2013   Mean   :1   Mean   :15.74   Mean   :1391.8  
##  3rd Qu.:2013   3rd Qu.:1   3rd Qu.:24.00   3rd Qu.:1809.0  
##  Max.   :2013   Max.   :1   Max.   :31.00   Max.   :2359.0  
##                                                             
##    dep_delay           arr_time      arr_delay           carrier    
##  Min.   : -17.000   Min.   :   1   Min.   : -70.000   B6     :3321  
##  1st Qu.:  -5.000   1st Qu.:1110   1st Qu.: -18.000   DL     :1517  
##  Median :  -2.000   Median :1641   Median :  -7.000   9E     :1338  
##  Mean   :   8.558   Mean   :1546   Mean   :   1.368   AA     :1230  
##  3rd Qu.:   5.000   3rd Qu.:2019   3rd Qu.:   9.000   MQ     : 570  
##  Max.   :1301.000   Max.   :2400   Max.   :1272.000   UA     : 377  
##                                                       (Other): 678  
##     tailnum         flight     origin          dest         air_time    
##  N249JB :  43   Min.   :   1   EWR:   0   LAX    : 934   Min.   : 24.0  
##  N334JB :  43   1st Qu.: 155   JFK:9031   SFO    : 667   1st Qu.: 72.0  
##  N239JB :  39   Median : 711   LGA:   0   BOS    : 475   Median :154.0  
##  N281JB :  39   Mean   :1377              MCO    : 456   Mean   :181.2  
##  N283JB :  39   3rd Qu.:2041              FLL    : 437   3rd Qu.:313.0  
##  N266JB :  36   Max.   :5716              SJU    : 410   Max.   :660.0  
##  (Other):8792                             (Other):5652                  
##     distance         hour          minute     
##  Min.   :  94   Min.   : 0.0   Min.   : 0.00  
##  1st Qu.: 425   1st Qu.: 9.0   1st Qu.:16.00  
##  Median :1041   Median :15.0   Median :32.00  
##  Mean   :1241   Mean   :13.6   Mean   :31.99  
##  3rd Qu.:2248   3rd Qu.:18.0   3rd Qu.:50.00  
##  Max.   :4983   Max.   :23.0   Max.   :59.00  
##

Selection of independent and dependent variables

I am interested in arrival delay as the response/dependent variable. The explanatory/independent variables I want to look at are departure delay and air time because I anticipate that these are predictors that may help explain the variation in arrival delay of flights.

Independent variables:

departure delay of flights departing from JFK in January 2013 in minutes (“dep_delay”)
air time of flights originating in JFK January 2013 in minutes (“air_time”)

Dependent variable: arrival delay of flights that originated in JFK in January 2013 in minutes (“arr_delay”)

colnames(jfk)

##  [1] "year"      "month"     "day"       "dep_time"  "dep_delay"
##  [6] "arr_time"  "arr_delay" "carrier"   "tailnum"   "flight"   
## [11] "origin"    "dest"      "air_time"  "distance"  "hour"     
## [16] "minute"

data.jfk=jfk[,c("arr_delay", "dep_delay", "air_time")]

Linear model and null hypothesis

The multiple linear regression model is testing whether the variation in arrival delay of flights originating from JFK in January 2013 can be explained by the variation in the departure delay of flights leaving JFK and the air travel time, where all the variables are in minutes.

The null hypothesis is that there is no association between the independent variables (departure delay and air time) and the dependent variable (arrival delay).

A quick look at some plots and correlations

plot(data.jfk)

cor(data.jfk)

##             arr_delay   dep_delay    air_time
## arr_delay  1.00000000  0.91002261 -0.07826183
## dep_delay  0.91002261  1.00000000 -0.05687568
## air_time  -0.07826183 -0.05687568  1.00000000

Some observations:

As one would expect, there seems to be a strong positive correlation between departure delay and arrival delay.
Weak negative correlation between air time and arrival delay.
Weak negative correlation between the two independent variables.
I would like to remove some of the extreme values in the dep_delay data, to include only those flights delayed by less than 400 minutes. Also removing some extreme points in the air time data of those flights with air time more than 600 minutes.

data.jfk=data.jfk[data.jfk$dep_delay<400 & data.jfk$air_time<600,]
dim(data.jfk)

## [1] 8998    3

Step-wise linear regression model

Here we add independent variables to the model based on the strength of their correlations with the dependent variable, from highest to lowest. Based on pair-wise correlations from before, departure delay had a higher correlation with arrival delay followed by air time. So we build two models in this order - the first model using departure delay alone, then a second model by adding air time.

Arrival delay and departure delay

attach(data.jfk)
h.lm.one<-lm(arr_delay~dep_delay)
summary(h.lm.one)

## 
## Call:
## lm(formula = arr_delay ~ dep_delay)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.934  -9.853  -0.671   9.283 123.230 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.251721   0.179448  -40.41   <2e-16 ***
## dep_delay    1.015483   0.005533  183.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.46 on 8996 degrees of freedom
## Multiple R-squared:  0.7892, Adjusted R-squared:  0.7892 
## F-statistic: 3.368e+04 on 1 and 8996 DF,  p-value: < 2.2e-16

Departure delay is a significant predictor of arrival delay (P<2.2e-16), and variation in departure delay can explain about 79% of the variation in arrival delay. The slope of the regression equation (1.01) indicates that for every one minute delay in departure of flights leaving JFK in January 2013, there is a corresponding roughly 1 minute arrival delay.

Arrival delay and departure delay + air time

h.lm.two<-lm(arr_delay~dep_delay+air_time)
summary(h.lm.two)

## 
## Call:
## lm(formula = arr_delay ~ dep_delay + air_time)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.972 -10.035  -0.703   9.272 124.779 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.967961   0.329820 -18.095  < 2e-16 ***
## dep_delay    1.013368   0.005546 182.722  < 2e-16 ***
## air_time    -0.007050   0.001521  -4.637 3.59e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.44 on 8995 degrees of freedom
## Multiple R-squared:  0.7897, Adjusted R-squared:  0.7896 
## F-statistic: 1.689e+04 on 2 and 8995 DF,  p-value: < 2.2e-16

Even though air time is a significant predictor of arrival delay (P<2.61e-09), adding this to the model did not improve the R-squared (0.7892 in the previous model and 0.7897 now). In other words, air time did not add much information to the model after departure delay was taken into consideration.

However, I am inclined to keep air time in the model just because it is significant. While this could partly be due to the sample size being rather large, I will not know this until I do a power analysis.

Linear model interpretation

Given that both departure delay and air time were significant in the model, we can reject the null hypothesis and conclude that there is a significant association between departure delay and air time and the response variable, arrival delay of flights originating in JFK in January 2013.
When air time is fixed, for every 1 minute delay in departure, there is about a 1 minute delay in arrival.
When departure delay is fixed, for every 1 minute increase in air time, there is a 0.007 minute earlier arrival. Basically, air time has very little influence on arrival delay.
Interpreting the intercept in this scenario is not very meaningful because when there is 0 air time, there is no question of arrival delay. However, in the simple linear regression, where only departure delay was used to model arrival delay, the intercept of -7.25 indicated that in the absense of a departure delay, flights landed about 7 minutes early.

Scattergrams

Now we plot the regression models and look at the relationship between dependent and independent variables visually.

IV departure delay and DV arrival delay.

plot(dep_delay, arr_delay, cex=1,pch=16,col="red", xlab="Departure delay of flights leaving JFK in Jan 2013 (minutes)", ylab="Arrival delay of flights from JFK in Jan 2013 (minutes)", main="Arrival delay vs. Departure delay")
lm.one<-lm(arr_delay~dep_delay)
abline(lm.one$coef, lwd=2)

IV air time and DV arrival delay.

plot(air_time, arr_delay, cex=1,pch=16,col="red", xlab="Air time of flights from JFK in Jan 2013 (minutes)", ylab="Arrival delay of flights from JFK in Jan 2013 (minutes)", main="Arrival delay vs. Departure delay")
lm.two<-lm(arr_delay~air_time)
abline(lm.two$coef, lwd=2)

Plots of residuals and standardized residuals

Next we plot the residuals and standardized residuals to see if their scatter has any pattern that may suggest a problem with the model, or if they are randomly distributed.

lm.all<-lm(arr_delay~dep_delay+air_time)
resid<-resid(lm.all)
st.resid<-rstandard(lm.all)
par(mfrow=c(1,2))
plot(resid, pch=16, col="red", main="Residuals")
abline(0,0)
plot(st.resid, pch=16, col="blue", main="Standardized Residuals")
abline(0,0)

Overall, it looks like the residuals are fairly randomly distributed around 0, but there seems to be a wider scatter of residuals in the positive range of values compared to the negative range.

We need to look at some diagnostic plots of residuals to check whether linear model assumptions are satisfied or violated in this situation.

Diagnostic plots to check linear model assumptions

Assumption 1. The distribution of residuals is normal at every value of the response variable.

For this we will check the normality of residuals via histogram, boxplot and QQ plot.

par(mfrow=c(1,1))
hist(st.resid, breaks = 50, main="Distribution of std. residuals")

boxplot(st.resid, main="Boxplot of std. residuals")

qqnorm(st.resid, ylab="Std. residuals", xlab="Normal scores", main="QQplot of std. residuals")
qqline(st.resid, col="red")

The histogram indicates a fairly normal distribution of residuals with some values extending to the far right. The boxplot depicts the spread of data and can be used to detect obvious outliers. Our plot seems to indicate that the residuals are fairly spread out evenly about the median, and consistent with the histogram, indicates some potential extreme values in the positive range. Finally, the QQ plot serves to illustrate the deviation from the normal distribution, and also suggests some deviation from the ‘theoritical normal’ in the upper range of the distribution.

Assumption 2. Homoscedasticity of the variance of the residuals

For this we will plot the std. residuals against the fitted values of the linear regression model, and also perform White’s test of heteroscedasticity.

par(mfrow=c(1,1))
plot(fitted(lm.all),st.resid, xlab="fitted values", ylab="std. residuals")
abline(c(0,0),lwd=2, col="red")
library("vars")

library("het.test")
dataset1 <- data.frame(x=dep_delay,y=arr_delay)
model1<-VAR(dataset1)
whites.htest(model1)

## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  240.5556 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.0000

dataset2 <- data.frame(x=air_time,y=arr_delay)
model2<-VAR(dataset2)
whites.htest(model2)

## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  130.7758 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.0000

Both the plot of the fitted values against the std. residuals and the results of White’s test showing a significant p-value point to a heteroscedastic relationship. This means that the variance of the residuals for every set of values of the predictor is not equal. In other words, the model is insufficient, and the independent variables are not fully explaining the response variable, as there may be other variables involved that we have not accounted for. These could likely relate to indicators such as weather at origin and/or destination or even engine/make of the particular aircrafts involved, for example.

Overall interpretation

Are assumptions satisfied?

While in general assumption 1 is satisfied, where the distribution of residuals appears to be normal, it may not be the case for every value of the response variable. There are occurrences of extreme values of residuals as noted in the histogram, boxplot and qqplots.

We also found a violation of assumption 2 using White’s test where we found heteroscedasticity of variance of the residuals.

Assumption 4 states that at every value of the outcome the expected (mean) value of the residuals is zero. This would mean a perfectly linear relationship between the DV and IV with an R-squared of 1, which is not true in our case as evidenced by the model summary and also partly by the residuals which are not perfectly randomly distributed around 0.

Assumption 5 states that the expected correlation between residuals, for any two cases, is 0. In other words, every case is independent of other cases. Interestingly for our data it may not be far-fetched to imagine that flights departing on a certain day may be more correlated with one another in terms of their departure delay than flights departing on other days, due to factors like local weather. As another example, arrival delays could be dependent on how busy particular airports are, so we could have clustering of data by airports. We cannot visualise these based on residual plots alone; one way to visualise such violations of this assumption may be to do a clustering dendrogram using departure and arrival delay information overlayed on departure days and destination airports.

Assumption 7 states that no predictors are a perfect linear function of other predictors. Based on the correlation matrix from above, we saw that the two IVs are poorly correlated with one another. It is unlikely that departure delay and air time will be related in a perfect linear function.

Assumptions 3 (error term is additive), 6 (all predictor variables are uncorrelated with the error term) and 8 (mean of error term is 0) are hard to test/confirm.

4 issues of regression

Causality: Hard to test this. In general, it may be likely that very large delays in departure may be causal for delays in arrival.
Sample size: About 9000, which is quite large. Maybe too large? Need to do a power analysis to confirm.
Collinearity: We saw that the two IVs are weakly correlated.

Applied Linear Regression Project 3

Kalpana Gopalakrishnan

Monday, April 20th, 2015