Applied Regression Analysis: Project 3

Applied Regression Analysis: Project 3 Assumptions and Issues

Ali Svoobda

RPI

4/20/15

1. Dataset Selection

The data used in this analysis was created by Hadley Wickham. We will examine data from flights that departed NYC in 2013. The set includes the flight information (flight number, carrier, departure and destination cities, date and time of departure), as well as information regarding the flight length and if it was delayed.

The code below reads in the dataset and displays the first several observations. There are incomplete observations with fields missing so those observations are removed from the set as to not mess up calculations:

library(nycflights13)
xx<-flights
xx<-na.omit(xx)

head(xx)

##   year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013     1   1      517         2      830        11      UA  N14228
## 2 2013     1   1      533         4      850        20      UA  N24211
## 3 2013     1   1      542         2      923        33      AA  N619AA
## 4 2013     1   1      544        -1     1004       -18      B6  N804JB
## 5 2013     1   1      554        -6      812       -25      DL  N668DN
## 6 2013     1   1      554        -4      740        12      UA  N39463
##   flight origin dest air_time distance hour minute
## 1   1545    EWR  IAH      227     1400    5     17
## 2   1714    LGA  IAH      227     1416    5     33
## 3   1141    JFK  MIA      160     1089    5     42
## 4    725    JFK  BQN      183     1576    5     44
## 5    461    LGA  ATL      116      762    5     54
## 6   1696    EWR  ORD      150      719    5     54

Description of Datset

The dataset contains 327346 observations of 16 continuous variables. However, for this project, we will focus on 3 of these variables: arrival delay (arr_delay): delay in minutes (negative values represent an early flight) milage of flight (distance): how far the flight flew departure time (dep_time): time of the day the flight departed

For this project, the dependent/response variable will be the arrival delay and the independent variables will be distance and dep_time. These are selected to determine to test if longer flights may influence the delay of a flight. Departure time will be included to see if flying during a certain time of day can predict if a flight is on time.

The overall goal is to see if the arrival delay can be at all explained by the departure time and disatance of a flight.

Since we only need these 3 variables, we will remove all others from the dataset:

x<-xx[c(4,7,14)]

We will also select a subset of this data since it is an extremely large dataset. We will do this by taking a random sample of the observations (See Section 4 Issue 2 for explanation on sample size). Finally we will attach the variable names:

size=196
vector=nrow(x)
set.seed(11)
indicies<- sample(vector, size, replace= FALSE)
data<-x[indicies,]

head(data)

##        dep_time arr_delay distance
## 92553       648         8     2565
## 170         900         8     1065
## 171740      632        -1      187
## 4649       1242       174     2454
## 21440      1458        50      209
## 321859      509       -39     1400

Below is a summary of the data:

summary(data)

##     dep_time      arr_delay        distance   
##  Min.   : 509   Min.   :-56.0   Min.   :  94  
##  1st Qu.: 842   1st Qu.:-19.0   1st Qu.: 543  
##  Median :1310   Median : -5.0   Median :1020  
##  Mean   :1324   Mean   :  9.8   Mean   :1154  
##  3rd Qu.:1767   3rd Qu.: 13.8   3rd Qu.:1598  
##  Max.   :2353   Max.   :360.0   Max.   :2586

Initial Plots

Below we create a boxplot of the delay time to examine the distribution and outliers

boxplot(data$arr_delay, main="Boxplot of Arrival Delays")

plot of chunk unnamed-chunk-5

As you can see, there are some outliers. This can be attributed to the nature of the dataset. When considering so many flights leaving NYC, most will be on time, as you can see by the median being about 0. However, there are still many flights that are delayed, and done so by significant lengths, and this is shown by the outliers.

Since there are so many outliers, we will not remove them because if we did we would have no delays to try to explain. It is important to keep in mind that there are outliers when we go to interpret the model and results.

2. Building the Regression Models

Null Hypothesis

The null hypothesis is that the variation in the dependent variable, arrival delay, cannot be explained by anything other than randomization. In other words, for the model described below, the variation in distance traveled and departure time cannot explain the variation in arrival delay.

Examine the scattergram for relationships between arrival delay and the independent variables:

plot(data)

plot of chunk unnamed-chunk-6

One thing we can observe from this plot is between departure time and distance traveled. Since these are the independent variables, we do not want them to be correlated because correlation between IV’s can lead to supression. Since the points are spread across the chart for dep_time and distance, supression should not be an issue in this analysis.

Building the Model

The model will be built using the step-wise method. We add the most correlated independent variable first, then continue to add the other variables as long as the contriburte to explaining the variation in the dependent variable (viewed through the R^2 value).

To determine the order, we build a correlation matrix:

cor(data)

##           dep_time arr_delay  distance
## dep_time  1.000000   0.35183  0.006639
## arr_delay 0.351831   1.00000 -0.078293
## distance  0.006639  -0.07829  1.000000

As predicted above, the IV’s have correlation values close to zero. Departure time seems to be correlated with arrival delay, while distance and arrival delay are only slightly correlated. Therefore, departure time is the first variable included in the model.

Model 1: Arrival Delay by Departure time

model1<-lm(data$arr_delay~data$dep_time)
summary(model1)

## 
## Call:
## lm(formula = data$arr_delay ~ data$dep_time)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -88.4  -27.1   -8.6   13.7  316.5 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -39.20350   10.03168   -3.91  0.00013 ***
## data$dep_time   0.03703    0.00707    5.24  4.3e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.4 on 194 degrees of freedom
## Multiple R-squared:  0.124,  Adjusted R-squared:  0.119 
## F-statistic: 27.4 on 1 and 194 DF,  p-value: 4.26e-07

Interpretation

In this model, we reject the null hypothesis for departure time. According to the model, departure time does contribute to the variation in the arrival delay. At 0.1238, the R^2 value is on the lower side, indicating the model only explains about 10% of the variation in delay.

Scatterplot With Regression Line for Model 1

plot(data$dep_time,data$arr_delay, main="Arrival Delay vs Departure Time (with regression line)", xlab= "Departure Time", ylab="Arrival Delay", col="blue", pch=18)
abline(model1$coef, lwd=2, col="dark blue")

plot of chunk unnamed-chunk-9

Again with the nature of the flight patterns having most flights on time, it is hard to fit the data with a regression line. For many points, the regression line is not a great fit, which explains why the R^2 value is low. One the other hand, the line does suggest an increase in arrival delay as departure time increases. In other words, as the day progresses, flights may tend to have greater arrival delays. This is a plausable outcome as many flights use the same planes throughout the day, so the delay of an earlier flight may impact many flights following.

Model 2: Arrival Delay by Departure time and Distance

Now, the second independent varible will be added to the model to see if together, departure time and distance traveled can better predict the arrival delay:

model2<-lm(data$arr_delay~data$dep_time+data$distance)
summary(model2)

## 
## Call:
## lm(formula = data$arr_delay ~ data$dep_time + data$distance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -82.6  -27.2   -8.1   11.5  322.2 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -32.60517   11.42731   -2.85   0.0048 ** 
## data$dep_time   0.03709    0.00707    5.25    4e-07 ***
## data$distance  -0.00578    0.00481   -1.20   0.2312    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.4 on 193 degrees of freedom
## Multiple R-squared:  0.13,   Adjusted R-squared:  0.121 
## F-statistic: 14.5 on 2 and 193 DF,  p-value: 1.41e-06

In this model, we again reject the null hypothesis for departure time as a predictor variable but fail to reject for distance traveled. This means that based on these results, distance is not a significant predictor of arrival delay time and does not explain any of its variation. This follows the results from the correlation matrix, as the correlation between distance traveled and arrival delay was close to zero. The R^2 value did change to 0.1303, but this is less than a 1% increase from the previous model and likely only changed due to changes in the degress of freedom.

This is still a small amount of variation for a model to explain, suggesting there may be several other factors that impact a flights arrival delay (i.e. weather, airline, airplane, etc.)

Scatterplot With Regression Plane for Model 2

A 3-Dimensional Scatterplot will be required since there are two independent variables. To plot this, we must first install an additional package:

#install.packages("scatterplot3d")
library("scatterplot3d", lib.loc="~/R/win-library/3.1")

## Warning: package 'scatterplot3d' was built under R version 3.1.3

Create 3D plot with regression plane:

best<-scatterplot3d(data$dep_time, data$distance, data$arr_delay, pch=18, main="Arrival Delay vs Departure Time and Distance (with Regression)", xlab="Departure Time", ylab="Distance (miles)", zlab="Arrival Delay (minutes)")
best$plane3d(model2, col="blueviolet")

plot of chunk unnamed-chunk-12

Although it is hard to see on an angle, it appears that the regression plane only hits a small portion of the points, suggesting a poor fit. This is logical as distance was not found as a significant predictor and because the model only explained a small percentage of the variation in arrival delay.

3. Plots: Testing the Assumptions

A series of plots were created to check the various assumptions of regression models. Model 1 will be used since it explained a greater amount of the arrival delay.

The 8 assumtions that will be checked are as follows:

The distribution of residuals is normal (at each value of the outcome).
The variance of the residuals for every set of values for the predictor is equal (violation is called heteroscedasticity)
The error term is additive (no interactions)
At every value of the outcome the expected (mean) value of the residuals is zero (No non-linear relationships)
The expected correlation between residuals, for any two cases, is 0 (The independence assumption (lack of autocorrelation))
All predictors are uncorrelated with the error term
No predictors are a perfect linear function of other predictors (no perfect multicollinearity)
The mean of the error term is zero

Assumption 1: Normality

First, we will check that the data meets the normality assumption:

qqnorm(residuals(model1))
qqline(residuals(model1))

plot of chunk unnamed-chunk-13

A majority of the residuals fall along the normal distrbution but the model may not be normally distributed as they tail off at the upper end. The few that tail off could be due to the outliers that were left in the dataset for analysis.

Examine histogram for skewdness and kurtosis:

hist(data$arr_delay, breaks=15, density=20, xlim=c(-75,250))

plot of chunk unnamed-chunk-14

The data is right skewed, again suggesting the data is not normally distributed. Most observations are on time (0 arrival delay) while the majority of the remaning flights range from 50 miniutes early to 50 minutes late. The skew occurs because of the flights that have longer delays (50+ minutes). Revisiting the boxplot, these are likely the outliers that are shown and the reason the data is not normal. For example, imagine ending the histogram after 50min arrival delay. The data would be much closer to being normally distributed, although it may still have soome kurtosis as the histogram has a bit of a peak.

Boxplot to illustrate outliers that were left in the dataset:

boxplot(data$arr_delay, main="Boxplot of Arrival Delays")

plot of chunk unnamed-chunk-15

The median line is not quite in the center of the box, again identifying skewdness.

Assumption 2: Residuals

White’s Test for Heteroskedasticity to determine if the residuals have equal varience. Null Hypothesis: Homoskedasticity (all residuals have equal variance) Alternative Hypothesis: Heteroskedasticity

Before Whites test is run, a package must first be installed:

#install.packages("het.test")
library("het.test", lib.loc="~/R/win-library/3.1")

## Warning: package 'het.test' was built under R version 3.1.3

## Loading required package: vars

## Warning: package 'vars' was built under R version 3.1.3

## Loading required package: MASS
## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 3.1.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.1.3

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 3.1.3

## Loading required package: urca

## Warning: package 'urca' was built under R version 3.1.3

## Loading required package: lmtest

## Warning: package 'lmtest' was built under R version 3.1.3

Run whites test:

departure<-data.frame(x=data$dep_time,y=data$arr_delay)
whitemodel<-VAR(departure,p=1)
whites.htest(whitemodel)

## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  16.1427 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.1848

With a p-value of .1848, we fail to reject the null hypothesis, indicatingn the residuals do appear to be homoscedastic and confirming the assumption of equal variance of residuals

Assumption 3: The Error Term is Addative

Departure time and distance traveled are additive and not multiplicative of each other. Both long and short flights can leave at any point in a day.

Assumption 4: Mean of Residuals is Zero

To check the fit of the model, we will examine the model residuals:

model.resid<-model1$residuals
plot(model.resid, main="Residuals",pch=20)
abline(0,0, lwd=2, col="blueviolet")

plot of chunk unnamed-chunk-18

The resididuals show the difference between the acutal and fitted values of the model. They are spread out across the dynamic range but have a slightly higher amount distributed below the 0 while the values above the zero are further from the zero. The mean may still be zero but the fit of the model may not be the best.

Assumption 5: Expected Correlation Between Residuals is Zero

We can again examine the residuals plot above to test this assumption. This assumption ensures the cases are independent of each other. Since there is no trend in the plot above, we assume this assumption is fulfilled.

Assumption 6: Predictors Uncorrelated with Error Term

This assumption is making sure there are not other predictors that have been left out of the model. Since only two independent variables were considered and the dataset did have other variables, this assumption may not be fulfilled. There may even be other variables not included in the dataset that could cause the variation as mentioned above in the model building section above (weather, plane, etc.).

Assumption 7: No Perfect Multicollinearity

The predictors are most likely not predictors of one another. Departure time most likely is not a linear function of distance traveled as flights can leave at the same time but go to different places. In addition, the two IV’s had a correlation value close to zero. Besides, the model of best fit only included one predictor variable, so this assumption does not pose a threat to the model.

Assumption 8: Mean of the Error Term is Zero

As discussed in Assumption 4, we can look at the fitted vs actual values to infer the mean of the error term. The error is the difference between the actual and fitted, so the closer the points on the residuals plot are to zero, the closer the error is to zero. The error term may not be exactly zero, but based on the plot it is likely pretty close.

4. Issues

Below four issues that may occur when running regression analysis will be discussed.

Issue 1: Causality

The independent variables in this analysis, distance and departure time, are probabalistic and proximal causes of arrival delay. They are probabalistic and not deterministic because nither of the two can determine the arrival delay of a flight every time. A shorter flight does not neccesarily equate to a shorter delay, nor does a longer flight. At the same time, a flight that leaves in the morning won’t always have a shorter or longer delay than a flight that leaves in the evening. The are probable causes because sometimes, they may lead to a certain arrival delay (at least in this case, departure time may while distance traveled tested to be insgnificant).

They are proximal and not ultimate as I do not believe distance traveled and departure time are the main cause of a flight delay. It is much more likely that another factor, such as weather, plane maintenance, or a security or issue at the airport/runway is the ultimate cause of the delay. It was simply the goal of this experiment to see if distance and departure time explaned some of the variance in arrival delay.

Issue 2: Sample Size

Since flights is a very large dataset, a program called G*Power was utilized to find an appropriate sample size to test. We do not want to use all 300,000+ observations beacuse we want to ensure results are do to an effect actually existing, not because there are too many observations in the sample. For the calculation, the following parmeters were choosen:

effect size = .05 alpha = .05 Power (1-beta) =.8 Tested Predictors = 2 Total # of Predictors = 16

A Sample size of 196 was calculated and this size was used to select a random sample from the dataset (entered in the model building section above.

Issue 3: Collinearity

Collinearity becomes a problem when the independent variables in a model are correlated. Although the best model only used one variable, we will check this assumption in case it is affecting the second model that was created. We checked for collinearity in section ________ above when we examined the correlation matrix. Distance and departure time had a correlation value of 0.00663, so we can consider this assumption fulfilled.

We can also examine the scatter plot between the IV’s:

plot(data$distance, data$dep_time, xlab="Distance", ylab="Departure Time (in Military Time)")

plot of chunk unnamed-chunk-19

There is a good spread with no identifiable trends. There are verticle columns (different distances) that have more gaps than others, but this can be attributed to the random sample used including fewer observations for these distances.

Issue 4: Measurement Error

There is not likely to be much measurement error in this dataset. Air traffic is highly regulated and therefore likely to be strictly observed and recorded. Distances are set depending on the destination, departure time is recorded by the plane and air traffic control, and arrival delay is a simple calculation based on when the plane was planned to arrive.

5. Conclusions

Since the nature of the dataset made it have so many outliers, this is likely the reason all of the assumptions were not fulfilled. With this in mind, the model using departure time as a predictor variable for arrival delay still did return an R^2 value around 10%, indicating that the time of day a flight departs in part explains the arrival delay. It would be beneficial to repeat this experiment with a different random sample or dataset. It could also be interesting to repeat with the outliers removed to see if more assumptions are fulfilled and the results stand, although it could be problematic in this case since that would mean removing a significant portion of the delayed flights.

Applied Regression Analysis: Project 3

Ali Svoboda

April 20 , 2015