========================================================
The data used in this analysis was created by Hadley Wickham. We will examine data from flights that departed NYC in 2013. The set includes the flight information (flight number, carrier, departure and destination cities, date and time of departure), as well as information regarding the flight length and if it was delayed.
The code below reads in the dataset and displays the first several observations. There are incomplete observations with fields missing so those observations are removed from the set as to not mess up calculations:
library(nycflights13)
xx<-flights
xx<-na.omit(xx)
head(xx)
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## flight origin dest air_time distance hour minute
## 1 1545 EWR IAH 227 1400 5 17
## 2 1714 LGA IAH 227 1416 5 33
## 3 1141 JFK MIA 160 1089 5 42
## 4 725 JFK BQN 183 1576 5 44
## 5 461 LGA ATL 116 762 5 54
## 6 1696 EWR ORD 150 719 5 54
The dataset contains 327346 observations of 16 continuous variables. However, for this project, we will focus on 3 of these variables: arrival delay (arr_delay): delay in minutes (negative values represent an early flight) milage of flight (distance): how far the flight flew departure time (dep_time): time of the day the flight departed
For this project, the dependent/response variable will be the arrival delay and the independent variables will be distance and dep_time. These are selected to determine to test if longer flights may influence the delay of a flight. Departure time will be included to see if flying during a certain time of day can predict if a flight is on time.
The overall goal is to see if the arrival delay can be at all explained by the departure time and disatance of a flight.
Since we only need these 3 variables, we will remove all others from the dataset:
x<-xx[c(4,7,14)]
We will also select a subset of this data since it is an extremely large dataset. We will do this by taking a random sample of the observations (See Section 4 Issue 2 for explanation on sample size). Finally we will attach the variable names:
size=196
vector=nrow(x)
set.seed(11)
indicies<- sample(vector, size, replace= FALSE)
data<-x[indicies,]
head(data)
## dep_time arr_delay distance
## 92553 648 8 2565
## 170 900 8 1065
## 171740 632 -1 187
## 4649 1242 174 2454
## 21440 1458 50 209
## 321859 509 -39 1400
Below is a summary of the data:
summary(data)
## dep_time arr_delay distance
## Min. : 509 Min. :-56.0 Min. : 94
## 1st Qu.: 842 1st Qu.:-19.0 1st Qu.: 543
## Median :1310 Median : -5.0 Median :1020
## Mean :1324 Mean : 9.8 Mean :1154
## 3rd Qu.:1767 3rd Qu.: 13.8 3rd Qu.:1598
## Max. :2353 Max. :360.0 Max. :2586
Below we create a boxplot of the delay time to examine the distribution and outliers
boxplot(data$arr_delay, main="Boxplot of Arrival Delays")
As you can see, there are some outliers. This can be attributed to the nature of the dataset. When considering so many flights leaving NYC, most will be on time, as you can see by the median being about 0. However, there are still many flights that are delayed, and done so by significant lengths, and this is shown by the outliers.
Since there are so many outliers, we will not remove them because if we did we would have no delays to try to explain. It is important to keep in mind that there are outliers when we go to interpret the model and results.
The null hypothesis is that the variation in the dependent variable, arrival delay, cannot be explained by anything other than randomization. In other words, for the model described below, the variation in distance traveled and departure time cannot explain the variation in arrival delay.
Examine the scattergram for relationships between arrival delay and the independent variables:
plot(data)
One thing we can observe from this plot is between departure time and distance traveled. Since these are the independent variables, we do not want them to be correlated because correlation between IV’s can lead to supression. Since the points are spread across the chart for dep_time and distance, supression should not be an issue in this analysis.
The model will be built using the step-wise method. We add the most correlated independent variable first, then continue to add the other variables as long as the contriburte to explaining the variation in the dependent variable (viewed through the R^2 value).
To determine the order, we build a correlation matrix:
cor(data)
## dep_time arr_delay distance
## dep_time 1.000000 0.35183 0.006639
## arr_delay 0.351831 1.00000 -0.078293
## distance 0.006639 -0.07829 1.000000
As predicted above, the IV’s have correlation values close to zero. Departure time seems to be correlated with arrival delay, while distance and arrival delay are only slightly correlated. Therefore, departure time is the first variable included in the model.
Model 1: Arrival Delay by Departure time
model1<-lm(data$arr_delay~data$dep_time)
summary(model1)
##
## Call:
## lm(formula = data$arr_delay ~ data$dep_time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.4 -27.1 -8.6 13.7 316.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.20350 10.03168 -3.91 0.00013 ***
## data$dep_time 0.03703 0.00707 5.24 4.3e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.4 on 194 degrees of freedom
## Multiple R-squared: 0.124, Adjusted R-squared: 0.119
## F-statistic: 27.4 on 1 and 194 DF, p-value: 4.26e-07
In this model, we reject the null hypothesis for departure time. According to the model, departure time does contribute to the variation in the arrival delay. At 0.1238, the R^2 value is on the lower side, indicating the model only explains about 10% of the variation in delay.
Scatterplot With Regression Line for Model 1
plot(data$dep_time,data$arr_delay, main="Arrival Delay vs Departure Time (with regression line)", xlab= "Departure Time", ylab="Arrival Delay", col="blue", pch=18)
abline(model1$coef, lwd=2, col="dark blue")
Again with the nature of the flight patterns having most flights on time, it is hard to fit the data with a regression line. For many points, the regression line is not a great fit, which explains why the R^2 value is low. One the other hand, the line does suggest an increase in arrival delay as departure time increases. In other words, as the day progresses, flights may tend to have greater arrival delays. This is a plausable outcome as many flights use the same planes throughout the day, so the delay of an earlier flight may impact many flights following.
Model 2: Arrival Delay by Departure time and Distance
Now, the second independent varible will be added to the model to see if together, departure time and distance traveled can better predict the arrival delay:
model2<-lm(data$arr_delay~data$dep_time+data$distance)
summary(model2)
##
## Call:
## lm(formula = data$arr_delay ~ data$dep_time + data$distance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.6 -27.2 -8.1 11.5 322.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32.60517 11.42731 -2.85 0.0048 **
## data$dep_time 0.03709 0.00707 5.25 4e-07 ***
## data$distance -0.00578 0.00481 -1.20 0.2312
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.4 on 193 degrees of freedom
## Multiple R-squared: 0.13, Adjusted R-squared: 0.121
## F-statistic: 14.5 on 2 and 193 DF, p-value: 1.41e-06
In this model, we again reject the null hypothesis for departure time as a predictor variable but fail to reject for distance traveled. This means that based on these results, distance is not a significant predictor of arrival delay time and does not explain any of its variation. This follows the results from the correlation matrix, as the correlation between distance traveled and arrival delay was close to zero. The R^2 value did change to 0.1303, but this is less than a 1% increase from the previous model and likely only changed due to changes in the degress of freedom.
This is still a small amount of variation for a model to explain, suggesting there may be several other factors that impact a flights arrival delay (i.e. weather, airline, airplane, etc.)
Scatterplot With Regression Plane for Model 2
A 3-Dimensional Scatterplot will be required since there are two independent variables. To plot this, we must first install an additional package:
#install.packages("scatterplot3d")
library("scatterplot3d", lib.loc="~/R/win-library/3.1")
## Warning: package 'scatterplot3d' was built under R version 3.1.3
Create 3D plot with regression plane:
best<-scatterplot3d(data$dep_time, data$distance, data$arr_delay, pch=18, main="Arrival Delay vs Departure Time and Distance (with Regression)", xlab="Departure Time", ylab="Distance (miles)", zlab="Arrival Delay (minutes)")
best$plane3d(model2, col="blueviolet")
Although it is hard to see on an angle, it appears that the regression plane only hits a small portion of the points, suggesting a poor fit. This is logical as distance was not found as a significant predictor and because the model only explained a small percentage of the variation in arrival delay.
A series of plots were created to check the various assumptions of regression models. Model 1 will be used since it explained a greater amount of the arrival delay.
The 8 assumtions that will be checked are as follows:
First, we will check that the data meets the normality assumption:
qqnorm(residuals(model1))
qqline(residuals(model1))
A majority of the residuals fall along the normal distrbution but the model may not be normally distributed as they tail off at the upper end. The few that tail off could be due to the outliers that were left in the dataset for analysis.
Examine histogram for skewdness and kurtosis:
hist(data$arr_delay, breaks=15, density=20, xlim=c(-75,250))
The data is right skewed, again suggesting the data is not normally distributed. Most observations are on time (0 arrival delay) while the majority of the remaning flights range from 50 miniutes early to 50 minutes late. The skew occurs because of the flights that have longer delays (50+ minutes). Revisiting the boxplot, these are likely the outliers that are shown and the reason the data is not normal. For example, imagine ending the histogram after 50min arrival delay. The data would be much closer to being normally distributed, although it may still have soome kurtosis as the histogram has a bit of a peak.
Boxplot to illustrate outliers that were left in the dataset:
boxplot(data$arr_delay, main="Boxplot of Arrival Delays")
The median line is not quite in the center of the box, again identifying skewdness.
White’s Test for Heteroskedasticity to determine if the residuals have equal varience. Null Hypothesis: Homoskedasticity (all residuals have equal variance) Alternative Hypothesis: Heteroskedasticity
Before Whites test is run, a package must first be installed:
#install.packages("het.test")
library("het.test", lib.loc="~/R/win-library/3.1")
## Warning: package 'het.test' was built under R version 3.1.3
## Loading required package: vars
## Warning: package 'vars' was built under R version 3.1.3
## Loading required package: MASS
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.1.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.1.3
## Loading required package: urca
## Warning: package 'urca' was built under R version 3.1.3
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 3.1.3
Run whites test:
departure<-data.frame(x=data$dep_time,y=data$arr_delay)
whitemodel<-VAR(departure,p=1)
whites.htest(whitemodel)
##
## White's Test for Heteroskedasticity:
## ====================================
##
## No Cross Terms
##
## H0: Homoskedasticity
## H1: Heteroskedasticity
##
## Test Statistic:
## 16.1427
##
## Degrees of Freedom:
## 12
##
## P-value:
## 0.1848
With a p-value of .1848, we fail to reject the null hypothesis, indicatingn the residuals do appear to be homoscedastic and confirming the assumption of equal variance of residuals
Departure time and distance traveled are additive and not multiplicative of each other. Both long and short flights can leave at any point in a day.
To check the fit of the model, we will examine the model residuals:
model.resid<-model1$residuals
plot(model.resid, main="Residuals",pch=20)
abline(0,0, lwd=2, col="blueviolet")
The resididuals show the difference between the acutal and fitted values of the model. They are spread out across the dynamic range but have a slightly higher amount distributed below the 0 while the values above the zero are further from the zero. The mean may still be zero but the fit of the model may not be the best.
We can again examine the residuals plot above to test this assumption. This assumption ensures the cases are independent of each other. Since there is no trend in the plot above, we assume this assumption is fulfilled.
The predictors are most likely not predictors of one another. Departure time most likely is not a linear function of distance traveled as flights can leave at the same time but go to different places. In addition, the two IV’s had a correlation value close to zero. Besides, the model of best fit only included one predictor variable, so this assumption does not pose a threat to the model.
As discussed in Assumption 4, we can look at the fitted vs actual values to infer the mean of the error term. The error is the difference between the actual and fitted, so the closer the points on the residuals plot are to zero, the closer the error is to zero. The error term may not be exactly zero, but based on the plot it is likely pretty close.
Below four issues that may occur when running regression analysis will be discussed.
The independent variables in this analysis, distance and departure time, are probabalistic and proximal causes of arrival delay. They are probabalistic and not deterministic because nither of the two can determine the arrival delay of a flight every time. A shorter flight does not neccesarily equate to a shorter delay, nor does a longer flight. At the same time, a flight that leaves in the morning won’t always have a shorter or longer delay than a flight that leaves in the evening. The are probable causes because sometimes, they may lead to a certain arrival delay (at least in this case, departure time may while distance traveled tested to be insgnificant).
They are proximal and not ultimate as I do not believe distance traveled and departure time are the main cause of a flight delay. It is much more likely that another factor, such as weather, plane maintenance, or a security or issue at the airport/runway is the ultimate cause of the delay. It was simply the goal of this experiment to see if distance and departure time explaned some of the variance in arrival delay.
Since flights is a very large dataset, a program called G*Power was utilized to find an appropriate sample size to test. We do not want to use all 300,000+ observations beacuse we want to ensure results are do to an effect actually existing, not because there are too many observations in the sample. For the calculation, the following parmeters were choosen:
effect size = .05 alpha = .05 Power (1-beta) =.8 Tested Predictors = 2 Total # of Predictors = 16
A Sample size of 196 was calculated and this size was used to select a random sample from the dataset (entered in the model building section above.
Collinearity becomes a problem when the independent variables in a model are correlated. Although the best model only used one variable, we will check this assumption in case it is affecting the second model that was created. We checked for collinearity in section ________ above when we examined the correlation matrix. Distance and departure time had a correlation value of 0.00663, so we can consider this assumption fulfilled.
We can also examine the scatter plot between the IV’s:
plot(data$distance, data$dep_time, xlab="Distance", ylab="Departure Time (in Military Time)")
There is a good spread with no identifiable trends. There are verticle columns (different distances) that have more gaps than others, but this can be attributed to the random sample used including fewer observations for these distances.
There is not likely to be much measurement error in this dataset. Air traffic is highly regulated and therefore likely to be strictly observed and recorded. Distances are set depending on the destination, departure time is recorded by the plane and air traffic control, and arrival delay is a simple calculation based on when the plane was planned to arrive.
Since the nature of the dataset made it have so many outliers, this is likely the reason all of the assumptions were not fulfilled. With this in mind, the model using departure time as a predictor variable for arrival delay still did return an R^2 value around 10%, indicating that the time of day a flight departs in part explains the arrival delay. It would be beneficial to repeat this experiment with a different random sample or dataset. It could also be interesting to repeat with the outliers removed to see if more assumptions are fulfilled and the results stand, although it could be problematic in this case since that would mean removing a significant portion of the delayed flights.