This dataset contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) in 2013, which amounts to 336,776 flights in total. It contains 336,776 observations of 16 variables, which include ‘year’, ‘month’, ‘day’, ‘dep_time’, ‘dep_delay’, ‘arr_time’, ‘arr_delay’, ‘carrier’, ‘tailnum’, ‘flight’, ‘origin’, ‘dest’, ‘air_time’, ‘distance’, ‘hour’, and ‘minute’. The variables that are made up of integer values include ‘year’ (which refers to the year of departure), ‘month’ (which refers to the month of departure), ‘day’ (which refers to the day of departure), ‘dep_time’ (which refers to the time of departure [in hhmm notation]), ‘arr_time’ (which refers to the time of arrival [in hhmm notation]), and ‘flight’ (which refers to the flight number). The variables that are made up of character values include ‘carrier’ (which refers to the airline carrier), ‘tailnum’ (which refers to plane tail number), ‘origin’ (which refers to the origin location), and ‘dest’ (which refers to the destination location). Lastly, the variables that are made up of numeric values include ‘dep_delay’ (which refers to departure delays expressed in minutes), ‘arr_delay’ (which refers to arrival delays expressed in minutes), ‘air_time’ (which refers to the amount of time a plane is airborne, expressed in minutes), ‘distance’ (which refers to the distance traveled by a flight, expressed in miles), ‘hour’ (which refers to the hour of departure), and ‘minute’ (which refers to the minute of departure).
[Reference: An R data package containing all out-bound flights from NYC in 2013 [+ useful metdata] is used in this analysis. It can be installed from github with devtools::install_github(“hadley/nycflights13”).]
Below, the “nycflights13” Dataset is loaded into R, and its summary statistics and its structure are display (along with the “head” and the “tail” of the dataset).
#Install and load the "nycflights13" dataset into R.
rm(list=ls())
install.packages("nycflights13", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/howelb/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## package 'nycflights13' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\howelb\AppData\Local\Temp\Rtmp041ftd\downloaded_packages
library("nycflights13", lib.loc = "C:/Program Files/R/R-3.1.1/library")
## Warning: package 'nycflights13' was built under R version 3.1.3
#Assign a variable, "data_raw", to the complete dataframe, "flights".
#Then, display the "head" and "tail" of the dataset, "data_raw".
data_raw<-flights
head(data_raw)
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## flight origin dest air_time distance hour minute
## 1 1545 EWR IAH 227 1400 5 17
## 2 1714 LGA IAH 227 1416 5 33
## 3 1141 JFK MIA 160 1089 5 42
## 4 725 JFK BQN 183 1576 5 44
## 5 461 LGA ATL 116 762 5 54
## 6 1696 EWR ORD 150 719 5 54
tail(data_raw)
## year month day dep_time dep_delay arr_time arr_delay carrier
## 336771 2013 9 30 NA NA NA NA EV
## 336772 2013 9 30 NA NA NA NA 9E
## 336773 2013 9 30 NA NA NA NA 9E
## 336774 2013 9 30 NA NA NA NA MQ
## 336775 2013 9 30 NA NA NA NA MQ
## 336776 2013 9 30 NA NA NA NA MQ
## tailnum flight origin dest air_time distance hour minute
## 336771 N740EV 5274 LGA BNA NA 764 NA NA
## 336772 3393 JFK DCA NA 213 NA NA
## 336773 3525 LGA SYR NA 198 NA NA
## 336774 N535MQ 3461 LGA BNA NA 764 NA NA
## 336775 N511MQ 3572 LGA CLE NA 419 NA NA
## 336776 N839MQ 3531 LGA RDU NA 431 NA NA
#Display the summary statistics and the structure of the data
summary(data_raw)
## year month day dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907
## Median :2013 Median : 7.000 Median :16.00 Median :1401
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400
## NA's :8255
## dep_delay arr_time arr_delay carrier
## Min. : -43.00 Min. : 1 Min. : -86.000 Length:336776
## 1st Qu.: -5.00 1st Qu.:1104 1st Qu.: -17.000 Class :character
## Median : -2.00 Median :1535 Median : -5.000 Mode :character
## Mean : 12.64 Mean :1502 Mean : 6.895
## 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.: 14.000
## Max. :1301.00 Max. :2400 Max. :1272.000
## NA's :8255 NA's :8713 NA's :9430
## tailnum flight origin dest
## Length:336776 Min. : 1 Length:336776 Length:336776
## Class :character 1st Qu.: 553 Class :character Class :character
## Mode :character Median :1496 Mode :character Mode :character
## Mean :1972
## 3rd Qu.:3465
## Max. :8500
##
## air_time distance hour minute
## Min. : 20.0 Min. : 17 Min. : 0.00 Min. : 0.00
## 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00 1st Qu.:16.00
## Median :129.0 Median : 872 Median :14.00 Median :31.00
## Mean :150.7 Mean :1040 Mean :13.17 Mean :31.76
## 3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00 3rd Qu.:49.00
## Max. :695.0 Max. :4983 Max. :24.00 Max. :59.00
## NA's :9430 NA's :8255 NA's :8255
str(data_raw)
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 16 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ dep_delay: num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ arr_delay: num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num 1400 1416 1089 1576 762 ...
## $ hour : num 5 5 5 5 5 5 5 5 5 5 ...
## $ minute : num 17 33 42 44 54 54 55 57 57 58 ...
#Create a subset of "data_raw" that contains only numeric data
nycflights <- subset(data_raw, select = c(dep_delay, arr_time, arr_delay, air_time, distance, hour, minute))
#Display the "head" and "tail" of the dataset, "nycflights".
head(nycflights)
## dep_delay arr_time arr_delay air_time distance hour minute
## 1 2 830 11 227 1400 5 17
## 2 4 850 20 227 1416 5 33
## 3 2 923 33 160 1089 5 42
## 4 -1 1004 -18 183 1576 5 44
## 5 -6 812 -25 116 762 5 54
## 6 -4 740 12 150 719 5 54
tail(nycflights)
## dep_delay arr_time arr_delay air_time distance hour minute
## 336771 NA NA NA NA 764 NA NA
## 336772 NA NA NA NA 213 NA NA
## 336773 NA NA NA NA 198 NA NA
## 336774 NA NA NA NA 764 NA NA
## 336775 NA NA NA NA 419 NA NA
## 336776 NA NA NA NA 431 NA NA
#Display the summary statistics and the structure of the data
summary(nycflights)
## dep_delay arr_time arr_delay air_time
## Min. : -43.00 Min. : 1 Min. : -86.000 Min. : 20.0
## 1st Qu.: -5.00 1st Qu.:1104 1st Qu.: -17.000 1st Qu.: 82.0
## Median : -2.00 Median :1535 Median : -5.000 Median :129.0
## Mean : 12.64 Mean :1502 Mean : 6.895 Mean :150.7
## 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.: 14.000 3rd Qu.:192.0
## Max. :1301.00 Max. :2400 Max. :1272.000 Max. :695.0
## NA's :8255 NA's :8713 NA's :9430 NA's :9430
## distance hour minute
## Min. : 17 Min. : 0.00 Min. : 0.00
## 1st Qu.: 502 1st Qu.: 9.00 1st Qu.:16.00
## Median : 872 Median :14.00 Median :31.00
## Mean :1040 Mean :13.17 Mean :31.76
## 3rd Qu.:1389 3rd Qu.:17.00 3rd Qu.:49.00
## Max. :4983 Max. :24.00 Max. :59.00
## NA's :8255 NA's :8255
str(nycflights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 7 variables:
## $ dep_delay: num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ arr_delay: num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num 1400 1416 1089 1576 762 ...
## $ hour : num 5 5 5 5 5 5 5 5 5 5 ...
## $ minute : num 17 33 42 44 54 54 55 57 57 58 ...
Upon carrying out this initial summary statistics analysis and taking a random sample of the data, a hierarchical approach is carried out in beginning to develop a multiple linear regression model. Using information obtained from the website ‘flightstats.com’ pertaining to “Airport Departures and Arrivals” [reference: http://www.flightstats.com/go/FlightStatus/flightStatusByAirport.do] and learning that weather, air traffic control directives, and congestion on the taxi-ways can affect arrival times for flights, it is intuitively determined that the dependent variable “arr_delay” and the independent variables “distance” and “air_time” will be the most suitable variables to include in this analysis that are found in this “nycflights13” dataset.
Therefore, upon carrying out this hierarchical approach for this experiment, we are now trying to determine whether or not the variation that is observed in the dependent variable (which corresponds to ‘arr_delay’ in this analysis) can be explained by the variation existent in either of the independent variables in this experiment (which correspond to ‘air_time’ and ‘distance’). Therefore, the null hypothesis that is being tested states that the distance traveled by a given flight and the amount of time that a given flight is airborne do not have a significant effect on a given flight’s arrival delay. Opposingly, the alternate hypothesis that is being tested states that the distance traveled by a given flight and the amount of time that a given flight is airborne do, in fact, have a significant effect on a given flight’s arrival delay.
In this experiment, a hierarchical multiple linear regression model is generated, which will offer some insight into determining both the amount of variance in ‘arr_delay’ that can be explained by each of the independent variables being considered in this analysis and whether any existence of suppression is likely to exist within a linear regression model comprised of this data. The independent variables include the distance traveled by a given flight (in miles) and the amount of time that a given flight is airborne (in minutes), and the dependent variable refers to a given flight’s delay in arrival (in minutes).
Originally, the “nycflights13” dataset contains 336,776 observations. However, this number of observations may serve to be too large for a statistically significant analysis, so a power analysis is performed in this experiment to determine the most appropriate sample size for our final multiple linear regression model (where our desired alpha-level equaled 0.05, our desired power-level equaled 0.95, and the considered number of predictors equaled 2).
#Generate an initial Hierarchical Multiple Linear Regression Model that uses all 336,776 observations
flights_model_0 <- lm(nycflights$arr_delay~nycflights$distance+nycflights$air_time)
#Display summary of the initial Hierarchical Multiple Linear Regression Model
summary(flights_model_0)
##
## Call:
## lm(formula = nycflights$arr_delay ~ nycflights$distance + nycflights$air_time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.27 -22.91 -12.28 5.44 1284.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4559364 0.1728867 -8.421 <2e-16 ***
## nycflights$distance -0.0876549 0.0007613 -115.145 <2e-16 ***
## nycflights$air_time 0.6652632 0.0059795 111.256 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.73 on 327343 degrees of freedom
## (9430 observations deleted due to missingness)
## Multiple R-squared: 0.04012, Adjusted R-squared: 0.04012
## F-statistic: 6842 on 2 and 327343 DF, p-value: < 2.2e-16
#Determine effect size for power analysis
r2 <- summary(flights_model_0)$r.squared
f2 <- r2 / (1-r2)
f2
## [1] 0.04180118
Upon determining the effect size, the software G[STAR]Power is used to determine the most appropriate sample size for this hierarchical multiple linear regression analysis. In its results, G[STAR]Power generated a sample size of 261. So, with this sample size, the dataset “nycflights” will be sampled, creating a new dataset to be used for this hierarchical multiple linear regression model, which will then be used to determine if the variation in “arr_delay” can be explained by the variation existent in both “air_time” and “distance”.
#Remove any rows that contain "NA" in "data_raw", creating "data_clean".
#Randomly take a sample of 261 observations from "data_clean", creating "nycflights_final".
data_clean<-na.omit(nycflights)
S <- 261
set.seed(39)
flight.index <- sample(1:nrow(data_clean),S,replace=FALSE)
nycflights_final <- data_clean[flight.index,]
#Generate a new Hierarchical Multiple Linear Regression Model that uses 261 observations
flights_model_final <- lm(nycflights_final$arr_delay~nycflights_final$distance+nycflights_final$air_time)
#Display summary of the final Hierarchical Multiple Linear Regression Model
summary(flights_model_final)
##
## Call:
## lm(formula = nycflights_final$arr_delay ~ nycflights_final$distance +
## nycflights_final$air_time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.560 -21.148 -10.521 6.748 214.770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.73010 5.21712 0.332 0.74
## nycflights_final$distance -0.09311 0.02160 -4.310 2.33e-05 ***
## nycflights_final$air_time 0.68462 0.16957 4.037 7.13e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.28 on 258 degrees of freedom
## Multiple R-squared: 0.07497, Adjusted R-squared: 0.06779
## F-statistic: 10.45 on 2 and 258 DF, p-value: 4.309e-05
For the hierarchical multiple linear regression analysis that is performed where ‘distance’,and ‘air_time’ are all analyzed against the response variable ‘arr_delay’, p-values equal to 2.33e-05 and 7.13e-05 [respectively] for each of these dependent variables are returned, indicating that there is roughly a probability equal to 2.33e-05 and 7.13e-05 (for each of these independent variables, respectively) that the resulting model’s associated F-value (10.45) is the result of solely randomization. Therefore, based on this hierarchical multiple linear regression model’s yielded results (and its respective p-value outputted in the model summary above), we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of flight arrival delays can be explained by the variation in the distance traveled by a given flight and the amount of time that a given flight is airborne being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for corresponding p-values and F-value.)
In further analyzing the results of the simple linear regression analysis, it’s important to note that the value of b_0 (the linear model’s y-intercept) is 1.73010,the value of b_1 (which represents both the slope of the linear model and the coefficient associated with the variable ‘distance’ in the linear model) is -0.09311, and the value of b_2 (which represents both the slope of the linear model and the coefficient associated with the variable ‘air_time’ in the linear model) is 0.68462. These values indicate the relationship between the independent variables corresponding to the distance traveled by a given flight and the amount of time that a given flight is airborne and the dependent variable corresponding to the arrival delay of a given flight, which corresponds to the idea that an increase in one unit of “distance” (in miles) results in a decrease in 0.09311 units of “flight arrival delay” and that an increase in one unit of “flight airborne time” (in minutes) results in an increase in 0.68462 units of “flight arrival delay.”
Additionally, it’s important to note the metrics that measure the correlation between the independent variables corresponding to the distance traveled by a given flight and the amount of time that a given flight is airborne and the dependent variable corresponding to the arrival delay of a given flight, which are multiple R-squared and adjusted R-squared (since we want to take into account any bias that might be associated with the number of explanatory variables being included in the model, this analysis emphasis the value of adjusted R-squared rather than the value of multiple R-squared). Since the value of adjusted R-squared is 0.06779, it can be inferred that the variation that exists in the independent variables corresponding to the distance traveled by a given flight and the amount of time that a given flight is airborne can explain approximately 6.779% of the variation existent in the dependent variable corresponding to the arrival delay of a given flight. As a result of this low adjusted R-squared value here, one should not put much faith in asserting that the independent variables being considered in this analysis (‘distance’ and ‘air_time’) explain a significant portion of the variation existent in the dependent variable being considered here (‘arr_delay’).
Before beginning to check the model against all eight of the assumptions associated with linear regression modeling, histograms, boxplots, scatterplots, and a “Quality of Fit” plot (via a fitted vs. residual values determination) are generated, which will be used for their graphical nature in our interpretations.
#Generate histograms for all of the different variables being considered in our sampled data ('arr_delay', 'distance', and 'air_time')
hist(nycflights_final$arr_delay, xlab = "Flight Arrival Delay [in minutes]", main = "Histogram of Flight Arrival Delay")
hist(nycflights_final$distance, xlab = "Flight Distance Traveled [in miles]", main = "Histogram of Flight Distance Traveled")
hist(nycflights_final$air_time, xlab = "Flight Airborne Time [in minutes]", main = "Histogram of Flight Airborne Time")
#Generate a boxplot of the data (Dependent Variable = Flight Arrival Delay)
boxplot(x = nycflights_final$arr_delay, pch=21, bg="darkviolet", main="Flight Arrival Delay", xlab = "Flight Arrival Delay [in minutes]")
#Generate a boxplot of the data (Independent Variable = Flight Distance Traveled)
boxplot(x = nycflights_final$distance, pch=21, bg="darkviolet", main="Flight Distance Traveled", xlab = "Flight Distance Traveled [in miles]")
#Generate a boxplot of the data (Independent Variable = Flight Airborne Time)
boxplot(x = nycflights_final$air_time, pch=21, bg="darkviolet", main="Flight Airborne Time", xlab = "Flight Airborne Time [in minutes]")
#Generate a scatterplot of the data: "arr_delay" vs. "distance"
plot(y = nycflights_final$arr_delay,x = nycflights_final$distance, pch=21, bg="darkviolet", main="Distance Traveled by Flight vs. Arrival Delay [in minutes]", ylab = "Arrival Delay [in minutes]", xlab = "Distance Traveled by Flight (in miles)")
#Generate a scatterplot of the data: "arr_delay" vs. "air_time"
plot(y = nycflights_final$arr_delay,x = nycflights_final$air_time, pch=21, bg="darkviolet", main="Flight Airborne Time vs. Arrival Delay [in minutes]", ylab = "Arrival Delay [in minutes]", xlab = "Flight Airborne Time (in minutes)")
#Create a "Quality of Fit Model" that plots the residuals of "flights_model_final" against its fitted model.
par(mfrow=c(1,1))
plot(fitted(flights_model_final),residuals(flights_model_final), main = "Residuals of 'flights_model_final' Against Fitted Model 'flights_model_final' [Not Standardized]")
abline(0,0, col='darkviolet', lwd=2.5)
#Create a "Quality of Fit Model" that plots the standardized residuals of "flights_model_final" against its fitted model.
par(mfrow=c(1,1))
standardized_flights_model <- rstandard(flights_model_final)
plot(fitted(flights_model_final),standardized_flights_model, main = "Standardized Residuals of 'flights_model_final' Against Fitted Model 'flights_model_final'")
abline(0,0, col='darkviolet', lwd=2.5)
In interpreting our hierarchical multiple linear regression model and the statistical significance of the results that were generated therein, it is important to test the model against the eight assumptions corresponding to linear regression.
In order to determine of the distribution of the residuals is normal, we can generate histograms and boxplots for the residuals of the model, analyze them for skewness and kurtosis. Upon observing both the boxplot and the histogram of the residuals, it appears that the model’s residuals do exhibit some skewness, as the residuals seem to be skewed severely to the right (indicating that some bias is likely existent in the model). However, upon observing the histogram of the residuals, it appears that there isn’t much kurtosis (if any) existent in the residuals.
#Generate histograms for the residuals of our model
hist(residuals(flights_model_final), xlab = "Residuals", main = "Histogram of Residuals of 'flights_model_final'")
#Generate a boxplot for the residuals of our model
boxplot(x = residuals(flights_model_final), pch=21, bg="darkviolet", main="Boxplot of Residuals of 'flights_model_final", xlab = "Residuals")
We can further determine whether or not the distribution of the residuals exhibits normality by generating a Normal Quantile-Quantile (QQ) Plot for the residuals of the model and by performing a Shapiro-Wilk Test of Normality on our data. Upon doing so, it’s likely that we can readily assume our data does not exhibit normality. First of all, the constructed Normal Q-Q Plots did not seem to display a trend of data that aligned closely with the Normal Q-Q Line. Secondly, the fact that the resulting p-value of the Shapiro-Wilk Test of Normality for “nycflights_final[,“arr_delay”]” was < 0.1 would lead one to assume that the data does not exhibit normality. Therefore, it is likely evident that our model does not satisfy this assumption.
#Create a Normal Q-Q Plot for the data pertaining to Sulfur Dioxide Content in the Air.
qqnorm(residuals(flights_model_final), main = "Normal Q-Q Plot for Residuals of 'flights_model_final'")
qqline(residuals(flights_model_final))
#Perform Shapiro-Wilk Test of Normality on Flight Arrival Delays (normality is assummed if p > 0.1).
shapiro.test(nycflights_final[,"arr_delay"])
##
## Shapiro-Wilk normality test
##
## data: nycflights_final[, "arr_delay"]
## W = 0.7998, p-value < 2.2e-16
In order to determine if the variance of the residuals for every set of values for the predictor are equal, we can generate a residuals plot of the model and discern whether or not the residuals located across the dynamic range are uniformly distributed along the “y=0” axis. Upon generating this plot, it appears that while some heteroskedasticity may potentially exist (given that it appears that some variance may exist at the lower end of the dynamic range with residuals being slightly more expanded along the y-axis), the residuals may seem to be generally uniformly distributed here [exhibiting homoskedasticity].
#Generate a residuals plot for "flights_model_final"
plot(flights_model_final$residuals, pch=21, bg="darkviolet", main = "Residuals Plot for 'flights_model_final")
In further determining whether or not the variance of the residuals for every set of values for the predictor is equal, we can perform White’s Test for Heteroskedasticity for each individual independent variable (“distance” and “air_time”) against the dependent variable (“arr_delay”). In performing this test, it is important to note that the null hypothesis states that the model exhibits homoskedasticity, which (if not rejected) would lead one to infer that this assumption is satisfied. Upon generating this test for both respective independent variables (“distance” and “air_time”), p-values greater than an alpha-level of 0.05 resulted for both independent variables (0.767 and 0.829, respectively), leading us to fail to reject the null hypothesis here. Therefore, it is likely evident that our model does satisfy this assumption.
#install.packages("het.test") #[needs to be installed before use]
#install.packages("vars") #[needs to be installed before use]
library(het.test)
## Warning: package 'het.test' was built under R version 3.1.3
## Loading required package: vars
## Warning: package 'vars' was built under R version 3.1.3
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.1.2
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.1.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.1.3
## Loading required package: urca
## Warning: package 'urca' was built under R version 3.1.3
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 3.1.3
library(vars)
whitetest_distance <- VAR(data.frame(y=nycflights_final$arr_delay, x=nycflights_final$distance))
whites.htest(whitetest_distance)
##
## White's Test for Heteroskedasticity:
## ====================================
##
## No Cross Terms
##
## H0: Homoskedasticity
## H1: Heteroskedasticity
##
## Test Statistic:
## 8.2288
##
## Degrees of Freedom:
## 12
##
## P-value:
## 0.7670
whitetest_air_time <- VAR(data.frame(y=nycflights_final$arr_delay, x=nycflights_final$air_time))
whites.htest(whitetest_air_time)
##
## White's Test for Heteroskedasticity:
## ====================================
##
## No Cross Terms
##
## H0: Homoskedasticity
## H1: Heteroskedasticity
##
## Test Statistic:
## 7.4155
##
## Degrees of Freedom:
## 12
##
## P-value:
## 0.8290
In determining whether or not the error term is additive, we can qualitatively discern that an argument exists for the error term that considers both flight distance traveled and flight airborne time being multiplicative (as opposed to additive, given that these two independent variables may not be completely independent of each other in an intuitive sense). Additionally, a standardized residual plot can be generated for this model (against a fitted version of the model) as a means for determining whether or not the model satisfies this assumption. Upon generating this plot, it appears that the residuals located across the dynamic range are not uniformly distributed along the “y=0” axis, further indicating that a non-linear kind of effect likely exists within the data that the model is comprised of. Therefore, it is likely evident that our model does not satisfy this assumption.
#Create a "Quality of Fit Model" that plots the standardized residuals of "flights_model_final" against its fitted model.
par(mfrow=c(1,1))
standardized_flights_model <- rstandard(flights_model_final)
plot(fitted(flights_model_final),standardized_flights_model, main = "Standardized Residuals of 'flights_model_final' Against Fitted Model 'flights_model_final'")
abline(0,0, col='darkviolet', lwd=2.5)
In order to determine whether or not the expected (mean) value of the residuals is zero at every value of the outcome, we can observe the results corresponding to the generated “Quality of Fit Model” that plots the standardized residuals of “flights_model_final” against its fitted model [found in Assumption #3 above]. Upon doing so, it appears that the residuals located across the dynamic range are not uniformly distributed along the “y=0” axis, further indicating that a non-linear kind of effect likely exists within the data that the model is comprised of and, most especially, that the expected (mean) value of the residuals is not zero at every value of the outcome. Therefore, it is likely evident that our model does not satisfy this assumption.
In order to determine whether or not the expected correlation between residuals, for any two cases, is zero, we can generate a non-standardized “Quality of Fit Model” that plots the residuals of “flights_model_final” against its fitted model. In doing so, we can determine if the data exhibits independence, which is necessary for a model comprised of that data to be statistically significant (since it needs to be generated using “i.i.d.” [independent and identically distributed] data). Upon generating this plot, it appears that the residuals located across the dynamic range are not uniformly distributed along the “y=0” axis, especially given that the plot indicates the existence of some kind of “biased” pattern that typically exhibits non-independence and/or auto-correlation. Therefore, it is likely evident that our model does not satisfy this assumption.
#Create a "Quality of Fit Model" that plots the residuals of "flights_model_final" against its fitted model.
par(mfrow=c(1,1))
plot(fitted(flights_model_final),residuals(flights_model_final), main = "Residuals of 'flights_model_final' Against Fitted Model 'flights_model_final' [Not Standardized]")
abline(0,0, col='darkviolet', lwd=2.5)
In order to determine whether or not the predictors are a perfect linear function of other predictors (i.e., no perfect multicollinearity), a correlation matrix and a plot that graphs the independent variables against each other can be generated. Upon doing so, it appears that “distance” and “air_time” are strongly correlated with each other (at a value of 0.9890678, which is very close to 1.00), which does seem to exhibit the existence of collinearity between these independent variables. Therefore, it is likely evident that our model does not satisfy this assumption.
#Generate a correlation matrix for "distance" and "air_time"
cor(nycflights_final[c("distance","air_time")])
## distance air_time
## distance 1.0000000 0.9890678
## air_time 0.9890678 1.0000000
#Generate a plot that graphs "distance" and "air_time" against each other
plot(nycflights_final$distance~nycflights_final$air_time, main = "'air_time' vs. 'distance'")
In order to determine if the mean of the error term is zero, we can observe the results corresponding to the generated “Quality of Fit Model” that plots the standardized residuals of “flights_model_final” against its fitted model [found in Assumption #3 above]. Upon doing so, it appears that the residuals located across the dynamic range are not uniformly distributed along the “y=0” axis, further indicating that a non-linear kind of effect likely exists within the data that the model is comprised of and, most especially, that the mean of the error term is likely not zero. Therefore, it is likely evident that our model does not satisfy this assumption.
In interpreting our hierarchical multiple linear regression model and the statistical significance of the results that were generated therein, it is important to check the model against the four main issues surrounding linear regression.
As far as causality is concerned here, we can intuitively discern that “flight distance traveled” and “flight airborne time” probabilistically and proximally cause delays in flight arrival. These independent variables are considered to be proximal and probabilistic causes (and not ultimate or determinate causes) because they do not “perfectly” or “directly” cause delays in flight arrival times in every flight situation (especially considering the fact that other factors are more likely to “perfectly/directly” cause flight arrival delays, such as the existence of inclement weather).
As far as sample sizes are concerned here, our original, raw dataset is massive (336,776 observations). If this dataset were used to generate our hierarchical multiple linear regression model, the significance of our results would have been misinterpreted as a result of the bias that exists when using massive datasets. To alleviate this concern, the effect size of our proposed model was determined, and then an appropriate sample size was determined using the software G[STAR]Power. Upon determining the effect size to be 0.04180118, the use of G[STAR]Power resulted in a generated sample size of 261. So, with this sample size, the dataset “nycflights” was sampled, and a new dataset “nycflights_final” was created. (See the section “Power Analysis for Multiple Linear Regression Modeling” within Chapter #2 for more information.)
As far as collinearity is concerned here, our work in determining whether or not Assumption #7 (i.e., “no predictors are a perfect linear function of other predictors (no perfect multicollinearity)”) was satisfied by our model brought us to the seemingly clear conclusion that “distance” and “air_time” are strongly correlated with each other (at a value of 0.9890678, which is very close to 1.00), which does seem to exhibit the existence of collinearity between these independent variables. Therefore, collinearity is evidently a huge issue for our modeling approach, given how much “distance” and “air_time” seem to affect each other in a multiplicative sort of way.
In this study, the existence of measurement error is not an immediate concern. Since airborne flights are tracked continuously through every individual plane’s journey and are electronically monitored (with a heavy importance placed upon data collection) by both the government and the airports associated with given departure and arrival locations, it is not very likely (though, technically possible) that measurement error would play a role in affecting accurate data collection.