Assignment #03: Assumptions and Issues [Final]

Assumptions and Issues Project - Analysis of Analysis of Arrival Delay Times for Flights Departing from NYC

Brendan Howell

Renselaer Polytechnic Institute

04/20/15 - Version 1.0

1. Data

Dataset of Out-bound Flights from NYC in 2013 (‘nycflights13’)

Description: This package contains information about all flights that departed from NYC (e.g. EWR, JFK, and LGA) in 2013: 336,776 flights in total.

This dataset contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) in 2013, which amounts to 336,776 flights in total. It contains 336,776 observations of 16 variables, which include ‘year’, ‘month’, ‘day’, ‘dep_time’, ‘dep_delay’, ‘arr_time’, ‘arr_delay’, ‘carrier’, ‘tailnum’, ‘flight’, ‘origin’, ‘dest’, ‘air_time’, ‘distance’, ‘hour’, and ‘minute’. The variables that are made up of integer values include ‘year’ (which refers to the year of departure), ‘month’ (which refers to the month of departure), ‘day’ (which refers to the day of departure), ‘dep_time’ (which refers to the time of departure [in hhmm notation]), ‘arr_time’ (which refers to the time of arrival [in hhmm notation]), and ‘flight’ (which refers to the flight number). The variables that are made up of character values include ‘carrier’ (which refers to the airline carrier), ‘tailnum’ (which refers to plane tail number), ‘origin’ (which refers to the origin location), and ‘dest’ (which refers to the destination location). Lastly, the variables that are made up of numeric values include ‘dep_delay’ (which refers to departure delays expressed in minutes), ‘arr_delay’ (which refers to arrival delays expressed in minutes), ‘air_time’ (which refers to the amount of time a plane is airborne, expressed in minutes), ‘distance’ (which refers to the distance traveled by a flight, expressed in miles), ‘hour’ (which refers to the hour of departure), and ‘minute’ (which refers to the minute of departure).

[Reference: An R data package containing all out-bound flights from NYC in 2013 [+ useful metdata] is used in this analysis. It can be installed from github with devtools::install_github(“hadley/nycflights13”).]

Data Organization

Below, the “nycflights13” Dataset is loaded into R, and its summary statistics and its structure are display (along with the “head” and the “tail” of the dataset).

#Install and load the "nycflights13" dataset into R.
rm(list=ls())
install.packages("nycflights13", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/howelb/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## package 'nycflights13' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\howelb\AppData\Local\Temp\Rtmp041ftd\downloaded_packages
library("nycflights13", lib.loc = "C:/Program Files/R/R-3.1.1/library")
## Warning: package 'nycflights13' was built under R version 3.1.3
#Assign a variable, "data_raw", to the complete dataframe, "flights".
#Then, display the "head" and "tail" of the dataset, "data_raw".
data_raw<-flights
head(data_raw)
##   year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013     1   1      517         2      830        11      UA  N14228
## 2 2013     1   1      533         4      850        20      UA  N24211
## 3 2013     1   1      542         2      923        33      AA  N619AA
## 4 2013     1   1      544        -1     1004       -18      B6  N804JB
## 5 2013     1   1      554        -6      812       -25      DL  N668DN
## 6 2013     1   1      554        -4      740        12      UA  N39463
##   flight origin dest air_time distance hour minute
## 1   1545    EWR  IAH      227     1400    5     17
## 2   1714    LGA  IAH      227     1416    5     33
## 3   1141    JFK  MIA      160     1089    5     42
## 4    725    JFK  BQN      183     1576    5     44
## 5    461    LGA  ATL      116      762    5     54
## 6   1696    EWR  ORD      150      719    5     54
tail(data_raw)
##        year month day dep_time dep_delay arr_time arr_delay carrier
## 336771 2013     9  30       NA        NA       NA        NA      EV
## 336772 2013     9  30       NA        NA       NA        NA      9E
## 336773 2013     9  30       NA        NA       NA        NA      9E
## 336774 2013     9  30       NA        NA       NA        NA      MQ
## 336775 2013     9  30       NA        NA       NA        NA      MQ
## 336776 2013     9  30       NA        NA       NA        NA      MQ
##        tailnum flight origin dest air_time distance hour minute
## 336771  N740EV   5274    LGA  BNA       NA      764   NA     NA
## 336772           3393    JFK  DCA       NA      213   NA     NA
## 336773           3525    LGA  SYR       NA      198   NA     NA
## 336774  N535MQ   3461    LGA  BNA       NA      764   NA     NA
## 336775  N511MQ   3572    LGA  CLE       NA      419   NA     NA
## 336776  N839MQ   3531    LGA  RDU       NA      431   NA     NA
#Display the summary statistics and the structure of the data
summary(data_raw)
##       year          month             day           dep_time   
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400  
##                                                  NA's   :8255  
##    dep_delay          arr_time      arr_delay          carrier         
##  Min.   : -43.00   Min.   :   1   Min.   : -86.000   Length:336776     
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.: -17.000   Class :character  
##  Median :  -2.00   Median :1535   Median :  -5.000   Mode  :character  
##  Mean   :  12.64   Mean   :1502   Mean   :   6.895                     
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:  14.000                     
##  Max.   :1301.00   Max.   :2400   Max.   :1272.000                     
##  NA's   :8255      NA's   :8713   NA's   :9430                         
##    tailnum              flight        origin              dest          
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##     air_time        distance         hour           minute     
##  Min.   : 20.0   Min.   :  17   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00   1st Qu.:16.00  
##  Median :129.0   Median : 872   Median :14.00   Median :31.00  
##  Mean   :150.7   Mean   :1040   Mean   :13.17   Mean   :31.76  
##  3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00   3rd Qu.:49.00  
##  Max.   :695.0   Max.   :4983   Max.   :24.00   Max.   :59.00  
##  NA's   :9430                   NA's   :8255    NA's   :8255
str(data_raw)
## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  16 variables:
##  $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ dep_delay: num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ arr_delay: num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier  : chr  "UA" "UA" "AA" "B6" ...
##  $ tailnum  : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ flight   : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ origin   : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest     : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance : num  1400 1416 1089 1576 762 ...
##  $ hour     : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ minute   : num  17 33 42 44 54 54 55 57 57 58 ...
#Create a subset of "data_raw" that contains only numeric data
nycflights <- subset(data_raw, select = c(dep_delay, arr_time, arr_delay, air_time, distance, hour, minute))
#Display the "head" and "tail" of the dataset, "nycflights".
head(nycflights)
##   dep_delay arr_time arr_delay air_time distance hour minute
## 1         2      830        11      227     1400    5     17
## 2         4      850        20      227     1416    5     33
## 3         2      923        33      160     1089    5     42
## 4        -1     1004       -18      183     1576    5     44
## 5        -6      812       -25      116      762    5     54
## 6        -4      740        12      150      719    5     54
tail(nycflights)
##        dep_delay arr_time arr_delay air_time distance hour minute
## 336771        NA       NA        NA       NA      764   NA     NA
## 336772        NA       NA        NA       NA      213   NA     NA
## 336773        NA       NA        NA       NA      198   NA     NA
## 336774        NA       NA        NA       NA      764   NA     NA
## 336775        NA       NA        NA       NA      419   NA     NA
## 336776        NA       NA        NA       NA      431   NA     NA
#Display the summary statistics and the structure of the data
summary(nycflights)
##    dep_delay          arr_time      arr_delay           air_time    
##  Min.   : -43.00   Min.   :   1   Min.   : -86.000   Min.   : 20.0  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.: -17.000   1st Qu.: 82.0  
##  Median :  -2.00   Median :1535   Median :  -5.000   Median :129.0  
##  Mean   :  12.64   Mean   :1502   Mean   :   6.895   Mean   :150.7  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:  14.000   3rd Qu.:192.0  
##  Max.   :1301.00   Max.   :2400   Max.   :1272.000   Max.   :695.0  
##  NA's   :8255      NA's   :8713   NA's   :9430       NA's   :9430   
##     distance         hour           minute     
##  Min.   :  17   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 502   1st Qu.: 9.00   1st Qu.:16.00  
##  Median : 872   Median :14.00   Median :31.00  
##  Mean   :1040   Mean   :13.17   Mean   :31.76  
##  3rd Qu.:1389   3rd Qu.:17.00   3rd Qu.:49.00  
##  Max.   :4983   Max.   :24.00   Max.   :59.00  
##                 NA's   :8255    NA's   :8255
str(nycflights)
## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  7 variables:
##  $ dep_delay: num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ arr_delay: num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ air_time : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance : num  1400 1416 1089 1576 762 ...
##  $ hour     : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ minute   : num  17 33 42 44 54 54 55 57 57 58 ...

Data Selection for Hierarchical Multiple Linear Regression Model

Upon carrying out this initial summary statistics analysis and taking a random sample of the data, a hierarchical approach is carried out in beginning to develop a multiple linear regression model. Using information obtained from the website ‘flightstats.com’ pertaining to “Airport Departures and Arrivals” [reference: http://www.flightstats.com/go/FlightStatus/flightStatusByAirport.do] and learning that weather, air traffic control directives, and congestion on the taxi-ways can affect arrival times for flights, it is intuitively determined that the dependent variable “arr_delay” and the independent variables “distance” and “air_time” will be the most suitable variables to include in this analysis that are found in this “nycflights13” dataset.

Description of the null hypothesis (H_0) and the alternate hypothesis (H_1)

Therefore, upon carrying out this hierarchical approach for this experiment, we are now trying to determine whether or not the variation that is observed in the dependent variable (which corresponds to ‘arr_delay’ in this analysis) can be explained by the variation existent in either of the independent variables in this experiment (which correspond to ‘air_time’ and ‘distance’). Therefore, the null hypothesis that is being tested states that the distance traveled by a given flight and the amount of time that a given flight is airborne do not have a significant effect on a given flight’s arrival delay. Opposingly, the alternate hypothesis that is being tested states that the distance traveled by a given flight and the amount of time that a given flight is airborne do, in fact, have a significant effect on a given flight’s arrival delay.

2. The Linear Model (A Hierarchical Multiple Linear Regression Model)

Description of independent variables and dependent variable

In this experiment, a hierarchical multiple linear regression model is generated, which will offer some insight into determining both the amount of variance in ‘arr_delay’ that can be explained by each of the independent variables being considered in this analysis and whether any existence of suppression is likely to exist within a linear regression model comprised of this data. The independent variables include the distance traveled by a given flight (in miles) and the amount of time that a given flight is airborne (in minutes), and the dependent variable refers to a given flight’s delay in arrival (in minutes).

Power Analysis for Multiple Linear Regression Modeling

Originally, the “nycflights13” dataset contains 336,776 observations. However, this number of observations may serve to be too large for a statistically significant analysis, so a power analysis is performed in this experiment to determine the most appropriate sample size for our final multiple linear regression model (where our desired alpha-level equaled 0.05, our desired power-level equaled 0.95, and the considered number of predictors equaled 2).

#Generate an initial Hierarchical Multiple Linear Regression Model that uses all 336,776 observations
flights_model_0 <- lm(nycflights$arr_delay~nycflights$distance+nycflights$air_time)
#Display summary of the initial Hierarchical Multiple Linear Regression Model
summary(flights_model_0)
## 
## Call:
## lm(formula = nycflights$arr_delay ~ nycflights$distance + nycflights$air_time)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -69.27  -22.91  -12.28    5.44 1284.47 
## 
## Coefficients:
##                       Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)         -1.4559364  0.1728867   -8.421   <2e-16 ***
## nycflights$distance -0.0876549  0.0007613 -115.145   <2e-16 ***
## nycflights$air_time  0.6652632  0.0059795  111.256   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.73 on 327343 degrees of freedom
##   (9430 observations deleted due to missingness)
## Multiple R-squared:  0.04012,    Adjusted R-squared:  0.04012 
## F-statistic:  6842 on 2 and 327343 DF,  p-value: < 2.2e-16
#Determine effect size for power analysis
r2 <- summary(flights_model_0)$r.squared
f2 <- r2 / (1-r2)
f2
## [1] 0.04180118

Upon determining the effect size, the software G[STAR]Power is used to determine the most appropriate sample size for this hierarchical multiple linear regression analysis. In its results, G[STAR]Power generated a sample size of 261. So, with this sample size, the dataset “nycflights” will be sampled, creating a new dataset to be used for this hierarchical multiple linear regression model, which will then be used to determine if the variation in “arr_delay” can be explained by the variation existent in both “air_time” and “distance”.

#Remove any rows that contain "NA" in "data_raw", creating "data_clean".
#Randomly take a sample of 261 observations from "data_clean", creating "nycflights_final".
data_clean<-na.omit(nycflights)
S <- 261
set.seed(39)
flight.index <- sample(1:nrow(data_clean),S,replace=FALSE)
nycflights_final <- data_clean[flight.index,]
#Generate a new Hierarchical Multiple Linear Regression Model that uses 261 observations
flights_model_final <- lm(nycflights_final$arr_delay~nycflights_final$distance+nycflights_final$air_time)
#Display summary of the final Hierarchical Multiple Linear Regression Model
summary(flights_model_final)
## 
## Call:
## lm(formula = nycflights_final$arr_delay ~ nycflights_final$distance + 
##     nycflights_final$air_time)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.560 -21.148 -10.521   6.748 214.770 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.73010    5.21712   0.332     0.74    
## nycflights_final$distance -0.09311    0.02160  -4.310 2.33e-05 ***
## nycflights_final$air_time  0.68462    0.16957   4.037 7.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.28 on 258 degrees of freedom
## Multiple R-squared:  0.07497,    Adjusted R-squared:  0.06779 
## F-statistic: 10.45 on 2 and 258 DF,  p-value: 4.309e-05

For the hierarchical multiple linear regression analysis that is performed where ‘distance’,and ‘air_time’ are all analyzed against the response variable ‘arr_delay’, p-values equal to 2.33e-05 and 7.13e-05 [respectively] for each of these dependent variables are returned, indicating that there is roughly a probability equal to 2.33e-05 and 7.13e-05 (for each of these independent variables, respectively) that the resulting model’s associated F-value (10.45) is the result of solely randomization. Therefore, based on this hierarchical multiple linear regression model’s yielded results (and its respective p-value outputted in the model summary above), we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of flight arrival delays can be explained by the variation in the distance traveled by a given flight and the amount of time that a given flight is airborne being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for corresponding p-values and F-value.)

In further analyzing the results of the simple linear regression analysis, it’s important to note that the value of b_0 (the linear model’s y-intercept) is 1.73010,the value of b_1 (which represents both the slope of the linear model and the coefficient associated with the variable ‘distance’ in the linear model) is -0.09311, and the value of b_2 (which represents both the slope of the linear model and the coefficient associated with the variable ‘air_time’ in the linear model) is 0.68462. These values indicate the relationship between the independent variables corresponding to the distance traveled by a given flight and the amount of time that a given flight is airborne and the dependent variable corresponding to the arrival delay of a given flight, which corresponds to the idea that an increase in one unit of “distance” (in miles) results in a decrease in 0.09311 units of “flight arrival delay” and that an increase in one unit of “flight airborne time” (in minutes) results in an increase in 0.68462 units of “flight arrival delay.”

Additionally, it’s important to note the metrics that measure the correlation between the independent variables corresponding to the distance traveled by a given flight and the amount of time that a given flight is airborne and the dependent variable corresponding to the arrival delay of a given flight, which are multiple R-squared and adjusted R-squared (since we want to take into account any bias that might be associated with the number of explanatory variables being included in the model, this analysis emphasis the value of adjusted R-squared rather than the value of multiple R-squared). Since the value of adjusted R-squared is 0.06779, it can be inferred that the variation that exists in the independent variables corresponding to the distance traveled by a given flight and the amount of time that a given flight is airborne can explain approximately 6.779% of the variation existent in the dependent variable corresponding to the arrival delay of a given flight. As a result of this low adjusted R-squared value here, one should not put much faith in asserting that the independent variables being considered in this analysis (‘distance’ and ‘air_time’) explain a significant portion of the variation existent in the dependent variable being considered here (‘arr_delay’).

3. Diagnostic Plots

Before beginning to check the model against all eight of the assumptions associated with linear regression modeling, histograms, boxplots, scatterplots, and a “Quality of Fit” plot (via a fitted vs. residual values determination) are generated, which will be used for their graphical nature in our interpretations.

#Generate histograms for all of the different variables being considered in our sampled data ('arr_delay', 'distance', and 'air_time')
hist(nycflights_final$arr_delay, xlab = "Flight Arrival Delay [in minutes]", main = "Histogram of Flight Arrival Delay")

hist(nycflights_final$distance, xlab = "Flight Distance Traveled [in miles]", main = "Histogram of Flight Distance Traveled")

hist(nycflights_final$air_time, xlab = "Flight Airborne Time [in minutes]", main = "Histogram of Flight Airborne Time")

#Generate a boxplot of the data (Dependent Variable = Flight Arrival Delay)
boxplot(x = nycflights_final$arr_delay, pch=21, bg="darkviolet", main="Flight Arrival Delay", xlab = "Flight Arrival Delay [in minutes]")

#Generate a boxplot of the data (Independent Variable = Flight Distance Traveled)
boxplot(x = nycflights_final$distance, pch=21, bg="darkviolet", main="Flight Distance Traveled", xlab = "Flight Distance Traveled [in miles]")

#Generate a boxplot of the data (Independent Variable = Flight Airborne Time)
boxplot(x = nycflights_final$air_time, pch=21, bg="darkviolet", main="Flight Airborne Time", xlab = "Flight Airborne Time [in minutes]")

#Generate a scatterplot of the data: "arr_delay" vs. "distance"
plot(y = nycflights_final$arr_delay,x = nycflights_final$distance, pch=21, bg="darkviolet", main="Distance Traveled by Flight vs. Arrival Delay [in minutes]", ylab = "Arrival Delay [in minutes]", xlab = "Distance Traveled by Flight (in miles)")

#Generate a scatterplot of the data: "arr_delay" vs. "air_time"
plot(y = nycflights_final$arr_delay,x = nycflights_final$air_time, pch=21, bg="darkviolet", main="Flight Airborne Time vs. Arrival Delay [in minutes]", ylab = "Arrival Delay [in minutes]", xlab = "Flight Airborne Time (in minutes)")

#Create a "Quality of Fit Model" that plots the residuals of "flights_model_final" against its fitted model.
par(mfrow=c(1,1))
plot(fitted(flights_model_final),residuals(flights_model_final), main = "Residuals of 'flights_model_final' Against Fitted Model 'flights_model_final' [Not Standardized]")
abline(0,0, col='darkviolet', lwd=2.5)

#Create a "Quality of Fit Model" that plots the standardized residuals of "flights_model_final" against its fitted model.
par(mfrow=c(1,1))
standardized_flights_model <- rstandard(flights_model_final)
plot(fitted(flights_model_final),standardized_flights_model, main = "Standardized Residuals of 'flights_model_final' Against Fitted Model 'flights_model_final'")
abline(0,0, col='darkviolet', lwd=2.5)

4. Interpretation via Assumptions

In interpreting our hierarchical multiple linear regression model and the statistical significance of the results that were generated therein, it is important to test the model against the eight assumptions corresponding to linear regression.

1. The distribution of residuals is normal (at each value of the outcome).

In order to determine of the distribution of the residuals is normal, we can generate histograms and boxplots for the residuals of the model, analyze them for skewness and kurtosis. Upon observing both the boxplot and the histogram of the residuals, it appears that the model’s residuals do exhibit some skewness, as the residuals seem to be skewed severely to the right (indicating that some bias is likely existent in the model). However, upon observing the histogram of the residuals, it appears that there isn’t much kurtosis (if any) existent in the residuals.

#Generate histograms for the residuals of our model
hist(residuals(flights_model_final), xlab = "Residuals", main = "Histogram of Residuals of 'flights_model_final'")

#Generate a boxplot for the residuals of our model
boxplot(x = residuals(flights_model_final), pch=21, bg="darkviolet", main="Boxplot of Residuals of 'flights_model_final", xlab = "Residuals")

We can further determine whether or not the distribution of the residuals exhibits normality by generating a Normal Quantile-Quantile (QQ) Plot for the residuals of the model and by performing a Shapiro-Wilk Test of Normality on our data. Upon doing so, it’s likely that we can readily assume our data does not exhibit normality. First of all, the constructed Normal Q-Q Plots did not seem to display a trend of data that aligned closely with the Normal Q-Q Line. Secondly, the fact that the resulting p-value of the Shapiro-Wilk Test of Normality for “nycflights_final[,“arr_delay”]” was < 0.1 would lead one to assume that the data does not exhibit normality. Therefore, it is likely evident that our model does not satisfy this assumption.

#Create a Normal Q-Q Plot for the data pertaining to Sulfur Dioxide Content in the Air.
qqnorm(residuals(flights_model_final), main = "Normal Q-Q Plot for Residuals of 'flights_model_final'")
qqline(residuals(flights_model_final))

#Perform Shapiro-Wilk Test of Normality on Flight Arrival Delays (normality is assummed if p > 0.1).
shapiro.test(nycflights_final[,"arr_delay"])
## 
##  Shapiro-Wilk normality test
## 
## data:  nycflights_final[, "arr_delay"]
## W = 0.7998, p-value < 2.2e-16

2. The variance of the residuals for every set of values for the predictor is equal.

In order to determine if the variance of the residuals for every set of values for the predictor are equal, we can generate a residuals plot of the model and discern whether or not the residuals located across the dynamic range are uniformly distributed along the “y=0” axis. Upon generating this plot, it appears that while some heteroskedasticity may potentially exist (given that it appears that some variance may exist at the lower end of the dynamic range with residuals being slightly more expanded along the y-axis), the residuals may seem to be generally uniformly distributed here [exhibiting homoskedasticity].

#Generate a residuals plot for "flights_model_final"
plot(flights_model_final$residuals, pch=21, bg="darkviolet", main = "Residuals Plot for 'flights_model_final")

In further determining whether or not the variance of the residuals for every set of values for the predictor is equal, we can perform White’s Test for Heteroskedasticity for each individual independent variable (“distance” and “air_time”) against the dependent variable (“arr_delay”). In performing this test, it is important to note that the null hypothesis states that the model exhibits homoskedasticity, which (if not rejected) would lead one to infer that this assumption is satisfied. Upon generating this test for both respective independent variables (“distance” and “air_time”), p-values greater than an alpha-level of 0.05 resulted for both independent variables (0.767 and 0.829, respectively), leading us to fail to reject the null hypothesis here. Therefore, it is likely evident that our model does satisfy this assumption.

#install.packages("het.test") #[needs to be installed before use]
#install.packages("vars") #[needs to be installed before use]
library(het.test)
## Warning: package 'het.test' was built under R version 3.1.3
## Loading required package: vars
## Warning: package 'vars' was built under R version 3.1.3
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.1.2
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.1.3
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.1.3
## Loading required package: urca
## Warning: package 'urca' was built under R version 3.1.3
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 3.1.3
library(vars)
whitetest_distance <- VAR(data.frame(y=nycflights_final$arr_delay, x=nycflights_final$distance))
whites.htest(whitetest_distance)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  8.2288 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.7670
whitetest_air_time <- VAR(data.frame(y=nycflights_final$arr_delay, x=nycflights_final$air_time))
whites.htest(whitetest_air_time)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  7.4155 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.8290

3. The error term is additive.

In determining whether or not the error term is additive, we can qualitatively discern that an argument exists for the error term that considers both flight distance traveled and flight airborne time being multiplicative (as opposed to additive, given that these two independent variables may not be completely independent of each other in an intuitive sense). Additionally, a standardized residual plot can be generated for this model (against a fitted version of the model) as a means for determining whether or not the model satisfies this assumption. Upon generating this plot, it appears that the residuals located across the dynamic range are not uniformly distributed along the “y=0” axis, further indicating that a non-linear kind of effect likely exists within the data that the model is comprised of. Therefore, it is likely evident that our model does not satisfy this assumption.

#Create a "Quality of Fit Model" that plots the standardized residuals of "flights_model_final" against its fitted model.
par(mfrow=c(1,1))
standardized_flights_model <- rstandard(flights_model_final)
plot(fitted(flights_model_final),standardized_flights_model, main = "Standardized Residuals of 'flights_model_final' Against Fitted Model 'flights_model_final'")
abline(0,0, col='darkviolet', lwd=2.5)

4. At every value of the outcome the expected (mean) value of the residuals is zero.

In order to determine whether or not the expected (mean) value of the residuals is zero at every value of the outcome, we can observe the results corresponding to the generated “Quality of Fit Model” that plots the standardized residuals of “flights_model_final” against its fitted model [found in Assumption #3 above]. Upon doing so, it appears that the residuals located across the dynamic range are not uniformly distributed along the “y=0” axis, further indicating that a non-linear kind of effect likely exists within the data that the model is comprised of and, most especially, that the expected (mean) value of the residuals is not zero at every value of the outcome. Therefore, it is likely evident that our model does not satisfy this assumption.

5. The expected correlation between residuals, for any two cases, is zero.

In order to determine whether or not the expected correlation between residuals, for any two cases, is zero, we can generate a non-standardized “Quality of Fit Model” that plots the residuals of “flights_model_final” against its fitted model. In doing so, we can determine if the data exhibits independence, which is necessary for a model comprised of that data to be statistically significant (since it needs to be generated using “i.i.d.” [independent and identically distributed] data). Upon generating this plot, it appears that the residuals located across the dynamic range are not uniformly distributed along the “y=0” axis, especially given that the plot indicates the existence of some kind of “biased” pattern that typically exhibits non-independence and/or auto-correlation. Therefore, it is likely evident that our model does not satisfy this assumption.

#Create a "Quality of Fit Model" that plots the residuals of "flights_model_final" against its fitted model.
par(mfrow=c(1,1))
plot(fitted(flights_model_final),residuals(flights_model_final), main = "Residuals of 'flights_model_final' Against Fitted Model 'flights_model_final' [Not Standardized]")
abline(0,0, col='darkviolet', lwd=2.5)

6. All predictors are uncorrelated with the error term.

In order to determine whether or not all of the predictors are uncorrelated with the error term, we can qualitatively discern (like we did in Assumption #3) that an argument exists for the error term that considers both flight distance traveled and flight airborne time being multiplicative (as opposed to additive, given that these two independent variables may not be completely independent of each other in an intuitive sense). Therefore, although this assumption is difficult to exactly discern whether or not it has been satisfied, one can argue using intuition that these predictors may not necessarily be uncorrelated with the error term.

7. No predictors are a perfect linear function of other predictors (no perfect multicollinearity).

In order to determine whether or not the predictors are a perfect linear function of other predictors (i.e., no perfect multicollinearity), a correlation matrix and a plot that graphs the independent variables against each other can be generated. Upon doing so, it appears that “distance” and “air_time” are strongly correlated with each other (at a value of 0.9890678, which is very close to 1.00), which does seem to exhibit the existence of collinearity between these independent variables. Therefore, it is likely evident that our model does not satisfy this assumption.

#Generate a correlation matrix for "distance" and "air_time"
cor(nycflights_final[c("distance","air_time")])
##           distance  air_time
## distance 1.0000000 0.9890678
## air_time 0.9890678 1.0000000
#Generate a plot that graphs "distance" and "air_time" against each other
plot(nycflights_final$distance~nycflights_final$air_time, main = "'air_time' vs. 'distance'")

8. The mean of the error term is zero.

In order to determine if the mean of the error term is zero, we can observe the results corresponding to the generated “Quality of Fit Model” that plots the standardized residuals of “flights_model_final” against its fitted model [found in Assumption #3 above]. Upon doing so, it appears that the residuals located across the dynamic range are not uniformly distributed along the “y=0” axis, further indicating that a non-linear kind of effect likely exists within the data that the model is comprised of and, most especially, that the mean of the error term is likely not zero. Therefore, it is likely evident that our model does not satisfy this assumption.

5. Interpretation via Issues

In interpreting our hierarchical multiple linear regression model and the statistical significance of the results that were generated therein, it is important to check the model against the four main issues surrounding linear regression.

1. Causality

As far as causality is concerned here, we can intuitively discern that “flight distance traveled” and “flight airborne time” probabilistically and proximally cause delays in flight arrival. These independent variables are considered to be proximal and probabilistic causes (and not ultimate or determinate causes) because they do not “perfectly” or “directly” cause delays in flight arrival times in every flight situation (especially considering the fact that other factors are more likely to “perfectly/directly” cause flight arrival delays, such as the existence of inclement weather).

2. Sample Sizes

As far as sample sizes are concerned here, our original, raw dataset is massive (336,776 observations). If this dataset were used to generate our hierarchical multiple linear regression model, the significance of our results would have been misinterpreted as a result of the bias that exists when using massive datasets. To alleviate this concern, the effect size of our proposed model was determined, and then an appropriate sample size was determined using the software G[STAR]Power. Upon determining the effect size to be 0.04180118, the use of G[STAR]Power resulted in a generated sample size of 261. So, with this sample size, the dataset “nycflights” was sampled, and a new dataset “nycflights_final” was created. (See the section “Power Analysis for Multiple Linear Regression Modeling” within Chapter #2 for more information.)

3. Collinearity

As far as collinearity is concerned here, our work in determining whether or not Assumption #7 (i.e., “no predictors are a perfect linear function of other predictors (no perfect multicollinearity)”) was satisfied by our model brought us to the seemingly clear conclusion that “distance” and “air_time” are strongly correlated with each other (at a value of 0.9890678, which is very close to 1.00), which does seem to exhibit the existence of collinearity between these independent variables. Therefore, collinearity is evidently a huge issue for our modeling approach, given how much “distance” and “air_time” seem to affect each other in a multiplicative sort of way.

4. Measurement Error

In this study, the existence of measurement error is not an immediate concern. Since airborne flights are tracked continuously through every individual plane’s journey and are electronically monitored (with a heavy importance placed upon data collection) by both the government and the airports associated with given departure and arrival locations, it is not very likely (though, technically possible) that measurement error would play a role in affecting accurate data collection.