Assignment #03: Assumptions and Issues [Outline]

Assumptions and Issues Project - Analysis of Analysis of Arrival Delay Times for Flights Departing from NYC

Brendan Howell

Renselaer Polytechnic Institute

04/13/15 - Version 1.0

1. Data

Dataset of Out-bound Flights from NYC in 2013 (‘nycflights13’)

Description: This package contains information about all flights that departed from NYC (e.g. EWR, JFK, and LGA) in 2013: 336,776 flights in total.

This dataset contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) in 2013, which amounts to 336,776 flights in total. It contains 336,776 observations of 16 variables, which include ‘year’, ‘month’, ‘day’, ‘dep_time’, ‘dep_delay’, ‘arr_time’, ‘arr_delay’, ‘carrier’, ‘tailnum’, ‘flight’, ‘origin’, ‘dest’, ‘air_time’, ‘distance’, ‘hour’, and ‘minute’. The variables that are made up of integer values include ‘year’ (which refers to the year of departure), ‘month’ (which refers to the month of departure), ‘day’ (which refers to the day of departure), ‘dep_time’ (which refers to the time of departure [in hhmm notation]), ‘arr_time’ (which refers to the time of arrival [in hhmm notation]), and ‘flight’ (which refers to the flight number). The variables that are made up of character values include ‘carrier’ (which refers to the airline carrier), ‘tailnum’ (which refers to plane tail number), ‘origin’ (which refers to the origin location), and ‘dest’ (which refers to the destination location). Lastly, the variables that are made up of numeric values include ‘dep_delay’ (which refers to departure delays expressed in minutes), ‘arr_delay’ (which refers to arrival delays expressed in minutes), ‘air_time’ (which refers to the amount of time a plane is airborne, expressed in minutes), ‘distance’ (which refers to the distance traveled by a flight, expressed in miles), ‘hour’ (which refers to the hour of departure), and ‘minute’ (which refers to the minute of departure).

[Reference: An R data package containing all out-bound flights from NYC in 2013 [+ useful metdata] is used in this analysis. It can be installed from github with devtools::install_github(“hadley/nycflights13”).]

Data Organization

Below, the “nycflights13” Dataset is loaded into R, and its summary statistics and its structure are display (along with the “head” and the “tail” of the dataset).

#Install and load the "nycflights13" dataset into R.
rm(list=ls())
install.packages("nycflights13", repos='http://cran.us.r-project.org')

## package 'nycflights13' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\howelb\AppData\Local\Temp\Rtmpclo9Dl\downloaded_packages

library("nycflights13", lib.loc = "C:/Program Files/R/R-3.1.1/library")

## Warning: package 'nycflights13' was built under R version 3.1.3

#Assign a variable, "data_raw", to the complete dataframe, "flights".
#Then, display the "head" and "tail" of the dataset, "data_raw".
data_raw<-flights
head(data_raw)

##   year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013     1   1      517         2      830        11      UA  N14228
## 2 2013     1   1      533         4      850        20      UA  N24211
## 3 2013     1   1      542         2      923        33      AA  N619AA
## 4 2013     1   1      544        -1     1004       -18      B6  N804JB
## 5 2013     1   1      554        -6      812       -25      DL  N668DN
## 6 2013     1   1      554        -4      740        12      UA  N39463
##   flight origin dest air_time distance hour minute
## 1   1545    EWR  IAH      227     1400    5     17
## 2   1714    LGA  IAH      227     1416    5     33
## 3   1141    JFK  MIA      160     1089    5     42
## 4    725    JFK  BQN      183     1576    5     44
## 5    461    LGA  ATL      116      762    5     54
## 6   1696    EWR  ORD      150      719    5     54

tail(data_raw)

##        year month day dep_time dep_delay arr_time arr_delay carrier
## 336771 2013     9  30       NA        NA       NA        NA      EV
## 336772 2013     9  30       NA        NA       NA        NA      9E
## 336773 2013     9  30       NA        NA       NA        NA      9E
## 336774 2013     9  30       NA        NA       NA        NA      MQ
## 336775 2013     9  30       NA        NA       NA        NA      MQ
## 336776 2013     9  30       NA        NA       NA        NA      MQ
##        tailnum flight origin dest air_time distance hour minute
## 336771  N740EV   5274    LGA  BNA       NA      764   NA     NA
## 336772           3393    JFK  DCA       NA      213   NA     NA
## 336773           3525    LGA  SYR       NA      198   NA     NA
## 336774  N535MQ   3461    LGA  BNA       NA      764   NA     NA
## 336775  N511MQ   3572    LGA  CLE       NA      419   NA     NA
## 336776  N839MQ   3531    LGA  RDU       NA      431   NA     NA

#Display the summary statistics and the structure of the data
summary(data_raw)

##       year          month             day           dep_time   
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400  
##                                                  NA's   :8255  
##    dep_delay          arr_time      arr_delay          carrier         
##  Min.   : -43.00   Min.   :   1   Min.   : -86.000   Length:336776     
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.: -17.000   Class :character  
##  Median :  -2.00   Median :1535   Median :  -5.000   Mode  :character  
##  Mean   :  12.64   Mean   :1502   Mean   :   6.895                     
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:  14.000                     
##  Max.   :1301.00   Max.   :2400   Max.   :1272.000                     
##  NA's   :8255      NA's   :8713   NA's   :9430                         
##    tailnum              flight        origin              dest          
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##     air_time        distance         hour           minute     
##  Min.   : 20.0   Min.   :  17   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00   1st Qu.:16.00  
##  Median :129.0   Median : 872   Median :14.00   Median :31.00  
##  Mean   :150.7   Mean   :1040   Mean   :13.17   Mean   :31.76  
##  3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00   3rd Qu.:49.00  
##  Max.   :695.0   Max.   :4983   Max.   :24.00   Max.   :59.00  
##  NA's   :9430                   NA's   :8255    NA's   :8255

str(data_raw)

## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  16 variables:
##  $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ dep_delay: num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ arr_delay: num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier  : chr  "UA" "UA" "AA" "B6" ...
##  $ tailnum  : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ flight   : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ origin   : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest     : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance : num  1400 1416 1089 1576 762 ...
##  $ hour     : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ minute   : num  17 33 42 44 54 54 55 57 57 58 ...

#Create a subset of "data_raw" that contains only numeric data
nycflights <- subset(data_raw, select = c(dep_delay, arr_time, arr_delay, air_time, distance, hour, minute))
#Display the "head" and "tail" of the dataset, "nycflights".
head(nycflights)

##   dep_delay arr_time arr_delay air_time distance hour minute
## 1         2      830        11      227     1400    5     17
## 2         4      850        20      227     1416    5     33
## 3         2      923        33      160     1089    5     42
## 4        -1     1004       -18      183     1576    5     44
## 5        -6      812       -25      116      762    5     54
## 6        -4      740        12      150      719    5     54

tail(nycflights)

##        dep_delay arr_time arr_delay air_time distance hour minute
## 336771        NA       NA        NA       NA      764   NA     NA
## 336772        NA       NA        NA       NA      213   NA     NA
## 336773        NA       NA        NA       NA      198   NA     NA
## 336774        NA       NA        NA       NA      764   NA     NA
## 336775        NA       NA        NA       NA      419   NA     NA
## 336776        NA       NA        NA       NA      431   NA     NA

#Display the summary statistics and the structure of the data
summary(nycflights)

##    dep_delay          arr_time      arr_delay           air_time    
##  Min.   : -43.00   Min.   :   1   Min.   : -86.000   Min.   : 20.0  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.: -17.000   1st Qu.: 82.0  
##  Median :  -2.00   Median :1535   Median :  -5.000   Median :129.0  
##  Mean   :  12.64   Mean   :1502   Mean   :   6.895   Mean   :150.7  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:  14.000   3rd Qu.:192.0  
##  Max.   :1301.00   Max.   :2400   Max.   :1272.000   Max.   :695.0  
##  NA's   :8255      NA's   :8713   NA's   :9430       NA's   :9430   
##     distance         hour           minute     
##  Min.   :  17   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 502   1st Qu.: 9.00   1st Qu.:16.00  
##  Median : 872   Median :14.00   Median :31.00  
##  Mean   :1040   Mean   :13.17   Mean   :31.76  
##  3rd Qu.:1389   3rd Qu.:17.00   3rd Qu.:49.00  
##  Max.   :4983   Max.   :24.00   Max.   :59.00  
##                 NA's   :8255    NA's   :8255

str(nycflights)

## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  7 variables:
##  $ dep_delay: num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ arr_delay: num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ air_time : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance : num  1400 1416 1089 1576 762 ...
##  $ hour     : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ minute   : num  17 33 42 44 54 54 55 57 57 58 ...

Data Selection for Hierarchical Multiple Linear Regression Model

Upon carrying out this initial sumamry statistics analysis and taking a random sample of the data, a hierarchical approach is carried out in beginning to develop a multiple linear regression model. Using information obtained from the website ‘flightstats.com’ pertaining to “Airport Departures and Arrivals” [reference: http://www.flightstats.com/go/FlightStatus/flightStatusByAirport.do] and learning that weather, air traffic control directives, and congestion on the taxi-ways can affect arrival times for flights, it is intuitively determined that the dependent variable “arr_delay” and the independent variables “distance” and “air_time” will be the most suitable variables to include in this analysis that are found in this “nycflights13” dataset.

Description of the null hypothesis (H_0) and the alternate hypothesis (H_1)

Therefore, upon carrying out this hierarchical approach for this experiment, we are now trying to determine whether or not the variation that is observed in the dependent variable (which corresponds to ‘arr_delay’ in this analysis) can be explained by the variation existent in either of the independent variables in this experiment (which correspond to ‘air_time’ and ‘distance’). Therefore, the null hypothesis that is being tested states that the distance traveled by a given flight and the amount of time that a given flight is airborne do not have a significant effect on a given flight’s arrival delay. Opposingly, the alternate hypothesis that is being tested states that the distance traveled by a given flight and the amount of time that a given flight is airborne do, in fact, have a significant effect on a given flight’s arrival delay.

2. The Linear Model (A Hierarchical Multiple Linear Regression Model)

Description of independent variables and dependent variable

In this experiment, a hierarchical multiple linear regression model is generated, which will offer some insight into determining both the amount of variance in ‘arr_delay’ that can be explained by each of the independent variables being considered in this analysis and whether any existence of suppression is likely to exist within a linear regression model comprised of this data. The independent variables include the distance traveled by a given flight (in miles) and the amount of time that a given flight is airborne (in minutes), and the dependent variable refers to a given flight’s delay in arrival (in minutes).

Power Analysis for Multiple Linear Regression Modeling

Originally, the “nycflights13” dataset contains 336,776 observations. However, this number of observations may serve to be too large for a statistically significant analysis, so a power analysis is performed in this experiment to determine the most appropriate sample size for our final multiple linear regression model.

#Generate an initial Hierarchical Multiple Linear Regression Model that uses all 336,776 observations
flights_model_0 <- lm(nycflights$arr_delay~nycflights$distance+nycflights$air_time)
#Display summary of the initial Hierarchical Multiple Linear Regression Model
summary(flights_model_0)

## 
## Call:
## lm(formula = nycflights$arr_delay ~ nycflights$distance + nycflights$air_time)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -69.27  -22.91  -12.28    5.44 1284.47 
## 
## Coefficients:
##                       Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)         -1.4559364  0.1728867   -8.421   <2e-16 ***
## nycflights$distance -0.0876549  0.0007613 -115.145   <2e-16 ***
## nycflights$air_time  0.6652632  0.0059795  111.256   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.73 on 327343 degrees of freedom
##   (9430 observations deleted due to missingness)
## Multiple R-squared:  0.04012,    Adjusted R-squared:  0.04012 
## F-statistic:  6842 on 2 and 327343 DF,  p-value: < 2.2e-16

#Determine effect size for power analysis
r2 <- summary(flights_model_0)$r.squared
f2 <- r2 / (1-r2)
f2

## [1] 0.04180118

Upon determining the effect size, the software GPower is used to determine the most appropriate sample size for this hierarchical multiple linear regression analysis. In its results, GPower generated a sample size of 261. So, with this sample size, the dataset “nycflights” will be sampled, creating a new dataset to be used for this hierarchical multiple linear regression model, which will then be used to determine if the variation in “arr_delay” can be explained by the variation existent in both “air_time” and “distance”.

#Remove any rows that contain "NA" in "data_raw", creating "data_clean".
#Randomly take a sample of 10000 observations from "data_clean", creating "nycflights".
data_clean<-na.omit(nycflights)
flight.index = sample(1:nrow(data_clean),261,replace=FALSE)
nycflights_final<-data_clean[flight.index,]
#Generate a new Hierarchical Multiple Linear Regression Model that uses 261 observations
flights_model_final <- lm(nycflights_final$arr_delay~nycflights_final$distance+nycflights_final$air_time)
#Display summary of the final Hierarchical Multiple Linear Regression Model
summary(flights_model_final)

## 
## Call:
## lm(formula = nycflights_final$arr_delay ~ nycflights_final$distance + 
##     nycflights_final$air_time)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.431 -23.133 -12.816   6.092 268.650 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                4.48232    6.24864   0.717  0.47382   
## nycflights_final$distance -0.08195    0.02656  -3.086  0.00225 **
## nycflights_final$air_time  0.60041    0.21322   2.816  0.00524 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.1 on 258 degrees of freedom
## Multiple R-squared:  0.0509, Adjusted R-squared:  0.04354 
## F-statistic: 6.918 on 2 and 258 DF,  p-value: 0.001184

NOTE/ASIDE: Interpretation and plots will be completed very soon. I just want to be sure that the work I’ve done so far is correct(i.e., have I performed the Power Analysis correctly?). After observing the results of my model, it appears that none of my independent variables can explain the variation existent in my dependent variable. If this is the case, that’s okay, but I’d hate to carry out the assumptions surrounding (and the interpretations pertaining to) my work when my modeling approach is flawed to begin with. If you look at the two plots below, it appears that some outliers may exist. However, I’m unsure if these data points should still be included. Am I on the right track so far?

#Generate a scatterplot of the data: "arr_delay" vs. "distance"
plot(y = nycflights_final$arr_delay,x = nycflights_final$distance, pch=21, bg="darkviolet", main="Distance Traveled by Flight vs. Arrival Delay [in minutes]", ylab = "Arrival Delay [in minutes]", xlab = "Distance Traveled by Flight (in miles)")

#Generate a scatterplot of the data: "arr_delay" vs. "air_time"
plot(y = nycflights_final$arr_delay,x = nycflights_final$air_time, pch=21, bg="darkviolet", main="Flight Airborne Time vs. Arrival Delay [in minutes]", ylab = "Arrival Delay [in minutes]", xlab = "Flight Airborne Time (in minutes)")