This dataset contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) in 2013, which amounts to 336,776 flights in total. It contains 336,776 observations of 16 variables, which include ‘year’, ‘month’, ‘day’, ‘dep_time’, ‘dep_delay’, ‘arr_time’, ‘arr_delay’, ‘carrier’, ‘tailnum’, ‘flight’, ‘origin’, ‘dest’, ‘air_time’, ‘distance’, ‘hour’, and ‘minute’. The variables that are made up of integer values include ‘year’ (which refers to the year of departure), ‘month’ (which refers to the month of departure), ‘day’ (which refers to the day of departure), ‘dep_time’ (which refers to the time of departure [in hhmm notation]), ‘arr_time’ (which refers to the time of arrival [in hhmm notation]), and ‘flight’ (which refers to the flight number). The variables that are made up of character values include ‘carrier’ (which refers to the airline carrier), ‘tailnum’ (which refers to plane tail number), ‘origin’ (which refers to the origin location), and ‘dest’ (which refers to the destination location). Lastly, the variables that are made up of numeric values include ‘dep_delay’ (which refers to departure delays expressed in minutes), ‘arr_delay’ (which refers to arrival delays expressed in minutes), ‘air_time’ (which refers to the amount of time a plane is airborne, expressed in minutes), ‘distance’ (which refers to the distance traveled by a flight, expressed in miles), ‘hour’ (which refers to the hour of departure), and ‘minute’ (which refers to the minute of departure).
[Reference: An R data package containing all out-bound flights from NYC in 2013 [+ useful metdata] is used in this analysis. It can be installed from github with devtools::install_github(“hadley/nycflights13”).]
Below, the “nycflights13” Dataset is loaded into R, and its summary statistics and its structure are display (along with the “head” and the “tail” of the dataset).
#Install and load the "nycflights13" dataset into R.
rm(list=ls())
install.packages("nycflights13", repos='http://cran.us.r-project.org')
## package 'nycflights13' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\howelb\AppData\Local\Temp\Rtmpclo9Dl\downloaded_packages
library("nycflights13", lib.loc = "C:/Program Files/R/R-3.1.1/library")
## Warning: package 'nycflights13' was built under R version 3.1.3
#Assign a variable, "data_raw", to the complete dataframe, "flights".
#Then, display the "head" and "tail" of the dataset, "data_raw".
data_raw<-flights
head(data_raw)
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## flight origin dest air_time distance hour minute
## 1 1545 EWR IAH 227 1400 5 17
## 2 1714 LGA IAH 227 1416 5 33
## 3 1141 JFK MIA 160 1089 5 42
## 4 725 JFK BQN 183 1576 5 44
## 5 461 LGA ATL 116 762 5 54
## 6 1696 EWR ORD 150 719 5 54
tail(data_raw)
## year month day dep_time dep_delay arr_time arr_delay carrier
## 336771 2013 9 30 NA NA NA NA EV
## 336772 2013 9 30 NA NA NA NA 9E
## 336773 2013 9 30 NA NA NA NA 9E
## 336774 2013 9 30 NA NA NA NA MQ
## 336775 2013 9 30 NA NA NA NA MQ
## 336776 2013 9 30 NA NA NA NA MQ
## tailnum flight origin dest air_time distance hour minute
## 336771 N740EV 5274 LGA BNA NA 764 NA NA
## 336772 3393 JFK DCA NA 213 NA NA
## 336773 3525 LGA SYR NA 198 NA NA
## 336774 N535MQ 3461 LGA BNA NA 764 NA NA
## 336775 N511MQ 3572 LGA CLE NA 419 NA NA
## 336776 N839MQ 3531 LGA RDU NA 431 NA NA
#Display the summary statistics and the structure of the data
summary(data_raw)
## year month day dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907
## Median :2013 Median : 7.000 Median :16.00 Median :1401
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400
## NA's :8255
## dep_delay arr_time arr_delay carrier
## Min. : -43.00 Min. : 1 Min. : -86.000 Length:336776
## 1st Qu.: -5.00 1st Qu.:1104 1st Qu.: -17.000 Class :character
## Median : -2.00 Median :1535 Median : -5.000 Mode :character
## Mean : 12.64 Mean :1502 Mean : 6.895
## 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.: 14.000
## Max. :1301.00 Max. :2400 Max. :1272.000
## NA's :8255 NA's :8713 NA's :9430
## tailnum flight origin dest
## Length:336776 Min. : 1 Length:336776 Length:336776
## Class :character 1st Qu.: 553 Class :character Class :character
## Mode :character Median :1496 Mode :character Mode :character
## Mean :1972
## 3rd Qu.:3465
## Max. :8500
##
## air_time distance hour minute
## Min. : 20.0 Min. : 17 Min. : 0.00 Min. : 0.00
## 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00 1st Qu.:16.00
## Median :129.0 Median : 872 Median :14.00 Median :31.00
## Mean :150.7 Mean :1040 Mean :13.17 Mean :31.76
## 3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00 3rd Qu.:49.00
## Max. :695.0 Max. :4983 Max. :24.00 Max. :59.00
## NA's :9430 NA's :8255 NA's :8255
str(data_raw)
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 16 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ dep_delay: num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ arr_delay: num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num 1400 1416 1089 1576 762 ...
## $ hour : num 5 5 5 5 5 5 5 5 5 5 ...
## $ minute : num 17 33 42 44 54 54 55 57 57 58 ...
#Create a subset of "data_raw" that contains only numeric data
nycflights <- subset(data_raw, select = c(dep_delay, arr_time, arr_delay, air_time, distance, hour, minute))
#Display the "head" and "tail" of the dataset, "nycflights".
head(nycflights)
## dep_delay arr_time arr_delay air_time distance hour minute
## 1 2 830 11 227 1400 5 17
## 2 4 850 20 227 1416 5 33
## 3 2 923 33 160 1089 5 42
## 4 -1 1004 -18 183 1576 5 44
## 5 -6 812 -25 116 762 5 54
## 6 -4 740 12 150 719 5 54
tail(nycflights)
## dep_delay arr_time arr_delay air_time distance hour minute
## 336771 NA NA NA NA 764 NA NA
## 336772 NA NA NA NA 213 NA NA
## 336773 NA NA NA NA 198 NA NA
## 336774 NA NA NA NA 764 NA NA
## 336775 NA NA NA NA 419 NA NA
## 336776 NA NA NA NA 431 NA NA
#Display the summary statistics and the structure of the data
summary(nycflights)
## dep_delay arr_time arr_delay air_time
## Min. : -43.00 Min. : 1 Min. : -86.000 Min. : 20.0
## 1st Qu.: -5.00 1st Qu.:1104 1st Qu.: -17.000 1st Qu.: 82.0
## Median : -2.00 Median :1535 Median : -5.000 Median :129.0
## Mean : 12.64 Mean :1502 Mean : 6.895 Mean :150.7
## 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.: 14.000 3rd Qu.:192.0
## Max. :1301.00 Max. :2400 Max. :1272.000 Max. :695.0
## NA's :8255 NA's :8713 NA's :9430 NA's :9430
## distance hour minute
## Min. : 17 Min. : 0.00 Min. : 0.00
## 1st Qu.: 502 1st Qu.: 9.00 1st Qu.:16.00
## Median : 872 Median :14.00 Median :31.00
## Mean :1040 Mean :13.17 Mean :31.76
## 3rd Qu.:1389 3rd Qu.:17.00 3rd Qu.:49.00
## Max. :4983 Max. :24.00 Max. :59.00
## NA's :8255 NA's :8255
str(nycflights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 7 variables:
## $ dep_delay: num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ arr_delay: num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num 1400 1416 1089 1576 762 ...
## $ hour : num 5 5 5 5 5 5 5 5 5 5 ...
## $ minute : num 17 33 42 44 54 54 55 57 57 58 ...
Upon carrying out this initial sumamry statistics analysis and taking a random sample of the data, a hierarchical approach is carried out in beginning to develop a multiple linear regression model. Using information obtained from the website ‘flightstats.com’ pertaining to “Airport Departures and Arrivals” [reference: http://www.flightstats.com/go/FlightStatus/flightStatusByAirport.do] and learning that weather, air traffic control directives, and congestion on the taxi-ways can affect arrival times for flights, it is intuitively determined that the dependent variable “arr_delay” and the independent variables “distance” and “air_time” will be the most suitable variables to include in this analysis that are found in this “nycflights13” dataset.
Therefore, upon carrying out this hierarchical approach for this experiment, we are now trying to determine whether or not the variation that is observed in the dependent variable (which corresponds to ‘arr_delay’ in this analysis) can be explained by the variation existent in either of the independent variables in this experiment (which correspond to ‘air_time’ and ‘distance’). Therefore, the null hypothesis that is being tested states that the distance traveled by a given flight and the amount of time that a given flight is airborne do not have a significant effect on a given flight’s arrival delay. Opposingly, the alternate hypothesis that is being tested states that the distance traveled by a given flight and the amount of time that a given flight is airborne do, in fact, have a significant effect on a given flight’s arrival delay.
In this experiment, a hierarchical multiple linear regression model is generated, which will offer some insight into determining both the amount of variance in ‘arr_delay’ that can be explained by each of the independent variables being considered in this analysis and whether any existence of suppression is likely to exist within a linear regression model comprised of this data. The independent variables include the distance traveled by a given flight (in miles) and the amount of time that a given flight is airborne (in minutes), and the dependent variable refers to a given flight’s delay in arrival (in minutes).
Originally, the “nycflights13” dataset contains 336,776 observations. However, this number of observations may serve to be too large for a statistically significant analysis, so a power analysis is performed in this experiment to determine the most appropriate sample size for our final multiple linear regression model.
#Generate an initial Hierarchical Multiple Linear Regression Model that uses all 336,776 observations
flights_model_0 <- lm(nycflights$arr_delay~nycflights$distance+nycflights$air_time)
#Display summary of the initial Hierarchical Multiple Linear Regression Model
summary(flights_model_0)
##
## Call:
## lm(formula = nycflights$arr_delay ~ nycflights$distance + nycflights$air_time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.27 -22.91 -12.28 5.44 1284.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4559364 0.1728867 -8.421 <2e-16 ***
## nycflights$distance -0.0876549 0.0007613 -115.145 <2e-16 ***
## nycflights$air_time 0.6652632 0.0059795 111.256 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.73 on 327343 degrees of freedom
## (9430 observations deleted due to missingness)
## Multiple R-squared: 0.04012, Adjusted R-squared: 0.04012
## F-statistic: 6842 on 2 and 327343 DF, p-value: < 2.2e-16
#Determine effect size for power analysis
r2 <- summary(flights_model_0)$r.squared
f2 <- r2 / (1-r2)
f2
## [1] 0.04180118
Upon determining the effect size, the software GPower is used to determine the most appropriate sample size for this hierarchical multiple linear regression analysis. In its results, GPower generated a sample size of 261. So, with this sample size, the dataset “nycflights” will be sampled, creating a new dataset to be used for this hierarchical multiple linear regression model, which will then be used to determine if the variation in “arr_delay” can be explained by the variation existent in both “air_time” and “distance”.
#Remove any rows that contain "NA" in "data_raw", creating "data_clean".
#Randomly take a sample of 10000 observations from "data_clean", creating "nycflights".
data_clean<-na.omit(nycflights)
flight.index = sample(1:nrow(data_clean),261,replace=FALSE)
nycflights_final<-data_clean[flight.index,]
#Generate a new Hierarchical Multiple Linear Regression Model that uses 261 observations
flights_model_final <- lm(nycflights_final$arr_delay~nycflights_final$distance+nycflights_final$air_time)
#Display summary of the final Hierarchical Multiple Linear Regression Model
summary(flights_model_final)
##
## Call:
## lm(formula = nycflights_final$arr_delay ~ nycflights_final$distance +
## nycflights_final$air_time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.431 -23.133 -12.816 6.092 268.650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.48232 6.24864 0.717 0.47382
## nycflights_final$distance -0.08195 0.02656 -3.086 0.00225 **
## nycflights_final$air_time 0.60041 0.21322 2.816 0.00524 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.1 on 258 degrees of freedom
## Multiple R-squared: 0.0509, Adjusted R-squared: 0.04354
## F-statistic: 6.918 on 2 and 258 DF, p-value: 0.001184
NOTE/ASIDE: Interpretation and plots will be completed very soon. I just want to be sure that the work I’ve done so far is correct(i.e., have I performed the Power Analysis correctly?). After observing the results of my model, it appears that none of my independent variables can explain the variation existent in my dependent variable. If this is the case, that’s okay, but I’d hate to carry out the assumptions surrounding (and the interpretations pertaining to) my work when my modeling approach is flawed to begin with. If you look at the two plots below, it appears that some outliers may exist. However, I’m unsure if these data points should still be included. Am I on the right track so far?
#Generate a scatterplot of the data: "arr_delay" vs. "distance"
plot(y = nycflights_final$arr_delay,x = nycflights_final$distance, pch=21, bg="darkviolet", main="Distance Traveled by Flight vs. Arrival Delay [in minutes]", ylab = "Arrival Delay [in minutes]", xlab = "Distance Traveled by Flight (in miles)")
#Generate a scatterplot of the data: "arr_delay" vs. "air_time"
plot(y = nycflights_final$arr_delay,x = nycflights_final$air_time, pch=21, bg="darkviolet", main="Flight Airborne Time vs. Arrival Delay [in minutes]", ylab = "Arrival Delay [in minutes]", xlab = "Flight Airborne Time (in minutes)")