The dataset contains basic information about every flight departing out of NYC airports in 2013.
library("nycflights13")
x<-flights
attach(x)
head(x)
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## flight origin dest air_time distance hour minute
## 1 1545 EWR IAH 227 1400 5 17
## 2 1714 LGA IAH 227 1416 5 33
## 3 1141 JFK MIA 160 1089 5 42
## 4 725 JFK BQN 183 1576 5 44
## 5 461 LGA ATL 116 762 5 54
## 6 1696 EWR ORD 150 719 5 54
The hypothesis under test is that Newark and JFK have a non-zero difference in mean departure delay times.
The factor we are intersted in for this analysis is origin, the airport of departure. We will look at two levels: JFK and EWR.
summary(as.factor(origin))
## EWR JFK LGA
## 120835 111279 104662
The response variable is the departure delay, in minutes.
summary(dep_delay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -43 -5 -2 13 11 1300 8255
Note that negative delay signifies an early departure.
The dataset contains every flight departing out of NYC airports in 2013, thus the ‘experiment’ is fully complete and randomized. Every flight out of each airport could be viewed as a repeated measure in the experiment. We use an independent 2-group t-test to compare the departure delays out of JFK and EWR.
boxplot(dep_delay~origin, outline=FALSE)
There does not appear to be a significant difference in means accross airports of origin.
EWR <- subset(x, origin =='EWR')
JFK <- subset(x, origin =='JFK')
t.test(EWR$dep_delay, JFK$dep_delay)
##
## Welch Two Sample t-test
##
## data: EWR$dep_delay and JFK$dep_delay
## t = 17.76, df = 226958, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.665 3.326
## sample estimates:
## mean of x mean of y
## 15.11 12.11
The t-test concludes that there is a statistically significant difference in means between EWR and JFK. Thus, we reject the null hypothesis and conclude that randomization alone cannot account for the variation in departure delay times.
summary(EWR$dep_delay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -25 -4 -1 15 15 1130 3239
summary(JFK$dep_delay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -43.0 -5.0 -1.0 12.1 10.0 1300.0 1863
Flights our of EWR have a mean departure delay of 15.11 minutes where ase flights out of JFK have a mean departure delay of 12.11 minutes. With 95% confidence from the t-test above, flights out of EWR will be 2.67 to 3.33 minutes more delayed in their departure than flights out of JFK.
The response variable departure delay violates the assumption of normality, but, the t-test is described as a robust test with respect to the assumption of normality, so we do not necessarily disregard the model.
qqnorm(dep_delay)
qqline(dep_delay)