For this recipie, we will examine the flights dataset from the nycflights13 package.
Read in data set:
install.packages("nycflights13")
## Installing package into 'C:/Users/svoboa/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library("nycflights13", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")
x<-flights
Create subset of data with only some of the destination cities and Airlines:
xx<-subset(x,x$dest=="ORD" | x$dest=="BOS" | x$dest=="DEN" | x$dest=="LAX" | x$dest=="ATL")
xxx<-subset(xx,xx$carrier=="AA" | xx$carrier=="DL" | xx$carrier=="UA" | xx$carrier=="WN")
Some observations in the dataset contain N/A's. We must omit N/A's for future analysis:
data<-na.omit(xxx)
Month (12 levels) month each flight took off in Hour (24 levels) hour of the day the plane departed (military time) Carrier (4 levels: AA, DL, US, WN) airline name Origin city (3 levels: EWR, JFK, LGA) city each flight departed from Destination city (dest) (5 levels: ORD, BOS, LAX, DEN, ATL) city each flight landed in.
Set up varibles as factors:
data$month=as.factor(data$month)
data$hour=as.factor(data$hour)
data$carrier=as.factor(data$carrier)
data$origin=as.factor(data$origin)
data$dest=as.factor(data$dest)
Departure and arrival time, arriaval and departure delay time, air time, and distance are all continuous variables.
Arrival delay time (arr_time), measured in minutes, will be the response variable for this experiment.
With the N/A's omitted and carriers and destinations subsetted, the dataset contains 47,695 flight observations of 16 variables.
Structure, summary, and more on the dataset:
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 46843 obs. of 16 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 554 554 558 558 606 615 628 629 646 656 ...
## $ dep_delay: num -6 -4 -2 -2 -4 0 -2 -1 1 -4 ...
## $ arr_time : int 812 740 753 924 837 833 1016 824 910 854 ...
## $ arr_delay: num -25 12 8 7 -8 -9 29 14 -6 4 ...
## $ carrier : Factor w/ 4 levels "AA","DL","UA",..: 2 3 1 3 2 2 3 1 3 1 ...
## $ tailnum : chr "N668DN" "N39463" "N3ALAA" "N29129" ...
## $ flight : int 461 1696 301 194 1743 575 1665 303 883 305 ...
## $ origin : Factor w/ 3 levels "EWR","JFK","LGA": 3 1 3 2 2 1 1 3 3 3 ...
## $ dest : Factor w/ 5 levels "ATL","BOS","DEN",..: 1 5 5 4 1 1 4 5 3 5 ...
## $ air_time : num 116 150 138 345 128 120 366 140 243 143 ...
## $ distance : num 762 719 733 2475 760 ...
## $ hour : Factor w/ 24 levels "0","1","2","3",..: 5 5 5 5 6 6 6 6 6 6 ...
## $ minute : num 54 54 58 58 6 15 28 29 46 56 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:851] 243 244 364 365 792 836 1153 1274 1324 1355 ...
## .. ..- attr(*, "names")= chr [1:851] "1783" "1785" "2691" "2692" ...
summary(data)
## year month day dep_time
## Min. :2013 8 : 4335 Min. : 1.0 Min. : 1
## 1st Qu.:2013 10 : 4326 1st Qu.: 8.0 1st Qu.: 858
## Median :2013 7 : 4189 Median :16.0 Median :1346
## Mean :2013 9 : 4097 Mean :15.7 Mean :1330
## 3rd Qu.:2013 6 : 4073 3rd Qu.:23.0 3rd Qu.:1730
## Max. :2013 5 : 4047 Max. :31.0 Max. :2400
## (Other):21776
## dep_delay arr_time arr_delay carrier
## Min. :-20 Min. : 1 Min. :-75.0 AA:10820
## 1st Qu.: -4 1st Qu.:1059 1st Qu.:-19.0 DL:14936
## Median : -1 Median :1523 Median : -7.0 UA:19650
## Mean : 11 Mean :1498 Mean : 3.7 WN: 1437
## 3rd Qu.: 8 3rd Qu.:1930 3rd Qu.: 11.0
## Max. :898 Max. :2400 Max. :895.0
##
## tailnum flight origin dest air_time
## Length:46843 Min. : 1 EWR:17151 ATL:10612 Min. : 26
## Class :character 1st Qu.: 337 JFK:12683 BOS: 5686 1st Qu.:107
## Mode :character Median : 708 LGA:17009 DEN: 6152 Median :120
## Mean : 886 LAX:11803 Mean :173
## 3rd Qu.:1377 ORD:12590 3rd Qu.:289
## Max. :4454 Max. :440
##
## distance hour minute
## Min. : 187 8 : 4106 Min. : 0.0
## 1st Qu.: 733 6 : 3772 1st Qu.:16.0
## Median : 762 15 : 3709 Median :34.0
## Mean :1225 18 : 3593 Mean :33.1
## 3rd Qu.:2454 17 : 3320 3rd Qu.:53.0
## Max. :2475 7 : 3188 Max. :59.0
## (Other):25155
head(data)
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## 10 2013 1 1 558 -2 753 8 AA N3ALAA
## 13 2013 1 1 558 -2 924 7 UA N29129
## 24 2013 1 1 606 -4 837 -8 DL N3739P
## 30 2013 1 1 615 0 833 -9 DL N326NB
## flight origin dest air_time distance hour minute
## 5 461 LGA ATL 116 762 5 54
## 6 1696 EWR ORD 150 719 5 54
## 10 301 LGA ORD 138 733 5 58
## 13 194 JFK LAX 345 2475 5 58
## 24 1743 JFK ATL 128 760 6 6
## 30 575 EWR ATL 120 746 6 15
?flights
## starting httpd help server ... done
A 3-factor analysis of variance will be performed with blocks on destination and origin city. Airline carrier, hour of departure, and month of departure will be tested to see if they can explain any of the variation in the arrival delay time.
This design was choosen to see if the time of day or year have an effect on a flight being delayed (i.e. find the best time of day, year, and airline to fly with). The goal is to find this without the effects of where you are traveling to or from, which is why origin and destination city are blocked for.
The dataset flights is a collection of survey data from flights departing from 3 NYC airports in 2013.
Since flights are not repeated, there are not repicates or repeated measures.
Blocking was used for the origin and destination cities.
Mean of arrival delay time by month, hour, and carrier:
tapply(data$arr_delay, data$month, mean)
## 1 2 3 4 5 6 7 8 9
## 1.9636 -1.0136 -1.0614 8.4818 -0.3425 13.1846 11.3414 3.3970 -4.9468
## 10 11 12
## -1.7191 0.6379 13.7107
tapply(data$arr_delay, data$hour, mean)
## 0 1 2 3 5 6 7 8
## 208.8718 302.7143 324.1429 410.0000 -12.1404 -8.6405 -8.8353 -6.3916
## 9 10 11 12 13 14 15 16
## -2.9847 -3.5328 -5.2358 -0.3394 2.0703 2.0656 2.1677 4.1738
## 17 18 19 20 21 22 23 24
## 8.0684 13.5296 15.7587 19.6720 30.1347 83.4458 151.5364 185.5000
tapply(data$arr_delay, data$carrier, mean)
## AA DL UA WN
## -0.675 4.907 4.656 11.536
tapply(data$arr_delay, data$origin, mean)
## EWR JFK LGA
## 6.0166 0.5186 3.7798
tapply(data$arr_delay, data$dest, mean)
## ATL BOS DEN LAX ORD
## 7.4509 2.6581 6.7731 0.1246 2.9180
The highest arrival delays occur in June and December. Flights that depart early in the morning and late at night seem to have higher delayed arrival times. Carrier AA has the best arrival delay (negative indicating the mean is early) whie WN has the worst delays.
Histogram of Arrival Times:
hist(data$arr_delay, breaks=100, xlim= c(0,300))
Boxplots:
boxplot(data$arr_delay~data$month, xlab="month of departure", ylab="Arrival Delay")
boxplot(data$arr_delay~data$hour, xlab="hour of departure", ylab="Arrival Delay")
boxplot(data$arr_delay~data$carrier, xlab="airline carrier", ylab="Arrival Delay")
boxplot(data$arr_delay~data$origin, xlab="origin city", ylab="Arrival Delay")
boxplot(data$arr_delay~data$dest, xlab="destination city", ylab="Arrival Delay")
This dataset is difficult to analyze through boxplots due to the large number of outliers.
However, when looking at the boxplot of arrival delay by the hour the flight departed, it is evident that hours early in the day (0, 1, 2, 3) and late in the day (22, 23, 24) may have a significant impact on arrival delay.
Due to the large size of the dataset, each factor under study will be tested first in its own model with blocking for origin and destination. A model including all 3 factors and 2 blocks was to large to run in RStudio without crashing.
model1=aov(data$arr_delay~data$month+data$origin+data$dest)
summary(model1)
## Df Sum Sq Mean Sq F value Pr(>F)
## data$month 11 1781656 161969 87.9 <2e-16 ***
## data$origin 2 210823 105411 57.2 <2e-16 ***
## data$dest 4 296818 74204 40.3 <2e-16 ***
## Residuals 46825 86278901 1843
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis of the anova model is that the variation in arrival delay time cannot be explained by anything other than randomization. Since the p-value for month is low, there is a high probability that month or departure can in fact explain the variation in arrival delay time.
model2=aov(data$arr_delay~data$hour+data$origin+data$dest)
summary(model2)
## Df Sum Sq Mean Sq F value Pr(>F)
## data$hour 23 14396809 625948 402 <2e-16 ***
## data$origin 2 508348 254174 163 <2e-16 ***
## data$dest 4 781471 195368 125 <2e-16 ***
## Residuals 46813 72881569 1557
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, it is likely that the hour a plane departs can explain the variation in its arrival delay.
model3=aov(data$arr_delay~data$carrier+data$origin+data$dest)
summary(model3)
## Df Sum Sq Mean Sq F value Pr(>F)
## data$carrier 3 335042 111681 59.5 < 2e-16 ***
## data$origin 2 110421 55210 29.4 1.7e-13 ***
## data$dest 4 203549 50887 27.1 < 2e-16 ***
## Residuals 46833 87919186 1877
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It is also likely that the carrier of a plane can explain the variation in its arrival delay.
model4=aov(data$arr_delay~data$hour*data$month+data$origin+data$dest)
summary(model4)
## Df Sum Sq Mean Sq F value Pr(>F)
## data$hour 23 14396809 625948 415.92 <2e-16 ***
## data$month 11 1279820 116347 77.31 <2e-16 ***
## data$origin 2 497767 248884 165.38 <2e-16 ***
## data$dest 4 746583 186646 124.02 <2e-16 ***
## data$hour:data$month 216 1537242 7117 4.73 <2e-16 ***
## Residuals 46586 70109976 1505
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This model adds the interation between the first two factors of hour and month of departure. The interaction effect also has a low p-value, indicating it is likely the two factors togther explain the variation in arrival delay.
Between these four models, it is likely that month of departure, hour of departure, and airline carrier can all explain the variation in arrival delay times.
When the line for a factor pair crosses zero or a p-value greater than .05 is generated, that indicates we fail to reject the null hypothesis that there is no difference in the means of that combination of pairs. When the plotted line does not cross zero or we generate a small p-value, that indicated there is likely a difference in means between the two factor levels.
tukey1<-TukeyHSD(aov(data$arr_delay~data$hour))
tukey1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = data$arr_delay ~ data$hour)
##
## $`data$hour`
## diff lwr upr p adj
## 1-0 9.384e+01 48.73775 138.947 0.0000
## 2-0 1.153e+02 55.84483 174.697 0.0000
## 3-0 2.011e+02 54.51325 347.743 0.0002
## 5-0 -2.210e+02 -244.66663 -197.358 0.0000
## 6-0 -2.175e+02 -240.81370 -194.211 0.0000
## 7-0 -2.177e+02 -241.03034 -194.384 0.0000
## 8-0 -2.153e+02 -238.55511 -191.972 0.0000
## 9-0 -2.119e+02 -235.23486 -188.478 0.0000
## 10-0 -2.124e+02 -235.79138 -189.018 0.0000
## 11-0 -2.141e+02 -237.44118 -190.774 0.0000
## 12-0 -2.092e+02 -232.58818 -185.834 0.0000
## 13-0 -2.068e+02 -230.18113 -183.422 0.0000
## 14-0 -2.068e+02 -230.18355 -183.429 0.0000
## 15-0 -2.067e+02 -230.00751 -183.401 0.0000
## 16-0 -2.047e+02 -228.02200 -181.374 0.0000
## 17-0 -2.008e+02 -224.12104 -177.486 0.0000
## 18-0 -1.953e+02 -218.64949 -172.035 0.0000
## 19-0 -1.931e+02 -216.46910 -169.757 0.0000
## 20-0 -1.892e+02 -212.59996 -165.800 0.0000
## 21-0 -1.787e+02 -202.23559 -155.239 0.0000
## 22-0 -1.254e+02 -149.67268 -101.179 0.0000
## 23-0 -5.734e+01 -83.33916 -31.332 0.0000
## 24-0 -2.337e+01 -99.37860 52.635 1.0000
## 2-1 2.143e+01 -45.58722 88.444 1.0000
## 3-1 1.073e+02 -42.56616 257.138 0.5937
## 5-1 -3.149e+02 -353.83127 -275.878 0.0000
## 6-1 -3.114e+02 -350.11812 -272.591 0.0000
## 7-1 -3.115e+02 -350.32606 -272.773 0.0000
## 8-1 -3.091e+02 -347.86340 -270.348 0.0000
## 9-1 -3.057e+02 -344.50864 -266.889 0.0000
## 10-1 -3.062e+02 -345.06180 -267.432 0.0000
## 11-1 -3.080e+02 -346.73276 -269.167 0.0000
## 12-1 -3.031e+02 -341.86250 -264.245 0.0000
## 13-1 -3.006e+02 -339.45439 -261.834 0.0000
## 14-1 -3.006e+02 -339.45773 -261.840 0.0000
## 15-1 -3.005e+02 -339.31113 -261.782 0.0000
## 16-1 -2.985e+02 -337.31740 -259.764 0.0000
## 17-1 -2.946e+02 -333.41899 -255.873 0.0000
## 18-1 -2.892e+02 -327.95154 -250.418 0.0000
## 19-1 -2.870e+02 -325.75176 -248.159 0.0000
## 20-1 -2.830e+02 -321.86506 -244.220 0.0000
## 21-1 -2.726e+02 -311.46171 -233.697 0.0000
## 22-1 -2.193e+02 -258.60731 -179.930 0.0000
## 23-1 -1.512e+02 -191.62335 -110.732 0.0000
## 24-1 -1.172e+02 -199.29154 -35.137 0.0001
## 3-2 8.586e+01 -68.90921 240.623 0.9421
## 5-2 -3.363e+02 -391.20333 -281.363 0.0000
## 6-2 -3.328e+02 -387.55228 -278.014 0.0000
## 7-2 -3.330e+02 -387.75639 -278.200 0.0000
## 8-2 -3.305e+02 -385.29927 -275.770 0.0000
## 9-2 -3.271e+02 -381.92928 -272.326 0.0000
## 10-2 -3.277e+02 -382.48095 -272.870 0.0000
## 11-2 -3.294e+02 -384.16126 -274.596 0.0000
## 12-2 -3.245e+02 -379.28338 -269.681 0.0000
## 13-2 -3.221e+02 -376.87481 -267.270 0.0000
## 14-2 -3.221e+02 -376.87855 -267.276 0.0000
## 15-2 -3.220e+02 -376.74494 -267.205 0.0000
## 16-2 -3.200e+02 -374.74759 -265.190 0.0000
## 17-2 -3.161e+02 -370.85031 -261.299 0.0000
## 18-2 -3.106e+02 -365.38466 -255.842 0.0000
## 19-2 -3.084e+02 -363.17633 -253.592 0.0000
## 20-2 -3.045e+02 -359.28187 -249.660 0.0000
## 21-2 -2.940e+02 -348.86122 -239.155 0.0000
## 22-2 -2.407e+02 -295.87479 -185.519 0.0000
## 23-2 -1.726e+02 -228.57854 -116.634 0.0000
## 24-2 -1.386e+02 -229.38267 -47.903 0.0000
## 5-3 -4.221e+02 -566.98752 -277.293 0.0000
## 6-3 -4.186e+02 -563.43036 -273.851 0.0000
## 7-3 -4.188e+02 -563.62869 -274.042 0.0000
## 8-3 -4.164e+02 -561.17991 -271.603 0.0000
## 9-3 -4.130e+02 -557.78698 -268.182 0.0000
## 10-3 -4.135e+02 -558.33641 -268.729 0.0000
## 11-3 -4.152e+02 -560.03081 -270.441 0.0000
## 12-3 -4.103e+02 -555.14144 -265.537 0.0000
## 13-3 -4.079e+02 -552.73217 -263.127 0.0000
## 14-3 -4.079e+02 -552.73653 -263.132 0.0000
## 15-3 -4.078e+02 -552.62248 -263.042 0.0000
## 16-3 -4.058e+02 -550.61968 -261.033 0.0000
## 17-3 -4.019e+02 -546.72409 -257.139 0.0000
## 18-3 -3.965e+02 -541.26117 -251.680 0.0000
## 19-3 -3.942e+02 -539.03995 -249.443 0.0000
## 20-3 -3.903e+02 -535.13378 -245.522 0.0000
## 21-3 -3.799e+02 -524.68701 -235.044 0.0000
## 22-3 -3.266e+02 -471.49920 -181.609 0.0000
## 23-3 -2.585e+02 -403.71282 -113.214 0.0000
## 24-3 -2.245e+02 -386.35852 -62.641 0.0001
## 6-5 3.500e+00 -1.76199 8.762 0.7363
## 7-5 3.305e+00 -2.05265 8.663 0.8481
## 8-5 5.749e+00 0.53002 10.968 0.0132
## 9-5 9.156e+00 3.56289 14.749 0.0000
## 10-5 8.608e+00 2.97956 14.236 0.0000
## 11-5 6.905e+00 1.50184 12.308 0.0008
## 12-5 1.180e+01 6.21389 17.388 0.0000
## 13-5 1.421e+01 8.61253 19.809 0.0000
## 14-5 1.421e+01 8.61745 19.795 0.0000
## 15-5 1.431e+01 9.03726 19.579 0.0000
## 16-5 1.631e+01 10.95301 21.676 0.0000
## 17-5 2.021e+01 14.87549 25.542 0.0000
## 18-5 2.567e+01 20.38192 30.958 0.0000
## 19-5 2.790e+01 22.40040 33.398 0.0000
## 20-5 3.181e+01 26.12908 37.496 0.0000
## 21-5 4.228e+01 36.19963 48.351 0.0000
## 22-5 9.559e+01 87.06367 104.109 0.0000
## 23-5 1.637e+02 150.99106 176.363 0.0000
## 24-5 1.976e+02 125.10240 270.178 0.0000
## 7-6 -1.948e-01 -3.67770 3.288 1.0000
## 8-6 2.249e+00 -1.01619 5.514 0.6721
## 9-6 5.656e+00 1.82113 9.490 0.0000
## 10-6 5.108e+00 1.22181 8.994 0.0004
## 11-6 3.405e+00 -0.14709 6.957 0.0810
## 12-6 8.301e+00 4.47474 12.127 0.0000
## 13-6 1.071e+01 6.86830 14.553 0.0000
## 14-6 1.071e+01 6.87764 14.535 0.0000
## 15-6 1.081e+01 7.46051 14.156 0.0000
## 16-6 1.281e+01 9.32608 16.303 0.0000
## 17-6 1.671e+01 13.26372 20.154 0.0000
## 18-6 2.217e+01 18.79532 25.545 0.0000
## 19-6 2.440e+01 20.70314 28.095 0.0000
## 20-6 2.831e+01 24.34698 32.278 0.0000
## 21-6 3.878e+01 34.26558 43.285 0.0000
## 22-6 9.209e+01 84.59905 99.574 0.0000
## 23-6 1.602e+02 148.16216 172.192 0.0000
## 24-6 1.941e+02 121.71681 266.564 0.0000
## 8-7 2.444e+00 -0.97369 5.861 0.5963
## 9-7 5.851e+00 1.88545 9.816 0.0000
## 10-7 5.303e+00 1.28779 9.317 0.0004
## 11-7 3.600e+00 -0.09278 7.292 0.0673
## 12-7 8.496e+00 4.53878 12.453 0.0000
## 13-7 1.091e+01 6.93287 14.878 0.0000
## 14-7 1.090e+01 6.94175 14.860 0.0000
## 15-7 1.100e+01 7.50660 14.499 0.0000
## 16-7 1.301e+01 9.37792 16.640 0.0000
## 17-7 1.690e+01 13.31384 20.494 0.0000
## 18-7 2.236e+01 18.84255 25.887 0.0000
## 19-7 2.459e+01 20.76273 28.425 0.0000
## 20-7 2.851e+01 24.41546 32.599 0.0000
## 21-7 3.897e+01 34.34892 43.591 0.0000
## 22-7 9.228e+01 84.72619 99.836 0.0000
## 23-7 1.604e+02 148.31469 172.429 0.0000
## 24-7 1.943e+02 121.90459 266.766 0.0000
## 9-8 3.407e+00 -0.36836 7.182 0.1467
## 10-8 2.859e+00 -0.96848 6.686 0.5026
## 11-8 1.156e+00 -2.33177 4.643 1.0000
## 12-8 6.052e+00 2.28537 9.819 0.0000
## 13-8 8.462e+00 4.67869 12.245 0.0000
## 14-8 8.457e+00 4.68825 12.226 0.0000
## 15-8 8.559e+00 5.27983 11.839 0.0000
## 16-8 1.057e+01 7.14259 13.988 0.0000
## 17-8 1.446e+01 11.08106 17.839 0.0000
## 18-8 1.992e+01 16.61407 23.228 0.0000
## 19-8 2.215e+01 18.51591 25.785 0.0000
## 20-8 2.606e+01 22.15550 29.972 0.0000
## 21-8 3.653e+01 32.06709 40.986 0.0000
## 22-8 8.984e+01 82.38040 97.294 0.0000
## 23-8 1.579e+02 145.93210 169.924 0.0000
## 24-8 1.919e+02 119.47104 264.312 0.0000
## 10-9 -5.481e-01 -4.87151 3.775 1.0000
## 11-9 -2.251e+00 -6.27688 1.775 0.9373
## 12-9 2.645e+00 -1.62469 6.915 0.8428
## 13-9 5.055e+00 0.77055 9.339 0.0041
## 14-9 5.050e+00 0.77843 9.322 0.0040
## 15-9 5.152e+00 1.30548 8.999 0.0003
## 16-9 7.159e+00 3.18869 11.128 0.0000
## 17-9 1.105e+01 7.12105 14.985 0.0000
## 18-9 1.651e+01 12.64379 20.385 0.0000
## 19-9 1.874e+01 14.58977 22.897 0.0000
## 20-9 2.266e+01 18.26158 27.052 0.0000
## 21-9 3.312e+01 28.22774 38.011 0.0000
## 22-9 8.643e+01 78.70712 94.154 0.0000
## 23-9 1.545e+02 142.35782 166.684 0.0000
## 24-9 1.885e+02 116.03623 260.933 0.0000
## 11-10 -1.703e+00 -5.77767 2.372 0.9983
## 12-10 3.193e+00 -1.12272 7.509 0.5233
## 13-10 5.603e+00 1.27267 9.933 0.0006
## 14-10 5.598e+00 1.28043 9.916 0.0006
## 15-10 5.700e+00 1.80245 9.599 0.0000
## 16-10 7.707e+00 3.68721 11.726 0.0000
## 17-10 1.160e+01 7.61910 15.583 0.0000
## 18-10 1.706e+01 13.14106 20.984 0.0000
## 19-10 1.929e+01 15.09047 23.493 0.0000
## 20-10 2.320e+01 18.76485 27.645 0.0000
## 21-10 3.367e+01 28.73551 38.599 0.0000
## 22-10 8.698e+01 79.22960 94.728 0.0000
## 23-10 1.551e+02 142.88963 167.249 0.0000
## 24-10 1.890e+02 116.58156 261.484 0.0000
## 12-11 4.896e+00 0.87842 8.914 0.0022
## 13-11 7.306e+00 3.27275 11.339 0.0000
## 14-11 7.301e+00 3.28143 11.321 0.0000
## 15-11 7.403e+00 3.83838 10.969 0.0000
## 16-11 9.410e+00 5.71220 13.107 0.0000
## 17-11 1.330e+01 9.64738 16.961 0.0000
## 18-11 1.877e+01 15.17483 22.356 0.0000
## 19-11 2.099e+01 17.10040 24.889 0.0000
## 20-11 2.491e+01 20.75708 29.058 0.0000
## 21-11 3.537e+01 30.69719 40.044 0.0000
## 22-11 8.868e+01 81.09461 96.268 0.0000
## 23-11 1.568e+02 144.69505 168.849 0.0000
## 24-11 1.907e+02 118.30169 263.170 0.0000
## 13-12 2.410e+00 -1.86736 6.687 0.9324
## 14-12 2.405e+00 -1.85945 6.669 0.9318
## 15-12 2.507e+00 -1.33158 6.346 0.7669
## 16-12 4.513e+00 0.55136 8.475 0.0077
## 17-12 8.408e+00 4.48380 12.332 0.0000
## 18-12 1.387e+01 10.00667 17.731 0.0000
## 19-12 1.610e+01 11.95209 20.244 0.0000
## 20-12 2.001e+01 15.62348 24.399 0.0000
## 21-12 3.047e+01 25.58890 35.359 0.0000
## 22-12 8.379e+01 76.06591 91.504 0.0000
## 23-12 1.519e+02 139.71511 164.037 0.0000
## 24-12 1.858e+02 113.39134 258.287 0.0000
## 14-13 -4.732e-03 -4.28364 4.274 1.0000
## 15-13 9.740e-02 -3.75737 3.952 1.0000
## 16-13 2.104e+00 -1.87392 6.081 0.9650
## 17-13 5.998e+00 2.05836 9.938 0.0000
## 18-13 1.146e+01 7.58098 15.338 0.0000
## 19-13 1.369e+01 9.52750 17.849 0.0000
## 20-13 1.760e+01 13.19970 22.004 0.0000
## 21-13 2.806e+01 23.16656 32.962 0.0000
## 22-13 8.138e+01 73.64819 89.103 0.0000
## 23-13 1.495e+02 137.30032 161.632 0.0000
## 24-13 1.834e+02 110.98079 255.879 0.0000
## 15-14 1.021e-01 -3.73860 3.943 1.0000
## 16-14 2.108e+00 -1.85559 6.072 0.9628
## 17-14 6.003e+00 2.07682 9.929 0.0000
## 18-14 1.146e+01 7.59966 15.328 0.0000
## 19-14 1.369e+01 9.54522 17.841 0.0000
## 20-14 1.761e+01 13.21672 21.996 0.0000
## 21-14 2.807e+01 23.18232 32.956 0.0000
## 22-14 8.138e+01 73.65991 89.101 0.0000
## 23-14 1.495e+02 137.30949 161.632 0.0000
## 24-14 1.834e+02 110.98627 255.883 0.0000
## 16-15 2.006e+00 -1.49563 5.508 0.9206
## 17-15 5.901e+00 2.44184 9.360 0.0000
## 18-15 1.136e+01 7.97315 14.751 0.0000
## 19-15 1.359e+01 9.88218 17.300 0.0000
## 20-15 1.750e+01 13.52689 21.482 0.0000
## 21-15 2.797e+01 23.44692 32.487 0.0000
## 22-15 8.128e+01 73.78454 88.772 0.0000
## 23-15 1.494e+02 137.35003 161.387 0.0000
## 24-15 1.833e+02 110.90795 255.757 0.0000
## 17-16 3.895e+00 0.29951 7.490 0.0170
## 18-16 9.356e+00 5.82812 12.884 0.0000
## 19-16 1.158e+01 7.74872 15.421 0.0000
## 20-16 1.550e+01 11.40177 19.595 0.0000
## 21-16 2.596e+01 21.33574 30.586 0.0000
## 22-16 7.927e+01 71.71458 86.829 0.0000
## 23-16 1.474e+02 135.30401 159.421 0.0000
## 24-16 1.813e+02 108.89520 253.757 0.0000
## 18-17 5.461e+00 1.97616 8.946 0.0000
## 19-17 7.690e+00 3.89330 11.487 0.0000
## 20-17 1.160e+01 7.54384 15.663 0.0000
## 21-17 2.207e+01 17.47359 26.659 0.0000
## 22-17 7.538e+01 67.83981 82.915 0.0000
## 23-17 1.435e+02 131.42184 155.514 0.0000
## 24-17 1.774e+02 105.00270 249.861 0.0000
## 19-18 2.229e+00 -1.50427 5.962 0.8844
## 20-18 6.142e+00 2.14208 10.143 0.0000
## 21-18 1.661e+01 12.06484 21.145 0.0000
## 22-18 6.992e+01 62.41043 77.422 0.0000
## 23-18 1.380e+02 125.98050 150.033 0.0000
## 24-18 1.720e+02 99.54475 244.396 0.0000
## 20-19 3.913e+00 -0.36148 8.188 0.1287
## 21-19 1.438e+01 9.59217 19.160 0.0000
## 22-19 6.769e+01 60.03155 75.343 0.0000
## 23-19 1.358e+02 123.65737 147.898 0.0000
## 24-19 1.697e+02 97.30001 242.183 0.0000
## 21-20 1.046e+01 5.46775 15.458 0.0000
## 22-20 6.377e+01 55.98458 71.563 0.0000
## 23-20 1.319e+02 119.65920 144.070 0.0000
## 24-20 1.658e+02 93.37245 238.284 0.0000
## 22-21 5.331e+01 45.23130 61.391 0.0000
## 23-21 1.214e+02 109.00904 133.794 0.0000
## 24-21 1.554e+02 82.87795 227.853 0.0000
## 23-22 6.809e+01 54.33198 81.849 0.0000
## 24-22 1.021e+02 29.32088 174.788 0.0001
## 24-23 3.396e+01 -39.37424 107.301 0.9930
This output is long since there are 24 levels or hour. Only a handful of pairs seem to have no significant difference in means, such as 0 and 24 (which makes sense since these are the same time!). Other pairs where it is likely there is no difference in means are pairs 2-1, 3-1, 3-2, 6-5, 7-5, 7-6, etc., all of which logically make sense since they are consecutive hours.
tukey2<-TukeyHSD(aov(data$arr_delay~data$month))
tukey2
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = data$arr_delay ~ data$month)
##
## $`data$month`
## diff lwr upr p adj
## 2-1 -2.97712 -6.4577 0.5034 0.1817
## 3-1 -3.02501 -6.3683 0.3182 0.1215
## 4-1 6.51823 3.2098 9.8266 0.0000
## 5-1 -2.30604 -5.5639 0.9519 0.4670
## 6-1 11.22107 7.9680 14.4742 0.0000
## 7-1 9.37781 6.1454 12.6102 0.0000
## 8-1 1.43344 -1.7742 4.6411 0.9510
## 9-1 -6.91035 -10.1591 -3.6616 0.0000
## 10-1 -3.68270 -6.8918 -0.4736 0.0097
## 11-1 -1.32564 -4.6010 1.9497 0.9764
## 12-1 11.74712 8.4504 15.0438 0.0000
## 3-2 -0.04789 -3.4895 3.3937 1.0000
## 4-2 9.49536 6.0876 12.9031 0.0000
## 5-2 0.67109 -2.6876 4.0298 1.0000
## 6-2 14.19819 10.8441 17.5523 0.0000
## 7-2 12.35493 9.0210 15.6889 0.0000
## 8-2 4.41056 1.1005 7.7206 0.0008
## 9-2 -3.93323 -7.2831 -0.5834 0.0069
## 10-2 -0.70558 -4.0170 2.6059 0.9999
## 11-2 1.65149 -1.7241 5.0271 0.9100
## 12-2 14.72425 11.3279 18.1206 0.0000
## 4-3 9.54324 6.2758 12.8106 0.0000
## 5-3 0.71897 -2.4973 3.9352 0.9999
## 6-3 14.24608 11.0347 17.4575 0.0000
## 7-3 12.40282 9.2124 15.5932 0.0000
## 8-3 4.45845 1.2931 7.6238 0.0003
## 9-3 -3.88534 -7.0923 -0.6784 0.0043
## 10-3 -0.65769 -3.8245 2.5091 0.9999
## 11-3 1.69938 -1.5345 4.9333 0.8607
## 12-3 14.77213 11.5166 18.0277 0.0000
## 5-4 -8.82427 -12.0043 -5.6442 0.0000
## 6-4 4.70284 1.5277 7.8780 0.0001
## 7-4 2.85958 -0.2943 6.0134 0.1195
## 8-4 -5.08479 -8.2133 -1.9563 0.0000
## 9-4 -13.42858 -16.5992 -10.2580 0.0000
## 10-4 -10.20093 -13.3310 -7.0709 0.0000
## 11-4 -7.84387 -11.0417 -4.6460 0.0000
## 12-4 5.22889 2.0091 8.4486 0.0000
## 6-5 13.52711 10.4047 16.6496 0.0000
## 7-5 11.68385 8.5830 14.7847 0.0000
## 8-5 3.73948 0.6644 6.8145 0.0041
## 9-5 -4.60431 -7.7222 -1.4864 0.0001
## 10-5 -1.37666 -4.4533 1.6999 0.9506
## 11-5 0.98040 -2.1652 4.1260 0.9973
## 12-5 14.05316 10.8853 17.2210 0.0000
## 7-6 -1.84326 -4.9391 1.2525 0.7301
## 8-6 -9.78763 -12.8576 -6.7176 0.0000
## 9-6 -18.13142 -21.2443 -15.0185 0.0000
## 10-6 -14.90377 -17.9753 -11.8322 0.0000
## 11-6 -12.54670 -15.6873 -9.4061 0.0000
## 12-6 0.52605 -2.6369 3.6890 1.0000
## 8-7 -7.94437 -10.9924 -4.8964 0.0000
## 9-7 -16.28816 -19.3794 -13.1970 0.0000
## 10-7 -13.06051 -16.1101 -10.0110 0.0000
## 11-7 -10.70344 -13.8226 -7.5843 0.0000
## 12-7 2.36931 -0.7722 5.5109 0.3624
## 9-8 -8.34379 -11.4091 -5.2784 0.0000
## 10-8 -5.11614 -8.1395 -2.0928 0.0000
## 11-8 -2.75907 -5.8526 0.3344 0.1355
## 12-8 10.31368 7.1975 13.4298 0.0000
## 10-9 3.22765 0.1608 6.2945 0.0289
## 11-9 5.58472 2.4486 8.7208 0.0000
## 12-9 18.65747 15.4991 21.8159 0.0000
## 11-10 2.35707 -0.7380 5.4521 0.3468
## 12-10 15.42982 12.3122 18.5475 0.0000
## 12-11 13.07276 9.8870 16.2585 0.0000
The pattern for months that have differences in means is not as evident, although just over half the pairings seems to have differences in means of arrival delay time. Further anlysis could look into if the difference related to weather patterns or seasons with increased travel that could cause patterns.
tukey3<-TukeyHSD(aov(data$arr_delay~data$carrier))
tukey3
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = data$arr_delay ~ data$carrier)
##
## $`data$carrier`
## diff lwr upr p adj
## DL-AA 5.5818 4.174 6.9894 0.0000
## UA-AA 5.3313 3.996 6.6661 0.0000
## WN-AA 12.2108 9.080 15.3414 0.0000
## UA-DL -0.2505 -1.461 0.9599 0.9514
## WN-DL 6.6290 3.549 9.7087 0.0000
## WN-UA 6.8795 3.832 9.9266 0.0000
plot(tukey3)
For this tukey test, we include the plot since there are fewer pair comparisons. The only pair where we fail to reject the null is between UA and DL. All other airline carrier pairs have a significant difference in means.
We will check model 4 since it inludes 2 of the 3 factors under study.
Visually inspect normality of data:
qqnorm(residuals(model4))
qqline(residuals(model4))
The data appears that it may not be normal.
Test normality with Shapiro Wilks test (*NOTE- this test can only be run with a sample size of less than 5000. Because of this, a model identical to model 4 but with a smaller set of data is created. This is done by taking a random sample of the data originally used):
small <- data[sample(1:nrow(data), 5000, replace=FALSE),]
modelsmall=aov(small$arr_delay~small$hour*small$month+small$origin+small$dest)
summary(modelsmall)
## Df Sum Sq Mean Sq F value Pr(>F)
## small$hour 21 1646310 78396 52.12 < 2e-16 ***
## small$month 11 101436 9221 6.13 4.5e-10 ***
## small$origin 2 34442 17221 11.45 1.1e-05 ***
## small$dest 4 86536 21634 14.38 1.1e-11 ***
## small$hour:small$month 194 636103 3279 2.18 < 2e-16 ***
## Residuals 4767 7170087 1504
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
shapiro.test(residuals(modelsmall))
##
## Shapiro-Wilk normality test
##
## data: residuals(modelsmall)
## W = 0.7592, p-value < 2.2e-16
Null hypothesis: The data came from a normally distributed population. We reject the null. We cannot assume the data is normal. This will be addressed in the contingencies section below.
plot(fitted(model4),residuals(model4))
The model may not be a good fit since the residuals are clustered and not distributed across the dynamic range.
We create an interaction plot to view the interactions between the factors.
interaction.plot(data$hour, data$month, data$arr_delay)
interaction.plot(data$hour, data$carrier, data$arr_delay)
interaction.plot(data$carrier, data$month, data$arr_delay)
There is interaction among all the factors, evident by the different slopes and intersecting lines.
Since the data did not fulfill the normality assumption of the anova model, a Kruskal-Wallis one-way analysis of variance by Rank Sum Test should be performed:
kruskal.test(data$arr_delay~data$month)
##
## Kruskal-Wallis rank sum test
##
## data: data$arr_delay by data$month
## Kruskal-Wallis chi-squared = 1561, df = 11, p-value < 2.2e-16
kruskal.test(data$arr_delay~data$hour)
##
## Kruskal-Wallis rank sum test
##
## data: data$arr_delay by data$hour
## Kruskal-Wallis chi-squared = 3047, df = 23, p-value < 2.2e-16
kruskal.test(data$arr_delay~data$carrier)
##
## Kruskal-Wallis rank sum test
##
## data: data$arr_delay by data$carrier
## Kruskal-Wallis chi-squared = 583.8, df = 3, p-value < 2.2e-16
The null hypothesis of the kruskal test is that the mean ranks of the samples from the populations are expected to be the same (this is not the same as saying the populations have identical means). Since each test results in a low p-value, we reject this null hypothesis. It is likely that the variation in the rank means of month, hour, and carrier can explain the variaion in arrival delay times.
None used.
Data is from the NYCflight13 package
All included above.