Recipie for Blocked Designs with multiple explanatory and nuisance factors

Ali Svoobda

RPI

10/23/14 V.1

1. Setting

System under test

For this recipie, we will examine the flights dataset from the nycflights13 package.

Read in data set:

install.packages("nycflights13")
## Installing package into 'C:/Users/svoboa/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library("nycflights13", lib.loc="C:/Users/svoboa/Documents/R/win-library/3.1")
x<-flights

Create subset of data with only some of the destination cities and Airlines:

xx<-subset(x,x$dest=="ORD" | x$dest=="BOS" | x$dest=="DEN" | x$dest=="LAX" | x$dest=="ATL")
xxx<-subset(xx,xx$carrier=="AA" | xx$carrier=="DL" | xx$carrier=="UA" | xx$carrier=="WN")

Some observations in the dataset contain N/A's. We must omit N/A's for future analysis:

data<-na.omit(xxx)

Factors and Levels

Month (12 levels) month each flight took off in Hour (24 levels) hour of the day the plane departed (military time) Carrier (4 levels: AA, DL, US, WN) airline name Origin city (3 levels: EWR, JFK, LGA) city each flight departed from Destination city (dest) (5 levels: ORD, BOS, LAX, DEN, ATL) city each flight landed in.

Set up varibles as factors:

data$month=as.factor(data$month)
data$hour=as.factor(data$hour)
data$carrier=as.factor(data$carrier)
data$origin=as.factor(data$origin)
data$dest=as.factor(data$dest)

Continuous Variables

Departure and arrival time, arriaval and departure delay time, air time, and distance are all continuous variables.

Response Variables

Arrival delay time (arr_time), measured in minutes, will be the response variable for this experiment.

The Data: How is it organized and what does it look like?

With the N/A's omitted and carriers and destinations subsetted, the dataset contains 47,695 flight observations of 16 variables.

Structure, summary, and more on the dataset:

str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    46843 obs. of  16 variables:
##  $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month    : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  554 554 558 558 606 615 628 629 646 656 ...
##  $ dep_delay: num  -6 -4 -2 -2 -4 0 -2 -1 1 -4 ...
##  $ arr_time : int  812 740 753 924 837 833 1016 824 910 854 ...
##  $ arr_delay: num  -25 12 8 7 -8 -9 29 14 -6 4 ...
##  $ carrier  : Factor w/ 4 levels "AA","DL","UA",..: 2 3 1 3 2 2 3 1 3 1 ...
##  $ tailnum  : chr  "N668DN" "N39463" "N3ALAA" "N29129" ...
##  $ flight   : int  461 1696 301 194 1743 575 1665 303 883 305 ...
##  $ origin   : Factor w/ 3 levels "EWR","JFK","LGA": 3 1 3 2 2 1 1 3 3 3 ...
##  $ dest     : Factor w/ 5 levels "ATL","BOS","DEN",..: 1 5 5 4 1 1 4 5 3 5 ...
##  $ air_time : num  116 150 138 345 128 120 366 140 243 143 ...
##  $ distance : num  762 719 733 2475 760 ...
##  $ hour     : Factor w/ 24 levels "0","1","2","3",..: 5 5 5 5 6 6 6 6 6 6 ...
##  $ minute   : num  54 54 58 58 6 15 28 29 46 56 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:851] 243 244 364 365 792 836 1153 1274 1324 1355 ...
##   .. ..- attr(*, "names")= chr [1:851] "1783" "1785" "2691" "2692" ...
summary(data)
##       year          month            day          dep_time   
##  Min.   :2013   8      : 4335   Min.   : 1.0   Min.   :   1  
##  1st Qu.:2013   10     : 4326   1st Qu.: 8.0   1st Qu.: 858  
##  Median :2013   7      : 4189   Median :16.0   Median :1346  
##  Mean   :2013   9      : 4097   Mean   :15.7   Mean   :1330  
##  3rd Qu.:2013   6      : 4073   3rd Qu.:23.0   3rd Qu.:1730  
##  Max.   :2013   5      : 4047   Max.   :31.0   Max.   :2400  
##                 (Other):21776                                
##    dep_delay      arr_time      arr_delay     carrier   
##  Min.   :-20   Min.   :   1   Min.   :-75.0   AA:10820  
##  1st Qu.: -4   1st Qu.:1059   1st Qu.:-19.0   DL:14936  
##  Median : -1   Median :1523   Median : -7.0   UA:19650  
##  Mean   : 11   Mean   :1498   Mean   :  3.7   WN: 1437  
##  3rd Qu.:  8   3rd Qu.:1930   3rd Qu.: 11.0             
##  Max.   :898   Max.   :2400   Max.   :895.0             
##                                                         
##    tailnum              flight     origin       dest          air_time  
##  Length:46843       Min.   :   1   EWR:17151   ATL:10612   Min.   : 26  
##  Class :character   1st Qu.: 337   JFK:12683   BOS: 5686   1st Qu.:107  
##  Mode  :character   Median : 708   LGA:17009   DEN: 6152   Median :120  
##                     Mean   : 886               LAX:11803   Mean   :173  
##                     3rd Qu.:1377               ORD:12590   3rd Qu.:289  
##                     Max.   :4454                           Max.   :440  
##                                                                         
##     distance         hour           minute    
##  Min.   : 187   8      : 4106   Min.   : 0.0  
##  1st Qu.: 733   6      : 3772   1st Qu.:16.0  
##  Median : 762   15     : 3709   Median :34.0  
##  Mean   :1225   18     : 3593   Mean   :33.1  
##  3rd Qu.:2454   17     : 3320   3rd Qu.:53.0  
##  Max.   :2475   7      : 3188   Max.   :59.0  
##                 (Other):25155
head(data)
##    year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 5  2013     1   1      554        -6      812       -25      DL  N668DN
## 6  2013     1   1      554        -4      740        12      UA  N39463
## 10 2013     1   1      558        -2      753         8      AA  N3ALAA
## 13 2013     1   1      558        -2      924         7      UA  N29129
## 24 2013     1   1      606        -4      837        -8      DL  N3739P
## 30 2013     1   1      615         0      833        -9      DL  N326NB
##    flight origin dest air_time distance hour minute
## 5     461    LGA  ATL      116      762    5     54
## 6    1696    EWR  ORD      150      719    5     54
## 10    301    LGA  ORD      138      733    5     58
## 13    194    JFK  LAX      345     2475    5     58
## 24   1743    JFK  ATL      128      760    6      6
## 30    575    EWR  ATL      120      746    6     15
?flights
## starting httpd help server ... done

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

A 3-factor analysis of variance will be performed with blocks on destination and origin city. Airline carrier, hour of departure, and month of departure will be tested to see if they can explain any of the variation in the arrival delay time.

What is the Rationale for this design?

This design was choosen to see if the time of day or year have an effect on a flight being delayed (i.e. find the best time of day, year, and airline to fly with). The goal is to find this without the effects of where you are traveling to or from, which is why origin and destination city are blocked for.

Randomize: What is the Randomization Scheme?

The dataset flights is a collection of survey data from flights departing from 3 NYC airports in 2013.

Replicate: Are there replicates and/or repeated measures?

Since flights are not repeated, there are not repicates or repeated measures.

Block: Did you use blocking in the design?

Blocking was used for the origin and destination cities.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Mean of arrival delay time by month, hour, and carrier:

tapply(data$arr_delay, data$month, mean)
##       1       2       3       4       5       6       7       8       9 
##  1.9636 -1.0136 -1.0614  8.4818 -0.3425 13.1846 11.3414  3.3970 -4.9468 
##      10      11      12 
## -1.7191  0.6379 13.7107
tapply(data$arr_delay, data$hour, mean)
##        0        1        2        3        5        6        7        8 
## 208.8718 302.7143 324.1429 410.0000 -12.1404  -8.6405  -8.8353  -6.3916 
##        9       10       11       12       13       14       15       16 
##  -2.9847  -3.5328  -5.2358  -0.3394   2.0703   2.0656   2.1677   4.1738 
##       17       18       19       20       21       22       23       24 
##   8.0684  13.5296  15.7587  19.6720  30.1347  83.4458 151.5364 185.5000
tapply(data$arr_delay, data$carrier, mean)
##     AA     DL     UA     WN 
## -0.675  4.907  4.656 11.536
tapply(data$arr_delay, data$origin, mean)
##    EWR    JFK    LGA 
## 6.0166 0.5186 3.7798
tapply(data$arr_delay, data$dest, mean)
##    ATL    BOS    DEN    LAX    ORD 
## 7.4509 2.6581 6.7731 0.1246 2.9180

The highest arrival delays occur in June and December. Flights that depart early in the morning and late at night seem to have higher delayed arrival times. Carrier AA has the best arrival delay (negative indicating the mean is early) whie WN has the worst delays.

Histogram of Arrival Times:

hist(data$arr_delay, breaks=100, xlim= c(0,300))

plot of chunk unnamed-chunk-7

Boxplots:

boxplot(data$arr_delay~data$month, xlab="month of departure", ylab="Arrival Delay")

plot of chunk unnamed-chunk-8

boxplot(data$arr_delay~data$hour, xlab="hour of departure", ylab="Arrival Delay")

plot of chunk unnamed-chunk-8

boxplot(data$arr_delay~data$carrier, xlab="airline carrier", ylab="Arrival Delay")

plot of chunk unnamed-chunk-8

boxplot(data$arr_delay~data$origin, xlab="origin city", ylab="Arrival Delay")

plot of chunk unnamed-chunk-8

boxplot(data$arr_delay~data$dest, xlab="destination city", ylab="Arrival Delay")

plot of chunk unnamed-chunk-8

This dataset is difficult to analyze through boxplots due to the large number of outliers.
However, when looking at the boxplot of arrival delay by the hour the flight departed, it is evident that hours early in the day (0, 1, 2, 3) and late in the day (22, 23, 24) may have a significant impact on arrival delay.

Testing

ANOVA Models

Due to the large size of the dataset, each factor under study will be tested first in its own model with blocking for origin and destination. A model including all 3 factors and 2 blocks was to large to run in RStudio without crashing.

Model 1: Anova model for effect of month with blocking on origin and destination city:
model1=aov(data$arr_delay~data$month+data$origin+data$dest)
summary(model1)
##                Df   Sum Sq Mean Sq F value Pr(>F)    
## data$month     11  1781656  161969    87.9 <2e-16 ***
## data$origin     2   210823  105411    57.2 <2e-16 ***
## data$dest       4   296818   74204    40.3 <2e-16 ***
## Residuals   46825 86278901    1843                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis of the anova model is that the variation in arrival delay time cannot be explained by anything other than randomization. Since the p-value for month is low, there is a high probability that month or departure can in fact explain the variation in arrival delay time.

Model 2: Anova model for effect of hour with blocking on origin and destination:

model2=aov(data$arr_delay~data$hour+data$origin+data$dest)
summary(model2)
##                Df   Sum Sq Mean Sq F value Pr(>F)    
## data$hour      23 14396809  625948     402 <2e-16 ***
## data$origin     2   508348  254174     163 <2e-16 ***
## data$dest       4   781471  195368     125 <2e-16 ***
## Residuals   46813 72881569    1557                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, it is likely that the hour a plane departs can explain the variation in its arrival delay.

Model 3: Anova model for effect of carrier with blocking on origin and destination:

model3=aov(data$arr_delay~data$carrier+data$origin+data$dest)
summary(model3)
##                 Df   Sum Sq Mean Sq F value  Pr(>F)    
## data$carrier     3   335042  111681    59.5 < 2e-16 ***
## data$origin      2   110421   55210    29.4 1.7e-13 ***
## data$dest        4   203549   50887    27.1 < 2e-16 ***
## Residuals    46833 87919186    1877                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is also likely that the carrier of a plane can explain the variation in its arrival delay.

Model 4: Anova model for hour and month with blocking on origin and destination:

model4=aov(data$arr_delay~data$hour*data$month+data$origin+data$dest)
summary(model4)
##                         Df   Sum Sq Mean Sq F value Pr(>F)    
## data$hour               23 14396809  625948  415.92 <2e-16 ***
## data$month              11  1279820  116347   77.31 <2e-16 ***
## data$origin              2   497767  248884  165.38 <2e-16 ***
## data$dest                4   746583  186646  124.02 <2e-16 ***
## data$hour:data$month   216  1537242    7117    4.73 <2e-16 ***
## Residuals            46586 70109976    1505                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This model adds the interation between the first two factors of hour and month of departure. The interaction effect also has a low p-value, indicating it is likely the two factors togther explain the variation in arrival delay.

Between these four models, it is likely that month of departure, hour of departure, and airline carrier can all explain the variation in arrival delay times.

Tukey Tests

When the line for a factor pair crosses zero or a p-value greater than .05 is generated, that indicates we fail to reject the null hypothesis that there is no difference in the means of that combination of pairs. When the plotted line does not cross zero or we generate a small p-value, that indicated there is likely a difference in means between the two factor levels.

Tukey Test for differences in hour:

tukey1<-TukeyHSD(aov(data$arr_delay~data$hour))
tukey1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = data$arr_delay ~ data$hour)
## 
## $`data$hour`
##             diff        lwr      upr  p adj
## 1-0    9.384e+01   48.73775  138.947 0.0000
## 2-0    1.153e+02   55.84483  174.697 0.0000
## 3-0    2.011e+02   54.51325  347.743 0.0002
## 5-0   -2.210e+02 -244.66663 -197.358 0.0000
## 6-0   -2.175e+02 -240.81370 -194.211 0.0000
## 7-0   -2.177e+02 -241.03034 -194.384 0.0000
## 8-0   -2.153e+02 -238.55511 -191.972 0.0000
## 9-0   -2.119e+02 -235.23486 -188.478 0.0000
## 10-0  -2.124e+02 -235.79138 -189.018 0.0000
## 11-0  -2.141e+02 -237.44118 -190.774 0.0000
## 12-0  -2.092e+02 -232.58818 -185.834 0.0000
## 13-0  -2.068e+02 -230.18113 -183.422 0.0000
## 14-0  -2.068e+02 -230.18355 -183.429 0.0000
## 15-0  -2.067e+02 -230.00751 -183.401 0.0000
## 16-0  -2.047e+02 -228.02200 -181.374 0.0000
## 17-0  -2.008e+02 -224.12104 -177.486 0.0000
## 18-0  -1.953e+02 -218.64949 -172.035 0.0000
## 19-0  -1.931e+02 -216.46910 -169.757 0.0000
## 20-0  -1.892e+02 -212.59996 -165.800 0.0000
## 21-0  -1.787e+02 -202.23559 -155.239 0.0000
## 22-0  -1.254e+02 -149.67268 -101.179 0.0000
## 23-0  -5.734e+01  -83.33916  -31.332 0.0000
## 24-0  -2.337e+01  -99.37860   52.635 1.0000
## 2-1    2.143e+01  -45.58722   88.444 1.0000
## 3-1    1.073e+02  -42.56616  257.138 0.5937
## 5-1   -3.149e+02 -353.83127 -275.878 0.0000
## 6-1   -3.114e+02 -350.11812 -272.591 0.0000
## 7-1   -3.115e+02 -350.32606 -272.773 0.0000
## 8-1   -3.091e+02 -347.86340 -270.348 0.0000
## 9-1   -3.057e+02 -344.50864 -266.889 0.0000
## 10-1  -3.062e+02 -345.06180 -267.432 0.0000
## 11-1  -3.080e+02 -346.73276 -269.167 0.0000
## 12-1  -3.031e+02 -341.86250 -264.245 0.0000
## 13-1  -3.006e+02 -339.45439 -261.834 0.0000
## 14-1  -3.006e+02 -339.45773 -261.840 0.0000
## 15-1  -3.005e+02 -339.31113 -261.782 0.0000
## 16-1  -2.985e+02 -337.31740 -259.764 0.0000
## 17-1  -2.946e+02 -333.41899 -255.873 0.0000
## 18-1  -2.892e+02 -327.95154 -250.418 0.0000
## 19-1  -2.870e+02 -325.75176 -248.159 0.0000
## 20-1  -2.830e+02 -321.86506 -244.220 0.0000
## 21-1  -2.726e+02 -311.46171 -233.697 0.0000
## 22-1  -2.193e+02 -258.60731 -179.930 0.0000
## 23-1  -1.512e+02 -191.62335 -110.732 0.0000
## 24-1  -1.172e+02 -199.29154  -35.137 0.0001
## 3-2    8.586e+01  -68.90921  240.623 0.9421
## 5-2   -3.363e+02 -391.20333 -281.363 0.0000
## 6-2   -3.328e+02 -387.55228 -278.014 0.0000
## 7-2   -3.330e+02 -387.75639 -278.200 0.0000
## 8-2   -3.305e+02 -385.29927 -275.770 0.0000
## 9-2   -3.271e+02 -381.92928 -272.326 0.0000
## 10-2  -3.277e+02 -382.48095 -272.870 0.0000
## 11-2  -3.294e+02 -384.16126 -274.596 0.0000
## 12-2  -3.245e+02 -379.28338 -269.681 0.0000
## 13-2  -3.221e+02 -376.87481 -267.270 0.0000
## 14-2  -3.221e+02 -376.87855 -267.276 0.0000
## 15-2  -3.220e+02 -376.74494 -267.205 0.0000
## 16-2  -3.200e+02 -374.74759 -265.190 0.0000
## 17-2  -3.161e+02 -370.85031 -261.299 0.0000
## 18-2  -3.106e+02 -365.38466 -255.842 0.0000
## 19-2  -3.084e+02 -363.17633 -253.592 0.0000
## 20-2  -3.045e+02 -359.28187 -249.660 0.0000
## 21-2  -2.940e+02 -348.86122 -239.155 0.0000
## 22-2  -2.407e+02 -295.87479 -185.519 0.0000
## 23-2  -1.726e+02 -228.57854 -116.634 0.0000
## 24-2  -1.386e+02 -229.38267  -47.903 0.0000
## 5-3   -4.221e+02 -566.98752 -277.293 0.0000
## 6-3   -4.186e+02 -563.43036 -273.851 0.0000
## 7-3   -4.188e+02 -563.62869 -274.042 0.0000
## 8-3   -4.164e+02 -561.17991 -271.603 0.0000
## 9-3   -4.130e+02 -557.78698 -268.182 0.0000
## 10-3  -4.135e+02 -558.33641 -268.729 0.0000
## 11-3  -4.152e+02 -560.03081 -270.441 0.0000
## 12-3  -4.103e+02 -555.14144 -265.537 0.0000
## 13-3  -4.079e+02 -552.73217 -263.127 0.0000
## 14-3  -4.079e+02 -552.73653 -263.132 0.0000
## 15-3  -4.078e+02 -552.62248 -263.042 0.0000
## 16-3  -4.058e+02 -550.61968 -261.033 0.0000
## 17-3  -4.019e+02 -546.72409 -257.139 0.0000
## 18-3  -3.965e+02 -541.26117 -251.680 0.0000
## 19-3  -3.942e+02 -539.03995 -249.443 0.0000
## 20-3  -3.903e+02 -535.13378 -245.522 0.0000
## 21-3  -3.799e+02 -524.68701 -235.044 0.0000
## 22-3  -3.266e+02 -471.49920 -181.609 0.0000
## 23-3  -2.585e+02 -403.71282 -113.214 0.0000
## 24-3  -2.245e+02 -386.35852  -62.641 0.0001
## 6-5    3.500e+00   -1.76199    8.762 0.7363
## 7-5    3.305e+00   -2.05265    8.663 0.8481
## 8-5    5.749e+00    0.53002   10.968 0.0132
## 9-5    9.156e+00    3.56289   14.749 0.0000
## 10-5   8.608e+00    2.97956   14.236 0.0000
## 11-5   6.905e+00    1.50184   12.308 0.0008
## 12-5   1.180e+01    6.21389   17.388 0.0000
## 13-5   1.421e+01    8.61253   19.809 0.0000
## 14-5   1.421e+01    8.61745   19.795 0.0000
## 15-5   1.431e+01    9.03726   19.579 0.0000
## 16-5   1.631e+01   10.95301   21.676 0.0000
## 17-5   2.021e+01   14.87549   25.542 0.0000
## 18-5   2.567e+01   20.38192   30.958 0.0000
## 19-5   2.790e+01   22.40040   33.398 0.0000
## 20-5   3.181e+01   26.12908   37.496 0.0000
## 21-5   4.228e+01   36.19963   48.351 0.0000
## 22-5   9.559e+01   87.06367  104.109 0.0000
## 23-5   1.637e+02  150.99106  176.363 0.0000
## 24-5   1.976e+02  125.10240  270.178 0.0000
## 7-6   -1.948e-01   -3.67770    3.288 1.0000
## 8-6    2.249e+00   -1.01619    5.514 0.6721
## 9-6    5.656e+00    1.82113    9.490 0.0000
## 10-6   5.108e+00    1.22181    8.994 0.0004
## 11-6   3.405e+00   -0.14709    6.957 0.0810
## 12-6   8.301e+00    4.47474   12.127 0.0000
## 13-6   1.071e+01    6.86830   14.553 0.0000
## 14-6   1.071e+01    6.87764   14.535 0.0000
## 15-6   1.081e+01    7.46051   14.156 0.0000
## 16-6   1.281e+01    9.32608   16.303 0.0000
## 17-6   1.671e+01   13.26372   20.154 0.0000
## 18-6   2.217e+01   18.79532   25.545 0.0000
## 19-6   2.440e+01   20.70314   28.095 0.0000
## 20-6   2.831e+01   24.34698   32.278 0.0000
## 21-6   3.878e+01   34.26558   43.285 0.0000
## 22-6   9.209e+01   84.59905   99.574 0.0000
## 23-6   1.602e+02  148.16216  172.192 0.0000
## 24-6   1.941e+02  121.71681  266.564 0.0000
## 8-7    2.444e+00   -0.97369    5.861 0.5963
## 9-7    5.851e+00    1.88545    9.816 0.0000
## 10-7   5.303e+00    1.28779    9.317 0.0004
## 11-7   3.600e+00   -0.09278    7.292 0.0673
## 12-7   8.496e+00    4.53878   12.453 0.0000
## 13-7   1.091e+01    6.93287   14.878 0.0000
## 14-7   1.090e+01    6.94175   14.860 0.0000
## 15-7   1.100e+01    7.50660   14.499 0.0000
## 16-7   1.301e+01    9.37792   16.640 0.0000
## 17-7   1.690e+01   13.31384   20.494 0.0000
## 18-7   2.236e+01   18.84255   25.887 0.0000
## 19-7   2.459e+01   20.76273   28.425 0.0000
## 20-7   2.851e+01   24.41546   32.599 0.0000
## 21-7   3.897e+01   34.34892   43.591 0.0000
## 22-7   9.228e+01   84.72619   99.836 0.0000
## 23-7   1.604e+02  148.31469  172.429 0.0000
## 24-7   1.943e+02  121.90459  266.766 0.0000
## 9-8    3.407e+00   -0.36836    7.182 0.1467
## 10-8   2.859e+00   -0.96848    6.686 0.5026
## 11-8   1.156e+00   -2.33177    4.643 1.0000
## 12-8   6.052e+00    2.28537    9.819 0.0000
## 13-8   8.462e+00    4.67869   12.245 0.0000
## 14-8   8.457e+00    4.68825   12.226 0.0000
## 15-8   8.559e+00    5.27983   11.839 0.0000
## 16-8   1.057e+01    7.14259   13.988 0.0000
## 17-8   1.446e+01   11.08106   17.839 0.0000
## 18-8   1.992e+01   16.61407   23.228 0.0000
## 19-8   2.215e+01   18.51591   25.785 0.0000
## 20-8   2.606e+01   22.15550   29.972 0.0000
## 21-8   3.653e+01   32.06709   40.986 0.0000
## 22-8   8.984e+01   82.38040   97.294 0.0000
## 23-8   1.579e+02  145.93210  169.924 0.0000
## 24-8   1.919e+02  119.47104  264.312 0.0000
## 10-9  -5.481e-01   -4.87151    3.775 1.0000
## 11-9  -2.251e+00   -6.27688    1.775 0.9373
## 12-9   2.645e+00   -1.62469    6.915 0.8428
## 13-9   5.055e+00    0.77055    9.339 0.0041
## 14-9   5.050e+00    0.77843    9.322 0.0040
## 15-9   5.152e+00    1.30548    8.999 0.0003
## 16-9   7.159e+00    3.18869   11.128 0.0000
## 17-9   1.105e+01    7.12105   14.985 0.0000
## 18-9   1.651e+01   12.64379   20.385 0.0000
## 19-9   1.874e+01   14.58977   22.897 0.0000
## 20-9   2.266e+01   18.26158   27.052 0.0000
## 21-9   3.312e+01   28.22774   38.011 0.0000
## 22-9   8.643e+01   78.70712   94.154 0.0000
## 23-9   1.545e+02  142.35782  166.684 0.0000
## 24-9   1.885e+02  116.03623  260.933 0.0000
## 11-10 -1.703e+00   -5.77767    2.372 0.9983
## 12-10  3.193e+00   -1.12272    7.509 0.5233
## 13-10  5.603e+00    1.27267    9.933 0.0006
## 14-10  5.598e+00    1.28043    9.916 0.0006
## 15-10  5.700e+00    1.80245    9.599 0.0000
## 16-10  7.707e+00    3.68721   11.726 0.0000
## 17-10  1.160e+01    7.61910   15.583 0.0000
## 18-10  1.706e+01   13.14106   20.984 0.0000
## 19-10  1.929e+01   15.09047   23.493 0.0000
## 20-10  2.320e+01   18.76485   27.645 0.0000
## 21-10  3.367e+01   28.73551   38.599 0.0000
## 22-10  8.698e+01   79.22960   94.728 0.0000
## 23-10  1.551e+02  142.88963  167.249 0.0000
## 24-10  1.890e+02  116.58156  261.484 0.0000
## 12-11  4.896e+00    0.87842    8.914 0.0022
## 13-11  7.306e+00    3.27275   11.339 0.0000
## 14-11  7.301e+00    3.28143   11.321 0.0000
## 15-11  7.403e+00    3.83838   10.969 0.0000
## 16-11  9.410e+00    5.71220   13.107 0.0000
## 17-11  1.330e+01    9.64738   16.961 0.0000
## 18-11  1.877e+01   15.17483   22.356 0.0000
## 19-11  2.099e+01   17.10040   24.889 0.0000
## 20-11  2.491e+01   20.75708   29.058 0.0000
## 21-11  3.537e+01   30.69719   40.044 0.0000
## 22-11  8.868e+01   81.09461   96.268 0.0000
## 23-11  1.568e+02  144.69505  168.849 0.0000
## 24-11  1.907e+02  118.30169  263.170 0.0000
## 13-12  2.410e+00   -1.86736    6.687 0.9324
## 14-12  2.405e+00   -1.85945    6.669 0.9318
## 15-12  2.507e+00   -1.33158    6.346 0.7669
## 16-12  4.513e+00    0.55136    8.475 0.0077
## 17-12  8.408e+00    4.48380   12.332 0.0000
## 18-12  1.387e+01   10.00667   17.731 0.0000
## 19-12  1.610e+01   11.95209   20.244 0.0000
## 20-12  2.001e+01   15.62348   24.399 0.0000
## 21-12  3.047e+01   25.58890   35.359 0.0000
## 22-12  8.379e+01   76.06591   91.504 0.0000
## 23-12  1.519e+02  139.71511  164.037 0.0000
## 24-12  1.858e+02  113.39134  258.287 0.0000
## 14-13 -4.732e-03   -4.28364    4.274 1.0000
## 15-13  9.740e-02   -3.75737    3.952 1.0000
## 16-13  2.104e+00   -1.87392    6.081 0.9650
## 17-13  5.998e+00    2.05836    9.938 0.0000
## 18-13  1.146e+01    7.58098   15.338 0.0000
## 19-13  1.369e+01    9.52750   17.849 0.0000
## 20-13  1.760e+01   13.19970   22.004 0.0000
## 21-13  2.806e+01   23.16656   32.962 0.0000
## 22-13  8.138e+01   73.64819   89.103 0.0000
## 23-13  1.495e+02  137.30032  161.632 0.0000
## 24-13  1.834e+02  110.98079  255.879 0.0000
## 15-14  1.021e-01   -3.73860    3.943 1.0000
## 16-14  2.108e+00   -1.85559    6.072 0.9628
## 17-14  6.003e+00    2.07682    9.929 0.0000
## 18-14  1.146e+01    7.59966   15.328 0.0000
## 19-14  1.369e+01    9.54522   17.841 0.0000
## 20-14  1.761e+01   13.21672   21.996 0.0000
## 21-14  2.807e+01   23.18232   32.956 0.0000
## 22-14  8.138e+01   73.65991   89.101 0.0000
## 23-14  1.495e+02  137.30949  161.632 0.0000
## 24-14  1.834e+02  110.98627  255.883 0.0000
## 16-15  2.006e+00   -1.49563    5.508 0.9206
## 17-15  5.901e+00    2.44184    9.360 0.0000
## 18-15  1.136e+01    7.97315   14.751 0.0000
## 19-15  1.359e+01    9.88218   17.300 0.0000
## 20-15  1.750e+01   13.52689   21.482 0.0000
## 21-15  2.797e+01   23.44692   32.487 0.0000
## 22-15  8.128e+01   73.78454   88.772 0.0000
## 23-15  1.494e+02  137.35003  161.387 0.0000
## 24-15  1.833e+02  110.90795  255.757 0.0000
## 17-16  3.895e+00    0.29951    7.490 0.0170
## 18-16  9.356e+00    5.82812   12.884 0.0000
## 19-16  1.158e+01    7.74872   15.421 0.0000
## 20-16  1.550e+01   11.40177   19.595 0.0000
## 21-16  2.596e+01   21.33574   30.586 0.0000
## 22-16  7.927e+01   71.71458   86.829 0.0000
## 23-16  1.474e+02  135.30401  159.421 0.0000
## 24-16  1.813e+02  108.89520  253.757 0.0000
## 18-17  5.461e+00    1.97616    8.946 0.0000
## 19-17  7.690e+00    3.89330   11.487 0.0000
## 20-17  1.160e+01    7.54384   15.663 0.0000
## 21-17  2.207e+01   17.47359   26.659 0.0000
## 22-17  7.538e+01   67.83981   82.915 0.0000
## 23-17  1.435e+02  131.42184  155.514 0.0000
## 24-17  1.774e+02  105.00270  249.861 0.0000
## 19-18  2.229e+00   -1.50427    5.962 0.8844
## 20-18  6.142e+00    2.14208   10.143 0.0000
## 21-18  1.661e+01   12.06484   21.145 0.0000
## 22-18  6.992e+01   62.41043   77.422 0.0000
## 23-18  1.380e+02  125.98050  150.033 0.0000
## 24-18  1.720e+02   99.54475  244.396 0.0000
## 20-19  3.913e+00   -0.36148    8.188 0.1287
## 21-19  1.438e+01    9.59217   19.160 0.0000
## 22-19  6.769e+01   60.03155   75.343 0.0000
## 23-19  1.358e+02  123.65737  147.898 0.0000
## 24-19  1.697e+02   97.30001  242.183 0.0000
## 21-20  1.046e+01    5.46775   15.458 0.0000
## 22-20  6.377e+01   55.98458   71.563 0.0000
## 23-20  1.319e+02  119.65920  144.070 0.0000
## 24-20  1.658e+02   93.37245  238.284 0.0000
## 22-21  5.331e+01   45.23130   61.391 0.0000
## 23-21  1.214e+02  109.00904  133.794 0.0000
## 24-21  1.554e+02   82.87795  227.853 0.0000
## 23-22  6.809e+01   54.33198   81.849 0.0000
## 24-22  1.021e+02   29.32088  174.788 0.0001
## 24-23  3.396e+01  -39.37424  107.301 0.9930

This output is long since there are 24 levels or hour. Only a handful of pairs seem to have no significant difference in means, such as 0 and 24 (which makes sense since these are the same time!). Other pairs where it is likely there is no difference in means are pairs 2-1, 3-1, 3-2, 6-5, 7-5, 7-6, etc., all of which logically make sense since they are consecutive hours.

Tukey Test for differences in month:

tukey2<-TukeyHSD(aov(data$arr_delay~data$month))
tukey2
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = data$arr_delay ~ data$month)
## 
## $`data$month`
##            diff      lwr      upr  p adj
## 2-1    -2.97712  -6.4577   0.5034 0.1817
## 3-1    -3.02501  -6.3683   0.3182 0.1215
## 4-1     6.51823   3.2098   9.8266 0.0000
## 5-1    -2.30604  -5.5639   0.9519 0.4670
## 6-1    11.22107   7.9680  14.4742 0.0000
## 7-1     9.37781   6.1454  12.6102 0.0000
## 8-1     1.43344  -1.7742   4.6411 0.9510
## 9-1    -6.91035 -10.1591  -3.6616 0.0000
## 10-1   -3.68270  -6.8918  -0.4736 0.0097
## 11-1   -1.32564  -4.6010   1.9497 0.9764
## 12-1   11.74712   8.4504  15.0438 0.0000
## 3-2    -0.04789  -3.4895   3.3937 1.0000
## 4-2     9.49536   6.0876  12.9031 0.0000
## 5-2     0.67109  -2.6876   4.0298 1.0000
## 6-2    14.19819  10.8441  17.5523 0.0000
## 7-2    12.35493   9.0210  15.6889 0.0000
## 8-2     4.41056   1.1005   7.7206 0.0008
## 9-2    -3.93323  -7.2831  -0.5834 0.0069
## 10-2   -0.70558  -4.0170   2.6059 0.9999
## 11-2    1.65149  -1.7241   5.0271 0.9100
## 12-2   14.72425  11.3279  18.1206 0.0000
## 4-3     9.54324   6.2758  12.8106 0.0000
## 5-3     0.71897  -2.4973   3.9352 0.9999
## 6-3    14.24608  11.0347  17.4575 0.0000
## 7-3    12.40282   9.2124  15.5932 0.0000
## 8-3     4.45845   1.2931   7.6238 0.0003
## 9-3    -3.88534  -7.0923  -0.6784 0.0043
## 10-3   -0.65769  -3.8245   2.5091 0.9999
## 11-3    1.69938  -1.5345   4.9333 0.8607
## 12-3   14.77213  11.5166  18.0277 0.0000
## 5-4    -8.82427 -12.0043  -5.6442 0.0000
## 6-4     4.70284   1.5277   7.8780 0.0001
## 7-4     2.85958  -0.2943   6.0134 0.1195
## 8-4    -5.08479  -8.2133  -1.9563 0.0000
## 9-4   -13.42858 -16.5992 -10.2580 0.0000
## 10-4  -10.20093 -13.3310  -7.0709 0.0000
## 11-4   -7.84387 -11.0417  -4.6460 0.0000
## 12-4    5.22889   2.0091   8.4486 0.0000
## 6-5    13.52711  10.4047  16.6496 0.0000
## 7-5    11.68385   8.5830  14.7847 0.0000
## 8-5     3.73948   0.6644   6.8145 0.0041
## 9-5    -4.60431  -7.7222  -1.4864 0.0001
## 10-5   -1.37666  -4.4533   1.6999 0.9506
## 11-5    0.98040  -2.1652   4.1260 0.9973
## 12-5   14.05316  10.8853  17.2210 0.0000
## 7-6    -1.84326  -4.9391   1.2525 0.7301
## 8-6    -9.78763 -12.8576  -6.7176 0.0000
## 9-6   -18.13142 -21.2443 -15.0185 0.0000
## 10-6  -14.90377 -17.9753 -11.8322 0.0000
## 11-6  -12.54670 -15.6873  -9.4061 0.0000
## 12-6    0.52605  -2.6369   3.6890 1.0000
## 8-7    -7.94437 -10.9924  -4.8964 0.0000
## 9-7   -16.28816 -19.3794 -13.1970 0.0000
## 10-7  -13.06051 -16.1101 -10.0110 0.0000
## 11-7  -10.70344 -13.8226  -7.5843 0.0000
## 12-7    2.36931  -0.7722   5.5109 0.3624
## 9-8    -8.34379 -11.4091  -5.2784 0.0000
## 10-8   -5.11614  -8.1395  -2.0928 0.0000
## 11-8   -2.75907  -5.8526   0.3344 0.1355
## 12-8   10.31368   7.1975  13.4298 0.0000
## 10-9    3.22765   0.1608   6.2945 0.0289
## 11-9    5.58472   2.4486   8.7208 0.0000
## 12-9   18.65747  15.4991  21.8159 0.0000
## 11-10   2.35707  -0.7380   5.4521 0.3468
## 12-10  15.42982  12.3122  18.5475 0.0000
## 12-11  13.07276   9.8870  16.2585 0.0000

The pattern for months that have differences in means is not as evident, although just over half the pairings seems to have differences in means of arrival delay time. Further anlysis could look into if the difference related to weather patterns or seasons with increased travel that could cause patterns.

Tukey Test for differences in carrier:

tukey3<-TukeyHSD(aov(data$arr_delay~data$carrier))
tukey3
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = data$arr_delay ~ data$carrier)
## 
## $`data$carrier`
##          diff    lwr     upr  p adj
## DL-AA  5.5818  4.174  6.9894 0.0000
## UA-AA  5.3313  3.996  6.6661 0.0000
## WN-AA 12.2108  9.080 15.3414 0.0000
## UA-DL -0.2505 -1.461  0.9599 0.9514
## WN-DL  6.6290  3.549  9.7087 0.0000
## WN-UA  6.8795  3.832  9.9266 0.0000
plot(tukey3)

plot of chunk unnamed-chunk-15

For this tukey test, we include the plot since there are fewer pair comparisons. The only pair where we fail to reject the null is between UA and DL. All other airline carrier pairs have a significant difference in means.

Diagnostics/Model Adequacy Checking

We will check model 4 since it inludes 2 of the 3 factors under study.

Visually inspect normality of data:

qqnorm(residuals(model4))
qqline(residuals(model4))

plot of chunk unnamed-chunk-16

The data appears that it may not be normal.

Test normality with Shapiro Wilks test (*NOTE- this test can only be run with a sample size of less than 5000. Because of this, a model identical to model 4 but with a smaller set of data is created. This is done by taking a random sample of the data originally used):

small <- data[sample(1:nrow(data), 5000, replace=FALSE),]
modelsmall=aov(small$arr_delay~small$hour*small$month+small$origin+small$dest)
summary(modelsmall)
##                          Df  Sum Sq Mean Sq F value  Pr(>F)    
## small$hour               21 1646310   78396   52.12 < 2e-16 ***
## small$month              11  101436    9221    6.13 4.5e-10 ***
## small$origin              2   34442   17221   11.45 1.1e-05 ***
## small$dest                4   86536   21634   14.38 1.1e-11 ***
## small$hour:small$month  194  636103    3279    2.18 < 2e-16 ***
## Residuals              4767 7170087    1504                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
shapiro.test(residuals(modelsmall))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(modelsmall)
## W = 0.7592, p-value < 2.2e-16

Null hypothesis: The data came from a normally distributed population. We reject the null. We cannot assume the data is normal. This will be addressed in the contingencies section below.

Fitted vs Residuals Plot

plot(fitted(model4),residuals(model4))

plot of chunk unnamed-chunk-18

The model may not be a good fit since the residuals are clustered and not distributed across the dynamic range.

Interaction Plot

We create an interaction plot to view the interactions between the factors.

interaction.plot(data$hour, data$month, data$arr_delay)

plot of chunk unnamed-chunk-19

interaction.plot(data$hour, data$carrier, data$arr_delay)

plot of chunk unnamed-chunk-19

interaction.plot(data$carrier, data$month, data$arr_delay)

plot of chunk unnamed-chunk-19

There is interaction among all the factors, evident by the different slopes and intersecting lines.

4. Contingencies

Since the data did not fulfill the normality assumption of the anova model, a Kruskal-Wallis one-way analysis of variance by Rank Sum Test should be performed:

kruskal.test(data$arr_delay~data$month)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  data$arr_delay by data$month
## Kruskal-Wallis chi-squared = 1561, df = 11, p-value < 2.2e-16
kruskal.test(data$arr_delay~data$hour)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  data$arr_delay by data$hour
## Kruskal-Wallis chi-squared = 3047, df = 23, p-value < 2.2e-16
kruskal.test(data$arr_delay~data$carrier)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  data$arr_delay by data$carrier
## Kruskal-Wallis chi-squared = 583.8, df = 3, p-value < 2.2e-16

The null hypothesis of the kruskal test is that the mean ranks of the samples from the populations are expected to be the same (this is not the same as saying the populations have identical means). Since each test results in a low p-value, we reject this null hypothesis. It is likely that the variation in the rank means of month, hour, and carrier can explain the variaion in arrival delay times.

5. References to the Literature

None used.

6. Appendicies

Link to raw data

Data is from the NYCflight13 package

Complete R Code

All included above.