Recipes for the Design of Experiments

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Analysis of Arrival Delay Times for Flights Departing from NYC (Two-Factor, Multi-Level Experiment)

Brendan Howell

Renselaer Polytechnic Institute

10/02/14 - Version 1.0

1. Setting

Dataset of Out-bound Flights from NYC in 2013 (‘nycflights13’)

Description: This package contains information about all flights that departed from NYC (e.g. EWR, JFK, and LGA) in 2013: 336,776 flights in total.

In this study, a two-factor, multi-level experiment will be performed to see if either the “origin location” of a given flight route or the “destination location” of a given flight route (or, both via interaction) has a statistically significant effect on the “delay in arrival time” that is observed upon a flight’s arrival to its designated destination. In the dataset, the factor ‘origin’ refers to the origin location of a given flight route and the factor ‘dest’ refers to the destination location of a given flight route [both of which are represented by their respective 3-letter airport codes]. Additionally, this analysis’ response variable is referred to in the dataset as ‘arr_delay’, which denotes the delay in arrival time [in minutes] that is observed upon a flight’s arrival to its designated destination. (Negative times represent early departure and arrival times in the dataset.)

#Install and load the "nycflights13" dataset into R.
install.packages("nycflights13", repos='http://cran.us.r-project.org')

## package 'nycflights13' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\howelb\AppData\Local\Temp\Rtmpqi4IEy\downloaded_packages

library("nycflights13", lib.loc = "C:/Program Files/R/R-3.1.1/library")
#Assign a variable, "data_raw", to the complete dataframe, "flights".
#Then, display the "head" and "tail" of the dataset, "data_raw".
data_raw<-flights
head(data_raw)

##   year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013     1   1      517         2      830        11      UA  N14228
## 2 2013     1   1      533         4      850        20      UA  N24211
## 3 2013     1   1      542         2      923        33      AA  N619AA
## 4 2013     1   1      544        -1     1004       -18      B6  N804JB
## 5 2013     1   1      554        -6      812       -25      DL  N668DN
## 6 2013     1   1      554        -4      740        12      UA  N39463
##   flight origin dest air_time distance hour minute
## 1   1545    EWR  IAH      227     1400    5     17
## 2   1714    LGA  IAH      227     1416    5     33
## 3   1141    JFK  MIA      160     1089    5     42
## 4    725    JFK  BQN      183     1576    5     44
## 5    461    LGA  ATL      116      762    5     54
## 6   1696    EWR  ORD      150      719    5     54

tail(data_raw)

##        year month day dep_time dep_delay arr_time arr_delay carrier
## 336771 2013     9  30       NA        NA       NA        NA      EV
## 336772 2013     9  30       NA        NA       NA        NA      9E
## 336773 2013     9  30       NA        NA       NA        NA      9E
## 336774 2013     9  30       NA        NA       NA        NA      MQ
## 336775 2013     9  30       NA        NA       NA        NA      MQ
## 336776 2013     9  30       NA        NA       NA        NA      MQ
##        tailnum flight origin dest air_time distance hour minute
## 336771  N740EV   5274    LGA  BNA       NA      764   NA     NA
## 336772           3393    JFK  DCA       NA      213   NA     NA
## 336773           3525    LGA  SYR       NA      198   NA     NA
## 336774  N535MQ   3461    LGA  BNA       NA      764   NA     NA
## 336775  N511MQ   3572    LGA  CLE       NA      419   NA     NA
## 336776  N839MQ   3531    LGA  RDU       NA      431   NA     NA

Factors and Levels

This analysis considers two different factors (with each having multiple levels), which include ‘origin’ and ‘dest’. In the original dataset “data_raw”, the factor ‘origin’ has 3 unique levels and the factor ‘dest’ has 105 unique levels. These factors were selected intuitively, since this analysis aims to determine whether or not the origin and destination locations of a given flight route have a significant effect on the resulting arrival delay time that a given flight experiences.

#Display the summary statistics of "data_raw".
summary(data_raw)

##       year          month            day          dep_time   
##  Min.   :2013   Min.   : 1.00   Min.   : 1.0   Min.   :   1  
##  1st Qu.:2013   1st Qu.: 4.00   1st Qu.: 8.0   1st Qu.: 907  
##  Median :2013   Median : 7.00   Median :16.0   Median :1401  
##  Mean   :2013   Mean   : 6.55   Mean   :15.7   Mean   :1349  
##  3rd Qu.:2013   3rd Qu.:10.00   3rd Qu.:23.0   3rd Qu.:1744  
##  Max.   :2013   Max.   :12.00   Max.   :31.0   Max.   :2400  
##                                                NA's   :8255  
##    dep_delay       arr_time      arr_delay      carrier         
##  Min.   : -43   Min.   :   1   Min.   : -86   Length:336776     
##  1st Qu.:  -5   1st Qu.:1104   1st Qu.: -17   Class :character  
##  Median :  -2   Median :1535   Median :  -5   Mode  :character  
##  Mean   :  13   Mean   :1502   Mean   :   7                     
##  3rd Qu.:  11   3rd Qu.:1940   3rd Qu.:  14                     
##  Max.   :1301   Max.   :2400   Max.   :1272                     
##  NA's   :8255   NA's   :8713   NA's   :9430                     
##    tailnum              flight        origin              dest          
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##     air_time       distance         hour          minute    
##  Min.   : 20    Min.   :  17   Min.   : 0     Min.   : 0    
##  1st Qu.: 82    1st Qu.: 502   1st Qu.: 9     1st Qu.:16    
##  Median :129    Median : 872   Median :14     Median :31    
##  Mean   :151    Mean   :1040   Mean   :13     Mean   :32    
##  3rd Qu.:192    3rd Qu.:1389   3rd Qu.:17     3rd Qu.:49    
##  Max.   :695    Max.   :4983   Max.   :24     Max.   :59    
##  NA's   :9430                  NA's   :8255   NA's   :8255

#Display the names found in "data_raw".
names(data_raw)

##  [1] "year"      "month"     "day"       "dep_time"  "dep_delay"
##  [6] "arr_time"  "arr_delay" "carrier"   "tailnum"   "flight"   
## [11] "origin"    "dest"      "air_time"  "distance"  "hour"     
## [16] "minute"

#Display the structure of "data_raw".
str(data_raw)

## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  16 variables:
##  $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ dep_delay: num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ arr_delay: num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier  : chr  "UA" "UA" "AA" "B6" ...
##  $ tailnum  : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ flight   : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ origin   : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest     : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance : num  1400 1416 1089 1576 762 ...
##  $ hour     : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ minute   : num  17 33 42 44 54 54 55 57 57 58 ...

#Categorize 'origin' as a factor and display its resulting levels.
data_raw$origin = as.factor(data_raw$origin)
levels(data_raw$origin)

## [1] "EWR" "JFK" "LGA"

#Categorize 'dest' as a factor and display its resulting levels.
data_raw$dest = as.factor(data_raw$dest)
levels(data_raw$dest)

##   [1] "ABQ" "ACK" "ALB" "ANC" "ATL" "AUS" "AVL" "BDL" "BGR" "BHM" "BNA"
##  [12] "BOS" "BQN" "BTV" "BUF" "BUR" "BWI" "BZN" "CAE" "CAK" "CHO" "CHS"
##  [23] "CLE" "CLT" "CMH" "CRW" "CVG" "DAY" "DCA" "DEN" "DFW" "DSM" "DTW"
##  [34] "EGE" "EYW" "FLL" "GRR" "GSO" "GSP" "HDN" "HNL" "HOU" "IAD" "IAH"
##  [45] "ILM" "IND" "JAC" "JAX" "LAS" "LAX" "LEX" "LGA" "LGB" "MCI" "MCO"
##  [56] "MDW" "MEM" "MHT" "MIA" "MKE" "MSN" "MSP" "MSY" "MTJ" "MVY" "MYR"
##  [67] "OAK" "OKC" "OMA" "ORD" "ORF" "PBI" "PDX" "PHL" "PHX" "PIT" "PSE"
##  [78] "PSP" "PVD" "PWM" "RDU" "RIC" "ROC" "RSW" "SAN" "SAT" "SAV" "SBN"
##  [89] "SDF" "SEA" "SFO" "SJC" "SJU" "SLC" "SMF" "SNA" "SRQ" "STL" "STT"
## [100] "SYR" "TPA" "TUL" "TVC" "TYS" "XNA"

Continuous variables (if any)

In this dataset, there are a few variables that can be considered to be continuous variables; these variables are the ones which are categorized as being numeric variables. By this standard, the continuous variables that exist in this dataset include ‘dep_delay’ (which refers to the delay in flight departure [in minutes]), ‘arr_delay’(which refers to the delay in flight arrival [in minutes]), ‘air_time’ (which refers to amount of time, in minutes, that a flight is airborne), and ‘distance’ (which refers to the distance traveled by a flight [in miles]). The two other variables in the dataset that are categorized as being numeric are ‘hour’ and ‘minute’, which refer to the time of departure of a given flight. Despite this, they cannot accurately be considered to be continuous variables in this dataset, since they are really expressed as integer values in the dataset. The two main factors that our analysis considers, ‘origin’ and ‘dest’, are treated as categorical variables here.

Response variables

This analysis will consider one response variable, ‘arr_delay’, which denotes the delay in arrival time [in minutes] that is observed upon a flight’s arrival to its designated destination. (Negative times represent early departure and arrival times in the dataset.)

The Data: How is it organized and what does it look like?

This dataset contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) in 2013, which amounts to 336,776 flights in total. It contains 336,776 observations of 16 variables, which include ‘year’, ‘month’, ‘day’, ‘dep_time’, ‘dep_delay’, ‘arr_time’, ‘arr_delay’, ‘carrier’, ‘tailnum’, ‘flight’, ‘origin’, ‘dest’, ‘air_time’, ‘distance’, ‘hour’, and ‘minute’. The variables that are made up of integer values include ‘year’ (which refers to the year of departure), ‘month’ (which refers to the month of departure), ‘day’ (which refers to the day of departure), ‘dep_time’ (which refers to the time of departure [in hhmm notation]), ‘arr_time’ (which refers to the time of arrival [in hhmm notation]), and ‘flight’ (which refers to the flight number). The variables that are made up of character values include ‘carrier’ (which refers to the airline carrier), ‘tailnum’ (which refers to plane tail number), ‘origin’ (which refers to the origin location), and ‘dest’ (which refers to the destination location). Lastly, the variables that are made up of numeric values include ‘dep_delay’ (which refers to departure delays expressed in minutes), ‘arr_delay’ (which refers to arrival delays expressed in minutes), ‘air_time’ (which refers to the amount of time a plane is airborne, expressed in minutes), ‘distance’ (which refers to the distance traveled by a flight, expressed in miles), ‘hour’ (which refers to the hour of departure), and ‘minute’ (which refers to the minute of departure).

Randomization

According to the Research and Innovative Technology Administration Bureau of Transportation Statistics, this flight dataset includes information about every flight that departed from New York City in 2013 [1]. Therefore, due to the size and amount of data collected here, we can make the assumption that this dataset exhibits randomization without any bias.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘arr_delay’ in this analysis) can be explained by the variation existent in the two different treatments of the experiment (which corresponds to ‘origin’ and ‘dest’). Therefore, the null hypothesis that is being tested states that the origin and the destination locations of a given flight route do not have a significant effect on the delay in arrival time that is observed. In carrying out this analysis, we perform an analysis of variance (ANOVA) for the delay in arrival time (‘arr_delay’) to see if there is a significant difference in the means for this response variable when considering both the origin (‘origin’) and destination (‘dest’) locations of the flights contained in this dataset.

What is the rationale for this design?

The rationale for this design lies primarily in the fact that we’re trying to determine if the origin and destination locations of a given flight have any effect on the overall arrival time (represented as any delay in arrival time) that is achieved for a given flight. So, since this delay in arrival time is a useful and relevant metric to consider when determining why a given flight experiences any delay in their arrival , this design of a two-factor, multi-level experiment (as it corresponds to an analysis of variance) was crafted to see if the origin and destination locations of a given flight route have any effect on the delays experienced in a flight’s arrival time. Therefore, by performing this analysis, we can hope to receive some insight regarding the different factors that come into play which can cause delays in arrival times for flights.

Randomize: What is the Randomization Scheme?

While our original assumption claimed that the entire vehicle dataset exhibits randomization, our analysis needed to ensure that we developed a completely randomized design. In meeting this objective, a new dataset is created (“nycflights”) that randomly selects 5000 observations from “data_raw”. In creating this new dataset, we’re ensuring that our analysis considers a large sample of an even larger population and randomizes the order that the runs of the data are placed in the dataset (since they were originally listed chronologically by departure date). After creating this new dataset, we can now assume that our randomization scheme represents a completely randomized design.

#Remove any rows that contain "NA" in "data_raw", creating "data_clean".
#Randomly take a sample of 5000 observations from "data_clean", creating "nycflights".
data_clean<-na.omit(data_raw)
flight.index = sample(1:nrow(data_clean),5000,replace=FALSE)
nycflights<-data_clean[flight.index,]

Replicate: Are there replicates and/or repeated measures?

In this experiment, there are no replicates or repeated measures present.

Block: Did you use blocking in the design?

By designing this experiment in such a way that allows for multiple levels to be considered (so as to reach the most accurate conclusions possible), blocking was not considered in this two-factor, multi-level design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and Descriptive Summary

In beginning to display this data graphically, summary statistics were gathered for the newly created dataset, “nycflights”. Additionally, histograms and boxplots were created to represent the different observations of arrival time delays existent within this dataset of randomly selected/sampled observations.

#Display the summary statistics of "nycflights".
summary(nycflights)

##       year          month            day          dep_time   
##  Min.   :2013   Min.   : 1.00   Min.   : 1.0   Min.   :   1  
##  1st Qu.:2013   1st Qu.: 4.00   1st Qu.: 8.0   1st Qu.: 858  
##  Median :2013   Median : 7.00   Median :16.0   Median :1357  
##  Mean   :2013   Mean   : 6.54   Mean   :15.7   Mean   :1341  
##  3rd Qu.:2013   3rd Qu.:10.00   3rd Qu.:23.0   3rd Qu.:1746  
##  Max.   :2013   Max.   :12.00   Max.   :31.0   Max.   :2358  
##                                                              
##    dep_delay        arr_time      arr_delay       carrier         
##  Min.   :-22.0   Min.   :   1   Min.   :-61.0   Length:5000       
##  1st Qu.: -5.0   1st Qu.:1052   1st Qu.:-16.0   Class :character  
##  Median : -1.0   Median :1530   Median : -4.0   Mode  :character  
##  Mean   : 12.5   Mean   :1493   Mean   :  6.9                     
##  3rd Qu.: 12.0   3rd Qu.:1941   3rd Qu.: 14.0                     
##  Max.   :388.0   Max.   :2400   Max.   :474.0                     
##                                                                   
##    tailnum              flight     origin          dest         air_time  
##  Length:5000        Min.   :   1   EWR:1795   ORD    : 257   Min.   : 21  
##  Class :character   1st Qu.: 512   JFK:1695   BOS    : 241   1st Qu.: 80  
##  Mode  :character   Median :1455   LGA:1510   ATL    : 238   Median :130  
##                     Mean   :1930              LAX    : 237   Mean   :150  
##                     3rd Qu.:3407              CLT    : 216   3rd Qu.:194  
##                     Max.   :6181              MCO    : 197   Max.   :654  
##                                               (Other):3614                
##     distance         hour          minute    
##  Min.   :  94   Min.   : 0.0   Min.   : 0.0  
##  1st Qu.: 502   1st Qu.: 8.0   1st Qu.:16.0  
##  Median : 872   Median :13.0   Median :32.0  
##  Mean   :1044   Mean   :13.1   Mean   :31.8  
##  3rd Qu.:1389   3rd Qu.:17.0   3rd Qu.:49.0  
##  Max.   :4983   Max.   :23.0   Max.   :59.0  
##

#Display the names found in "nycflights".
names(nycflights)

##  [1] "year"      "month"     "day"       "dep_time"  "dep_delay"
##  [6] "arr_time"  "arr_delay" "carrier"   "tailnum"   "flight"   
## [11] "origin"    "dest"      "air_time"  "distance"  "hour"     
## [16] "minute"

#Display the structure of "nycflights".
str(nycflights)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5000 obs. of  16 variables:
##  $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month    : int  4 3 1 6 7 10 9 10 10 2 ...
##  $ day      : int  25 25 11 25 28 4 19 15 13 19 ...
##  $ dep_time : int  1940 809 1956 1150 2320 1751 1345 1650 1828 2056 ...
##  $ dep_delay: num  10 -1 -4 -5 115 -9 0 5 28 51 ...
##  $ arr_time : int  2049 912 2242 1311 152 2058 1623 1910 2144 2302 ...
##  $ arr_delay: num  -1 -13 -30 1 113 -6 -39 -6 40 32 ...
##  $ carrier  : chr  "EV" "AA" "B6" "MQ" ...
##  $ tailnum  : chr  "N826AS" "N3FSAA" "N657JB" "N0EGMQ" ...
##  $ flight   : int  5714 1838 801 3461 1611 359 476 2042 359 4033 ...
##  $ origin   : Factor w/ 3 levels "EWR","JFK","LGA": 2 2 2 3 1 2 3 1 2 3 ...
##  $ dest     : Factor w/ 105 levels "ABQ","ACK","ALB",..: 43 12 36 11 44 16 59 5 16 104 ...
##  $ air_time : num  51 33 140 105 190 344 139 106 350 107 ...
##  $ distance : num  228 187 1069 764 1400 ...
##  $ hour     : num  19 8 19 11 23 17 13 16 18 20 ...
##  $ minute   : num  40 9 56 50 20 51 45 50 28 56 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:9430] 472 478 616 644 726 734 755 839 840 841 ...
##   .. ..- attr(*, "names")= chr [1:9430] "472" "478" "616" "644" ...

#Display the head and tail of "nycflights".
head(nycflights)

##        year month day dep_time dep_delay arr_time arr_delay carrier
## 188651 2013     4  25     1940        10     2049        -1      EV
## 158692 2013     3  25      809        -1      912       -13      AA
## 9663   2013     1  11     1956        -4     2242       -30      B6
## 245110 2013     6  25     1150        -5     1311         1      MQ
## 276798 2013     7  28     2320       115      152       113      UA
## 30692  2013    10   4     1751        -9     2058        -6      B6
##        tailnum flight origin dest air_time distance hour minute
## 188651  N826AS   5714    JFK  IAD       51      228   19     40
## 158692  N3FSAA   1838    JFK  BOS       33      187    8      9
## 9663    N657JB    801    JFK  FLL      140     1069   19     56
## 245110  N0EGMQ   3461    LGA  BNA      105      764   11     50
## 276798  N36447   1611    EWR  IAH      190     1400   23     20
## 30692   N564JB    359    JFK  BUR      344     2465   17     51

tail(nycflights)

##        year month day dep_time dep_delay arr_time arr_delay carrier
## 11126  2013     1  13     1906        56     2208        38      AA
## 141000 2013     3   6      820         5     1013         3      MQ
## 135535 2013     2  28      924        -5     1127       -53      UA
## 271705 2013     7  23     1629        79     1841        67      EV
## 18052  2013     1  21     1839        -6     2144       -37      DL
## 293508 2013     8  15      756        -5     1031        -2      UA
##        tailnum flight origin dest air_time distance hour minute
## 11126   N3ECAA   1611    LGA  MIA      144     1096   19      6
## 141000  N711MQ   4490    LGA  CMH       76      479    8     20
## 135535  N504UA    926    EWR  LAS      278     2227    9     24
## 271705  N25134   4377    EWR  JAX      115      820   16     29
## 18052   N329NW   2190    JFK  MIA      166     1089   18     39
## 293508  N419UA    706    EWR  LAS      297     2227    7     56

#Display the levels of 'origin' and 'dest' within "nycflights".
levels(nycflights$origin)

## [1] "EWR" "JFK" "LGA"

levels(nycflights$dest)

##   [1] "ABQ" "ACK" "ALB" "ANC" "ATL" "AUS" "AVL" "BDL" "BGR" "BHM" "BNA"
##  [12] "BOS" "BQN" "BTV" "BUF" "BUR" "BWI" "BZN" "CAE" "CAK" "CHO" "CHS"
##  [23] "CLE" "CLT" "CMH" "CRW" "CVG" "DAY" "DCA" "DEN" "DFW" "DSM" "DTW"
##  [34] "EGE" "EYW" "FLL" "GRR" "GSO" "GSP" "HDN" "HNL" "HOU" "IAD" "IAH"
##  [45] "ILM" "IND" "JAC" "JAX" "LAS" "LAX" "LEX" "LGA" "LGB" "MCI" "MCO"
##  [56] "MDW" "MEM" "MHT" "MIA" "MKE" "MSN" "MSP" "MSY" "MTJ" "MVY" "MYR"
##  [67] "OAK" "OKC" "OMA" "ORD" "ORF" "PBI" "PDX" "PHL" "PHX" "PIT" "PSE"
##  [78] "PSP" "PVD" "PWM" "RDU" "RIC" "ROC" "RSW" "SAN" "SAT" "SAV" "SBN"
##  [89] "SDF" "SEA" "SFO" "SJC" "SJU" "SLC" "SMF" "SNA" "SRQ" "STL" "STT"
## [100] "SYR" "TPA" "TUL" "TVC" "TYS" "XNA"

par(mfrow=c(1,1))
#Create a histogram of Arrival Time Delays ('arr_delay') across all 2013 flights.
hist(nycflights$arr_delay, main = "Arrival Time Delays [in minutes]")

plot of chunk unnamed-chunk-5

par(mfrow=c(1,1))
#Create a boxplot of Arrival Time Delays ('arr_delay') at each destination airport ('dest') [ylim = c(min|'arr_delay',max|'arr_delay')].
boxplot(nycflights$arr_delay~nycflights$dest, main = "Arrival Time Delays [in minutes]", ylim = c(min(nycflights$arr_delay),max(nycflights$arr_delay)), ylab = "Minutes")

plot of chunk unnamed-chunk-6

par(mfrow=c(1,1))
#Create a boxplot of Arrival Time Delays ('arr_delay') at each destination airport ('dest') [ylim = c(~1st Quarter|'arr_delay',~3rd Quarter|'arr_delay')].(Gives better indication of median differences among destination airport locations.)
boxplot(nycflights$arr_delay~nycflights$dest, main = "Arrival Time Delays [in minutes]", ylim = c(-20,20), ylab = "Minutes")

plot of chunk unnamed-chunk-7

Testing

In order to determine if the variation that is observed in the response variable (which corresponds to the delay in arrival times in this analysis) can be explained by the variation existent in the treatments of the experiment (which correspond to both the origin and destination locations of the flight routes), an analysis of variance (ANOVA) is performed as a means for analyzing the differences in arrival time delays for each of the different corresponding origin and destination airports (for each given flight route) contained within this dataset. For each of the three ANOVA models that are designed in this experiment, the null hypothesis that is being tested (which we will either reject or fail to reject by the end of our analysis) states that the origin and the destination locations of a given flight route do not have a significant effect on the delay in arrival time that is observed, implying that the differences in mean values of the delays in arrival time for each of the origin and destination airports were solely the result of randomization in this experiment. In other words, if we reject the null hypothesis, we would infer that the differences in mean values of the delays in arrival time for each of the corresponding origin and destination airports in this dataset is caused by something other than randomization, leading us to believe that the variation that is observed in the mean values of the delays in arrival time can be explained by the variation existent in the differentorigin and destination airport locations being considered in this analysis. Alternately, if we fail to reject the null hypothesis, we would infer that the variation that is observed in the mean values of the delays in arrival time cannot be explained by the variation existent in the different origin and destination airports being considered and, as such, is likely caused by randomization.

#Perform an analysis of variance (ANOVA) for the different mean values observed for the delays in arrival time, given the factor 'origin'.
model_origin <- aov(arr_delay~origin,nycflights)
anova(model_origin)

## Analysis of Variance Table
## 
## Response: arr_delay
##             Df  Sum Sq Mean Sq F value  Pr(>F)    
## origin       2   26297   13149    7.05 0.00088 ***
## Residuals 4997 9324609    1866                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Perform an analysis of variance (ANOVA) for the different mean values observed for the delays in arrival time, given the factor 'dest'.
model_dest <- aov(arr_delay~dest,nycflights)
anova(model_dest)

## Analysis of Variance Table
## 
## Response: arr_delay
##             Df  Sum Sq Mean Sq F value  Pr(>F)    
## dest        93  342304    3681       2 5.1e-08 ***
## Residuals 4906 9008602    1836                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Perform an analysis of variance (ANOVA) for the different mean values observed for the delays in arrival time, given the interaction of 'origin' and 'dest'.
model_interaction <- aov(arr_delay~origin*dest,nycflights)
anova(model_interaction)

## Analysis of Variance Table
## 
## Response: arr_delay
##               Df  Sum Sq Mean Sq F value  Pr(>F)    
## origin         2   26297   13149    7.26 0.00071 ***
## dest          93  341451    3672    2.03   3e-08 ***
## origin:dest  105  291988    2781    1.54 0.00042 ***
## Residuals   4799 8691170    1811                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

par(mfrow=c(1,1))
#Create an interaction plot that plots the mean values of the delays in arrival time('arr_delay') against the interaction of both 'origin' and 'dest'.
interaction.plot(nycflights$origin,nycflights$dest,nycflights$arr_delay)

plot of chunk unnamed-chunk-11

For the analysis of variance (ANOVA) that is performed where ‘origin’ is analyzed against the response variable ‘arr_delay’, a p-value < 0.05 is returned, indicating that there is roughly a probability of < 0.05 that the resulting associated F-value is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the delays in arrival time can be explained by the variation existent in the different origin airport locations being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)

For the analysis of variance (ANOVA) that is performed where ‘dest’ is analyzed against the response variable ‘arr_delay’, a p-value < 0.05 is returned, indicating that there is roughly a probability of < 0.05 that the resulting associated F-value is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the delays in arrival time can be explained by the variation existent in the different destination airport locations being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)

For the analysis of variance (ANOVA) that is performed where the interation of both ‘origin’ and ‘dest’ is analyzed against the response variable ‘arr_delay’, a p-value < 0.05 is returned, indicating that there is roughly a probability of < 0.05 that the resulting associated F-value is the result of solely randomization. Additionally, upon generating an interaction plot that plots the mean values of the delays in arrival time (‘arr_delay’) against the interaction of both ‘origin’ and ‘dest’, the plot suggests that the interaction of these two factors does have a significant effect on the response variable. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the delays in arrival time can be explained by the variation existent in the interation of the different origin and destination airport locations being considered in this analysis and, as such, is likely not caused solely by randomization. (Note: The statistical significance of the effect that ‘origin’ and ‘dest’ each individually have on the response variable ‘arr_delay’ did not change upon developing an interaction model [see Analysis of Variance Table for “model_interaction”].)

Estimation (of Parameters)

In estimating the different parameters of the experiment, I performed summary statistics on relevant data in the dataset pertaining to the delays in arrival time for the flights noted in “nycflights”, which includes both the average arrival time delays (in minutes) and the standard deviation for those arrival time delays.

Tables of Parameter Values

#Display summary statistics of nycflights$arr_delay.
summary(nycflights$arr_delay)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -61.0   -16.0    -4.0     6.9    14.0   474.0

#Display standard deviation of nycflights$arr_delay.
sd(nycflights$arr_delay, na.rm = FALSE)

## [1] 43.25

Diagnostics/Model Adequacy Checking

In verifying the results of this experiment, it’s important to ensure that the dataset itself meets all of the assumptions that correlate with the design approach that was carried out. In this way, we want to make sure that our dataset exhibits normality. Until we know that our dataset does, in fact, exhibit normality, we cannot yet say with confidence that our results are significant and representative of a properly carried-out modeling approach. In verifying our dataset for normality, we can both create a Normal Quantile-Quantile (QQ) Plot of our data and perform a Shapiro-Wilk Test of Normality on our data.

#Create a Normal Q-Q Plot for Delay in Arrival Time Data.
qqnorm(nycflights[,"arr_delay"])
qqline(nycflights[,"arr_delay"])

plot of chunk unnamed-chunk-13

#Create a Normal Q-Q Plot of the residuals for "model_origin".
qqnorm(residuals(model_origin))
qqline(residuals(model_origin))

plot of chunk unnamed-chunk-13

#Create a Normal Q-Q Plot of the residuals for "model_dest".
qqnorm(residuals(model_dest))
qqline(residuals(model_dest))

plot of chunk unnamed-chunk-13

#Create a Normal Q-Q Plot of the residuals for "model_interaction".
qqnorm(residuals(model_interaction))
qqline(residuals(model_interaction))

plot of chunk unnamed-chunk-13

Shapiro-Wilk Test of Normality Results

#Perform Shapiro-Wilk Test of Normality on Delay in Arrival Time Data within "nycflights" dataset (normality is assummed if p > 0.1).
shapiro.test(nycflights[,"arr_delay"])

## 
##  Shapiro-Wilk normality test
## 
## data:  nycflights[, "arr_delay"]
## W = 0.7207, p-value < 2.2e-16

Upon both constructing Normal Q-Q Plots and performing Shapiro-Wilk Tests of Normality on the data in this analysis, we cannot readily assume that our data exhibits normality (since all of the resulting p-values of the Shapiro-Wilk Tests of Normality were < 0.1 and since all of the constructed Normal Q-Q Plots did not particularly display a trend of data that aligned closely with the Normal Q-Q Line). However, research has shown that analyses of variance aren’t very sensitive to even moderate deviations from normality, as per the simulation studies that have been performed using many differenty non-normal distributions (Glass et al. 1972, Harwell et al. 1992, Lix et al. 1996) [2], [3], [4]. So, we are likely still safe to carry out our analysis of variance in this case. Erring on the side of caution, we can perform the nonparametric Kruskal-Wallis rank sum test to back up our original results, which will help us to decide whether the population distributions are identical without necessarily exhibiting a normal distribution.

Kruskal-Wallis Rank Sum Test Results

#Perform Kruskal-Wallis Rank Sum Test on Delay in Arrival Time Data within "nycflights" dataset for both 'origin' and 'dest' (identical populations is assummed if p > 0.05).
kruskal.test(nycflights[,"arr_delay"],nycflights$origin)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  nycflights[, "arr_delay"] and nycflights$origin
## Kruskal-Wallis chi-squared = 21.67, df = 2, p-value = 1.974e-05

kruskal.test(nycflights[,"arr_delay"],nycflights$dest)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  nycflights[, "arr_delay"] and nycflights$dest
## Kruskal-Wallis chi-squared = 220.2, df = 93, p-value = 2.569e-12

Since the p-values for both of the resulting Kruskal-Wallist rank sum tests that consider the factors ‘origin’ and ‘dest’ against the response variable ‘arr_delay’ are less than 0.05, we can assume that the mean values of the delays in arrival time both at each of the different origin airport locations and at each of the different destination airport locations (considered separately) are comparatively nonidentical populations. Therefore, this result suggests that we would reject the null hypothesis of our main experiment, leading us to believe that the variation that is observed in the mean values of the delays in arrival time can actually be explained by the variation existent in both the different destination airport locations and the different origin airport locations being considered in this analysis.

In further backing up the confidence that we have with our results, we can generate a “quality of fit” model that plots residual error against each of the fitted models that were developed in our original analysis of variance (ANOVA).

#Create a "Quality of Fit Model" that plots the residuals of "model_origin" against its fitted model.
plot(fitted(model_origin),residuals(model_origin))

plot of chunk unnamed-chunk-16

#Create a "Quality of Fit Model" that plots the residuals of "model_dest" against its fitted model.
plot(fitted(model_dest),residuals(model_dest))

plot of chunk unnamed-chunk-16

#Create a "Quality of Fit Model" that plots the residuals of "model_interaction" against its fitted model.
plot(fitted(model_interaction),residuals(model_interaction))

plot of chunk unnamed-chunk-16

Because each of the resulting plots appears to be scatted and clumped around zero, each of the three ANOVA models developed suggests good fit. Thus, we can confindently rely on both the modeling approach that we carried out and the dataset that we analyzed in justifying the significance of our results.

4. Contingencies to the Experimental Design/Analysis if Modeling Assumptions Failed

If our modeling assumptions had failed in this experiment, we would have certainly needed to modify our experimental design approach. For instance, if we could not assume that our dataset exhibits normality, transformations such as the “Box-Cox Power Transformation” could have been performed on the data to make it approximately normal. Additionally, if we were forced to make no assumptions about the probability distribution of our data, we could have used nonparametric statistics to describe our dataset and to analyze it further as a means for achieving some sort of significant result from our data. For iunstance, performing a “Friedman Two-Way Analysis of Variance Test” to determine the variance of our data without knowing if the data itself exhibits normality or not could be a potentially useful way to work around an inability to assume that the dataset is normally distributed.

5. References to the literature

[1] Online Data Source: http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

[2] Glass, G.V., P.D. Peckham, and J.R. Sanders. 1972. Consequences of failure to meet assumptions underlying fixed effects analyses of variance and covariance. Rev. Educ. Res. 42: 237-288.

[3] Harwell, M.R., E.N. Rubinstein, W.S. Hayes, and C.C. Olds. 1992. Summarizing Monte Carlo results in methodological research: the one- and two-factor fixed effects ANOVA cases. J. Educ. Stat. 17: 315-339.

[4] Lix, L.M., J.C. Keselman, and H.J. Keselman. 1996. Consequences of assumption violations revisited: A quantitative review of alternatives to the one-way analysis of variance F test. Rev. Educ. Res. 66: 579-619.

6. Appendices

A summary of, or pointer to, the raw data

complete and documented R code

An R data package containing all out-bound flights from NYC in 2013 [+ useful metdata] is used in this analysis. It can be installed from github with devtools::install_github(“hadley/nycflights13”).

Recipe 2: Two or More Factors

Recipes for the Design of Experiments

Analysis of Arrival Delay Times for Flights Departing from NYC (Two-Factor, Multi-Level Experiment)

Brendan Howell

Renselaer Polytechnic Institute

10/02/14 - Version 1.0

1. Setting

Dataset of Out-bound Flights from NYC in 2013 (‘nycflights13’)

Description: This package contains information about all flights that departed from NYC (e.g. EWR, JFK, and LGA) in 2013: 336,776 flights in total.

Factors and Levels

Continuous variables (if any)

Response variables

The Data: How is it organized and what does it look like?

Randomization

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

What is the rationale for this design?

Randomize: What is the Randomization Scheme?

Replicate: Are there replicates and/or repeated measures?

Block: Did you use blocking in the design?

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and Descriptive Summary

Testing

Estimation (of Parameters)

Tables of Parameter Values

Diagnostics/Model Adequacy Checking

Shapiro-Wilk Test of Normality Results

Kruskal-Wallis Rank Sum Test Results

4. Contingencies to the Experimental Design/Analysis if Modeling Assumptions Failed

5. References to the literature

6. Appendices

A summary of, or pointer to, the raw data

complete and documented R code