This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).
When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
In this study, a two-factor, multi-level experiment will be performed to see if either the “origin location” of a given flight route or the “destination location” of a given flight route (or, both via interaction) has a statistically significant effect on the “delay in arrival time” that is observed upon a flight’s arrival to its designated destination. In the dataset, the factor ‘origin’ refers to the origin location of a given flight route and the factor ‘dest’ refers to the destination location of a given flight route [both of which are represented by their respective 3-letter airport codes]. Additionally, this analysis’ response variable is referred to in the dataset as ‘arr_delay’, which denotes the delay in arrival time [in minutes] that is observed upon a flight’s arrival to its designated destination. (Negative times represent early departure and arrival times in the dataset.)
#Install and load the "nycflights13" dataset into R.
install.packages("nycflights13", repos='http://cran.us.r-project.org')
## package 'nycflights13' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\howelb\AppData\Local\Temp\Rtmpqi4IEy\downloaded_packages
library("nycflights13", lib.loc = "C:/Program Files/R/R-3.1.1/library")
#Assign a variable, "data_raw", to the complete dataframe, "flights".
#Then, display the "head" and "tail" of the dataset, "data_raw".
data_raw<-flights
head(data_raw)
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## flight origin dest air_time distance hour minute
## 1 1545 EWR IAH 227 1400 5 17
## 2 1714 LGA IAH 227 1416 5 33
## 3 1141 JFK MIA 160 1089 5 42
## 4 725 JFK BQN 183 1576 5 44
## 5 461 LGA ATL 116 762 5 54
## 6 1696 EWR ORD 150 719 5 54
tail(data_raw)
## year month day dep_time dep_delay arr_time arr_delay carrier
## 336771 2013 9 30 NA NA NA NA EV
## 336772 2013 9 30 NA NA NA NA 9E
## 336773 2013 9 30 NA NA NA NA 9E
## 336774 2013 9 30 NA NA NA NA MQ
## 336775 2013 9 30 NA NA NA NA MQ
## 336776 2013 9 30 NA NA NA NA MQ
## tailnum flight origin dest air_time distance hour minute
## 336771 N740EV 5274 LGA BNA NA 764 NA NA
## 336772 3393 JFK DCA NA 213 NA NA
## 336773 3525 LGA SYR NA 198 NA NA
## 336774 N535MQ 3461 LGA BNA NA 764 NA NA
## 336775 N511MQ 3572 LGA CLE NA 419 NA NA
## 336776 N839MQ 3531 LGA RDU NA 431 NA NA
This analysis considers two different factors (with each having multiple levels), which include ‘origin’ and ‘dest’. In the original dataset “data_raw”, the factor ‘origin’ has 3 unique levels and the factor ‘dest’ has 105 unique levels. These factors were selected intuitively, since this analysis aims to determine whether or not the origin and destination locations of a given flight route have a significant effect on the resulting arrival delay time that a given flight experiences.
#Display the summary statistics of "data_raw".
summary(data_raw)
## year month day dep_time
## Min. :2013 Min. : 1.00 Min. : 1.0 Min. : 1
## 1st Qu.:2013 1st Qu.: 4.00 1st Qu.: 8.0 1st Qu.: 907
## Median :2013 Median : 7.00 Median :16.0 Median :1401
## Mean :2013 Mean : 6.55 Mean :15.7 Mean :1349
## 3rd Qu.:2013 3rd Qu.:10.00 3rd Qu.:23.0 3rd Qu.:1744
## Max. :2013 Max. :12.00 Max. :31.0 Max. :2400
## NA's :8255
## dep_delay arr_time arr_delay carrier
## Min. : -43 Min. : 1 Min. : -86 Length:336776
## 1st Qu.: -5 1st Qu.:1104 1st Qu.: -17 Class :character
## Median : -2 Median :1535 Median : -5 Mode :character
## Mean : 13 Mean :1502 Mean : 7
## 3rd Qu.: 11 3rd Qu.:1940 3rd Qu.: 14
## Max. :1301 Max. :2400 Max. :1272
## NA's :8255 NA's :8713 NA's :9430
## tailnum flight origin dest
## Length:336776 Min. : 1 Length:336776 Length:336776
## Class :character 1st Qu.: 553 Class :character Class :character
## Mode :character Median :1496 Mode :character Mode :character
## Mean :1972
## 3rd Qu.:3465
## Max. :8500
##
## air_time distance hour minute
## Min. : 20 Min. : 17 Min. : 0 Min. : 0
## 1st Qu.: 82 1st Qu.: 502 1st Qu.: 9 1st Qu.:16
## Median :129 Median : 872 Median :14 Median :31
## Mean :151 Mean :1040 Mean :13 Mean :32
## 3rd Qu.:192 3rd Qu.:1389 3rd Qu.:17 3rd Qu.:49
## Max. :695 Max. :4983 Max. :24 Max. :59
## NA's :9430 NA's :8255 NA's :8255
#Display the names found in "data_raw".
names(data_raw)
## [1] "year" "month" "day" "dep_time" "dep_delay"
## [6] "arr_time" "arr_delay" "carrier" "tailnum" "flight"
## [11] "origin" "dest" "air_time" "distance" "hour"
## [16] "minute"
#Display the structure of "data_raw".
str(data_raw)
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 16 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ dep_delay: num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ arr_delay: num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num 1400 1416 1089 1576 762 ...
## $ hour : num 5 5 5 5 5 5 5 5 5 5 ...
## $ minute : num 17 33 42 44 54 54 55 57 57 58 ...
#Categorize 'origin' as a factor and display its resulting levels.
data_raw$origin = as.factor(data_raw$origin)
levels(data_raw$origin)
## [1] "EWR" "JFK" "LGA"
#Categorize 'dest' as a factor and display its resulting levels.
data_raw$dest = as.factor(data_raw$dest)
levels(data_raw$dest)
## [1] "ABQ" "ACK" "ALB" "ANC" "ATL" "AUS" "AVL" "BDL" "BGR" "BHM" "BNA"
## [12] "BOS" "BQN" "BTV" "BUF" "BUR" "BWI" "BZN" "CAE" "CAK" "CHO" "CHS"
## [23] "CLE" "CLT" "CMH" "CRW" "CVG" "DAY" "DCA" "DEN" "DFW" "DSM" "DTW"
## [34] "EGE" "EYW" "FLL" "GRR" "GSO" "GSP" "HDN" "HNL" "HOU" "IAD" "IAH"
## [45] "ILM" "IND" "JAC" "JAX" "LAS" "LAX" "LEX" "LGA" "LGB" "MCI" "MCO"
## [56] "MDW" "MEM" "MHT" "MIA" "MKE" "MSN" "MSP" "MSY" "MTJ" "MVY" "MYR"
## [67] "OAK" "OKC" "OMA" "ORD" "ORF" "PBI" "PDX" "PHL" "PHX" "PIT" "PSE"
## [78] "PSP" "PVD" "PWM" "RDU" "RIC" "ROC" "RSW" "SAN" "SAT" "SAV" "SBN"
## [89] "SDF" "SEA" "SFO" "SJC" "SJU" "SLC" "SMF" "SNA" "SRQ" "STL" "STT"
## [100] "SYR" "TPA" "TUL" "TVC" "TYS" "XNA"
In this dataset, there are a few variables that can be considered to be continuous variables; these variables are the ones which are categorized as being numeric variables. By this standard, the continuous variables that exist in this dataset include ‘dep_delay’ (which refers to the delay in flight departure [in minutes]), ‘arr_delay’(which refers to the delay in flight arrival [in minutes]), ‘air_time’ (which refers to amount of time, in minutes, that a flight is airborne), and ‘distance’ (which refers to the distance traveled by a flight [in miles]). The two other variables in the dataset that are categorized as being numeric are ‘hour’ and ‘minute’, which refer to the time of departure of a given flight. Despite this, they cannot accurately be considered to be continuous variables in this dataset, since they are really expressed as integer values in the dataset. The two main factors that our analysis considers, ‘origin’ and ‘dest’, are treated as categorical variables here.
This analysis will consider one response variable, ‘arr_delay’, which denotes the delay in arrival time [in minutes] that is observed upon a flight’s arrival to its designated destination. (Negative times represent early departure and arrival times in the dataset.)
This dataset contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) in 2013, which amounts to 336,776 flights in total. It contains 336,776 observations of 16 variables, which include ‘year’, ‘month’, ‘day’, ‘dep_time’, ‘dep_delay’, ‘arr_time’, ‘arr_delay’, ‘carrier’, ‘tailnum’, ‘flight’, ‘origin’, ‘dest’, ‘air_time’, ‘distance’, ‘hour’, and ‘minute’. The variables that are made up of integer values include ‘year’ (which refers to the year of departure), ‘month’ (which refers to the month of departure), ‘day’ (which refers to the day of departure), ‘dep_time’ (which refers to the time of departure [in hhmm notation]), ‘arr_time’ (which refers to the time of arrival [in hhmm notation]), and ‘flight’ (which refers to the flight number). The variables that are made up of character values include ‘carrier’ (which refers to the airline carrier), ‘tailnum’ (which refers to plane tail number), ‘origin’ (which refers to the origin location), and ‘dest’ (which refers to the destination location). Lastly, the variables that are made up of numeric values include ‘dep_delay’ (which refers to departure delays expressed in minutes), ‘arr_delay’ (which refers to arrival delays expressed in minutes), ‘air_time’ (which refers to the amount of time a plane is airborne, expressed in minutes), ‘distance’ (which refers to the distance traveled by a flight, expressed in miles), ‘hour’ (which refers to the hour of departure), and ‘minute’ (which refers to the minute of departure).
According to the Research and Innovative Technology Administration Bureau of Transportation Statistics, this flight dataset includes information about every flight that departed from New York City in 2013 [1]. Therefore, due to the size and amount of data collected here, we can make the assumption that this dataset exhibits randomization without any bias.
In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘arr_delay’ in this analysis) can be explained by the variation existent in the two different treatments of the experiment (which corresponds to ‘origin’ and ‘dest’). Therefore, the null hypothesis that is being tested states that the origin and the destination locations of a given flight route do not have a significant effect on the delay in arrival time that is observed. In carrying out this analysis, we perform an analysis of variance (ANOVA) for the delay in arrival time (‘arr_delay’) to see if there is a significant difference in the means for this response variable when considering both the origin (‘origin’) and destination (‘dest’) locations of the flights contained in this dataset.
The rationale for this design lies primarily in the fact that we’re trying to determine if the origin and destination locations of a given flight have any effect on the overall arrival time (represented as any delay in arrival time) that is achieved for a given flight. So, since this delay in arrival time is a useful and relevant metric to consider when determining why a given flight experiences any delay in their arrival , this design of a two-factor, multi-level experiment (as it corresponds to an analysis of variance) was crafted to see if the origin and destination locations of a given flight route have any effect on the delays experienced in a flight’s arrival time. Therefore, by performing this analysis, we can hope to receive some insight regarding the different factors that come into play which can cause delays in arrival times for flights.
While our original assumption claimed that the entire vehicle dataset exhibits randomization, our analysis needed to ensure that we developed a completely randomized design. In meeting this objective, a new dataset is created (“nycflights”) that randomly selects 5000 observations from “data_raw”. In creating this new dataset, we’re ensuring that our analysis considers a large sample of an even larger population and randomizes the order that the runs of the data are placed in the dataset (since they were originally listed chronologically by departure date). After creating this new dataset, we can now assume that our randomization scheme represents a completely randomized design.
#Remove any rows that contain "NA" in "data_raw", creating "data_clean".
#Randomly take a sample of 5000 observations from "data_clean", creating "nycflights".
data_clean<-na.omit(data_raw)
flight.index = sample(1:nrow(data_clean),5000,replace=FALSE)
nycflights<-data_clean[flight.index,]
In this experiment, there are no replicates or repeated measures present.
By designing this experiment in such a way that allows for multiple levels to be considered (so as to reach the most accurate conclusions possible), blocking was not considered in this two-factor, multi-level design.
In beginning to display this data graphically, summary statistics were gathered for the newly created dataset, “nycflights”. Additionally, histograms and boxplots were created to represent the different observations of arrival time delays existent within this dataset of randomly selected/sampled observations.
#Display the summary statistics of "nycflights".
summary(nycflights)
## year month day dep_time
## Min. :2013 Min. : 1.00 Min. : 1.0 Min. : 1
## 1st Qu.:2013 1st Qu.: 4.00 1st Qu.: 8.0 1st Qu.: 858
## Median :2013 Median : 7.00 Median :16.0 Median :1357
## Mean :2013 Mean : 6.54 Mean :15.7 Mean :1341
## 3rd Qu.:2013 3rd Qu.:10.00 3rd Qu.:23.0 3rd Qu.:1746
## Max. :2013 Max. :12.00 Max. :31.0 Max. :2358
##
## dep_delay arr_time arr_delay carrier
## Min. :-22.0 Min. : 1 Min. :-61.0 Length:5000
## 1st Qu.: -5.0 1st Qu.:1052 1st Qu.:-16.0 Class :character
## Median : -1.0 Median :1530 Median : -4.0 Mode :character
## Mean : 12.5 Mean :1493 Mean : 6.9
## 3rd Qu.: 12.0 3rd Qu.:1941 3rd Qu.: 14.0
## Max. :388.0 Max. :2400 Max. :474.0
##
## tailnum flight origin dest air_time
## Length:5000 Min. : 1 EWR:1795 ORD : 257 Min. : 21
## Class :character 1st Qu.: 512 JFK:1695 BOS : 241 1st Qu.: 80
## Mode :character Median :1455 LGA:1510 ATL : 238 Median :130
## Mean :1930 LAX : 237 Mean :150
## 3rd Qu.:3407 CLT : 216 3rd Qu.:194
## Max. :6181 MCO : 197 Max. :654
## (Other):3614
## distance hour minute
## Min. : 94 Min. : 0.0 Min. : 0.0
## 1st Qu.: 502 1st Qu.: 8.0 1st Qu.:16.0
## Median : 872 Median :13.0 Median :32.0
## Mean :1044 Mean :13.1 Mean :31.8
## 3rd Qu.:1389 3rd Qu.:17.0 3rd Qu.:49.0
## Max. :4983 Max. :23.0 Max. :59.0
##
#Display the names found in "nycflights".
names(nycflights)
## [1] "year" "month" "day" "dep_time" "dep_delay"
## [6] "arr_time" "arr_delay" "carrier" "tailnum" "flight"
## [11] "origin" "dest" "air_time" "distance" "hour"
## [16] "minute"
#Display the structure of "nycflights".
str(nycflights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5000 obs. of 16 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 4 3 1 6 7 10 9 10 10 2 ...
## $ day : int 25 25 11 25 28 4 19 15 13 19 ...
## $ dep_time : int 1940 809 1956 1150 2320 1751 1345 1650 1828 2056 ...
## $ dep_delay: num 10 -1 -4 -5 115 -9 0 5 28 51 ...
## $ arr_time : int 2049 912 2242 1311 152 2058 1623 1910 2144 2302 ...
## $ arr_delay: num -1 -13 -30 1 113 -6 -39 -6 40 32 ...
## $ carrier : chr "EV" "AA" "B6" "MQ" ...
## $ tailnum : chr "N826AS" "N3FSAA" "N657JB" "N0EGMQ" ...
## $ flight : int 5714 1838 801 3461 1611 359 476 2042 359 4033 ...
## $ origin : Factor w/ 3 levels "EWR","JFK","LGA": 2 2 2 3 1 2 3 1 2 3 ...
## $ dest : Factor w/ 105 levels "ABQ","ACK","ALB",..: 43 12 36 11 44 16 59 5 16 104 ...
## $ air_time : num 51 33 140 105 190 344 139 106 350 107 ...
## $ distance : num 228 187 1069 764 1400 ...
## $ hour : num 19 8 19 11 23 17 13 16 18 20 ...
## $ minute : num 40 9 56 50 20 51 45 50 28 56 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:9430] 472 478 616 644 726 734 755 839 840 841 ...
## .. ..- attr(*, "names")= chr [1:9430] "472" "478" "616" "644" ...
#Display the head and tail of "nycflights".
head(nycflights)
## year month day dep_time dep_delay arr_time arr_delay carrier
## 188651 2013 4 25 1940 10 2049 -1 EV
## 158692 2013 3 25 809 -1 912 -13 AA
## 9663 2013 1 11 1956 -4 2242 -30 B6
## 245110 2013 6 25 1150 -5 1311 1 MQ
## 276798 2013 7 28 2320 115 152 113 UA
## 30692 2013 10 4 1751 -9 2058 -6 B6
## tailnum flight origin dest air_time distance hour minute
## 188651 N826AS 5714 JFK IAD 51 228 19 40
## 158692 N3FSAA 1838 JFK BOS 33 187 8 9
## 9663 N657JB 801 JFK FLL 140 1069 19 56
## 245110 N0EGMQ 3461 LGA BNA 105 764 11 50
## 276798 N36447 1611 EWR IAH 190 1400 23 20
## 30692 N564JB 359 JFK BUR 344 2465 17 51
tail(nycflights)
## year month day dep_time dep_delay arr_time arr_delay carrier
## 11126 2013 1 13 1906 56 2208 38 AA
## 141000 2013 3 6 820 5 1013 3 MQ
## 135535 2013 2 28 924 -5 1127 -53 UA
## 271705 2013 7 23 1629 79 1841 67 EV
## 18052 2013 1 21 1839 -6 2144 -37 DL
## 293508 2013 8 15 756 -5 1031 -2 UA
## tailnum flight origin dest air_time distance hour minute
## 11126 N3ECAA 1611 LGA MIA 144 1096 19 6
## 141000 N711MQ 4490 LGA CMH 76 479 8 20
## 135535 N504UA 926 EWR LAS 278 2227 9 24
## 271705 N25134 4377 EWR JAX 115 820 16 29
## 18052 N329NW 2190 JFK MIA 166 1089 18 39
## 293508 N419UA 706 EWR LAS 297 2227 7 56
#Display the levels of 'origin' and 'dest' within "nycflights".
levels(nycflights$origin)
## [1] "EWR" "JFK" "LGA"
levels(nycflights$dest)
## [1] "ABQ" "ACK" "ALB" "ANC" "ATL" "AUS" "AVL" "BDL" "BGR" "BHM" "BNA"
## [12] "BOS" "BQN" "BTV" "BUF" "BUR" "BWI" "BZN" "CAE" "CAK" "CHO" "CHS"
## [23] "CLE" "CLT" "CMH" "CRW" "CVG" "DAY" "DCA" "DEN" "DFW" "DSM" "DTW"
## [34] "EGE" "EYW" "FLL" "GRR" "GSO" "GSP" "HDN" "HNL" "HOU" "IAD" "IAH"
## [45] "ILM" "IND" "JAC" "JAX" "LAS" "LAX" "LEX" "LGA" "LGB" "MCI" "MCO"
## [56] "MDW" "MEM" "MHT" "MIA" "MKE" "MSN" "MSP" "MSY" "MTJ" "MVY" "MYR"
## [67] "OAK" "OKC" "OMA" "ORD" "ORF" "PBI" "PDX" "PHL" "PHX" "PIT" "PSE"
## [78] "PSP" "PVD" "PWM" "RDU" "RIC" "ROC" "RSW" "SAN" "SAT" "SAV" "SBN"
## [89] "SDF" "SEA" "SFO" "SJC" "SJU" "SLC" "SMF" "SNA" "SRQ" "STL" "STT"
## [100] "SYR" "TPA" "TUL" "TVC" "TYS" "XNA"
par(mfrow=c(1,1))
#Create a histogram of Arrival Time Delays ('arr_delay') across all 2013 flights.
hist(nycflights$arr_delay, main = "Arrival Time Delays [in minutes]")
par(mfrow=c(1,1))
#Create a boxplot of Arrival Time Delays ('arr_delay') at each destination airport ('dest') [ylim = c(min|'arr_delay',max|'arr_delay')].
boxplot(nycflights$arr_delay~nycflights$dest, main = "Arrival Time Delays [in minutes]", ylim = c(min(nycflights$arr_delay),max(nycflights$arr_delay)), ylab = "Minutes")
par(mfrow=c(1,1))
#Create a boxplot of Arrival Time Delays ('arr_delay') at each destination airport ('dest') [ylim = c(~1st Quarter|'arr_delay',~3rd Quarter|'arr_delay')].(Gives better indication of median differences among destination airport locations.)
boxplot(nycflights$arr_delay~nycflights$dest, main = "Arrival Time Delays [in minutes]", ylim = c(-20,20), ylab = "Minutes")
In order to determine if the variation that is observed in the response variable (which corresponds to the delay in arrival times in this analysis) can be explained by the variation existent in the treatments of the experiment (which correspond to both the origin and destination locations of the flight routes), an analysis of variance (ANOVA) is performed as a means for analyzing the differences in arrival time delays for each of the different corresponding origin and destination airports (for each given flight route) contained within this dataset. For each of the three ANOVA models that are designed in this experiment, the null hypothesis that is being tested (which we will either reject or fail to reject by the end of our analysis) states that the origin and the destination locations of a given flight route do not have a significant effect on the delay in arrival time that is observed, implying that the differences in mean values of the delays in arrival time for each of the origin and destination airports were solely the result of randomization in this experiment. In other words, if we reject the null hypothesis, we would infer that the differences in mean values of the delays in arrival time for each of the corresponding origin and destination airports in this dataset is caused by something other than randomization, leading us to believe that the variation that is observed in the mean values of the delays in arrival time can be explained by the variation existent in the differentorigin and destination airport locations being considered in this analysis. Alternately, if we fail to reject the null hypothesis, we would infer that the variation that is observed in the mean values of the delays in arrival time cannot be explained by the variation existent in the different origin and destination airports being considered and, as such, is likely caused by randomization.
#Perform an analysis of variance (ANOVA) for the different mean values observed for the delays in arrival time, given the factor 'origin'.
model_origin <- aov(arr_delay~origin,nycflights)
anova(model_origin)
## Analysis of Variance Table
##
## Response: arr_delay
## Df Sum Sq Mean Sq F value Pr(>F)
## origin 2 26297 13149 7.05 0.00088 ***
## Residuals 4997 9324609 1866
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Perform an analysis of variance (ANOVA) for the different mean values observed for the delays in arrival time, given the factor 'dest'.
model_dest <- aov(arr_delay~dest,nycflights)
anova(model_dest)
## Analysis of Variance Table
##
## Response: arr_delay
## Df Sum Sq Mean Sq F value Pr(>F)
## dest 93 342304 3681 2 5.1e-08 ***
## Residuals 4906 9008602 1836
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Perform an analysis of variance (ANOVA) for the different mean values observed for the delays in arrival time, given the interaction of 'origin' and 'dest'.
model_interaction <- aov(arr_delay~origin*dest,nycflights)
anova(model_interaction)
## Analysis of Variance Table
##
## Response: arr_delay
## Df Sum Sq Mean Sq F value Pr(>F)
## origin 2 26297 13149 7.26 0.00071 ***
## dest 93 341451 3672 2.03 3e-08 ***
## origin:dest 105 291988 2781 1.54 0.00042 ***
## Residuals 4799 8691170 1811
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(1,1))
#Create an interaction plot that plots the mean values of the delays in arrival time('arr_delay') against the interaction of both 'origin' and 'dest'.
interaction.plot(nycflights$origin,nycflights$dest,nycflights$arr_delay)
For the analysis of variance (ANOVA) that is performed where ‘origin’ is analyzed against the response variable ‘arr_delay’, a p-value < 0.05 is returned, indicating that there is roughly a probability of < 0.05 that the resulting associated F-value is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the delays in arrival time can be explained by the variation existent in the different origin airport locations being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)
For the analysis of variance (ANOVA) that is performed where ‘dest’ is analyzed against the response variable ‘arr_delay’, a p-value < 0.05 is returned, indicating that there is roughly a probability of < 0.05 that the resulting associated F-value is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the delays in arrival time can be explained by the variation existent in the different destination airport locations being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)
For the analysis of variance (ANOVA) that is performed where the interation of both ‘origin’ and ‘dest’ is analyzed against the response variable ‘arr_delay’, a p-value < 0.05 is returned, indicating that there is roughly a probability of < 0.05 that the resulting associated F-value is the result of solely randomization. Additionally, upon generating an interaction plot that plots the mean values of the delays in arrival time (‘arr_delay’) against the interaction of both ‘origin’ and ‘dest’, the plot suggests that the interaction of these two factors does have a significant effect on the response variable. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the delays in arrival time can be explained by the variation existent in the interation of the different origin and destination airport locations being considered in this analysis and, as such, is likely not caused solely by randomization. (Note: The statistical significance of the effect that ‘origin’ and ‘dest’ each individually have on the response variable ‘arr_delay’ did not change upon developing an interaction model [see Analysis of Variance Table for “model_interaction”].)
In estimating the different parameters of the experiment, I performed summary statistics on relevant data in the dataset pertaining to the delays in arrival time for the flights noted in “nycflights”, which includes both the average arrival time delays (in minutes) and the standard deviation for those arrival time delays.
#Display summary statistics of nycflights$arr_delay.
summary(nycflights$arr_delay)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -61.0 -16.0 -4.0 6.9 14.0 474.0
#Display standard deviation of nycflights$arr_delay.
sd(nycflights$arr_delay, na.rm = FALSE)
## [1] 43.25
In verifying the results of this experiment, it’s important to ensure that the dataset itself meets all of the assumptions that correlate with the design approach that was carried out. In this way, we want to make sure that our dataset exhibits normality. Until we know that our dataset does, in fact, exhibit normality, we cannot yet say with confidence that our results are significant and representative of a properly carried-out modeling approach. In verifying our dataset for normality, we can both create a Normal Quantile-Quantile (QQ) Plot of our data and perform a Shapiro-Wilk Test of Normality on our data.
#Create a Normal Q-Q Plot for Delay in Arrival Time Data.
qqnorm(nycflights[,"arr_delay"])
qqline(nycflights[,"arr_delay"])
#Create a Normal Q-Q Plot of the residuals for "model_origin".
qqnorm(residuals(model_origin))
qqline(residuals(model_origin))
#Create a Normal Q-Q Plot of the residuals for "model_dest".
qqnorm(residuals(model_dest))
qqline(residuals(model_dest))
#Create a Normal Q-Q Plot of the residuals for "model_interaction".
qqnorm(residuals(model_interaction))
qqline(residuals(model_interaction))
#Perform Shapiro-Wilk Test of Normality on Delay in Arrival Time Data within "nycflights" dataset (normality is assummed if p > 0.1).
shapiro.test(nycflights[,"arr_delay"])
##
## Shapiro-Wilk normality test
##
## data: nycflights[, "arr_delay"]
## W = 0.7207, p-value < 2.2e-16
Upon both constructing Normal Q-Q Plots and performing Shapiro-Wilk Tests of Normality on the data in this analysis, we cannot readily assume that our data exhibits normality (since all of the resulting p-values of the Shapiro-Wilk Tests of Normality were < 0.1 and since all of the constructed Normal Q-Q Plots did not particularly display a trend of data that aligned closely with the Normal Q-Q Line). However, research has shown that analyses of variance aren’t very sensitive to even moderate deviations from normality, as per the simulation studies that have been performed using many differenty non-normal distributions (Glass et al. 1972, Harwell et al. 1992, Lix et al. 1996) [2], [3], [4]. So, we are likely still safe to carry out our analysis of variance in this case. Erring on the side of caution, we can perform the nonparametric Kruskal-Wallis rank sum test to back up our original results, which will help us to decide whether the population distributions are identical without necessarily exhibiting a normal distribution.
#Perform Kruskal-Wallis Rank Sum Test on Delay in Arrival Time Data within "nycflights" dataset for both 'origin' and 'dest' (identical populations is assummed if p > 0.05).
kruskal.test(nycflights[,"arr_delay"],nycflights$origin)
##
## Kruskal-Wallis rank sum test
##
## data: nycflights[, "arr_delay"] and nycflights$origin
## Kruskal-Wallis chi-squared = 21.67, df = 2, p-value = 1.974e-05
kruskal.test(nycflights[,"arr_delay"],nycflights$dest)
##
## Kruskal-Wallis rank sum test
##
## data: nycflights[, "arr_delay"] and nycflights$dest
## Kruskal-Wallis chi-squared = 220.2, df = 93, p-value = 2.569e-12
Since the p-values for both of the resulting Kruskal-Wallist rank sum tests that consider the factors ‘origin’ and ‘dest’ against the response variable ‘arr_delay’ are less than 0.05, we can assume that the mean values of the delays in arrival time both at each of the different origin airport locations and at each of the different destination airport locations (considered separately) are comparatively nonidentical populations. Therefore, this result suggests that we would reject the null hypothesis of our main experiment, leading us to believe that the variation that is observed in the mean values of the delays in arrival time can actually be explained by the variation existent in both the different destination airport locations and the different origin airport locations being considered in this analysis.
In further backing up the confidence that we have with our results, we can generate a “quality of fit” model that plots residual error against each of the fitted models that were developed in our original analysis of variance (ANOVA).
#Create a "Quality of Fit Model" that plots the residuals of "model_origin" against its fitted model.
plot(fitted(model_origin),residuals(model_origin))
#Create a "Quality of Fit Model" that plots the residuals of "model_dest" against its fitted model.
plot(fitted(model_dest),residuals(model_dest))
#Create a "Quality of Fit Model" that plots the residuals of "model_interaction" against its fitted model.
plot(fitted(model_interaction),residuals(model_interaction))
Because each of the resulting plots appears to be scatted and clumped around zero, each of the three ANOVA models developed suggests good fit. Thus, we can confindently rely on both the modeling approach that we carried out and the dataset that we analyzed in justifying the significance of our results.
If our modeling assumptions had failed in this experiment, we would have certainly needed to modify our experimental design approach. For instance, if we could not assume that our dataset exhibits normality, transformations such as the “Box-Cox Power Transformation” could have been performed on the data to make it approximately normal. Additionally, if we were forced to make no assumptions about the probability distribution of our data, we could have used nonparametric statistics to describe our dataset and to analyze it further as a means for achieving some sort of significant result from our data. For iunstance, performing a “Friedman Two-Way Analysis of Variance Test” to determine the variance of our data without knowing if the data itself exhibits normality or not could be a potentially useful way to work around an inability to assume that the dataset is normally distributed.
[1] Online Data Source: http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
[2] Glass, G.V., P.D. Peckham, and J.R. Sanders. 1972. Consequences of failure to meet assumptions underlying fixed effects analyses of variance and covariance. Rev. Educ. Res. 42: 237-288.
[3] Harwell, M.R., E.N. Rubinstein, W.S. Hayes, and C.C. Olds. 1992. Summarizing Monte Carlo results in methodological research: the one- and two-factor fixed effects ANOVA cases. J. Educ. Stat. 17: 315-339.
[4] Lix, L.M., J.C. Keselman, and H.J. Keselman. 1996. Consequences of assumption violations revisited: A quantitative review of alternatives to the one-way analysis of variance F test. Rev. Educ. Res. 66: 579-619.
An R data package containing all out-bound flights from NYC in 2013 [+ useful metdata] is used in this analysis. It can be installed from github with devtools::install_github(“hadley/nycflights13”).