Recipe 1: Example of Descriptive Statistics

Recipes for the Design of Experiments: Recipe Outline

Flight Delays

Jane Braun

RPI

September 14th Version 1.0

1. Setting

System under test

Choose one of the large datasets listed on the Realtime Board (e.g., babynames or nasaweather)
Make sure you have > 1000 data What is the problem that you were given?

load("C:/Users/braunj6/Documents/Fall 2014/Design of Experiments/flights.rda")

x<-flights
head(x)

##   year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013     1   1      517         2      830        11      UA  N14228
## 2 2013     1   1      533         4      850        20      UA  N24211
## 3 2013     1   1      542         2      923        33      AA  N619AA
## 4 2013     1   1      544        -1     1004       -18      B6  N804JB
## 5 2013     1   1      554        -6      812       -25      DL  N668DN
## 6 2013     1   1      554        -4      740        12      UA  N39463
##   flight origin dest air_time distance hour minute
## 1   1545    EWR  IAH      227     1400    5     17
## 2   1714    LGA  IAH      227     1416    5     33
## 3   1141    JFK  MIA      160     1089    5     42
## 4    725    JFK  BQN      183     1576    5     44
## 5    461    LGA  ATL      116      762    5     54
## 6   1696    EWR  ORD      150      719    5     54

Factors and Levels

In this dataset, the factors of interest were the airline carrier, the origin location, and the destination location. Airline carrier had 16 factors, the origin location had 3 factors, and the destination location had 105 factors.

head(x)

##   year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013     1   1      517         2      830        11      UA  N14228
## 2 2013     1   1      533         4      850        20      UA  N24211
## 3 2013     1   1      542         2      923        33      AA  N619AA
## 4 2013     1   1      544        -1     1004       -18      B6  N804JB
## 5 2013     1   1      554        -6      812       -25      DL  N668DN
## 6 2013     1   1      554        -4      740        12      UA  N39463
##   flight origin dest air_time distance hour minute
## 1   1545    EWR  IAH      227     1400    5     17
## 2   1714    LGA  IAH      227     1416    5     33
## 3   1141    JFK  MIA      160     1089    5     42
## 4    725    JFK  BQN      183     1576    5     44
## 5    461    LGA  ATL      116      762    5     54
## 6   1696    EWR  ORD      150      719    5     54

tail(x)

##        year month day dep_time dep_delay arr_time arr_delay carrier
## 336771 2013     9  30       NA        NA       NA        NA      EV
## 336772 2013     9  30       NA        NA       NA        NA      9E
## 336773 2013     9  30       NA        NA       NA        NA      9E
## 336774 2013     9  30       NA        NA       NA        NA      MQ
## 336775 2013     9  30       NA        NA       NA        NA      MQ
## 336776 2013     9  30       NA        NA       NA        NA      MQ
##        tailnum flight origin dest air_time distance hour minute
## 336771  N740EV   5274    LGA  BNA       NA      764   NA     NA
## 336772           3393    JFK  DCA       NA      213   NA     NA
## 336773           3525    LGA  SYR       NA      198   NA     NA
## 336774  N535MQ   3461    LGA  BNA       NA      764   NA     NA
## 336775  N511MQ   3572    LGA  CLE       NA      419   NA     NA
## 336776  N839MQ   3531    LGA  RDU       NA      431   NA     NA

summary(x)

##       year          month            day          dep_time   
##  Min.   :2013   Min.   : 1.00   Min.   : 1.0   Min.   :   1  
##  1st Qu.:2013   1st Qu.: 4.00   1st Qu.: 8.0   1st Qu.: 907  
##  Median :2013   Median : 7.00   Median :16.0   Median :1401  
##  Mean   :2013   Mean   : 6.55   Mean   :15.7   Mean   :1349  
##  3rd Qu.:2013   3rd Qu.:10.00   3rd Qu.:23.0   3rd Qu.:1744  
##  Max.   :2013   Max.   :12.00   Max.   :31.0   Max.   :2400  
##                                                NA's   :8255  
##    dep_delay       arr_time      arr_delay      carrier         
##  Min.   : -43   Min.   :   1   Min.   : -86   Length:336776     
##  1st Qu.:  -5   1st Qu.:1104   1st Qu.: -17   Class :character  
##  Median :  -2   Median :1535   Median :  -5   Mode  :character  
##  Mean   :  13   Mean   :1502   Mean   :   7                     
##  3rd Qu.:  11   3rd Qu.:1940   3rd Qu.:  14                     
##  Max.   :1301   Max.   :2400   Max.   :1272                     
##  NA's   :8255   NA's   :8713   NA's   :9430                     
##    tailnum              flight        origin              dest          
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##     air_time       distance         hour          minute    
##  Min.   : 20    Min.   :  17   Min.   : 0     Min.   : 0    
##  1st Qu.: 82    1st Qu.: 502   1st Qu.: 9     1st Qu.:16    
##  Median :129    Median : 872   Median :14     Median :31    
##  Mean   :151    Mean   :1040   Mean   :13     Mean   :32    
##  3rd Qu.:192    3rd Qu.:1389   3rd Qu.:17     3rd Qu.:49    
##  Max.   :695    Max.   :4983   Max.   :24     Max.   :59    
##  NA's   :9430                  NA's   :8255   NA's   :8255

str(x)

## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  16 variables:
##  $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ dep_delay: num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ arr_delay: num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier  : chr  "UA" "UA" "AA" "B6" ...
##  $ tailnum  : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ flight   : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ origin   : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest     : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance : num  1400 1416 1089 1576 762 ...
##  $ hour     : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ minute   : num  17 33 42 44 54 54 55 57 57 58 ...

Continuous variables

Departure time (dep_time) Departure delay (dep_delay) Arrival time (arr_time) Arrival delay (arr_delay)

Response variables

We are looking at the cause of airport and flight delays, so the response variables are Departure Delay and Arrival Delay.

The Data: How is it organized and what does it look like?

The data is organized into 16 factors.Each row is a different flight. It has a time and a date, along with key information about its route, the route distance, the flight time, and the promptness.

Randomization

The dataset is made up of observations for each flight leaving from the NYC area. Therefore, it was not a randomized experiment.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

The experiment will use a 2 sample t-test by subsetting the data. It will examine how flight carriers and origin city affect flight delays.

What is the rationale for this design?

A t-test is used to determine if two datasets that you are interested in, are significantly different from each other. In this case, a two-sample t-test is used in order to compare the means of two sets of data.

Randomize: What is the Randomization Scheme?

Because the dataset was a set of observations, there was no randomization.

Replicate: Are there replicates and/or repeated measures?

There were no replicates or repeated measures.

Block: Did you use blocking in the design?

Yes, blocking was done with Origin Location and Airline Carrier

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

#Frequency of flights by each carrier
carrier.freq <- table(x$carrier)
barplot(carrier.freq)

plot of chunk unnamed-chunk-3

#Top 5 Carriers
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.0.3

data(x)

## Warning: data set 'x' not found

with(x, barplot(rev(sort(table(carrier))[1:5]), main = "Top 5 Carriers"))

plot of chunk unnamed-chunk-4

#Frequency of flights by month
barplot(table(as.factor(x$month)), xlab = "Month", ylab = "Frequency", main = "Frequency of Flights by Month")

plot of chunk unnamed-chunk-5

#Frequency of Origin city
origin.freq <- table(x$origin)
barplot(origin.freq)

plot of chunk unnamed-chunk-6

#Frequency of Top 5 Destination cities
library(ggplot2)
data(x)

## Warning: data set 'x' not found

with(x, barplot(rev(sort(table(dest))[1:5])), main = "Top 5 Destination Cities")

plot of chunk unnamed-chunk-7

# Histogram of Departure Delays
hist(x$dep_delay)

plot of chunk unnamed-chunk-8

# Histogram of Arrival Delays
hist(x$arr_delay)

plot of chunk unnamed-chunk-9

#Plotting distance traveled by the arrival delay to try to distinguish relationship
plot(x$distance,x$arr_delay)

plot of chunk unnamed-chunk-10

#Arrival Delay and Departure Delay by Airline Carrier
par(mfrow=c(2,1))
boxplot(x$arr_delay ~ x$carrier, outline=FALSE, main = "Arrival Delays by Carrier")
boxplot(x$dep_delay ~ x$carrier, outline=FALSE, main = "Departure Delays by Carrier")

plot of chunk unnamed-chunk-11

#Arrival Delay and Departure Delay by Origin Location
par(mfrow=c(2,1))
boxplot(x$arr_delay ~ x$origin, outline=FALSE, main = "Arrival Delays by Origin")
boxplot(x$dep_delay ~ x$origin, outline=FALSE, main = "Departure Delays by Origin")

plot of chunk unnamed-chunk-12

par(mfrow=c(2,1))
#Arrival Delay and Departure Delay by Month
boxplot(x$dep_delay ~ x$month, outline=FALSE, main = "Departure Delays by Month", names = c("Jan","Feb", "Mar", "Apr", "May","June", "Jul","Aug","Sept","Oct","Nov","Dec"))
boxplot(x$arr_delay ~ x$month, outline=FALSE, main = "Arrival Delays by Month", names = c("Jan","Feb", "Mar", "Apr", "May","June", "Jul","Aug","Sept","Oct","Nov","Dec"))

plot of chunk unnamed-chunk-13

Testing

The focus of this recipe was on 1 factor T-tests. Therefore, the factors being analyzed had to be 2 levels. Because most of the factors in the sample data was >2 levels, the facots of interest had to be subsetted to only 2 levels. This was done by selecting the 2 highest frequency levels.

#2 Sample T-Tests

#Comparing 2 different carriers and difference in arrival delays
# H0: There is no difference in arrival delays between the two carriers
# Ha: The difference in means is not = 0
AS_carrier <- subset(x, carrier =='AS')
F9_carrier <- subset(x, carrier =='F9')
t.test(AS_carrier$arr_delay, F9_carrier$arr_delay)

## 
##  Welch Two Sample t-test
## 
## data:  AS_carrier$arr_delay and F9_carrier$arr_delay
## t = -11.66, df = 1095, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -37.21 -26.49
## sample estimates:
## mean of x mean of y 
##    -9.931    21.921

Because the p-value is less than an alpha of 0.05, you can reject the null, that there is no difference in means, and support the alternative.

#Comparing 2 different carriers and difference in departure delays
# H0: There is no difference in departure delays between the two carriers
# Ha: The difference in means is not = 0
AS_carrier <- subset(x, carrier =='AS')
F9_carrier <- subset(x, carrier =='F9')
t.test(AS_carrier$dep_delay, F9_carrier$dep_delay)

## 
##  Welch Two Sample t-test
## 
## data:  AS_carrier$dep_delay and F9_carrier$dep_delay
## t = -5.707, df = 1034, p-value = 1.5e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -19.366  -9.456
## sample estimates:
## mean of x mean of y 
##     5.805    20.216

Because the p-value is less than an alpha of 0.05, you can reject the null, that there is no difference in means, and support the alternative.

#Comparing 2 different origin locations and difference in arrival delays
# H0: There is no difference in arrival delays between the two origin locations
# Ha: The difference in means is not = 0
EWR_origin <- subset(x, origin =='EWR')
JFK_origin <- subset(x, origin =='JFK')
t.test(EWR_origin$arr_delay, JFK_origin$arr_delay)

## 
##  Welch Two Sample t-test
## 
## data:  EWR_origin$arr_delay and JFK_origin$arr_delay
## t = 18.83, df = 225780, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.185 3.926
## sample estimates:
## mean of x mean of y 
##     9.107     5.551

Because the p-value is less than an alpha of 0.05, you can reject the null, that there is no difference in means, and support the alternative.

#Comparing 2 different origin locations and difference in arrival delays
# H0: There is no difference in departure delays between the two origin locations
# Ha: The difference in means is not = 0
EWR_origin <- subset(x, origin =='EWR')
JFK_origin <- subset(x, origin =='JFK')
t.test(EWR_origin$dep_delay, JFK_origin$dep_delay)

## 
##  Welch Two Sample t-test
## 
## data:  EWR_origin$dep_delay and JFK_origin$dep_delay
## t = 17.76, df = 226958, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.665 3.326
## sample estimates:
## mean of x mean of y 
##     15.11     12.11

Because the p-value is less than an alpha of 0.05, you can reject the null, that there is no difference in means, and support the alternative.

Diagnostics/Model Adequacy Checking

par(mfrow=c(1,1))
library(qqplot2)

## Error: there is no package called 'qqplot2'

qqnorm(x$arr_delay)
qqline(x$arr_delay)

plot of chunk unnamed-chunk-18

# The data does not follow the Q-Q line. Therefore, the data does not appear to be normal.

par(mfrow=c(1,1))
qqnorm(x$dep_delay)
qqline(x$dep_delay)

plot of chunk unnamed-chunk-18

# The data does not follow the Q-Q line. Therefore, the data does not appear to be normal.

# Shapiro-Wilk test of normality.  Adequate if p < 0.1
sample <- x[sample(1:nrow(x), 4000, replace=FALSE),]
shapiro.test(sample$arr_delay)

## 
##  Shapiro-Wilk normality test
## 
## data:  sample$arr_delay
## W = 0.7212, p-value < 2.2e-16

shapiro.test(sample$dep_delay)

## 
##  Shapiro-Wilk normality test
## 
## data:  sample$dep_delay
## W = 0.5343, p-value < 2.2e-16

#As seen in the above Q-Q plots, the shapiro-wilk test gives evidence to reject the null - that the data comes from a normal population. Instead, it supports the hypothesis, that the data is not normal

4. Contingencies

Unfortunately, the data was not normal. Therefore, a possible contingency would be to run a nonparametric test to examine the data.

An example of this would be to use the Mann-Whitnet-Wilcoxon Test. This is done to examine whether two populations are similar without the assumption that they are normally distributed.

This is done by using the R funtion: wilcox.test(x, y, …) where x and y are the two populations.

4. References to the literature

N/A

5. Appendices

A summary of, or pointer to, the raw data

The data can be found at GitHub https://github.com/hadley/nycflights13

Complete and documented R code

See code above for complete R code