Choose one of the large datasets listed on the Realtime Board (e.g., babynames or nasaweather)
Make sure you have > 1000 data
What is the problem that you were given?
load("C:/Users/braunj6/Documents/Fall 2014/Design of Experiments/flights.rda")
x<-flights
head(x)
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## flight origin dest air_time distance hour minute
## 1 1545 EWR IAH 227 1400 5 17
## 2 1714 LGA IAH 227 1416 5 33
## 3 1141 JFK MIA 160 1089 5 42
## 4 725 JFK BQN 183 1576 5 44
## 5 461 LGA ATL 116 762 5 54
## 6 1696 EWR ORD 150 719 5 54
In this dataset, the factors of interest were the airline carrier, the origin location, and the destination location. Airline carrier had 16 factors, the origin location had 3 factors, and the destination location had 105 factors.
head(x)
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## flight origin dest air_time distance hour minute
## 1 1545 EWR IAH 227 1400 5 17
## 2 1714 LGA IAH 227 1416 5 33
## 3 1141 JFK MIA 160 1089 5 42
## 4 725 JFK BQN 183 1576 5 44
## 5 461 LGA ATL 116 762 5 54
## 6 1696 EWR ORD 150 719 5 54
tail(x)
## year month day dep_time dep_delay arr_time arr_delay carrier
## 336771 2013 9 30 NA NA NA NA EV
## 336772 2013 9 30 NA NA NA NA 9E
## 336773 2013 9 30 NA NA NA NA 9E
## 336774 2013 9 30 NA NA NA NA MQ
## 336775 2013 9 30 NA NA NA NA MQ
## 336776 2013 9 30 NA NA NA NA MQ
## tailnum flight origin dest air_time distance hour minute
## 336771 N740EV 5274 LGA BNA NA 764 NA NA
## 336772 3393 JFK DCA NA 213 NA NA
## 336773 3525 LGA SYR NA 198 NA NA
## 336774 N535MQ 3461 LGA BNA NA 764 NA NA
## 336775 N511MQ 3572 LGA CLE NA 419 NA NA
## 336776 N839MQ 3531 LGA RDU NA 431 NA NA
summary(x)
## year month day dep_time
## Min. :2013 Min. : 1.00 Min. : 1.0 Min. : 1
## 1st Qu.:2013 1st Qu.: 4.00 1st Qu.: 8.0 1st Qu.: 907
## Median :2013 Median : 7.00 Median :16.0 Median :1401
## Mean :2013 Mean : 6.55 Mean :15.7 Mean :1349
## 3rd Qu.:2013 3rd Qu.:10.00 3rd Qu.:23.0 3rd Qu.:1744
## Max. :2013 Max. :12.00 Max. :31.0 Max. :2400
## NA's :8255
## dep_delay arr_time arr_delay carrier
## Min. : -43 Min. : 1 Min. : -86 Length:336776
## 1st Qu.: -5 1st Qu.:1104 1st Qu.: -17 Class :character
## Median : -2 Median :1535 Median : -5 Mode :character
## Mean : 13 Mean :1502 Mean : 7
## 3rd Qu.: 11 3rd Qu.:1940 3rd Qu.: 14
## Max. :1301 Max. :2400 Max. :1272
## NA's :8255 NA's :8713 NA's :9430
## tailnum flight origin dest
## Length:336776 Min. : 1 Length:336776 Length:336776
## Class :character 1st Qu.: 553 Class :character Class :character
## Mode :character Median :1496 Mode :character Mode :character
## Mean :1972
## 3rd Qu.:3465
## Max. :8500
##
## air_time distance hour minute
## Min. : 20 Min. : 17 Min. : 0 Min. : 0
## 1st Qu.: 82 1st Qu.: 502 1st Qu.: 9 1st Qu.:16
## Median :129 Median : 872 Median :14 Median :31
## Mean :151 Mean :1040 Mean :13 Mean :32
## 3rd Qu.:192 3rd Qu.:1389 3rd Qu.:17 3rd Qu.:49
## Max. :695 Max. :4983 Max. :24 Max. :59
## NA's :9430 NA's :8255 NA's :8255
str(x)
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 16 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ dep_delay: num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ arr_delay: num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num 1400 1416 1089 1576 762 ...
## $ hour : num 5 5 5 5 5 5 5 5 5 5 ...
## $ minute : num 17 33 42 44 54 54 55 57 57 58 ...
Departure time (dep_time) Departure delay (dep_delay) Arrival time (arr_time) Arrival delay (arr_delay)
We are looking at the cause of airport and flight delays, so the response variables are Departure Delay and Arrival Delay.
The data is organized into 16 factors.Each row is a different flight. It has a time and a date, along with key information about its route, the route distance, the flight time, and the promptness.
The dataset is made up of observations for each flight leaving from the NYC area. Therefore, it was not a randomized experiment.
The experiment will use a 2 sample t-test by subsetting the data. It will examine how flight carriers and origin city affect flight delays.
A t-test is used to determine if two datasets that you are interested in, are significantly different from each other. In this case, a two-sample t-test is used in order to compare the means of two sets of data.
Because the dataset was a set of observations, there was no randomization.
There were no replicates or repeated measures.
Yes, blocking was done with Origin Location and Airline Carrier
#Frequency of flights by each carrier
carrier.freq <- table(x$carrier)
barplot(carrier.freq)
#Top 5 Carriers
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.0.3
data(x)
## Warning: data set 'x' not found
with(x, barplot(rev(sort(table(carrier))[1:5]), main = "Top 5 Carriers"))
#Frequency of flights by month
barplot(table(as.factor(x$month)), xlab = "Month", ylab = "Frequency", main = "Frequency of Flights by Month")
#Frequency of Origin city
origin.freq <- table(x$origin)
barplot(origin.freq)
#Frequency of Top 5 Destination cities
library(ggplot2)
data(x)
## Warning: data set 'x' not found
with(x, barplot(rev(sort(table(dest))[1:5])), main = "Top 5 Destination Cities")
# Histogram of Departure Delays
hist(x$dep_delay)
# Histogram of Arrival Delays
hist(x$arr_delay)
#Plotting distance traveled by the arrival delay to try to distinguish relationship
plot(x$distance,x$arr_delay)
#Arrival Delay and Departure Delay by Airline Carrier
par(mfrow=c(2,1))
boxplot(x$arr_delay ~ x$carrier, outline=FALSE, main = "Arrival Delays by Carrier")
boxplot(x$dep_delay ~ x$carrier, outline=FALSE, main = "Departure Delays by Carrier")
#Arrival Delay and Departure Delay by Origin Location
par(mfrow=c(2,1))
boxplot(x$arr_delay ~ x$origin, outline=FALSE, main = "Arrival Delays by Origin")
boxplot(x$dep_delay ~ x$origin, outline=FALSE, main = "Departure Delays by Origin")
par(mfrow=c(2,1))
#Arrival Delay and Departure Delay by Month
boxplot(x$dep_delay ~ x$month, outline=FALSE, main = "Departure Delays by Month", names = c("Jan","Feb", "Mar", "Apr", "May","June", "Jul","Aug","Sept","Oct","Nov","Dec"))
boxplot(x$arr_delay ~ x$month, outline=FALSE, main = "Arrival Delays by Month", names = c("Jan","Feb", "Mar", "Apr", "May","June", "Jul","Aug","Sept","Oct","Nov","Dec"))
The focus of this recipe was on 1 factor T-tests. Therefore, the factors being analyzed had to be 2 levels. Because most of the factors in the sample data was >2 levels, the facots of interest had to be subsetted to only 2 levels. This was done by selecting the 2 highest frequency levels.
#2 Sample T-Tests
#Comparing 2 different carriers and difference in arrival delays
# H0: There is no difference in arrival delays between the two carriers
# Ha: The difference in means is not = 0
AS_carrier <- subset(x, carrier =='AS')
F9_carrier <- subset(x, carrier =='F9')
t.test(AS_carrier$arr_delay, F9_carrier$arr_delay)
##
## Welch Two Sample t-test
##
## data: AS_carrier$arr_delay and F9_carrier$arr_delay
## t = -11.66, df = 1095, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -37.21 -26.49
## sample estimates:
## mean of x mean of y
## -9.931 21.921
Because the p-value is less than an alpha of 0.05, you can reject the null, that there is no difference in means, and support the alternative.
#Comparing 2 different carriers and difference in departure delays
# H0: There is no difference in departure delays between the two carriers
# Ha: The difference in means is not = 0
AS_carrier <- subset(x, carrier =='AS')
F9_carrier <- subset(x, carrier =='F9')
t.test(AS_carrier$dep_delay, F9_carrier$dep_delay)
##
## Welch Two Sample t-test
##
## data: AS_carrier$dep_delay and F9_carrier$dep_delay
## t = -5.707, df = 1034, p-value = 1.5e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.366 -9.456
## sample estimates:
## mean of x mean of y
## 5.805 20.216
Because the p-value is less than an alpha of 0.05, you can reject the null, that there is no difference in means, and support the alternative.
#Comparing 2 different origin locations and difference in arrival delays
# H0: There is no difference in arrival delays between the two origin locations
# Ha: The difference in means is not = 0
EWR_origin <- subset(x, origin =='EWR')
JFK_origin <- subset(x, origin =='JFK')
t.test(EWR_origin$arr_delay, JFK_origin$arr_delay)
##
## Welch Two Sample t-test
##
## data: EWR_origin$arr_delay and JFK_origin$arr_delay
## t = 18.83, df = 225780, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.185 3.926
## sample estimates:
## mean of x mean of y
## 9.107 5.551
Because the p-value is less than an alpha of 0.05, you can reject the null, that there is no difference in means, and support the alternative.
#Comparing 2 different origin locations and difference in arrival delays
# H0: There is no difference in departure delays between the two origin locations
# Ha: The difference in means is not = 0
EWR_origin <- subset(x, origin =='EWR')
JFK_origin <- subset(x, origin =='JFK')
t.test(EWR_origin$dep_delay, JFK_origin$dep_delay)
##
## Welch Two Sample t-test
##
## data: EWR_origin$dep_delay and JFK_origin$dep_delay
## t = 17.76, df = 226958, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.665 3.326
## sample estimates:
## mean of x mean of y
## 15.11 12.11
Because the p-value is less than an alpha of 0.05, you can reject the null, that there is no difference in means, and support the alternative.
par(mfrow=c(1,1))
library(qqplot2)
## Error: there is no package called 'qqplot2'
qqnorm(x$arr_delay)
qqline(x$arr_delay)
# The data does not follow the Q-Q line. Therefore, the data does not appear to be normal.
par(mfrow=c(1,1))
qqnorm(x$dep_delay)
qqline(x$dep_delay)
# The data does not follow the Q-Q line. Therefore, the data does not appear to be normal.
# Shapiro-Wilk test of normality. Adequate if p < 0.1
sample <- x[sample(1:nrow(x), 4000, replace=FALSE),]
shapiro.test(sample$arr_delay)
##
## Shapiro-Wilk normality test
##
## data: sample$arr_delay
## W = 0.7212, p-value < 2.2e-16
shapiro.test(sample$dep_delay)
##
## Shapiro-Wilk normality test
##
## data: sample$dep_delay
## W = 0.5343, p-value < 2.2e-16
#As seen in the above Q-Q plots, the shapiro-wilk test gives evidence to reject the null - that the data comes from a normal population. Instead, it supports the hypothesis, that the data is not normal
Unfortunately, the data was not normal. Therefore, a possible contingency would be to run a nonparametric test to examine the data.
An example of this would be to use the Mann-Whitnet-Wilcoxon Test. This is done to examine whether two populations are similar without the assumption that they are normally distributed.
This is done by using the R funtion: wilcox.test(x, y, …) where x and y are the two populations.
N/A
The data can be found at GitHub https://github.com/hadley/nycflights13
See code above for complete R code