In the Flight Delays Case Study in Section 1.1,
The data contain flight delays for two airlines, American Airlines and United Airlines. Conduct a two-sided permutation test to see if the mean delay times between the two carriers are statistically significant.
The flight delays occured in May and June of 2009. Conduct a two-sided permutation test to see if the difference in mean delay times between the 2 months is statistically significant.
FD <- read.csv("http://www1.appstate.edu/~arnholta/Data/FlightDelays.csv")
FD%>%
group_by(Carrier) %>%
summarise(n = n())
# A tibble: 2 x 2
Carrier n
<fctr> <int>
1 AA 2906
2 UA 1123
FD%>%
group_by(Carrier) %>%
summarise(Mean = mean(Delay)) %>%
summarise(obs_diff = Mean [2] - Mean[1])
# A tibble: 1 x 1
obs_diff
<dbl>
1 5.885696
obs_stat <- 5.885696
sim <- 10^4-1
md <- numeric(sim)
for(i in 1:sim)
{
index <-sample(4029,1123, replace = FALSE)
md[i] <- mean(FD$Delay[index]) - mean(FD$Delay[-index])
}
pvalue <- 2*((sum(md >= obs_stat)+1) / (sim + 1))
pvalue
[1] 2e-04
We reject null, which means the means are different.
FD%>%
filter(Month == "June" || Month == "May") %>%
group_by(Month) %>%
summarise(n = n())
# A tibble: 2 x 2
Month n
<fctr> <int>
1 June 2030
2 May 1999
FD%>%
filter(Month == "June" || Month == "May") %>%
group_by(Month) %>%
summarise(Mean = mean(Delay)) %>%
summarise(obs_diff = Mean [2] - Mean[1])
# A tibble: 1 x 1
obs_diff
<dbl>
1 -5.663341
obs_stat_2 <- -5.663341
test <- FD%>%
filter(Month == "June" || Month == "May")
md_2 <- numeric(sim)
for(i in 1:sim)
{
index_1 <-sample(4029,1999, replace = FALSE)
md_2[i] <- mean(test$Delay[index_1]) - mean(test$Delay[-index_1])
}
pvalue <- 2*((sum(md_2 >= obs_stat_2)+1) / (sim + 1))
pvalue
[1] 2
We fail to reject the null, which means the means are not different.
In the Flight Delays Case Study in Section 1.1, the data contain flight delays for two airlines, American Airlines and United Airlines.
Compute the proportion of times that each carrier’s flights was delayed more than 20 minutes. Conduct a two-sided test to see if the difference in these proportions is statistically significant.
Compute the variance in the flight delay lengths for each carrier. Conduct a test to see if the variance for United Airlines is greater than that of American Airlines.
# a. Your code here
FD%>%
filter(Delay > 20) %>%
group_by(Carrier) %>%
summarise(n = n())
# A tibble: 2 x 2
Carrier n
<fctr> <int>
1 AA 492
2 UA 239
FD%>%
filter(Delay > 20) %>%
group_by(Carrier) %>%
summarise(Mean = mean(Delay)) %>%
summarise(obs_diff = Mean [2] / Mean[1])
# A tibble: 1 x 1
obs_diff
<dbl>
1 1.116383
obs_stat_2 <- 1.116383
md_3 <- numeric(sim)
for(i in 1:sim)
{
index_2 <-sample(731,239, replace = FALSE)
md_3[i] <- mean(FD$Delay[index_2] >20) / mean(FD$Delay[-index_2]>20)
}
pvalue <- 2*((sum(md_3 >= obs_stat_2)+1) / (sim + 1)) #two-side test
pvalue
[1] 0.7804
We fail to reject the null, which means the means are not different.
# b. Your code here
FD%>%
group_by(Carrier) %>%
summarise(Var = var(Delay)) %>%
summarise(obs_diff = Var[2] - Var[1])
# A tibble: 1 x 1
obs_diff
<dbl>
1 431.0677
for loop.# Your code here
FD%>%
group_by(Carrier) %>%
summarise(n = n())
# A tibble: 2 x 2
Carrier n
<fctr> <int>
1 AA 2906
2 UA 1123
FD%>%
group_by(Carrier) %>%
summarise(Mean = mean(Delay)) %>%
summarise(obs_mean = mean(Mean [2]+Mean[1]))
# A tibble: 1 x 1
obs_mean
<dbl>
1 26.08047
obs_stat_mean <- 26.08047
FD%>%
group_by(Carrier) %>%
summarise(Sum = sum(Delay)) %>%
summarise(obs_stat_sum = Sum[2] + Sum[1])
# A tibble: 1 x 1
obs_stat_sum
<int>
1 47292
obs_stat_sum <- 47292
FD%>%
group_by(Carrier) %>%
summarise(Mean = mean(Delay)) %>%
summarise(obs_diff = Mean [2] - Mean[1])
# A tibble: 1 x 1
obs_diff
<dbl>
1 5.885696
obs_stat_diff <- 5.885696
sim <- 10^4-1
md1 <- numeric(sim)
md2 <- numeric(sim)
md3 <- numeric(sim)
for(i in 1:sim)
{
index <-sample(4029,1123, replace = FALSE)
md1[i] <- mean(mean(FD$Delay[index]) + mean(FD$Delay[-index]))
md2[i] <- sum((FD$Delay[index]) + (FD$Delay[-index]))
md3[i] <- mean(FD$Delay[index]) - mean(FD$Delay[-index])
}
pvalue_mean <- 2*((sum(md1 >= obs_stat_mean)+1) / (sim + 1))
pvalue_mean
[1] 4e-04
pvalue_sum <- 2*((sum(md2 >= obs_stat_sum)+1) / (sim + 1))
pvalue_sum
[1] 2
pvalue_diff <- 2*((sum(md3 >= obs_stat_diff)+1) / (sim + 1))
pvalue_diff
[1] 4e-04
We reject the null for the pvalue mean and the pvalue difference but not for the pvalue sum. The means are different for pvalue diff and mean but not for sum.
In the Flight Delays Case Study in Section 1.1,
Find the 25% trimmed mean of the delay times for United Airlines and American Airlines.
Conduct a two-sided test to see if the difference in trimmed means is statistically significant.
# Your code here
FD%>%
group_by(Carrier) %>%
summarise(Mean = mean(Delay, trim=0.25)) %>%
summarise(obs_diff = Mean [2] - Mean[1])
# A tibble: 1 x 1
obs_diff
<dbl>
1 1.774414
obs_stat_diff <- 1.774414
md_4 <- numeric(sim)
for(i in 1:sim)
{
index_2 <-sample(4029,1123, replace = FALSE)
md_4[i] <- mean(FD$Delay[index_2] >20 , trim=0.25) - mean(FD$Delay[-index_2]>20 , trim=0.25)
}
pvalue <- 2*((sum(md_4 >= obs_stat_diff)+1) / (sim + 1))
pvalue
[1] 2e-04
We reject the null so the means are different.
In the Flight Delays Case Study in Section 1.1,
Compute the proportion of times the flights in May and in June were delayed more than 20 min, and conduct a two-sided test of whether the difference between months is statistically significant.
Compute the variance of the flight delay times in May and June and then conduct a two-sided test of whether the ratio of variances is statistically significantly different from 1.
# a. Your code here
FD%>%
filter(Delay > 20) %>%
filter(Month == "June" || Month == "May") %>%
group_by(Month) %>%
summarise(n = n())
# A tibble: 2 x 2
Month n
<fctr> <int>
1 June 398
2 May 333
FD%>%
filter(Delay > 20) %>%
filter(Month == "June" || Month == "May") %>%
group_by(Month) %>%
summarise(Mean = mean(Delay)) %>%
summarise(obs_diff = Mean [2] - Mean[1])
# A tibble: 1 x 1
obs_diff
<dbl>
1 -14.21631
obs_stat_2 <- -14.21631
md_3 <- numeric(sim)
MM <- subset(FD, select = Delay, subset = Month == "May", drop = TRUE)
MJ <- subset(FD, select = Delay, subset = Month == "June", drop = TRUE)
test <- c(MM, MJ)
for(i in 1:sim)
{
index_2 <-sample(731,333, replace = FALSE)
md_3[i] <- mean(test[index_2] >20) - mean(test[-index_2]>20) #null distribution
}
pvalue <- 2*((sum(md_3 >= obs_stat_2)+1) / (sim + 1)) #two-side test
pvalue
[1] 2
We fail to reject the null; the means are nto different.
# b. Your code here
FD%>%
filter(Month == "June" || Month == "May") %>%
group_by(Month) %>%
summarise(n = n())
# A tibble: 2 x 2
Month n
<fctr> <int>
1 June 2030
2 May 1999
FD%>%
filter(Month == "June" || Month == "May") %>%
group_by(Month) %>%
summarise(Var = var(Delay)) %>%
summarise(obs_diff = Var[2] / Var[1])
# A tibble: 1 x 1
obs_diff
<dbl>
1 0.6646681
obs_stat_2 <- 0.6646681
test <- FD%>%
filter(Month == "June" || Month == "May")
md_2 <- numeric(sim)
for(i in 1:sim)
{
index_1 <-sample(4029,1999, replace = FALSE)
md_2[i] <- var(test$Delay[index_1]) / var(test$Delay[-index_1])
}
pvalue <- 2*((sum(md_2 >= obs_stat_2)+1) / (sim + 1))
pvalue
[1] 1.9646
Research at the University of Nebraska conducted a study to investigate sex differences in dieting trends among a group of Midwestern college students (Davy et al. (2006)). Students were recruited from an introductory nutrition course during one term. Below are data from one question asked to 286 participants.
Write down the appropriate hypothesis to test to see if there is a relationship between gender and diet and then carry out the test.
Can the resluts be generalized to a population? Explain.
LowFatDiet
Gender Yes No
Women 35 146
Men 8 97
# Your code here
ODT <- as.table(DT)
ODTDF <- as.data.frame(ODT)
DDF <- as.tbl(vcdExtra::expand.dft(ODTDF))
T1 <- xtabs(~Gender + LowFatDiet, data = DDF)
chisq.test(T1, correct = FALSE)
Pearson's Chi-squared test
data: T1
X-squared = 7.1427, df = 1, p-value = 0.007527
A national polling company conducted a survey in 2001 asking a randomly selected group of Americans of 18 years of age or older whether they supported limited use of marijuana for medicinal purposes. Here is a summary of the data:
Write down the appropriate hypothesis to test whether there is a relationship between age and support for medicinal marijuana and carry out the test.
Support
Age Against For
18-29 years old 52 172
30-49 years old 103 313
50 years or older 119 258
# Your code here
chisq.test(T1, correct = FALSE)
Pearson's Chi-squared test
data: T1
X-squared = 6.6814, df = 2, p-value = 0.03541
Ho: Age is independent on support for medicinal marijuana Ha: Age is dependent on support for medicinal marijuana
We fail to reject the null, they are independent.
Two students went to a local supermarket and collected data on cereals; they classified by their target consumer (children versus adults) and the placement of the cereal on the shelf (bottom, middle, and top). The data are given in Cereals.
Create a table to summarize the relationship between age of target consumer and shelf location.
Conduct a chi-square test using R’s chisq.test command.
R returns a warning message. Compute the expected counts for each cell to see why.
Conduct a permutation test for independence.
Cereals <- read.csv("http://www1.appstate.edu/~arnholta/Data/Cereals.csv")
# Your code here
T2 <- xtabs(~Age + Shelf, data = Cereals)
T2
Shelf
Age bottom middle top
adult 2 1 14
children 7 18 1
chisq.test(T2, correct = FALSE)
Pearson's Chi-squared test
data: T2
X-squared = 28.625, df = 2, p-value = 6.083e-07
Cereals %>%
group_by(Age) %>%
summarise(n = n())
# A tibble: 2 x 2
Age n
<fctr> <int>
1 adult 17
2 children 26
obs_stat_diff <- chisq.test(T2)$statistic
result <- numeric(sim)
for (i in 1:sim) {
T3 <- xtabs(~sample(Age) + Shelf, data = Cereals)
result[i] <- chisq.test(T3)$statistic
}
pvalue <- (sum(result >= obs_stat_diff) + 1)/(sim + 1)
pvalue
[1] 1e-04
We reject the null because they are dependent
From GSS 2002 Case Study in Section 1.6,
Create a table to summarize the relationship between gender and the person’s choice for president in the 2000 election.
Test to see if a person’s choice for president in the 2000 election is independent of gender (use chisq.test in R).
Repeat the test but use the permutation test for independence. Does your conclusion change? (Be sure to remove observations with missing values)
GSS2002 <- read.csv("http://www1.appstate.edu/~arnholta/Data/GSS2002.csv")
# Your code here
T4 <- xtabs(~Gender + Pres00, data = GSS2002)
T4
Pres00
Gender Bush Didnt vote Gore Nader Other
Female 459 5 492 26 3
Male 426 5 289 31 13
chisq.test(T4, correct = FALSE)
Pearson's Chi-squared test
data: T4
X-squared = 33.29, df = 4, p-value = 1.042e-06
We reject the null because they are dependent.
From GSS 2002 Case Study in Section 1.6,
Create a table to summarize the relationship bewteen gender and the person’s general level of happiness (Happy).
Conduct a permutation test to see if gender and level of happiness are independent (Be sure to remove the observations with missing values).
# Your code here
T4 <- xtabs(~Gender + Happy, data = GSS2002)
T4
Happy
Gender Not too happy Pretty happy Very happy
Female 109 406 205
Male 61 378 210
obs_stat_diff2 <- chisq.test(T4)$statistic
result2 <- numeric(sim)
for (i in 1:sim) {
T6 <- xtabs(~sample(Happy) + Gender, data = GSS2002)
result2[i] <- chisq.test(T6)$statistic
}
pvalue <- (sum(result2 >= obs_stat_diff2) + 1)/(sim + 1)
pvalue
[1] 0.0039
We reject the null because they are dependent.
From GSS 2002 Case Study in Section 1.6,
Create a table to summarize the relationship between support for gun laws (GunLaw) and views on government spending on the military (SpendMilitary).
Conduct a permutation test to see if support for gun laws and views on government spending on the military are independent (Be sure to remove observations with missing values).
T4 <- xtabs(~GunLaw + SpendMilitary, data = GSS2002)
T4
SpendMilitary
GunLaw About right Too little Too much
Favor 168 101 72
Oppose 34 33 19
obs_stat_diff3 <- chisq.test(T4)$statistic
result3 <- numeric(sim)
for (i in 1:sim) {
T7 <- xtabs(~sample(GunLaw) + SpendMilitary, data = GSS2002)
result3[i] <- chisq.test(T7)$statistic
}
pvalue <- (sum(result3 >= obs_stat_diff3) + 1)/(sim + 1)
pvalue
[1] 0.2129
We do not reject the null, which means they are independent.