In the Flight Delays Case Study in Section 1.1,
The data contain flight delays for two airlines, American Airlines and United Airlines. Conduct a two-sided permutation test to see if the mean delay times between the two carriers are statistically significant.
The flight delays occured in May and June of 2009. Conduct a two-sided permutation test to see if the difference in mean delay times between the 2 months is statistically significant.
FD <- read.csv("http://www1.appstate.edu/~arnholta/Data/FlightDelays.csv")
FD %>%
group_by(Carrier) %>%
summarize(n = n())
# A tibble: 2 x 2
Carrier n
<fctr> <int>
1 AA 2906
2 UA 1123
output <- FD %>%
group_by(Carrier) %>%
summarize(m = mean(Delay)) %>%
summarize(obs = m[2] - m[1])
sims <- 10^4-1
md <- numeric(sims)
for(i in 1:sims) {
index <- sample(4029, 1123, replace = FALSE)
md[i] <- mean(FD$Delay[index]) - mean(FD$Delay[-index])
}
pvalue <- ((sum(md >= (output * -1) + 1) / (sims + 1))) * 2
pvalue
[1] 2e-04
Given a p-value is below 0.05, therefore the difference in mean delay times of each carrier is statistically significant. This shows on average United Airline delays are longer than American Airline delays.
FD %>%
filter(Month == "June" || Month == "May") %>%
group_by(Month) %>%
summarize(n = n())
# A tibble: 2 x 2
Month n
<fctr> <int>
1 June 2030
2 May 1999
output <- FD %>%
filter(Month == "June" || Month == "May") %>%
group_by(Month) %>%
summarize(Mean = mean(Delay)) %>%
summarize(obs_diff = Mean [2] - Mean[1])
months <- FD %>%
filter(Month == "June" || Month == "May")
md <- numeric(sims)
for(i in 1:sims) {
index <- sample(4029, 1999, replace = FALSE)
md[i] <- mean(months$Delay[index]) - mean(months$Delay[-index])
}
pvalue <- ((sum(md >= (output * -1)) + 1) / (sims + 1)) * 2
pvalue
[1] 2e-04
Give our p-value is below 0.05 we find the difference in statistically significant. This suggest that on average the delay times in the month of June are longer than the delay times in the month of May.
In the Flight Delays Case Study in Section 1.1, the data contain flight delays for two airlines, American Airlines and United Airlines.
Compute the proportion of times that each carrier’s flights was delayed more than 20 minutes. Conduct a two-sided test to see if the difference in these proportions is statistically significant.
Compute the variance in the flight delay lengths for each carrier. Conduct a test to see if the variance for United Airlines is greater than that of American Airlines.
FD %>%
filter(Delay > 20) %>%
group_by(Carrier) %>%
summarize(n = n())
# A tibble: 2 x 2
Carrier n
<fctr> <int>
1 AA 492
2 UA 239
output <- FD %>%
filter(Delay > 20) %>%
group_by(Carrier) %>%
summarize(m = mean(Delay)) %>%
summarize(diff = m[2] / m[1])
md <- numeric(sims)
for(i in 1:sims) {
index <- sample(492, 239, replace = FALSE)
md[i] <- mean(FD$Delay[index] > 20) / mean(FD$Delay[-index] > 20)
}
pvalue <- ((sum(md >= (output * -1)) + 1) / (sims + 1)) * 2
pvalue
[1] 4e-04
Given the p-value is below 0.05, the difference in delay time proportions in statistically significant. This would suggest that United Airlines has a higher proportion of delays over 20 min.
output <- FD %>%
group_by(Carrier) %>%
summarize(vr = var(Delay)) %>%
summarize(diff = vr[2] - vr[1])
sims <- 10^4 - 1
md <- numeric(sims)
for(i in 1:sims){
index <- sample(492, 239, replace = FALSE)
md[i] <- var(FD$Delay[index]) - var(FD$Delay[-index])
}
pvalue <- ((sum(md >= (output * -1)) + 1) / (sims + 1)) * 2
pvalue
[1] 2e-04
The p-value is below 0.05, so the difference of variance in delay lengths is statistically signifcant. This would suggest that the variance of United Airlines is great than American Airlines.
for loop.FD %>%
group_by(Carrier) %>%
summarize(n = n())
# A tibble: 2 x 2
Carrier n
<fctr> <int>
1 AA 2906
2 UA 1123
output <- FD %>%
group_by(Carrier) %>%
summarize(m = mean(Delay)) %>%
summarize(obs = mean(m[2] + m[1]))
output_sum <- FD %>%
group_by(Carrier) %>%
summarize(s = sum(Delay)) %>%
summarize(obs = s[2] + s[1])
output_diff <- FD %>%
group_by(Carrier) %>%
summarize(m = mean(Delay)) %>%
summarize(obs_diff = m[2] - m[1])
sims <- 10^4-1
md1 <- numeric(sims)
md2 <- numeric(sims)
md3 <- numeric(sims)
for(i in 1:sims) {
index <-sample(4029,1123, replace = FALSE)
md1[i] <- mean(mean(FD$Delay[index]) + mean(FD$Delay[-index]))
md2[i] <- sum((FD$Delay[index]) + (FD$Delay[-index]))
md3[i] <- mean(FD$Delay[index]) - mean(FD$Delay[-index])
}
pvalue <- ((sum(md1 >= (output * -1)) + 1) / (sims + 1)) * 2
pvalue
[1] 4e-04
pvalue <- ((sum(md2 >= (output_sum * -1)) + 1) / (sims + 1)) * 2
pvalue
[1] 4e-04
pvalue <- ((sum(md3 >= (output_diff * -1)) + 1) / (sims + 1)) * 2
pvalue
[1] 4e-04
The p-value is below 0.05 for the means, thus is statistically significant. The other p-values for the differences also suggest the difference in sum and means is also statistically significant.
In the Flight Delays Case Study in Section 1.1,
Find the 25% trimmed mean of the delay times for United Airlines and American Airlines.
Conduct a two-sided test to see if the difference in trimmed means is statistically significant.
output <- FD %>%
group_by(Carrier) %>%
summarize(m = mean(Delay, trim = 0.25)) %>%
summarize(obs = m[2] - m[1])
#Two sided test
md <- numeric(sims)
for(i in 1:sims) {
index <- sample(4029,1123, replace = FALSE)
md[i] <- mean(FD$Delay[index] > 20 , trim = 0.25) - mean(FD$Delay[-index] > 20 , trim = 0.25)
}
pvalue <- ((sum(md >= (output * -1) + 1) / (sims + 1))) * 2
pvalue
[1] 2e-04
The p-value is below 0.05, so the difference in trimmed means is statistically significant. After trimming 25%, the mean delay times of United Airlines are shorter than American Airlines.
In the Flight Delays Case Study in Section 1.1,
Compute the proportion of times the flights in May and in June were delayed more than 20 min, and conduct a two-sided test of whether the difference between months is statistically significant.
Compute the variance of the flight delay times in May and June and then conduct a two-sided test of whether the ratio of variances is statistically significantly different from 1.
output <- FD %>%
filter(Delay > 20) %>%
filter(Month == "June" || Month == "May") %>%
group_by(Month) %>%
summarize(m = mean(Delay)) %>%
summarize(diff = m[2] - m[1])
md <- numeric(sims)
may <- subset(FD, select = Delay, subset = Month == "May", drop = TRUE)
june <- subset(FD, select = Delay, subset = Month == "June", drop = TRUE)
months <- c(may, june)
for(i in 1:sims) {
index <-sample(731,333, replace = FALSE)
md[i] <- mean(months[index] > 20) - mean(months[-index] > 20)
}
pvalue <- ((sum(md >= (output * -1)) + 1) / (sims + 1)) * 2
pvalue
[1] 2e-04
The p-value for the difference in flights delcared more than 20 minutes in June is less than 0.05, so it is statistically significant.
FD %>%
filter(Delay > 20) %>% filter(Month == "June" || Month == "May") %>%
group_by(Month) %>%
summarize(n = n())
# A tibble: 2 x 2
Month n
<fctr> <int>
1 June 398
2 May 333
output <- FD %>%
filter(Delay > 20) %>% filter(Month == "June" || Month == "May") %>%
group_by(Month) %>%
summarize(m = mean(Delay)) %>%
summarize(diff = m[2] - m[1])
md <- numeric(sims)
may <- subset(FD, select = Delay, subset = Month == "May", drop = TRUE)
june <- subset(FD, select = Delay, subset = Month == "June", drop = TRUE)
months <- c(may, june)
for(i in 1:sims) {
index <- sample(731, 333, replace = FALSE)
md[i] <- mean(months[index] > 20) - mean(months[-index] > 20)
}
pvalue <- ((sum(md >= (output * - 1)) + 1) / (sims + 1)) * 2
pvalue
[1] 2e-04
The p-value for the difference in variance is less than 0.05, so this suggest that the ratio of variances is not significantly different from 1.
Research at the University of Nebraska conducted a study to investigate sex differences in dieting trends among a group of Midwestern college students (Davy et al. (2006)). Students were recruited from an introductory nutrition course during one term. Below are data from one question asked to 286 participants.
Write down the appropriate hypothesis to test to see if there is a relationship between gender and diet and then carry out the test.
Can the resluts be generalized to a population? Explain.
Null Hypothesis(\(H_0\)): A low fat diet is independent of gender in Midwestern college students.
Alternative Hypothesis(\(H_A\)): A low fat diet is dependent of gender in Midwestern college students.
LowFatDiet
Gender Yes No
Women 35 146
Men 8 97
ODT <- as.table(DT)
ODTDF <- as.data.frame(ODT)
DDF <- as.tbl(vcdExtra::expand.dft(ODTDF))
Test <- xtabs(~Gender + LowFatDiet, data = DDF)
chisq.test(Test, correct = FALSE)
Pearson's Chi-squared test
data: Test
X-squared = 7.1427, df = 1, p-value = 0.007527
The p-value is less than 0.05 so we reject the null hypothesis. We find evidence to support the alternative hypothsis that diet is dependent on gender in Midwestern College students.
Write down the appropriate hypothesis to test whether there is a relationship between age and support for medicinal marijuana and carry out the test.
Support
Age Against For
18-29 years old 52 172
30-49 years old 103 313
50 years or older 119 258
(H\(_o\)): There is not an association between age and support of medicinal marijuana. (H\(_a\)): There is an association between age and support of medicinal marijuana.
chisq.test(T1, correct = FALSE)
Pearson's Chi-squared test
data: T1
X-squared = 6.6814, df = 2, p-value = 0.03541
Since the p-value of 0.03541 is less than .05, we reject (H\(_o\)) in favor of (H\(_a\)).
Cereals.Create a table to summarize the relationship between age of target consumer and shelf location.
Conduct a chi-square test using R’s chisq.test command.
R returns a warning message. Compute the expected counts for each cell to see why.
Conduct a permutation test for independence.
Cereals <- read.csv("http://www1.appstate.edu/~arnholta/Data/Cereals.csv")
#Part A
T2 <- xtabs(~Age + Shelf, data = Cereals)
T2
Shelf
Age bottom middle top
adult 2 1 14
children 7 18 1
#part b
chisq.test(T2, correct = FALSE)
Pearson's Chi-squared test
data: T2
X-squared = 28.625, df = 2, p-value = 6.083e-07
#part C
Cereals %>%
group_by(Age) %>%
summarise(n = n())
# A tibble: 2 x 2
Age n
<fctr> <int>
1 adult 17
2 children 26
#part D
obs_stat_diff <- chisq.test(T2)$statistic
loopityloop <- 10^4-1
result <- numeric(loopityloop)
for (i in 1:loopityloop) {
T3 <- xtabs(~sample(Age) + Shelf, data = Cereals)
result[i] <- chisq.test(T3)$statistic
}
pvalue <- (sum(result >= obs_stat_diff) + 1)/(loopityloop + 1)
pvalue
[1] 1e-04
Since the p-value is 0.0001 which is less than .05, we reject the null hypothesis and can say there is a relationship between the placement of cereals in stores and the age group of targeted customers.
Create a table to summarize the relationship between gender and the person’s choice for president in the 2000 election.
Test to see if a person’s choice for president in the 2000 election is independent of gender (use chisq.test in R).
Repeat the test but use the permutation test for independence. Does your conclusion change? (Be sure to remove observations with missing values)
GSS2002 <- read.csv("http://www1.appstate.edu/~arnholta/Data/GSS2002.csv")
T4 <- xtabs(~Gender + Pres00, data = GSS2002)
T4
Pres00
Gender Bush Didnt vote Gore Nader Other
Female 459 5 492 26 3
Male 426 5 289 31 13
#part b
chisq.test(T4, correct = FALSE)
Pearson's Chi-squared test
data: T4
X-squared = 33.29, df = 4, p-value = 1.042e-06
With results showing a p-value being equal to 1.042e-06, we can safely reject the null hypothesis showing dependence. There is a relationship between gender of the person and the person’s choice for the president of the 2000 election.
#part c
obs_stat_diff1 <- chisq.test(T4)$statistic
anothernameforresult <- numeric(loopityloop)
for (i in 1:loopityloop) {
T5 <- xtabs(~sample(Pres00) + Gender, data = GSS2002)
anothernameforresult[i] <- chisq.test(T5)$statistic
}
pvalue <- (sum(anothernameforresult >= obs_stat_diff1) + 1)/(loopityloop + 1)
pvalue
[1] 1e-04
Due to the small p-value 1e-04, my conclusion does not change from part c. Since p is less than 0.05, we still reject the null hypothesis and conclude there is a relationship between gender of the person and the person’s choice for president.
Create a table to summarize the relationship bewteen gender and the person’s general level of happiness (Happy).
Conduct a permutation test to see if gender and level of happiness are independent (Be sure to remove the observations with missing values).
#part a
T4 <- xtabs(~Gender + Happy, data = GSS2002)
T4
Happy
Gender Not too happy Pretty happy Very happy
Female 109 406 205
Male 61 378 210
#part b
obs_stat_diff2 <- chisq.test(T4)$statistic
anewresult <- numeric(loopityloop)
for (i in 1:loopityloop) {
T6 <- xtabs(~sample(Happy) + Gender, data = GSS2002)
anewresult[i] <- chisq.test(T6)$statistic
}
pvalue <- (sum(anewresult >= obs_stat_diff2) + 1)/(loopityloop + 1)
pvalue
[1] 0.0048
With the p-value being less than 0.05, we reject the null hypothesis and conclude that there is a relationship between gender of the person and the person’s level of happiness.
Create a table to summarize the relationship between support for gun laws (GunLaw) and views on government spending on the military (SpendMilitary).
Conduct a permutation test to see if support for gun laws and views on government spending on the military are independent (Be sure to remove observations with missing values).
#part a
T5 <- xtabs(~GunLaw + SpendMilitary, data = GSS2002)
T5
SpendMilitary
GunLaw About right Too little Too much
Favor 168 101 72
Oppose 34 33 19
#part b
obs_stat_diff3 <- chisq.test(T4)$statistic
result3 <- numeric(sims)
for (i in 1:sims) {
T7 <- xtabs(~sample(GunLaw) + SpendMilitary, data = GSS2002)
result3[i] <- chisq.test(T7)$statistic
}
pvalue <- (sum(result3 >= obs_stat_diff3) + 1)/(sims + 1)
pvalue
[1] 0.0039
Because of the p-value of .2, we do not reject the null hypothesis and conclude that support for gun laws and views on government spending on the military are independent.