1. In the Flight Delays Case Study in Section 1.1,

    1. The data contain flight delays for two airlines, American Airlines and United Airlines. Conduct a two-sided permutation test to see if the mean delay times between the two carriers are statistically significant.
      Ho: difference is statistically insignificant
      HA: Difference is statistially significant
FD <- read.csv("http://www1.appstate.edu/~arnholta/Data/FlightDelays.csv")
#a. Your code here
tapply(FD$Delay, FD$Carrier, mean)
      AA       UA 
10.09738 15.98308 
observed <- 10.09738 - 15.98308
observed
[1] -5.8857
time <- FD$Delay
N <- 10^4 - 1
result <- numeric(N)
for(i in 1:N){
  index <- sample(1687, 800, replace = FALSE)
  result[i] <- mean(time[index]) - mean(time[-index])
}
hist(result, xlab = "xbar1 - xbar2",
     main = "Permutation Distribution for Flight Times, diff of Carriers")
abline(v=observed, col = "blue")

pval <- (1 - (sum(result >= observed) + 1)/(N+1)) * 2
pval
[1] 0.0104

The p-value = .0078, which is less than the a-value = 0.05. Because of this, we have sufficient evidence to reject the idea that the mean delay times between the two carriers are statistically insignificant.

b. The flight delays occured in May and June of 2009.  Conduct a two-sided permutation test to see if the difference in mean delay times between the 2 months is statistically significant.  
H~o~: difference is statistically insignificant  
H~A~: Difference is statistially significant  
# b. Your code here
tapply(FD$Delay, FD$Month, mean)
     June       May 
14.547783  8.884442 
observed <- 14.547783 - 8.884442
observed
[1] 5.663341
time <- FD$Delay
N <- 10^4 - 1
result <- numeric(N)
for(i in 1:N){
  index <- sample(1687, 800, replace = FALSE)
  result[i] <- mean(time[index]) - mean(time[-index])
}
hist(result, xlab = "xbar1 - xbar2",
     main = "Permutation Distribution for Flight Times, diff of Months")
abline(v=observed, col = "red")

pval <- ((sum(result >= observed) + 1)/(N+1)) * 2
pval
[1] 2e-04

The p-value = .0002, which is much less than the a-value = 0.05. Because of this, we have sufficient evidence to reject the idea that the mean delay times between the two months is statistically insignificant.

  1. In the Flight Delays Case Study in Section 1.1, the data contain flight delays for two airlines, American Airlines and United Airlines.

    1. Compute the proportion of times that each carrier’s flights was delayed more than 20 minutes. Conduct a two-sided test to see if the difference in these proportions is statistically significant.
      Ho: difference is statistically insignificant
      HA: Difference is statistially significant
# a. Your code here
time <- FD$Delay[(FD$Delay > 20)]
tapply(time, FD$Carrier[FD$Delay > 20], mean)
      AA       UA 
74.10366 82.72803 
observed <- 74.10366 - 82.72803
observed
[1] -8.62437
N <- 10^4 - 1
result <- numeric(N)
for(i in 1:N){
  index <- sample(731, 375, replace = FALSE)
  result[i] <- mean(time[index]) - mean(time[-index])
}
hist(result, xlab = "xbar1 - xbar2",
     main = "Permutation Distribution for Flight Times, diff of Months")
abline(v=observed, col = "red")

pval <- (1 - (sum(result >= observed) + 1)/(N+1)) * 2
pval
[1] 0.0688

The p-value is greater than the a-value. Because of this, we do not have sufficient evidence to reject the idea that the difference of proportions of each carrier’s flights that delayed more than 20 minutes is statistically insignificant.

b. Compute the variance in the flight delay lengths for each carrier.  Conduct a test to see if the variance for United Airlines is greater than that of American Airlines.  
H~o~: variance for United Airlines = variance for American Airlines  
H~A~: variance for United Airlines > variance or American Airlines  
# b. Your code here
FD %>%
  group_by(Carrier) %>%
  summarize(Variance = var(Delay))
# A tibble: 2 x 2
  Carrier Variance
   <fctr>    <dbl>
1      AA 1606.457
2      UA 2037.525
observed <- 74.10366 - 82.72803
observed
[1] -8.62437
N <- 10^4 - 1
result <- numeric(N)
for(i in 1:N){
  index <- sample(731, 375, replace = FALSE)
  result[i] <- var(time[index]) - var(time[-index])
}
hist(result, xlab = "xbar1 - xbar2",
     main = "Permutation Distribution for Flight Times, diff of Months")
abline(v=observed, col = "red")

pval <- ((sum(result >= observed) + 1)/(N+1))
pval
[1] 0.5025
American Airlines has a variance of 1606.457, while United Airlines has a variance of 2037.525. Also, because the p-value in the test is greater than the a-value, we do not have sufficient evidence to reject the idea that United Airlines' variance is equal to American Airline's variance.    
  1. In the Flight Delays Case Study in Section 1.1, repeat Exercise 3 part (a) using three test statistics: (i) the mean of the United Airlines delay times, (ii) the sum of the United Airlines delay times, and (iii) the difference in the means, and compare the P-values. Make sure all three test statistics are computed within the same for loop.
# Your code here
N <- 10^4 - 1
UA.Delay <- subset(FD, select = Delay, Carrier == "UA", drop = T)
AA.Delay <- subset(FD, select = Delay, Carrier == "AA", drop = T)

observeUA <- sum(UA.Delay)
observeAA <- sum(AA.Delay)
observeDiff <- mean(UA.Delay) - mean(AA.Delay)
n <- length(UA.Delay)
UAN <- numeric(N)
UAM <- numeric(N)
MD <- numeric(N)

set.seed(0)
for(i in 1:N) {
  index <- sample(4029, n, replace = FALSE)
  UAN[i] <- sum(FD$Delay[index])
  UAM[i] <- mean(FD$Delay[index])
  MD[i] <- mean(FD$Delay[index]) - mean(FD$Delay[-index])
}
((sum(UAN >= observeUA) + 1)/(N+1))
[1] 1e-04
((sum(UAM >= observeAA) + 1)/(N+1))
[1] 1e-04
((sum(MD >= observeDiff) + 1)/(N+1))
[1] 1e-04

The p-value for each test is .0001 (> a-value = .05), so in each case we have sufficient evidence to reject the Null Hypothesis(Ho).

  1. In the Flight Delays Case Study in Section 1.1,

    1. Find the 25% trimmed mean of the delay times for United Airlines and American Airlines.
    2. Conduct a two-sided test to see if the difference in trimmed means is statistically significant.
      Ho: the trimmed means of delay times for United Airlines and American Airlines is statistically insignificant
      HA: the trimmed means of delay times for United Airlines and American Airlines is statistically significant
# Your code here
FD %>%
  group_by(Carrier) %>%
  summarize(TrimMean = mean(Delay, trim = .25))
# A tibble: 2 x 2
  Carrier   TrimMean
   <fctr>      <dbl>
1      AA -2.5701513
2      UA -0.7957371
observed <- -2.5701513 + 0.7957371
observed
[1] -1.774414
N <- 10^4 - 1
result <- numeric(N)
for(i in 1:N){
  index <- sample(731, 375, replace = FALSE)
  result[i] <- mean(time[index], trim = .25) - mean(time[-index], trim = .25)
}
hist(result, xlab = "xbar1 - xbar2",
     main = "Permutation Distribution for Flight Times, diff of Months")
abline(v=observed, col = "red")

pval <- (1 - ((sum(result >= observed) + 1)/(N+1))) * 2
pval
[1] 0.6392
The p-value is greater than the a-value.  Therefore, we have insufficient evidence to reject the notion that the trimmed means of delay times for United Airlines and American Airlines is statistically insignificant.  
  1. In the Flight Delays Case Study in Section 1.1,

    1. Compute the proportion of times the flights in May and in June were delayed more than 20 min, and conduct a two-sided test of whether the difference between months is statistically significant.
      Ho: difference between the proportion of times the flights in May and in June were delayed more than 20 min is statistically insignificant
      HA: difference between the proportion of times the flights in May and in June were delayed more than 20 min is statistically insignificant
# a. Your code here
delay20 <- FD$Delay[(FD$Delay > 20)]
FD %>%
  group_by(Month) %>%
  summarize(Variance = var(delay20))
# A tibble: 2 x 2
   Month Variance
  <fctr>    <dbl>
1   June 4214.232
2    May 4214.232
observed <- 74.10366 - 82.72803
observed
[1] -8.62437
N <- 10^4 - 1
result <- numeric(N)
for(i in 1:N){
  index <- sample(731, 375, replace = FALSE)
  result[i] <- mean(time[index]) - mean(time[-index])
}
hist(result, xlab = "xbar1 - xbar2",
     main = "Permutation Distribution for Flight Times, diff of Months")
abline(v=observed, col = "red")

pval <- (1 - (sum(result >= observed) + 1)/(N+1)) * 2
pval
[1] 0.0692
b. Compute the variance of the flight delay times in May and June and then conduct a two-sided test of whether the ratio of variances is statistically significantly different from 1.  
# b. Your code here
FD %>%
  group_by(Month) %>%
  summarize(Variance = var(Delay))
# A tibble: 2 x 2
   Month Variance
  <fctr>    <dbl>
1   June 2069.884
2    May 1375.786
observed <- (2069.884 - 1375.786) - (74.10366 - 82.72803)
observed
[1] 702.7224
N <- 10^4 - 1
result <- numeric(N)
for(i in 1:N){
  index <- sample(731, 375, replace = FALSE)
  result[i] <- var(time[index]) - var(time[-index])
}
hist(result, xlab = "xbar1 - xbar2",
     main = "Permutation Distribution for Flight Times, diff of Months")
abline(v=observed, col = "red")

pval <- ((sum(result >= observed) + 1)/(N+1)) * 2
pval
[1] 0.6532

The p-value is 0.6458, which is greater than the a-value of 0.05. Because of this, we do not have sufficient evidence to reject the hypothesis that the variance of the flight delay times in May and June compared to the proportion of times the flights in May and in June were delayed more than 20 min is statistically insignificant.

  1. Research at the University of Nebraska conducted a study to investigate sex differences in dieting trends among a group of Midwestern college students (Davy et al. (2006)). Students were recruited from an introductory nutrition course during one term. Below are data from one question asked to 286 participants.

    1. Write down the appropriate hypothesis to test to see if there is a relationship between gender and diet and then carry out the test.
      Ho: WomenDietNum = MenDietNum
      HA: WomenDietNum ≠ MenDietNum
       LowFatDiet
Gender  Yes  No
  Women  35 146
  Men     8  97
# Your code here
observed <- 19.33702 - 7.619048 # proportions of women - men who are on diets
observed
[1] 11.71797
N <- 10^4 - 1
result <- numeric(N)
for(i in 1:N){
  index <- sample(286, 143, replace = FALSE)
  result[i] <- mean(time[index]) - mean(time[-index])
}
hist(result, xlab = "xbar1 - xbar2",
     main = "Permutation Distribution for Flight Times, diff of Months")
abline(v=observed, col = "red")

pval <- ((sum(result >= observed) + 1)/(N+1))
pval
[1] 1e-04

Since the p-value is .0001, less than the assumed a-value is 0.05, we have sufficient evidence to reject the idea that the proportion of women on a diet is equal to the proportion of men on a diet.

  1. Can the resluts be generalized to a population? Explain.
    The results can be generalized to a population because we took the data set and repeatedly re-sampled the data.
  1. A national polling company conducted a survey in 2001 asking a randomly selected group of Americans of 18 years of age or older whether they supported limited use of marijuana for medicinal purposes. Here is a summary of the data:

    Write down the appropriate hypothesis to test whether there is a relationship between age and support for medicinal marijuana and carry out the test.

                   Support
Age                 Against For
  18-29 years old        52 172
  30-49 years old       103 313
  50 years or older     119 258

Ho: There is no relationship between age and support for medicinal marijuana
HA: There is a relationship between age and support for medicinal marijuana

chisq.test(T1)

    Pearson's Chi-squared test

data:  T1
X-squared = 6.6814, df = 2, p-value = 0.03541

Since the p-value = 0.03541 < 0.05, we can reject the notion that there is no relationship between age and support for medicinal marijuana.

  1. Two students went to a local supermarket and collected data on cereals; they classified by their target consumer (children versus adults) and the placement of the cereal on the shelf (bottom, middle, and top). The data are given in Cereals.

    1. Create a table to summarize the relationship between age of target consumer and shelf location.

    2. Conduct a chi-square test using R’s chisq.test command.

    3. R returns a warning message. Compute the expected counts for each cell to see why.

    4. Conduct a permutation test for independence.

Cereals <- read.csv("http://www1.appstate.edu/~arnholta/Data/Cereals.csv")
# Your code here


Tble <- table(ge, shelf)

#chisq.test(table(Shelf, Age))
chisq.test(Tble)
observed <- chisq.test(Tble)$stat
observed

Cereals %>%
  group_by(Age) %>%
  summarize(Mean = mean(Shelf),
            N = n()) %>%
  summarize(ObsDiff = diff(Mean))

N <- 10^4 - 1
result <- numeric(N)
for(i in 1:N) {
  C.table <- xtabs(~Age + sample(Shelf), data = Cereals)
  result[i] <- chisq.test(C.table)$stat
}
#hist(result, xlab = "chi-square statistic",
     #main = "Distribution of chi-square statistic")
#abline(v = observed, col = "blue")
(sum(result >= observed) + 1 (N+1))

Couldn’t get this one to work for some reason.

  1. From GSS 2002 Case Study in Section 1.6,

    1. Create a table to summarize the relationship between gender and the person’s choice for president in the 2000 election.

    2. Test to see if a person’s choice for president in the 2000 election is independent of gender (use chisq.test in R).

    3. Repeat the test but use the permutation test for independence. Does your conclusion change? (Be sure to remove observations with missing values)

GSS2002 <- read.csv("http://www1.appstate.edu/~arnholta/Data/GSS2002.csv")
# Your code here

Test <- table(GSS2002$Gender, GSS2002$Pres00)
chisq.test(T1)

    Pearson's Chi-squared test

data:  T1
X-squared = 6.6814, df = 2, p-value = 0.03541
We reached a p-value of 0.03541 < 0.05, so we can concldue that a person's choice for president in the 2000 election is *dependent* of gender.  
  1. From GSS 2002 Case Study in Section 1.6,

    1. Create a table to summarize the relationship bewteen gender and the person’s general level of happiness (Happy).

    2. Conduct a permutation test to see if gender and level of happiness are independent (Be sure to remove the observations with missing values).

# Your code here
Test <- table(GSS2002$Gender[!is.na(GSS2002$Happy)],
              GSS2002$Happy[!is.na(GSS2002$Happy)])
chisq.test(Test)

    Pearson's Chi-squared test

data:  Test
X-squared = 10.96, df = 2, p-value = 0.004168

The p-value is 0.004168, which is less than the a-value of 0.05. Because of this, we have evidence to reject the idea that there is a relationship between gender and a person’s general level of happiness.

  1. From GSS 2002 Case Study in Section 1.6,

    1. Create a table to summarize the relationship between support for gun laws (GunLaw) and views on government spending on the military (SpendMilitary).

    2. Conduct a permutation test to see if support for gun laws and views on government spending on the military are independent (Be sure to remove observations with missing values).
      Ho: Support for Gun laws and views on military spending are independent
      HA: Support for Gun laws and views onmilitary spending are unrelated

# Your code here
Test <- table(GSS2002$GunLaw[!is.na(GSS2002$GunLaw)],
              GSS2002$SpendMilitary[!is.na(GSS2002$GunLaw)])
chisq.test(Test)

    Pearson's Chi-squared test

data:  Test
X-squared = 3.0827, df = 2, p-value = 0.2141
The p-value is 0.2141, which is greater than the a-value of 0.05.  So, we have no evidence to support the notion that gun laws and views on government spending on the military are related