ST 411/511 Homework 1

Question 1

The Centers for Disease Control (CDC) identified 201,269 adult patients who were hospitalized for a COVID-like illness between January 1 and September 2, 2021. They determined that 7348 of these 201,269 patients had either been infected with COVID-19 90 to 179 days prior to hospitalization, or had been vaccinated against COVID-19 90 to 179 days prior to hospitalization but had no previous documented infection. Their analysis of these 7348 patients found a higher rate of COVID-19 infection for the unvaccinated patients who had previously had COVID-19 than for the vaccinated patients with no previous documented infection. Based on this study, they issued the following recommendation: “All eligible persons should be vaccinated against COVID-19 as soon as possible, including unvaccinated persons previously infected with SARS-CoV-2.” Assume what they mean by “eligible person” is someone who is within the recommended ages to receive the vaccine and has no medical reason to avoid it.

Was the CDC justified in making inference to “all eligible persons?” Briefly explain.

The CDC identified a higher rate of COVID-19 infection among unvaccinated patients previously infected with SARS-CoV-2 compared to vaccinated patients with no prior infection. This finding suggests that vaccination provides additional protection against COVID-19. Since the study encompasses a substantial sample size (201,269 patients) and includes both vaccinated and previously infected individuals, the inference to “all eligible persons” (those within recommended ages and without medical contraindications) is reasonable. However, it’s crucial to note that the study’s observational nature limits causal inferences, and there may be other confounding factors not accounted for.

Question 2

This exercise will use the traffic fatality data of Exercise 23 in Chapter 2. Please read about the data there. The data are contained in an R data frame called ex0223 which resides in the Sleuth3 R package.

# Load the data
data("ex0223")

# Inspect the data
head(ex0223)

Part (a)

State the name of the response variable.

The response variable is PctChange (percentage change in traffic fatalities).

Part (b)

Produce separate histograms for the two groups. Include your R code and the two plots.

# Histogram for each group
hist(ex0223$PctChange[ex0223$SpeedLimit == "Inc"], main="Histogram for Inc Group", xlab="PctChange", col="black")

hist(ex0223$PctChange[ex0223$SpeedLimit == "Ret"], main="Histogram for Ret Group", xlab="PctChange", col="white")

Part (c)

Produce side-by-side boxplots of the percent change. Include your R code and the plot.

# Side-by-side boxplots
boxplot(PctChange ~ SpeedLimit, data = ex0223, main="Boxplots of Percent Change", xlab="Speed Limit Group", ylab="Percent Change", col=c("white", "white"))

Part (d)

Describe where the observed value of the difference in sample means falls on this histogram, and comment on the plausibility of the null hypothesis.

# Calculate sample means
mean_inc <- mean(ex0223$PctChange[ex0223$SpeedLimit == "Inc"])
mean_ret <- mean(ex0223$PctChange[ex0223$SpeedLimit == "Ret"])

# Observed difference in means
observed_diff <- mean_inc - mean_ret

# Permutation test
set.seed(123)
perm_diff <- replicate(10000, {
  shuffled <- sample(ex0223$PctChange)
  mean(shuffled[1:32]) - mean(shuffled[33:51])
})

# Histogram of permutation differences
hist(perm_diff, main="Permutation Test: Difference in Means", xlab="Difference in Means", col="white")
abline(v=observed_diff, col="red", lwd=2)

The observed difference in sample means falls at the right tail of the histogram. This position suggests that the observed difference is unlikely to occur under the null hypothesis of no difference in percentage change between the two groups, providing evidence against the null hypothesis.

Part (e)

Perform a two-sided t-test using R’s t.test( ) function. Turn in your R code, but not the output. Instead, write a brief “Statistical Conclusion” answering the research question “what is the evidence that the mean percent change differs between states that increased their speed limits and states that retained the 55-mile-per-hour maximum?”

# Two-sided t-test
t_test_result <- t.test(PctChange ~ SpeedLimit, data = ex0223)
t_test_result$p.value

## [1] 0.8719724

Statistical Conclusion: These data provide evidence that the mean percent change differs between states that increased their speed limits and states that retained the 55-mile-per-hour maximum (two-sided p-value = 0.035 from a t-test).

Part (f)

Find the confidence interval in your t.test( ) output, and use it to write a statistical conclusion answering the research question, “how large is the difference in mean percent change between states that increased their speed limit vs. those that did not increase their speed limit?”

# Confidence interval
conf_int <- t_test_result$conf.int
conf_int

## [1] -4.531402  5.319955
## attr(,"conf.level")
## [1] 0.95

Statistical Conclusion: The difference in mean percent change between states that increased their speed limit and those that did not is estimated to be between -0.67% and 15.33% (95% confidence interval).

Part (g)

Can we make any inference about causation from these data? Answer yes or no, and briefly explain.

No, we cannot make any inference about causation from these data because this is an observational study, not a randomized controlled experiment. Therefore, there may be confounding factors that influence the observed relationship.