Permutation Analysis.
(a) Find the survival rates for lower limb
amputations in each of the post-antiseptic and pre-antiseptic periods
and report \(\hat{\Delta}\), the
post-rate minus the pre-rate.
Solution:
data <- read.table("lister.dat", header = FALSE, col.names = c("Subject_ID", "Year", "Antiseptic", "Limb", "Outcome"))
data <- subset(data, Limb == 1)
pre_rate <- mean(data$Outcome[data$Antiseptic == 0])
post_rate <- mean(data$Outcome[data$Antiseptic == 1])
delta_observed <- post_rate - pre_rate
delta_observed
## [1] 0.4166667
The answer is \(\boxed{\frac{5}{12}}\).
(b) Retain the survival values (0 or 1) but randomly permute the values of the pre versus post variable, keeping the total number of pre and post observations exactly the same as before. Do this \(B = 9,999\) times and make a histogram of the resulting \(\hat{\Delta}^*\) values. Mark that histogram with the original \(\hat{\Delta}\).
The one-sided \(p\)-value is given by: \[ \frac{1 + \sum_{b=1}^B \mathbf{1}(\hat{\Delta}^*_b \geq \hat{\Delta})}{1 + B} \]
The two-sided \(p\)-value is given by: \[ \frac{1 + \sum_{b=1}^B \mathbf{1}(|\hat{\Delta}^*_b| \geq |\hat{\Delta}|)}{1 + B} \]
The advantage of adding one to the numerator and denominator is that it keeps out impossible \(p\)-values like 0 and 1. You do not do that in a case where you can enumerate all the permutations.
Question: How many permutations would that be in
this case?
[Hint: In R, the sample function is
useful.]
Solution:
data <- read.table("lister.dat", header = FALSE, col.names = c("Subject_ID", "Year", "Antiseptic", "Limb", "Outcome"))
data <- subset(data, Limb == 1)
pre_rate <- mean(data$Outcome[data$Antiseptic == 0])
post_rate <- mean(data$Outcome[data$Antiseptic == 1])
delta_observed <- post_rate - pre_rate
set.seed(123)
B <- 9999
delta_star <- numeric(B)
for (b in 1:B) {
permuted_antiseptic <- sample(data$Antiseptic)
pre_rate_perm <- mean(data$Outcome[permuted_antiseptic == 0])
post_rate_perm <- mean(data$Outcome[permuted_antiseptic == 1])
delta_star[b] <- post_rate_perm - pre_rate_perm
}
delta_star_df <- data.frame(delta_star = delta_star)
ggplot(delta_star_df, aes(x = delta_star)) +
geom_histogram(binwidth = 0.01, fill = "blue", alpha = 0.7) +
geom_vline(aes(xintercept = delta_observed), color = "red", linetype = "dashed") +
labs(title = "Histogram of Permuted Δ Values",
x = "Delta",
y = "Frequency") +
annotate("text", x = delta_observed, y = max(table(cut(delta_star, breaks = 50))),
label = paste0("Observed Δ = ", round(delta_observed, 3)),
color = "red", hjust = 1)
p_value_one_sided <- (1 + sum(delta_star >= delta_observed)) / (1 + B)
p_value_two_sided <- (1 + sum(abs(delta_star) >= abs(delta_observed))) / (1 + B)
p_value_one_sided
## [1] 0.0373
p_value_two_sided
## [1] 0.0712
The one-sided p-value and two-sided p-value are shown above. In this problem, in order to enumerate all permutations, there will be \(\binom{24}{12}\) iterations, which is equal to \(\boxed{2,704,156}\).
(c) What \(p\)-value would you get if you just did a
regular \(2 \times 2\) chi-squared test
for independence?
Solution:
contingency_table <- table(data$Antiseptic, data$Outcome)
chisq.test(contingency_table)
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency_table
## X-squared = 3.2269, df = 1, p-value = 0.07244
The \(2 \times 2\) chi-squared test gives a p-value of \(\boxed{0.07244}\)
(d) This data is part of an important story and a medical success, but it is flawed from the point of view of A/B tests. What is the problem? Solution: The main issue is that this is an observational study rather than a randomized controlled experiment, which makes it flawed from the perspective of A/B testing. Due to the sequential nature of the data collection, the pre-antiseptic and post-antiseptic periods are not randomized; they are separated in time. As a result, there are several uncontrolled variables that could influence the outcome. For instance, the higher survival rate might not be solely attributed to the antiseptic but could also be due to other advancements in medical technology or changes in healthcare practices over time. Without randomization, it is impossible to isolate the effect of the antiseptic, making causal conclusions unreliable.
Picking the sample size.
Solution: \[
\text{Var}(\hat{p}_2 - \hat{p}_1) = \text{Var}\left(\frac{\sum
X_{i2}}{n} - \frac{\sum X_{i1}}{n}\right)
= \text{Var}\left(\frac{\sum X_{i2}}{n}\right) +
\text{Var}\left(\frac{\sum X_{i1}}{n}\right)
\]
\[ = \frac{n p_2 (1 - p_2)}{n^2} + \frac{n p_1 (1 - p_1)}{n^2} = \frac{p_2 (1 - p_2)}{n} + \frac{p_1 (1 - p_1)}{n}. \]
\[ \text{Var}(\hat{p}_2 - \hat{p}_1) \leq \frac{2(0.02)(1 - 0.02)}{n} \]
Therefore, if \(n\) is sufficently large,
\[ \text{Var}(\hat{p}_2 - \hat{p}_1) \leq \frac{2(0.02)(1 - 0.02)}{n} \leq \left(\frac{0.0001}{2}\right)^2. \]
\[ n \geq \frac{2(0.02)(1 - 0.02)}{\left(\frac{0.0001}{2}\right)^2}. \]
\[ \boxed{n \geq 15,680,000}. \]
simulate_biased_coin <- function(n, eta, reps) {
results <- numeric(reps)
for (rep in 1:reps) {
W <- numeric(n)
W[1] <- sample(c(0, 1), size = 1)
for (i in 2:n) {
treated <- sum(W[1:(i - 1)])
control <- (i - 1) - treated
if (treated > control) {
prob <- 0.5 - eta
} else if (treated < control) {
prob <- 0.5 + eta
} else {
prob <- 0.5
}
W[i] <- rbinom(1, size = 1, prob = prob)
}
results[rep] <- sum(W)
}
return(results)
}
n <- 30
reps <- 1000
etas <- c(0, 1/6, 2/5)
variances <- numeric(length(etas))
for (i in seq_along(etas)) {
T_values <- simulate_biased_coin(n = n, eta = etas[i], reps = reps)
variances[i] <- var(T_values)
}
results <- data.frame(Eta = etas, Variance = variances)
print(results)
## Eta Variance
## 1 0.0000000 7.3317708
## 2 0.1666667 1.1349099
## 3 0.4000000 0.1200841
The variance for \(\eta = 0\), \(\eta = \frac{1}{6}\), and \(\eta = \frac{2}{5}\) are shown above. For \(\eta = 0\), the true variance is given by \(np(1-p) = \frac{30}{4} = 7.5\), which is close to the empirical variance of 7.33.
(b)
n <- 30
reps <- 1000
etas <- c(0, 1/6, 2/5)
simulation_results <- data.frame()
for (eta in etas) {
T_values <- simulate_biased_coin(n = n, eta = eta, reps = reps)
V <- 1 / T_values + 1 / (n - T_values)
T_star <- n / 2
V_star <- 1 / T_star + 1 / (n - T_star)
V_ratio <- V / V_star
simulation_results <- rbind(simulation_results,
data.frame(Eta = eta, T = T_values, V = V, V_ratio = V_ratio))
}
ggplot(simulation_results, aes(x = V_ratio, fill = as.factor(Eta))) +
geom_density(alpha = 0.5) +
xlim(0.97, 1.03) +
labs(title = "Distribution of V/V* for Different Eta Values",
x = "V/V*",
y = "Density",
fill = "Eta") +
theme_minimal()
## Warning: Removed 373 rows containing non-finite outside the scale range
## (`stat_density()`).
Above we see distribution of V/V* for different \(\eta\) values. We see that for high value of \(\eta\), V/V* is highly concentrated around 1, indicating that there is often a perfectly even split between treatment and control groups. When \(\eta = 0\), then V/V* is no longer spiked at 1 but is multimodal with spies at lower values, meaning there is a significance chance of uneven split.
(c) We repeat a) and b) for \(n = 100\).
n <- 100
reps <- 1000
etas <- c(0, 1/6, 2/5)
variances <- numeric(length(etas))
for (i in seq_along(etas)) {
T_values <- simulate_biased_coin(n = n, eta = etas[i], reps = reps)
variances[i] <- var(T_values)
}
results <- data.frame(Eta = etas, Variance = variances)
print(results)
## Eta Variance
## 1 0.0000000 24.6661772
## 2 0.1666667 1.1111752
## 3 0.4000000 0.1181181
Variance decreases rapidly as \(eta\) increases. For \(\eta = 0\), the empirical variance is 24.66, which is close to the true variance of \(np(1-p) = \frac{100}{4} = 25\). The variance for \(\eta = \frac{1}{6}\) and \(\eta = \frac{2}{5}\) for \(n = 100\) is less than the corresponding variance for \(n = 30\).
n <- 100
reps <- 1000
etas <- c(0, 1/6, 2/5)
simulation_results <- data.frame()
for (eta in etas) {
T_values <- simulate_biased_coin(n = n, eta = eta, reps = reps)
V <- 1 / T_values + 1 / (n - T_values)
T_star <- n / 2
V_star <- 1 / T_star + 1 / (n - T_star)
V_ratio <- V / V_star
simulation_results <- rbind(simulation_results,
data.frame(Eta = eta, T = T_values, V = V, V_ratio = V_ratio))
}
ggplot(simulation_results, aes(x = V_ratio, fill = as.factor(Eta))) +
geom_density(alpha = 0.5) +
xlim(0.99, 1.01) +
labs(title = "Distribution of V/V* for Different Eta Values",
x = "V/V*",
y = "Density",
fill = "Eta") +
theme_minimal()
## Warning: Removed 354 rows containing non-finite outside the scale range
## (`stat_density()`).
When \(n = 100\), the ratios are in general much closer to 1 compared to when \(n = 30\) due to law of large numbers. For nonzero \(eta\), the ratio is within (0.995, 1.005) with very high probability. For \(eta = 0\), the distribution is multimodal with peaks at 1, 1.003 and 1.007, but generally for all values of \(eta\) the distribution is highly concentrated around 1.