1. Permutation Analysis.

    (a) Find the survival rates for lower limb amputations in each of the post-antiseptic and pre-antiseptic periods and report \(\hat{\Delta}\), the post-rate minus the pre-rate.
    Solution:

    data <- read.table("lister.dat", header = FALSE, col.names = c("Subject_ID", "Year", "Antiseptic", "Limb", "Outcome"))
    data <- subset(data, Limb == 1)
    
    pre_rate <- mean(data$Outcome[data$Antiseptic == 0])
    post_rate <- mean(data$Outcome[data$Antiseptic == 1])
    
    delta_observed <- post_rate - pre_rate
    
    delta_observed
    ## [1] 0.4166667

    The answer is \(\boxed{\frac{5}{12}}\).

    (b) Retain the survival values (0 or 1) but randomly permute the values of the pre versus post variable, keeping the total number of pre and post observations exactly the same as before. Do this \(B = 9,999\) times and make a histogram of the resulting \(\hat{\Delta}^*\) values. Mark that histogram with the original \(\hat{\Delta}\).

    The one-sided \(p\)-value is given by: \[ \frac{1 + \sum_{b=1}^B \mathbf{1}(\hat{\Delta}^*_b \geq \hat{\Delta})}{1 + B} \]

    The two-sided \(p\)-value is given by: \[ \frac{1 + \sum_{b=1}^B \mathbf{1}(|\hat{\Delta}^*_b| \geq |\hat{\Delta}|)}{1 + B} \]

    The advantage of adding one to the numerator and denominator is that it keeps out impossible \(p\)-values like 0 and 1. You do not do that in a case where you can enumerate all the permutations.

    Question: How many permutations would that be in this case?
    [Hint: In R, the sample function is useful.]
    Solution:

    data <- read.table("lister.dat", header = FALSE, col.names = c("Subject_ID", "Year", "Antiseptic", "Limb", "Outcome"))
    
    data <- subset(data, Limb == 1)
    
    pre_rate <- mean(data$Outcome[data$Antiseptic == 0])
    post_rate <- mean(data$Outcome[data$Antiseptic == 1])
    delta_observed <- post_rate - pre_rate
    
    set.seed(123) 
    B <- 9999 
    delta_star <- numeric(B)
    
    for (b in 1:B) {
      permuted_antiseptic <- sample(data$Antiseptic)  
      pre_rate_perm <- mean(data$Outcome[permuted_antiseptic == 0])
      post_rate_perm <- mean(data$Outcome[permuted_antiseptic == 1])
      delta_star[b] <- post_rate_perm - pre_rate_perm
    }
    
    
    delta_star_df <- data.frame(delta_star = delta_star)
    
    ggplot(delta_star_df, aes(x = delta_star)) +
      geom_histogram(binwidth = 0.01, fill = "blue", alpha = 0.7) +
      geom_vline(aes(xintercept = delta_observed), color = "red", linetype = "dashed") +
      labs(title = "Histogram of Permuted Δ Values",
       x = "Delta",
       y = "Frequency") +
      annotate("text", x = delta_observed, y = max(table(cut(delta_star, breaks = 50))), 
           label = paste0("Observed Δ = ", round(delta_observed, 3)), 
           color = "red", hjust = 1)

    p_value_one_sided <- (1 + sum(delta_star >= delta_observed)) / (1 + B)
    p_value_two_sided <- (1 + sum(abs(delta_star) >= abs(delta_observed))) / (1 + B)    
    
    p_value_one_sided 
    ## [1] 0.0373
    p_value_two_sided
    ## [1] 0.0712

    The one-sided p-value and two-sided p-value are shown above. In this problem, in order to enumerate all permutations, there will be \(\binom{24}{12}\) iterations, which is equal to \(\boxed{2,704,156}\).

    (c) What \(p\)-value would you get if you just did a regular \(2 \times 2\) chi-squared test for independence?
    Solution:

    contingency_table <- table(data$Antiseptic, data$Outcome)
    chisq.test(contingency_table)
    ## Warning in chisq.test(contingency_table): Chi-squared approximation may be
    ## incorrect
    ## 
    ##  Pearson's Chi-squared test with Yates' continuity correction
    ## 
    ## data:  contingency_table
    ## X-squared = 3.2269, df = 1, p-value = 0.07244

    The \(2 \times 2\) chi-squared test gives a p-value of \(\boxed{0.07244}\)

    (d) This data is part of an important story and a medical success, but it is flawed from the point of view of A/B tests. What is the problem? Solution: The main issue is that this is an observational study rather than a randomized controlled experiment, which makes it flawed from the perspective of A/B testing. Due to the sequential nature of the data collection, the pre-antiseptic and post-antiseptic periods are not randomized; they are separated in time. As a result, there are several uncontrolled variables that could influence the outcome. For instance, the higher survival rate might not be solely attributed to the antiseptic but could also be due to other advancements in medical technology or changes in healthcare practices over time. Without randomization, it is impossible to isolate the effect of the antiseptic, making causal conclusions unreliable.

  2. Picking the sample size.
    Solution: \[ \text{Var}(\hat{p}_2 - \hat{p}_1) = \text{Var}\left(\frac{\sum X_{i2}}{n} - \frac{\sum X_{i1}}{n}\right) = \text{Var}\left(\frac{\sum X_{i2}}{n}\right) + \text{Var}\left(\frac{\sum X_{i1}}{n}\right) \]

\[ = \frac{n p_2 (1 - p_2)}{n^2} + \frac{n p_1 (1 - p_1)}{n^2} = \frac{p_2 (1 - p_2)}{n} + \frac{p_1 (1 - p_1)}{n}. \]

\[ \text{Var}(\hat{p}_2 - \hat{p}_1) \leq \frac{2(0.02)(1 - 0.02)}{n} \]

Therefore, if \(n\) is sufficently large,

\[ \text{Var}(\hat{p}_2 - \hat{p}_1) \leq \frac{2(0.02)(1 - 0.02)}{n} \leq \left(\frac{0.0001}{2}\right)^2. \]

\[ n \geq \frac{2(0.02)(1 - 0.02)}{\left(\frac{0.0001}{2}\right)^2}. \]

\[ \boxed{n \geq 15,680,000}. \]

  1. (a)
simulate_biased_coin <- function(n, eta, reps) {
  results <- numeric(reps)
  
  for (rep in 1:reps) {
    W <- numeric(n)
    W[1] <- sample(c(0, 1), size = 1)
    
    for (i in 2:n) {
      treated <- sum(W[1:(i - 1)])
      control <- (i - 1) - treated
      
      if (treated > control) {
        prob <- 0.5 - eta
      } else if (treated < control) {
        prob <- 0.5 + eta
      } else {
        prob <- 0.5
      }
      
      W[i] <- rbinom(1, size = 1, prob = prob)
    }
    
    results[rep] <- sum(W)
  }
  
  return(results)
}

n <- 30     
reps <- 1000  
etas <- c(0, 1/6, 2/5)  

variances <- numeric(length(etas))

for (i in seq_along(etas)) {
  T_values <- simulate_biased_coin(n = n, eta = etas[i], reps = reps)
  variances[i] <- var(T_values)
}

results <- data.frame(Eta = etas, Variance = variances)

print(results)
##         Eta  Variance
## 1 0.0000000 7.3317708
## 2 0.1666667 1.1349099
## 3 0.4000000 0.1200841

The variance for \(\eta = 0\), \(\eta = \frac{1}{6}\), and \(\eta = \frac{2}{5}\) are shown above. For \(\eta = 0\), the true variance is given by \(np(1-p) = \frac{30}{4} = 7.5\), which is close to the empirical variance of 7.33.

(b)

n <- 30      
reps <- 1000 
etas <- c(0, 1/6, 2/5)  

simulation_results <- data.frame()


for (eta in etas) {
  T_values <- simulate_biased_coin(n = n, eta = eta, reps = reps)
  V <- 1 / T_values + 1 / (n - T_values)
  
  T_star <- n / 2
  V_star <- 1 / T_star + 1 / (n - T_star)
  V_ratio <- V / V_star
  
  simulation_results <- rbind(simulation_results, 
                              data.frame(Eta = eta, T = T_values, V = V, V_ratio = V_ratio))
}

ggplot(simulation_results, aes(x = V_ratio, fill = as.factor(Eta))) +
  geom_density(alpha = 0.5) +
  xlim(0.97, 1.03) +  
  labs(title = "Distribution of V/V* for Different Eta Values",
       x = "V/V*",
       y = "Density",
       fill = "Eta") +
  theme_minimal()
## Warning: Removed 373 rows containing non-finite outside the scale range
## (`stat_density()`).

Above we see distribution of V/V* for different \(\eta\) values. We see that for high value of \(\eta\), V/V* is highly concentrated around 1, indicating that there is often a perfectly even split between treatment and control groups. When \(\eta = 0\), then V/V* is no longer spiked at 1 but is multimodal with spies at lower values, meaning there is a significance chance of uneven split.

(c) We repeat a) and b) for \(n = 100\).

n <- 100     
reps <- 1000  
etas <- c(0, 1/6, 2/5)  

variances <- numeric(length(etas))

for (i in seq_along(etas)) {
  T_values <- simulate_biased_coin(n = n, eta = etas[i], reps = reps)
  variances[i] <- var(T_values)
}

results <- data.frame(Eta = etas, Variance = variances)

print(results)
##         Eta   Variance
## 1 0.0000000 24.6661772
## 2 0.1666667  1.1111752
## 3 0.4000000  0.1181181

Variance decreases rapidly as \(eta\) increases. For \(\eta = 0\), the empirical variance is 24.66, which is close to the true variance of \(np(1-p) = \frac{100}{4} = 25\). The variance for \(\eta = \frac{1}{6}\) and \(\eta = \frac{2}{5}\) for \(n = 100\) is less than the corresponding variance for \(n = 30\).

n <- 100     
reps <- 1000 
etas <- c(0, 1/6, 2/5)  

simulation_results <- data.frame()


for (eta in etas) {
  T_values <- simulate_biased_coin(n = n, eta = eta, reps = reps)
  V <- 1 / T_values + 1 / (n - T_values)
  
  T_star <- n / 2
  V_star <- 1 / T_star + 1 / (n - T_star)
  V_ratio <- V / V_star
  
  simulation_results <- rbind(simulation_results, 
                              data.frame(Eta = eta, T = T_values, V = V, V_ratio = V_ratio))
}

ggplot(simulation_results, aes(x = V_ratio, fill = as.factor(Eta))) +
  geom_density(alpha = 0.5) +
  xlim(0.99, 1.01) +  
  labs(title = "Distribution of V/V* for Different Eta Values",
       x = "V/V*",
       y = "Density",
       fill = "Eta") +
  theme_minimal()
## Warning: Removed 354 rows containing non-finite outside the scale range
## (`stat_density()`).

When \(n = 100\), the ratios are in general much closer to 1 compared to when \(n = 30\) due to law of large numbers. For nonzero \(eta\), the ratio is within (0.995, 1.005) with very high probability. For \(eta = 0\), the distribution is multimodal with peaks at 1, 1.003 and 1.007, but generally for all values of \(eta\) the distribution is highly concentrated around 1.