1. Find a real-world (should not be simulation) data set with at least 1,000 samples and two continue features X and Y. Assess the normality of X and Y.

setwd("D:/Mysoftware/DefaultWD-R/dataset/")
data_county_Death <- read_excel("ALL_Counties.xlsx") %>%
  select(Population, Premature_Death) %>%
  filter(!is.na(Population), !is.na(Premature_Death))

X <- data_county_Death$Premature_Death
Y <- data_county_Death$Population


qqnorm(X, main = "Q-Q Plot of Premature_Death")
qqline(X, col = "red", lwd = 2)

qqnorm(Y, main = "Q-Q Plot of Population")
qqline(Y, col = "red", lwd = 2)

A: From the plot, we can see that Premature_Death roughly follow a normal distribution, while Population do not.

2. Compute the theoretical 95% confidence interval estimate for the means of X, Y and their correlation.

mean_X <- mean(X)
sd_X <- sd(X)
n_X <- length(X)
mean_Y <- mean(Y)
sd_Y <- sd(Y)
n_Y <- length(Y)

t_crit_X <- qt(0.975,df = n_X - 1)
t_crit_Y <- qt(0.975,df = n_Y - 1)

ci_X <- c(mean_X - t_crit_X * (sd_X / sqrt(n_X)), mean_X + t_crit_X * (sd_X / sqrt(n_X)))
ci_Y <- c(mean_Y - t_crit_Y * (sd_Y / sqrt(n_Y)), mean_Y + t_crit_Y * (sd_Y / sqrt(n_Y)))


cat("95% CI for mean of Premature Death:", ci_X, "\n")
## 95% CI for mean of Premature Death: 9828.865 10080.7
cat("95% CI for mean of Population:", ci_Y, "\n")
## 95% CI for mean of Population: 102629.3 127856.2
r_xy <- cor(X, Y)
cat("correlation.:", r_xy, "\n")
## correlation.: -0.1852206

3. Use bootstrapping method to obtain the 95% CI estimate for the means of X, Y and their correlation with B = 500, 1000, 5000.

boot_mean <- function(data, indices) {
  sample_data <- data[indices]
  return(mean(sample_data))
}

boot_corr <- function(data, indices) {
  sample_data <- data[indices, ]
  return(cor(sample_data[, 1], sample_data[, 2]))
}

B_values <- c(500, 1000, 5000)
results <- list()

for (B in B_values) {
  cat("\nBootstrap Iterations:", B, "\n")
  
  boot_X <- boot(X, statistic = boot_mean, R = B)
  boot_Y <- boot(Y, statistic = boot_mean, R = B)
  
  ci_X_boot <- boot.ci(boot_X, type = "perc")$percent[4:5]
  ci_Y_boot <- boot.ci(boot_Y, type = "perc")$percent[4:5]

  data_matrix <- cbind(X, Y)
  boot_r <- boot(data = data_matrix, statistic = boot_corr, R = B)
  ci_r_boot <- boot.ci(boot_r, type = "perc")$percent[4:5]

  cat("95% CI for mean of Premature Death (Bootstrap B =", B, "):", ci_X_boot, "\n")
  cat("95% CI for mean of Population (Bootstrap B =", B, "):", ci_Y_boot, "\n")
  cat("95% CI for correlation between X and Y (Bootstrap B =", B, "):", ci_r_boot, "\n")
  
  
}
## 
## Bootstrap Iterations: 500 
## 95% CI for mean of Premature Death (Bootstrap B = 500 ): 9830.434 10079.88 
## 95% CI for mean of Population (Bootstrap B = 500 ): 102654.2 128307.6 
## 95% CI for correlation between X and Y (Bootstrap B = 500 ): -0.2269447 -0.1526086 
## 
## Bootstrap Iterations: 1000 
## 95% CI for mean of Premature Death (Bootstrap B = 1000 ): 9827.936 10080.07 
## 95% CI for mean of Population (Bootstrap B = 1000 ): 103315.2 129397.7 
## 95% CI for correlation between X and Y (Bootstrap B = 1000 ): -0.2287243 -0.1574099 
## 
## Bootstrap Iterations: 5000 
## 95% CI for mean of Premature Death (Bootstrap B = 5000 ): 9832.445 10081.91 
## 95% CI for mean of Population (Bootstrap B = 5000 ): 103040.4 128808.8 
## 95% CI for correlation between X and Y (Bootstrap B = 5000 ): -0.2291178 -0.1572198

4. Comment on your results by comparing estimates from step 2 and 3. What is the effect of B? Do you think the normality of X and Y also affects the accuracy of bootstrapping method?

A: From the outputs of 2 and 3, we can see that CI of X is roughly same, and as B increase, CI doesn’t change much. For y, as B increase, its range getting a little bit wider.Since x is more normality than y, I think the normality of X and Y also affects the accuracy of bootstrapping method.

5. If you sample is much smaller (n < 100), what do you expect for the bootstrapping method’s accuracy?

A: Because of sample is smaller, B-value also be smaller, so accuracy will be decrease