setwd("D:/Mysoftware/DefaultWD-R/dataset/")
data_county_Death <- read_excel("ALL_Counties.xlsx") %>%
select(Population, Premature_Death) %>%
filter(!is.na(Population), !is.na(Premature_Death))
X <- data_county_Death$Premature_Death
Y <- data_county_Death$Population
qqnorm(X, main = "Q-Q Plot of Premature_Death")
qqline(X, col = "red", lwd = 2)
qqnorm(Y, main = "Q-Q Plot of Population")
qqline(Y, col = "red", lwd = 2)
A: From the plot, we can see that Premature_Death roughly follow a
normal distribution, while Population do not.
mean_X <- mean(X)
sd_X <- sd(X)
n_X <- length(X)
mean_Y <- mean(Y)
sd_Y <- sd(Y)
n_Y <- length(Y)
t_crit_X <- qt(0.975,df = n_X - 1)
t_crit_Y <- qt(0.975,df = n_Y - 1)
ci_X <- c(mean_X - t_crit_X * (sd_X / sqrt(n_X)), mean_X + t_crit_X * (sd_X / sqrt(n_X)))
ci_Y <- c(mean_Y - t_crit_Y * (sd_Y / sqrt(n_Y)), mean_Y + t_crit_Y * (sd_Y / sqrt(n_Y)))
cat("95% CI for mean of Premature Death:", ci_X, "\n")
## 95% CI for mean of Premature Death: 9828.865 10080.7
cat("95% CI for mean of Population:", ci_Y, "\n")
## 95% CI for mean of Population: 102629.3 127856.2
r_xy <- cor(X, Y)
cat("correlation.:", r_xy, "\n")
## correlation.: -0.1852206
boot_mean <- function(data, indices) {
sample_data <- data[indices]
return(mean(sample_data))
}
boot_corr <- function(data, indices) {
sample_data <- data[indices, ]
return(cor(sample_data[, 1], sample_data[, 2]))
}
B_values <- c(500, 1000, 5000)
results <- list()
for (B in B_values) {
cat("\nBootstrap Iterations:", B, "\n")
boot_X <- boot(X, statistic = boot_mean, R = B)
boot_Y <- boot(Y, statistic = boot_mean, R = B)
ci_X_boot <- boot.ci(boot_X, type = "perc")$percent[4:5]
ci_Y_boot <- boot.ci(boot_Y, type = "perc")$percent[4:5]
data_matrix <- cbind(X, Y)
boot_r <- boot(data = data_matrix, statistic = boot_corr, R = B)
ci_r_boot <- boot.ci(boot_r, type = "perc")$percent[4:5]
cat("95% CI for mean of Premature Death (Bootstrap B =", B, "):", ci_X_boot, "\n")
cat("95% CI for mean of Population (Bootstrap B =", B, "):", ci_Y_boot, "\n")
cat("95% CI for correlation between X and Y (Bootstrap B =", B, "):", ci_r_boot, "\n")
}
##
## Bootstrap Iterations: 500
## 95% CI for mean of Premature Death (Bootstrap B = 500 ): 9830.434 10079.88
## 95% CI for mean of Population (Bootstrap B = 500 ): 102654.2 128307.6
## 95% CI for correlation between X and Y (Bootstrap B = 500 ): -0.2269447 -0.1526086
##
## Bootstrap Iterations: 1000
## 95% CI for mean of Premature Death (Bootstrap B = 1000 ): 9827.936 10080.07
## 95% CI for mean of Population (Bootstrap B = 1000 ): 103315.2 129397.7
## 95% CI for correlation between X and Y (Bootstrap B = 1000 ): -0.2287243 -0.1574099
##
## Bootstrap Iterations: 5000
## 95% CI for mean of Premature Death (Bootstrap B = 5000 ): 9832.445 10081.91
## 95% CI for mean of Population (Bootstrap B = 5000 ): 103040.4 128808.8
## 95% CI for correlation between X and Y (Bootstrap B = 5000 ): -0.2291178 -0.1572198
A: Because of sample is smaller, B-value also be smaller, so accuracy will be decrease
4. Comment on your results by comparing estimates from step 2 and 3. What is the effect of B? Do you think the normality of X and Y also affects the accuracy of bootstrapping method?
A: From the outputs of 2 and 3, we can see that CI of X is roughly same, and as B increase, CI doesn’t change much. For y, as B increase, its range getting a little bit wider.Since x is more normality than y, I think the normality of X and Y also affects the accuracy of bootstrapping method.