set.seed(1)
n_values <- c(5, 30, 100)
sd_values <- c(10, 50, 90)
status <- c("Known", "Unknown")
result <- data.frame()
for(n in n_values){
for(sd in sd_values){
for(s in status){
# generate data
x <- rnorm(n, mean = 500, sd = sd)
mean_x <- mean(x)
if(s == "Known"){
# Z distribution
z <- qnorm(0.975)
se <- sd/sqrt(n)
} else {
# t distribution
t_val <- qt(0.975, df = n-1)
se <- sd(x)/sqrt(n)
z <- t_val
}
lower <- mean_x - z*se
upper <- mean_x + z*se
width <- upper - lower
result <- rbind(result,
data.frame(n=n, sd=sd, status=s, width=width))
}
}
}
result
## n sd status width
## 1 5 10 Known 17.530451
## 2 5 10 Unknown 16.609347
## 3 5 50 Known 87.652254
## 4 5 50 Unknown 57.714740
## 5 5 90 Known 157.774057
## 6 5 90 Unknown 157.496139
## 7 30 10 Known 7.156777
## 8 30 10 Unknown 7.171987
## 9 30 50 Known 35.783883
## 10 30 50 Unknown 34.464104
## 11 30 90 Known 64.410989
## 12 30 90 Unknown 72.913842
## 13 100 10 Known 3.919928
## 14 100 10 Unknown 3.857058
## 15 100 50 Known 19.599640
## 16 100 50 Unknown 19.293416
## 17 100 90 Known 35.279352
## 18 100 90 Unknown 38.461708
colors <- ifelse(result$status == "Known", "blue", "red")
bp <- barplot(result$width,
names.arg = result$label,
las = 2,
col = colors,
ylab = "CI Width",
main = "CI Width (Blue=Known, Red=Unknown)")
legend("topright",
legend = c("Known", "Unknown"),
fill = c("blue", "red"))
group_centers <- c(mean(bp[1:6]), mean(bp[7:12]), mean(bp[13:18]))
text(group_centers,
par("usr")[3] - 0.05 * diff(par("usr")[3:4]),
labels = c("n = 5", "n = 30", "n = 100"),
xpd = TRUE)
Confidence Interval can be affected by 3 things from the test we just did. First is n, which is the sample size we had. By seeing the comparison from the graph, we can clearly know that more samples can make the interval narrower. Maybe it is because more samples can make the interval more accurate. Second factor is standard deviation, we can see from the graph that bigger standard deviation can make the interval wider. It is because a larger standard deviation means more variability in the data, which affects the interval to become wider to make sure that samples are included. Third factor is basically the same as the second one but with a different view. Instead of looking for the value of standard deviation, we focus on whether the standard deviation is known or not. It is somewhat unique because on smaller samples, it shows that the one with unknown standard deviation is narrower than the one with known standard deviation. However, if we look further at bigger samples such as 30 and 100, the one with unknown standard deviation is definitely bigger than those whose standard deviation is known. This can be caused by various things, but mainly without information about the standard deviation, the precision of the confidence interval is worse than if we know the value of the standard deviation. So without knowing the standard deviation value, the confidence interval is basically trying to include every possible value so that it can satisfy the confidence level.