In this report,the author will perform a simulation to calculate the coverage probability of the 95% confidence interval of the median.
Coverage probability is an important operating characteristic of methods for constructing interval estimates, particularly confidence intervals. Definition: For the purposes of this deliverable, define the 95% confidence interval of the median to be the middle 95% of sampling distribution of the median. Similarly, the 95% confidence interval of the mean, standard deviation, etc. is the middle 95% of the respective sampling distribution. Definition: For the purposes of this deliverable, define the coverage probability as the long run proportion of intervals that capture the population parameter of interest. Conceptualy, one can calculate the coverage probability with the following steps 1. generate a sample of size N from a known distribution 2. construct a confidence interval 3. determine if the confidence captures the population parameter 4. Repeat steps (1) - (3) many times. Estimate the coverage probability as the proportion of samples for which the confidence interval captured the population parameter. The figure below shows the 95% confidence interval calculated for a handful of samples. Intervals in blue capture the population parameter of interest; intervals in red do not.
The simulation contains 6 steps: ## Step1: Generate a single sample from a standard normal distribution of size N = 201. Explain to the reader how you use MLE to estimate the distribution. ## Step2: Show the reader how you approximate the sampling distribution of the median, conditional on the estimate of the distribution in the previous step. ## Step3: Describe how you calculate a 95% confidence interval from the approximated sampling distribution. ##Step4: Explain the concept of coverage probability. Explain your code for calculating the coverage probability. ##Step5: Perform the simulation and report the results. ##Step6: Describe how you might change the simulation to learn more about the operating characteristics of your chosen method for constructing the 95% confidence interval.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stats4)
N=201
sample_N=rnorm(201)
# uss MLE to estimate F_x
mle_mean=mean(sample_N)
mle_sd=sqrt((N-1)/N )* var(sample_N)
hist(sample_N,freq=FALSE)
curve(dnorm(x,mean=mle_mean, sd=mle_sd), add=TRUE)
curve(dnorm(x, mean = mle_mean, sd = sqrt(mle_sd)), add = TRUE)
# Step 2
R <- 1000 # times of simulation
med_samp <- rep(NA, R) # empty list to store new data
for (i in 1:R) {
sim_sample = sample(sample_N, N, replace =TRUE)
med_samp[i]=median(sim_sample)
}
confidence_95=quantile(med_samp,prob=c(0.025,0.975))
confidence_95
## 2.5% 97.5%
## -0.1879115 0.2497114
Coverage probability is an important operating characteristic of methods for constructing interval estimates, particularly confidence intervals. It is an interval surrounding the unknown parameter depends on the unknown parameter value and is different from confidence level.
out <- data.frame(lower_bound = rep(NA, R),
upper_bound = rep(NA, R))
for (j in 1:R){
sample <- rnorm(201)
mle_mean <- mean(sample)
mle_sd <- sqrt((201-1)/201 * var(sample))
for (i in 1:R){
med_samp[i] <- rnorm(201, mean = mle_mean, sd = mle_sd) %>% median
}
bound <- quantile(med_samp, probs = c(0.025, 0.975))
out$lower_bound[j] <- bound[1]
out$upper_bound[j] <- bound[2]
}
for (i in 1:nrow(out)){
if (out$lower_bound[i] < 0 && out$upper_bound[i] > 0){
out$cover[i] <- 1
}
else {
out$cover[i] <- 0
} }
#confidence intervals capture population median.
mean(out$cover)
## [1] 0.977
#bootstrap
R <- 1000 # times of simulation
med_samp <- rep(NA, R) # empty list to store new data
for (i in 1:R) {
sim_sample = sample(sample_N, N, replace =TRUE)
med_samp[i]=median(sim_sample)
}
out <- data.frame(lower_bound = rep(NA, R),
upper_bound = rep(NA, R))
for (j in 1:R){
for (i in 1:R){
sample <- rnorm(201)
mle_mean <- mean(sample)
mle_sd <- sqrt((201-1)/201 * var(sample))
med_samp[i] <- rnorm(201, mean = mle_mean, sd = mle_sd) %>% median
}
bound <- quantile(med_samp, probs = c(0.025, 0.975))
out$lower_bound[j] <- bound[1]
out$upper_bound[j] <- bound[2]
}
for (i in 1:nrow(out)){
if (out$lower_bound[i] < 0 && out$upper_bound[i] > 0){
out$cover[i] <- 1
}
else {
out$cover[i] <- 0
}
}
mean(out$cover)
## [1] 1
Using common method could get a good result, but when we use the bootstrap instead, the result would get more accurate. The bootstrap is a nonparametric technique for estimating standard errors and approximate confidence intervals. Using bootstrap, the result of confidence intervals capture 100% population median.