## [1] 0.5047766
## [1] 0.2492536
## [1] 0.4555882
nsim <-10000
result1 <-double(nsim)
for (i in 1:nsim){ result1[i] <- mean(sample(pop1, 10, replace = F))}
hist(result1, breaks=50, main="Histogram of sample means (n=10)")nsim <-10000
result2 <-double(nsim)
for (i in 1:nsim){ result2[i] <- mean(sample(pop1, 100, replace = F))}
hist(result2, breaks=50, main="Histogram of sample means (n=100)")hist(result2, breaks= 50, prob=TRUE, main="Histogram of sample means (n=100)")
x <- seq(-4, 4, 0.01)
curve(dnorm(x, mean=mean(result2), sd=sd(result2)), add=TRUE, col="red", lwd=2)hist(result1, breaks= 50, prob=TRUE, main="Histogram of sample means (n=10)")
x <- seq(-4, 4, 0.01)
curve(dnorm(x, mean=mean(result1), sd=sd(result1)), add=TRUE, col="red", lwd=2)## [1] 0.504719
## [1] 0.002423022
The population mean is 0.5048.
The theoretical variance of \(\bar{y}\) is \(n^{-1} (1-n/N) S^2\), which is equal to 0.0024676.
May use the normal approximation \[ \frac{ \bar{y} - \bar{Y}_U }{ \sqrt{ V( \bar{y} )} } \sim N(0,1) \] for sufficiently large sample sizes.
In practice, we use the variance estimator \(\hat{V} ( \bar{y})\) instead of \(V( \bar{y} )\) to get
\[\begin{equation} \frac{ \bar{y} - \bar{Y}_U }{ \sqrt{ \hat{V}( \bar{y} )} } \sim N(0,1) \label{eq1} \end{equation}\]
Quality of approximation depends on \(n\) and the population distribution of \(Y\)
“n is large enough for CLT” is less clear for finite populations
We may use the normal approximation (i.e. CLT) to construct confidence intervals of \(\bar{Y}_U\).
For example, the \(100\times(1-\alpha)\)% C.I. for \(\bar{Y}_U\) is \[ CI_{1-\alpha} = \left(\bar{y} - z_{\alpha/2}\sqrt{\hat{V} ( \bar{y})}, \bar{y} + z_{\alpha/2} \sqrt{\hat{V} ( \bar{y})} \right) \] where \[ \hat{V} = \frac{1}{n} \left( 1- \frac{n}{N}\right) s^2 \] and \(z_\alpha\) is the upper \(\alpha\) quantiles of \(N(0,1)\) distribution satisfying \(P( Z \le z_\alpha)= 1- \alpha\). (Here, \(Z\sim N(0,1)\).)
n <- 100
sam1 <- sample(pop1, n, replace = F)
m = mean(sam1)
v = (1/n)*(1-n/length(pop1))*var(sam1)
lci = m -1.96*sqrt(v)
uci = m +1.96*sqrt(v)
c(lci, uci)## [1] 0.3884439 0.5919833
cifunction <- function(data, psize, conf.level =0.95){
z = qnorm((1-conf.level)/2, lower.tail=FALSE)
m = mean(data); n = length(data); v = (1/n)*(1-n/psize)*var(data)
c(m-z*sqrt(v), m+z*sqrt(v)) }cover[i] is equal to one. Otherwise, it is zero.## [1] 0.94
Thus, the probability that 95% CI covers the population mean is equal to 0.94 in this simulation sample.
If the CLT holds, then \[ P \left\{\bar{Y}_U \in CI_{1-\alpha} \right\} \cong 1-\alpha \]
cover2 <-double(nsim)
for (i in 1:nsim)
{ ci <- cifunction(sample(pop1, 10, replace = F), length(pop1))
cover2[i] = sum((mean(pop1) > ci[1]) & (mean(pop1) < ci[2] )) }
mean(cover2)## [1] 0.8706
Two parameters
Absolute expression (half-width of CI): estimate within \(e\) of true population parameter \[ P\left\{ \left| \hat{\theta} - \theta \right| \le e \right\} = 1- \alpha \]
Relative expression: estimate within \(100 e \%\) of \(\theta\) \[ P\left\{\frac{ \left| \hat{\theta} - \theta \right|}{\theta} \le e \right\} = 1- \alpha \]
Most common equation is half-width of CI \[ e = z_{\alpha/2} \sqrt{\hat{V} ( \hat{\theta} )} \]
Example: sample mean under SRSWOR \[ e = z_{\alpha/2} \sqrt{ \frac{S^2}{n} \left( 1- \frac{n}{N} \right)} \Rightarrow n= \frac{z_{\alpha/2}^2 S^2 }{ e^2 + z_{\alpha/2}^2 S^2/N} \]
Note
A sample survey of retail pharmacies is to be conducted in Iowa with \(N=2,000\) pharmacies. The purpose of the survey is to estimate the retail price of 20 tablets of a commonly used vasodilator drug. An estimate is needed that is within 10% of the true value of the average retail price in Iowa.
A similar survey performed two years ago shows an average price of $ 7.00 for the 20 tablets with a standard deviation of $ 1.40.
If SRS is to be used, what is the minimum sample size to achieve this accuracy (with 95% confidence level)?
We wish to achieve \[ P\left\{\frac{ \left| \hat{\theta} - \theta \right|}{\theta} \le 0.1 \right\} = 0.95 \]
Using CTL, it is known that \[ P \left\{\left| \hat{\theta} - \theta \right| \le 1.96 \sqrt{ \left( \frac{1}{n} - \frac{1}{N} \right) S^2 } \right\} = 0.95 \]
Thus, we have only to slow \[ 0.1 \cdot \theta \cong 1.96 \sqrt{ \left( \frac{1}{n} - \frac{1}{N} \right) S^2 } \]
Using the previous survey, we have \(\theta=7\), \(S = 1.4\). Also, \(N=2,000\). Thus, \[ 0.1 \times 7 \cong 1.96 \sqrt{ \left( \frac{1}{n} - \frac{1}{2000} \right) 1.4^2 } \]
After some simple algrbra, \[ \frac{1}{n} =\frac{1}{2000} + \left( \frac{ 0.1 \times 7}{ 1.96 \times 1.4} \right)^2 \] which leads to \(n \cong 15.25\).
Thus, the minimum sample size is \(n=16\).