Stat 421: Week 4

Jae-kwang Kim

2/4/2020

Reveiw

In Week 3, we have studied
- Simple random sampling (SRS) design
- Estimation under SRS (Population mean, population total, population proportions)
- Variance estimation under SRS
In Week 4, we will study
- Understanding the sampling distribution using simulation
- Large-sample inference under SRS
- Sample size determination

Simulation

SRS Simulation:

Step 1: Generate a finite population of size N from a model (Bernoulli, Uniform, Normal)
Step 2: Draw SRS of size n repeatedly
Step 3: Compute the mean, variance of the sample means from the simulation.
Step 4: Compute the coverage of the interval estimator.

Step 1: Generate a finite population

Suppose that we use a finite population of size $N=10,000$ generated from an exponetial distribution.

pop1 <- rexp(10000,2)

The population mean and the population variance of the data “pop1” is

mean(pop1); var(pop1)

## [1] 0.5047766

## [1] 0.2492536

You can also plot a histogram of the population:

hist(pop1)

Step 2: Draw a SRS of size $n$ from the finite population

Draw a SRS of size $n=10$ from “pop1” data.

n <- 10 
sam1 <- sample(pop1, n, replace = F)
mean(sam1)

## [1] 0.4555882

You can repeat this sampling independently $10,000$ times.

nsim <-10000
result1 <-double(nsim) 
for (i in 1:nsim){  result1[i] <- mean(sample(pop1, 10, replace = F))}
hist(result1, breaks=50, main="Histogram of sample means (n=10)")

Let’s increase the sample size to $n=100$

nsim <-10000
result2 <-double(nsim) 
for (i in 1:nsim){  result2[i] <- mean(sample(pop1, 100, replace = F))}
hist(result2, breaks=50, main="Histogram of sample means (n=100)")

Let’s compare the histogram with the density of normal distribution

hist(result2, breaks= 50, prob=TRUE,  main="Histogram of sample means (n=100)")
x <- seq(-4, 4, 0.01) 
curve(dnorm(x, mean=mean(result2), sd=sd(result2)), add=TRUE, col="red", lwd=2)

Normal approximation is not very good for samples with $n=10$

hist(result1, breaks= 50, prob=TRUE,  main="Histogram of sample means (n=10)")
x <- seq(-4, 4, 0.01) 
curve(dnorm(x, mean=mean(result1), sd=sd(result1)), add=TRUE, col="red", lwd=2)

Check

The mean and variance of $\bar{y}$ (with $n=100$) from simulation samples are as follows:

mean(result2); var(result2)

## [1] 0.504719

## [1] 0.002423022

The population mean is 0.5048.
The theoretical variance of $\bar{y}$ is $n^{-1} (1-n/N) S^2$, which is equal to 0.0024676.

Large-sample inference under SRS

Central Limit theorem under SRS

May use the normal approximation \[ \frac{ \bar{y} - \bar{Y}_U }{ \sqrt{ V( \bar{y} )} } \sim N(0,1) \] for sufficiently large sample sizes.
In practice, we use the variance estimator $\hat{V} ( \bar{y})$ instead of $V( \bar{y} )$ to get

\[\begin{equation} \frac{ \bar{y} - \bar{Y}_U }{ \sqrt{ \hat{V}( \bar{y} )} } \sim N(0,1) \label{eq1} \end{equation}\]

When is CLT justified ?

Quality of approximation depends on $n$ and the population distribution of $Y$
“n is large enough for CLT” is less clear for finite populations
- What is the meaning of $n\rightarrow \infty$ for finite population?
- n = 30 rule in other stat classes does NOT apply

Rules of thumb

If distribution of Y is close to normal, n = 50
Need larger n if distribution of Y deviates from normal, e.g., skewed
Y categorical: if p is proportion with characteristic of interest, $np \ge$ 5 and $n(1-p) \ge$ 5

Interval estimation of $\bar{Y}_U$ under SRS

We may use the normal approximation (i.e. CLT) to construct confidence intervals of $\bar{Y}_U$.
For example, the $100\times(1-\alpha)$% C.I. for $\bar{Y}_U$ is \[ CI_{1-\alpha} = \left(\bar{y} - z_{\alpha/2}\sqrt{\hat{V} ( \bar{y})}, \bar{y} + z_{\alpha/2} \sqrt{\hat{V} ( \bar{y})} \right) \] where \[ \hat{V} = \frac{1}{n} \left( 1- \frac{n}{N}\right) s^2 \] and $z_\alpha$ is the upper $\alpha$ quantiles of $N(0,1)$ distribution satisfying $P( Z \le z_\alpha)= 1- \alpha$. (Here, $Z\sim N(0,1)$.)

Interpreting CIs in general

More generally (for any design), a $100 \left( 1- \alpha \right) \%$ CI has the interpretation: There is a $100 \left( 1- \alpha \right) \%$ chance of selecting a sample for which the CI will include the true population parameter
Note
- The upper and lower limits of the CI are random variables, calculated from the sample data
- The true parameter value is either included or not included in a single CI
- Coverage probability ($=1-\alpha$) of a CI has a relative frequency interpretation across samples

Computing CI from a SRS sample

n <- 100 
sam1 <- sample(pop1, n, replace = F)
m = mean(sam1)
v = (1/n)*(1-n/length(pop1))*var(sam1)
lci = m -1.96*sqrt(v) 
uci = m +1.96*sqrt(v)
 c(lci, uci)

## [1] 0.3884439 0.5919833

The 95% CI of the population mean is (0.3884, 0.5920) and the population mean is 0.4902.

Use simulation to compute the coverage probability of 95% CI.

R function for computing CI for the population mean:

cifunction <- function(data, psize, conf.level =0.95){
  z = qnorm((1-conf.level)/2, lower.tail=FALSE)
  m = mean(data); n = length(data); v = (1/n)*(1-n/psize)*var(data)
  c(m-z*sqrt(v), m+z*sqrt(v)) }

Now, we are interested in checking whether the 95% CI covers the true value with 0.95 probability.

nsim <-10000
cover <-double(nsim) 
  for (i in 1:nsim)
  { ci <- cifunction(sample(pop1, 100, replace = F), length(pop1)) 
  cover[i] = sum((mean(pop1) > ci[1])  & (mean(pop1) < ci[2] )) }

If the 95% CI computed from the $i$-th simulation sample covers the population mean, then cover[i] is equal to one. Otherwise, it is zero.

mean(cover)

## [1] 0.94

Thus, the probability that 95% CI covers the population mean is equal to 0.94 in this simulation sample.
If the CLT holds, then \[ P \left\{\bar{Y}_U \in CI_{1-\alpha} \right\} \cong 1-\alpha \]

Coverage probability of 95% CI for SRS of size $n=10$

cover2 <-double(nsim) 
  for (i in 1:nsim)
  { ci <- cifunction(sample(pop1, 10, replace = F), length(pop1)) 
  cover2[i] = sum((mean(pop1) > ci[1])  & (mean(pop1) < ci[2] )) }
mean(cover2)

## [1] 0.8706

Thus, the coverage probability under $n=10$ is equal to 0.8706 in this simulation sample.

Sample Size Determination

Determining a sample size, $n$

What should my sample size be?
Often depends on resources
- Times, funding, staff, …
When resources permit, can calculate a value for $n$ given a confidence level and a statement of desired precision

Determining sample size - a general approach

Specify tolerable error (level of precision, level of confidence)
Identify appropriate equation relating tolerable error ($e$, $\alpha$) to sample size (n)
Estimate unknown parameters in equation
Solve for n
Evaluate (and return to first step)
- Can you afford sample size ?
- What expectations can be altered ?

Specify tolerable error

Two parameters
- $e$: margin of error or half-width of CI
- $\alpha$: $100(1-\alpha)\%$ is the confidence level
Absolute expression (half-width of CI): estimate within $e$ of true population parameter \[ P\left\{ \left| \hat{\theta} - \theta \right| \le e \right\} = 1- \alpha \]
Relative expression: estimate within $100 e \%$ of $\theta$ \[ P\left\{\frac{ \left| \hat{\theta} - \theta \right|}{\theta} \le e \right\} = 1- \alpha \]

Equation linking $e$, $\alpha$, and $n$

Most common equation is half-width of CI \[ e = z_{\alpha/2} \sqrt{\hat{V} ( \hat{\theta} )} \]
Example: sample mean under SRSWOR \[ e = z_{\alpha/2} \sqrt{ \frac{S^2}{n} \left( 1- \frac{n}{N} \right)} \Rightarrow n= \frac{z_{\alpha/2}^2 S^2 }{ e^2 + z_{\alpha/2}^2 S^2/N} \]
Note
- Need to know $S^2$ in advance
- For sufficiently large $N$, $n \cong z_{\alpha/2}^2 S^2/e^2$

Estimate unknowns

Estimate of population variance, $S^2$
- Pilot study
- Previous study: careful about comparability
For $p$, use \[S^2 \cong p(1-p)\]
- If know nothing about $p$, use $p=0.5$
- In this case, $n \cong e^{-2}$ for $\alpha=0.05$.

Class Example

A sample survey of retail pharmacies is to be conducted in Iowa with $N=2,000$ pharmacies. The purpose of the survey is to estimate the retail price of 20 tablets of a commonly used vasodilator drug. An estimate is needed that is within 10% of the true value of the average retail price in Iowa.
A similar survey performed two years ago shows an average price of $ 7.00 for the 20 tablets with a standard deviation of $ 1.40.
If SRS is to be used, what is the minimum sample size to achieve this accuracy (with 95% confidence level)?

Solution

We wish to achieve \[ P\left\{\frac{ \left| \hat{\theta} - \theta \right|}{\theta} \le 0.1 \right\} = 0.95 \]
Using CTL, it is known that \[ P \left\{\left| \hat{\theta} - \theta \right| \le 1.96 \sqrt{ \left( \frac{1}{n} - \frac{1}{N} \right) S^2 } \right\} = 0.95 \]
Thus, we have only to slow \[ 0.1 \cdot \theta \cong 1.96 \sqrt{ \left( \frac{1}{n} - \frac{1}{N} \right) S^2 } \]

Using the previous survey, we have $\theta=7$, $S = 1.4$. Also, $N=2,000$. Thus, \[ 0.1 \times 7 \cong 1.96 \sqrt{ \left( \frac{1}{n} - \frac{1}{2000} \right) 1.4^2 } \]
After some simple algrbra, \[ \frac{1}{n} =\frac{1}{2000} + \left( \frac{ 0.1 \times 7}{ 1.96 \times 1.4} \right)^2 \] which leads to $n \cong 15.25$.
Thus, the minimum sample size is $n=16$.

Stat 421: Week 4

Jae-kwang Kim

2/4/2020

Reveiw

Simulation

SRS Simulation:

Step 1: Generate a finite population

Step 2: Draw a SRS of size \(n\) from the finite population

You can repeat this sampling independently \(10,000\) times.

Let’s increase the sample size to \(n=100\)

Let’s compare the histogram with the density of normal distribution

Normal approximation is not very good for samples with \(n=10\)

Check

Large-sample inference under SRS

Central Limit theorem under SRS

When is CLT justified ?

Rules of thumb

Interval estimation of \(\bar{Y}_U\) under SRS

Interpreting CIs in general

Computing CI from a SRS sample

Use simulation to compute the coverage probability of 95% CI.

Coverage probability of 95% CI for SRS of size \(n=10\)

Sample Size Determination

Determining a sample size, \(n\)

Determining sample size - a general approach

Specify tolerable error

Equation linking \(e\), \(\alpha\), and \(n\)

Estimate unknowns

Class Example

Solution

What is the minimum sample size for 5% accuray?