Stat 421: Week 4

Jae-kwang Kim

2/4/2020

Reveiw

  • In Week 3, we have studied
    • Simple random sampling (SRS) design
    • Estimation under SRS (Population mean, population total, population proportions)
    • Variance estimation under SRS
  • In Week 4, we will study
    • Understanding the sampling distribution using simulation
    • Large-sample inference under SRS
    • Sample size determination

Simulation

SRS Simulation:

  • Step 1: Generate a finite population of size N from a model (Bernoulli, Uniform, Normal)
  • Step 2: Draw SRS of size n repeatedly
  • Step 3: Compute the mean, variance of the sample means from the simulation.
  • Step 4: Compute the coverage of the interval estimator.

Step 1: Generate a finite population

  • Suppose that we use a finite population of size \(N=10,000\) generated from an exponetial distribution.
  • The population mean and the population variance of the data “pop1” is
## [1] 0.5047766
## [1] 0.2492536

  • You can also plot a histogram of the population:

Step 2: Draw a SRS of size \(n\) from the finite population

  • Draw a SRS of size \(n=10\) from “pop1” data.
## [1] 0.4555882

You can repeat this sampling independently \(10,000\) times.

Let’s increase the sample size to \(n=100\)

Let’s compare the histogram with the density of normal distribution

Normal approximation is not very good for samples with \(n=10\)

Check

  • The mean and variance of \(\bar{y}\) (with \(n=100\)) from simulation samples are as follows:
## [1] 0.504719
## [1] 0.002423022
  • The population mean is 0.5048.

  • The theoretical variance of \(\bar{y}\) is \(n^{-1} (1-n/N) S^2\), which is equal to 0.0024676.

Large-sample inference under SRS

Central Limit theorem under SRS

  • May use the normal approximation \[ \frac{ \bar{y} - \bar{Y}_U }{ \sqrt{ V( \bar{y} )} } \sim N(0,1) \] for sufficiently large sample sizes.

  • In practice, we use the variance estimator \(\hat{V} ( \bar{y})\) instead of \(V( \bar{y} )\) to get

\[\begin{equation} \frac{ \bar{y} - \bar{Y}_U }{ \sqrt{ \hat{V}( \bar{y} )} } \sim N(0,1) \label{eq1} \end{equation}\]

When is CLT justified ?

  • Quality of approximation depends on \(n\) and the population distribution of \(Y\)

  • n is large enough for CLT” is less clear for finite populations

    • What is the meaning of \(n\rightarrow \infty\) for finite population?
    • n = 30 rule in other stat classes does NOT apply

Rules of thumb

  • If distribution of Y is close to normal, n = 50
  • Need larger n if distribution of Y deviates from normal, e.g., skewed
  • Y categorical: if p is proportion with characteristic of interest, \(np \ge\) 5 and \(n(1-p) \ge\) 5

Interval estimation of \(\bar{Y}_U\) under SRS

  • We may use the normal approximation (i.e. CLT) to construct confidence intervals of \(\bar{Y}_U\).

  • For example, the \(100\times(1-\alpha)\)% C.I. for \(\bar{Y}_U\) is \[ CI_{1-\alpha} = \left(\bar{y} - z_{\alpha/2}\sqrt{\hat{V} ( \bar{y})}, \bar{y} + z_{\alpha/2} \sqrt{\hat{V} ( \bar{y})} \right) \] where \[ \hat{V} = \frac{1}{n} \left( 1- \frac{n}{N}\right) s^2 \] and \(z_\alpha\) is the upper \(\alpha\) quantiles of \(N(0,1)\) distribution satisfying \(P( Z \le z_\alpha)= 1- \alpha\). (Here, \(Z\sim N(0,1)\).)

Interpreting CIs in general

  • More generally (for any design), a \(100 \left( 1- \alpha \right) \%\) CI has the interpretation: There is a \(100 \left( 1- \alpha \right) \%\) chance of selecting a sample for which the CI will include the true population parameter
  • Note
    • The upper and lower limits of the CI are random variables, calculated from the sample data
    • The true parameter value is either included or not included in a single CI
    • Coverage probability (\(=1-\alpha\)) of a CI has a relative frequency interpretation across samples

Computing CI from a SRS sample

## [1] 0.3884439 0.5919833
  • The 95% CI of the population mean is (0.3884, 0.5920) and the population mean is 0.4902.

Use simulation to compute the coverage probability of 95% CI.

  • R function for computing CI for the population mean:
  • Now, we are interested in checking whether the 95% CI covers the true value with 0.95 probability.

  • If the 95% CI computed from the \(i\)-th simulation sample covers the population mean, then cover[i] is equal to one. Otherwise, it is zero.
## [1] 0.94
  • Thus, the probability that 95% CI covers the population mean is equal to 0.94 in this simulation sample.

  • If the CLT holds, then \[ P \left\{\bar{Y}_U \in CI_{1-\alpha} \right\} \cong 1-\alpha \]

Coverage probability of 95% CI for SRS of size \(n=10\)

## [1] 0.8706
  • Thus, the coverage probability under \(n=10\) is equal to 0.8706 in this simulation sample.

Sample Size Determination

Determining a sample size, \(n\)

  • What should my sample size be?
  • Often depends on resources
    • Times, funding, staff, …
  • When resources permit, can calculate a value for \(n\) given a confidence level and a statement of desired precision

Determining sample size - a general approach

  1. Specify tolerable error (level of precision, level of confidence)
  2. Identify appropriate equation relating tolerable error (\(e\), \(\alpha\)) to sample size (n)
  3. Estimate unknown parameters in equation
  4. Solve for n
  5. Evaluate (and return to first step)
    • Can you afford sample size ?
    • What expectations can be altered ?

Specify tolerable error

  • Two parameters

    • \(e\): margin of error or half-width of CI
    • \(\alpha\): \(100(1-\alpha)\%\) is the confidence level
  • Absolute expression (half-width of CI): estimate within \(e\) of true population parameter \[ P\left\{ \left| \hat{\theta} - \theta \right| \le e \right\} = 1- \alpha \]

  • Relative expression: estimate within \(100 e \%\) of \(\theta\) \[ P\left\{\frac{ \left| \hat{\theta} - \theta \right|}{\theta} \le e \right\} = 1- \alpha \]

Equation linking \(e\), \(\alpha\), and \(n\)

  • Most common equation is half-width of CI \[ e = z_{\alpha/2} \sqrt{\hat{V} ( \hat{\theta} )} \]

  • Example: sample mean under SRSWOR \[ e = z_{\alpha/2} \sqrt{ \frac{S^2}{n} \left( 1- \frac{n}{N} \right)} \Rightarrow n= \frac{z_{\alpha/2}^2 S^2 }{ e^2 + z_{\alpha/2}^2 S^2/N} \]

  • Note

    • Need to know \(S^2\) in advance
    • For sufficiently large \(N\), \(n \cong z_{\alpha/2}^2 S^2/e^2\)

Estimate unknowns

  • Estimate of population variance, \(S^2\)
    • Pilot study
    • Previous study: careful about comparability
  • For \(p\), use \[S^2 \cong p(1-p)\]
    • If know nothing about \(p\), use \(p=0.5\)
    • In this case, \(n \cong e^{-2}\) for \(\alpha=0.05\).

Class Example

  • A sample survey of retail pharmacies is to be conducted in Iowa with \(N=2,000\) pharmacies. The purpose of the survey is to estimate the retail price of 20 tablets of a commonly used vasodilator drug. An estimate is needed that is within 10% of the true value of the average retail price in Iowa.

  • A similar survey performed two years ago shows an average price of $ 7.00 for the 20 tablets with a standard deviation of $ 1.40.

  • If SRS is to be used, what is the minimum sample size to achieve this accuracy (with 95% confidence level)?

Solution

  • We wish to achieve \[ P\left\{\frac{ \left| \hat{\theta} - \theta \right|}{\theta} \le 0.1 \right\} = 0.95 \]

  • Using CTL, it is known that \[ P \left\{\left| \hat{\theta} - \theta \right| \le 1.96 \sqrt{ \left( \frac{1}{n} - \frac{1}{N} \right) S^2 } \right\} = 0.95 \]

  • Thus, we have only to slow \[ 0.1 \cdot \theta \cong 1.96 \sqrt{ \left( \frac{1}{n} - \frac{1}{N} \right) S^2 } \]

  • Using the previous survey, we have \(\theta=7\), \(S = 1.4\). Also, \(N=2,000\). Thus, \[ 0.1 \times 7 \cong 1.96 \sqrt{ \left( \frac{1}{n} - \frac{1}{2000} \right) 1.4^2 } \]

  • After some simple algrbra, \[ \frac{1}{n} =\frac{1}{2000} + \left( \frac{ 0.1 \times 7}{ 1.96 \times 1.4} \right)^2 \] which leads to \(n \cong 15.25\).

  • Thus, the minimum sample size is \(n=16\).

What is the minimum sample size for 5% accuray?