Stat 421: Week 3

Jae-kwang Kim

1/28/2020

Reveiw

  • In Week 2, we have studied
    • Selection bias and Measurement bias
    • Survey Errors
    • Basic concepts of probability sampling
  • In Week 3, we will study
    • Simple random sampling (SRS) design
    • Estimation under SRS (Population mean, population total, population proportions)
    • Variance estimation under SRS

Simple random sampling (SRS) design

Notation

  • \(\mathcal{U} = \left\{ 1, \cdots, N \right\}\): index set of finite population of size \(N\)

  • \(\mathcal{S}\): index set of sample. \((\mathcal{S} \subset \mathcal{U})\)

  • \(P\left( \mathcal{S} \right)\): probability of selecting sample \(\mathcal{S}\).

  • \(\pi_i = P( i \in \mathcal{S} )\): (the first-order) inclusion probability of unit \(i\). Note that \[ \pi_i = \sum_{\mathcal{S}; i \in \mathcal{S} } P( \mathcal{S}). \]

Basic Concepts

  • Sampling design (or sampling mechanism): Enumeration of possible \(\left( \mathcal{S}, P \left( \mathcal{S} \right) \right)\)

  • Probability sampling design: sampling design with \(\pi_i>0\) for all \(i \in \mathcal{U}\).

  • Definition: Simple random sampling (of size \(n\)) design \[ \iff P( \mathcal{S}) = \left\{ \begin{array}{ll} {N \choose n}^{-1} & \mbox{ if } |\mathcal{S} | = n \\ 0 & \mbox{ otherwise} \end{array} \right. \] where \[ {N \choose n} = \frac{ N!}{ n! (N-n)!} . \]

Remark

  • Define the following sampling indicator function \[ I_i = \left\{ \begin{array}{ll} 1 & \mbox{ if } i \in \mathcal{S} \\ 0 & \mbox{ otherwise} \end{array} \right. \]

  • By definition of \(I_i\), we have \[ \sum_{i=1}^N I_i = n \] where \(n\) is the (realized) sample size.

  • The first-order inclusion probability can be written as \[ \pi_i = Pr( I_i=1) = E( I_i) . \]

Result

  1. The sum of \(\pi_i\) over the finite population is equal to the sample size. That is, \[ \sum_{i=1}^N \pi_i = n.\]

  2. Under SRS, \(\pi_i\) are all equal. That is, \(\pi_i= n/N\) for all \(\mathcal{U}\), where \(n=|\mathcal{S}|\) and \(N=|\mathcal{U}|\).

Proof

Question

  • We have studied that SRS implies \(\pi_i=n/N\) for all \(i\). Does the reverse relationship also hold? That is, does \(\pi_i = n/N\) implies that the sampling design is SRS?

A counterexample with \(N=4\)

Case Sample ID Selection Prob.
1 1,2 0
2 1,3 1/4
3 1,4 1/4
4 2,3 1/4
5 2,4 1/4
6 3,4 0

Selection procedure for SRS without replacement

  • Draw-by-draw Procedure
    • Select one element from universe of size \(N\) with probability \(1/N\)
    • DO NOT return the selected element to the universe (=population)
    • Select 2nd element from remaining units in universe with probability \(1/(N -1)\)
    • DO NOT return the selected element to the universe
    • Repeat until \(n\) elements have been selected
  • Selection probabilities change with each draw

\[ 1/N, 1/(N -1), 1/(N -2), \cdots , 1/(N - n +1) \]

  • From the draw-by-draw procedure, what is the following probabilities?
    • P( Unit \(i\) is selected from the first draw) =
    • P( Unit \(i\) is selected from the second draw) =
    • P( Unit \(i\) is selected from the third draw) =
  • Thus, the probability that unit \(i\) selected \[ P( i \in \mathcal{S} )= \sum_{k=1}^n P( \mbox{unit $i$ is slected from the $k$-th draw}) = \]

SRS with replacement

  • This is another version of SRS.
  • Selection procedure for SRS with replacement:
    • Select one element with probability \(1/N\) from \(N\) elements.
    • Returning selected element to universe (=population)
    • Repeat this \(n\) times
  • The SRS with replacement is like drawing \(n\) independent samples of size 1
    • Can draw a sampling unit twice - duplicate units
    • Unappealing for finite population sampling - no additionl information in having a duplcate unit
    • Useful in theoretical development for large populations

Parameter Estimation under SRS

Sample statistic

  • Once the sample is selected, we observe \(y_i\) from the sample elements.
  • A statistic \(\hat{\theta}\) can be computed from the sample observations. That is, \(\hat{\theta}\) is a known function of \(y_i\)s in the sample.
  • \(\hat{\theta}\) is random in the sense that \(\hat{\theta}\) can vary with \(\mathcal{S}\).
  • Thus, we can write \(\hat{\theta}= \hat{\theta} ( \mathcal{S})\) to emphasize its dependence on \(\mathcal{S}\).

Sampling Distribution

  • The sampling distribution of \(\hat{\theta}\) is defined as the enumeration of all possible \((\hat{\theta}(\mathcal{S}), P( \mathcal{S}))\), where \(P( \mathcal{S})\) is the selection probability of sample \(\mathcal{S}\).

  • Using the sampling distribution, we can derive \[ E( \hat{\theta}) = \sum_{\mathcal{S}} \hat{\theta} (\mathcal{S}) P( \mathcal{S}). \]

  • If \(E( \hat{\theta})=\theta\), then \(\hat{\theta}\) is unbased for \(\theta\).

  • Also, the variance is \[ V( \hat{\theta}) = E [ \{ ( \hat{\theta} - E( \hat{\theta}) \}^2 ] \] where \[ E [ \{ ( \hat{\theta} - E( \hat{\theta}) \}^2 ] = \sum_{\mathcal{S}} \{ ( \hat{\theta}( \mathcal{S} ) - E( \hat{\theta}) \}^2 P( \mathcal{S} ) \]

  • Note that \[ V( \hat{\theta}) = E( \hat{\theta}^2) - \{E( \hat{\theta}) \}^2 \] where \[ E( \hat{\theta}^2) = \sum_{\mathcal{S}} \{ \hat{\theta} (\mathcal{S}) \}^2 P( \mathcal{S}). \]

  • Standard Error of \(\hat{\theta}\): \[ SE ( \hat{\theta}) = \sqrt{ V( \hat{\theta})} \]

Example: SRS

Case Sample ID Statistic
(Sample mean)
Selection Prob.
1 1,2 \((y_1 + y_2)/2\) 1/6
2 1,3 \((y_1+ y_3)/2\) 1/6
3 1,4 \((y_1 + y_4)/2\) 1/6
4 2,3 \((y_2 + y_3)/2\) 1/6
5 2,4 \((y_2 + y_4)/2\) 1/6
6 3,4 \((y_3+ y_4)/2\) 1/6
  • Expectation of the sample mean: \[{\small E( \bar{y})= \frac{1}{6} \left\{ \frac{y_1+y_2}{2} + \cdots + \frac{y_3+y_4}{2} \right\} =\frac{1}{4}\sum_{i=1}^4 y_i. } \]

Estimation of the population mean under SRS

  • \(\bar{Y}_U = N^{-1}\sum_{i=1}^N y_i\): population mean of \(y\), parameter of interest
  • \(\bar{y}=n^{-1} \sum_{i \in \mathcal{S}} y_i\): sample mean of \(y\). An estimator of \(\bar{Y}_U\).
  • Sampling design: SRS
  • We are interested in using \(\bar{y}\) to estimate \(\bar{Y}_U\).

Theorem 1

  • Under SRS without replacement, the sample mean \(\bar{y}\) satisfies \[ E( \bar{y} ) = \bar{Y}_N \] and \[ V( \bar{y}) = \frac{1}{n} \left( 1- \frac{n}{N}\right) S^2 , \] where \[ S^2 = \frac{1}{N-1} \sum_{i=1}^N \left( y_i - \bar{Y}_U \right)^2 . \]

Proof of Theorem 1

  • First we express \[ \bar{y} = \frac{1}{n} \sum_{i=1}^N I_i y_i \] where \[ I_i =\left\{ \begin{array}{ll} 1 & \mbox{ if } i \in \mathcal{S} \\ 0 & \mbox{ otherwise} \end{array} \right. \]
  • Expectation of \(\bar{y}\): \[\begin{eqnarray*} E( \bar{y} ) &=& E\left(\frac{1}{n} \sum_{i=1}^N I_i y_i \right)= \frac{1}{n} \sum_{i=1}^N y_i E( I_i) = \frac{1}{N} \sum_{i=1}^N y_i . \end{eqnarray*}\]

  • For the variance term, note that \[ V( \bar{y}) = E\{ (\bar{y}-\bar{Y}_U )^2 \} \]

  • We can express \[\begin{eqnarray*} \bar{y} - \bar{Y}_U &=& \frac{1}{n} \sum_{i=1}^N I_i ( y_i - \bar{Y}_U ):= \frac{1}{n} \sum_{i=1}^N I_i z_i \end{eqnarray*}\] where \(z_i = y_i - \bar{Y}_U\).

  • Thus, we have
    \[\begin{eqnarray*} (\bar{y} - \bar{Y}_U)^2 &=& \frac{1}{n^2} \left\{ \sum_{i=1}^N I_i z_i^2 + \sum_{i \neq j }I_i I_j z_i z_j \right\} \\ E\{ (\bar{y} - \bar{Y}_U)^2 \} &=& \frac{1}{n^2} \left\{ \sum_{i=1}^N E(I_i) z_i^2 + \sum_{i \neq j }E( I_i I_j) z_i z_j \right\} \end{eqnarray*}\]

  • We have already seen that \(E(I_i)=n/N\).
  • For the \(E(I_iI_j)\) term with \(i \neq j\), we use \[ E( I_i I_j ) = \frac{{N-2 \choose n-2}}{N \choose n} = \frac{n(n-1)}{N(N-1)} \]
  • Also, since \(\sum_{i=1}^N z_i=0\), we can show that \[ \sum_{i \neq j} z_i z_j = - \sum_{i=1}^N z_i^2 \]
  • Now, \(\sum_{i=1}^N z_i^2 = \sum_{i=1}^N (y_i - \bar{Y}_U)^2 = (N-1) S^2\).

  • Combining the above results, we can obtain \[\begin{eqnarray*} E\{ (\bar{y} - \bar{Y}_U)^2 \} &=& \frac{1}{n^2} \left\{ \frac{n}{N} - \frac{n(n-1)}{ N(N-1)} \right\} \sum_{i=1}^N z_i^2 \\ &=& \frac{1}{n^2} \frac{n}{N} \left( 1 - \frac{n-1}{N-1} \right) (N-1) S^2 \\ &=& \frac{1}{nN} (N-n) S^2 . \end{eqnarray*}\]

Remark

  • If \(n=N\), then \(V( \bar{y})=0\).

  • If \(n=1\), then \(\bar{y}\) is equal to the \(y\) value of the first selected element and
    \[ V( \bar{y})= \frac{1}{N} \sum_{i=1}^N ( y_i - \bar{Y}_U)^2. \] It is the variance of selecting one element at random from \(\{ y_1, \cdots, y_N\}\).

SRS with replacement

  • If the sample is SRS with replacement, then we can define \(a(k)\) to be the index of the element selected at the \(k\)-th draw.

  • The \(n\) sample elements, \(y_{a(1)}, \cdots, y_{a(n)}\), are independently and identically distributed with the distribution \[ Y_{a(k)} = \left\{ \begin{array}{ll} y_1 & \mbox{ with prob. } 1/N \\ y_2 & \mbox{ with prob. } 1/N \\ \vdots & \\ y_N & \mbox{ with prob. } 1/N \end{array} \right. \]

  • The sample mean is \[ \bar{y} = \frac{1}{ n} \sum_{k=1}^n Y_{a(k)} . \]

  • It can be shown that the sample mean \(\bar{y}\) satisfies \[ E( \bar{y} ) = \bar{Y}_N \] and \[ V( \bar{y}) = \frac{1}{n} \left( 1- \frac{1}{N}\right) S^2 , \]

  • Thus, the variance is smaller for without replacement sampling.

Finite population correction factor

  • Finite population correction factor (FPC) \[ FPC = 1 - \frac{n}{N} \]

  • Sampling fraction is the proportion of the population sampled, or \(n/N\).

  • Often FPC is very close to 1.

  • In cases where sampling fraction is very small and FPC is very close to 1, FPC has no practical effect on the variance or estimated variance of the parameter estimate.

Estimation of the population total under SRS

  • \(T = \sum_{i=1}^N y_i\): population total of \(y\), parameter of interest

  • \(\hat{T}=N \bar{y}\): an estimator of \(T\).

  • Properties of \(\hat{T}\)

    • Unbiased
    • Variance \[ V\left( \hat{T} \right) = \frac{N^2}{n} \left( 1-\frac{n}{N} \right) S^2. \]

Estimation of the population proportion under SRS

  • Y is binary, taking 1 or 0, where \(Y=1\) means the unit has the characteristic of interest

  • Parameter: proportion of \(Y=1\). \(P = N^{-1} \sum_{i=1}^N y_i\):

  • \(\hat{P}=\bar{y}\): an estimator of \(P\).

  • Properties

    • Unbiased: \(E( \hat{P})= P\)
    • Variance \[ V\left( \hat{P} \right) = \frac{1}{n} \left( 1-\frac{n}{N} \right) \frac{N}{N-1} P (1-P). \]

Justification

Variance Estimation

Introduction

  • Different concepts

    • Population variance: \[ S^2= \frac{1}{N-1} \sum_{i=1}^N (y_i - \bar{Y}_U)^2\]
    • (Sampling) variance of \(\bar{y}\): \(V( \bar{y})\)
    • Variance estimation of \(\bar{y}\)
  • We are interested in estimating \(V( \bar{y})\) under SRS. We use \(\hat{V} ( \bar{y})\) to denote an estimator of \(V( \bar{y})\).

  • Recall that, by Theorem 1,
    \[V( \bar{y}) = \frac{1}{n} \left( 1- \frac{n}{N}\right) S^2\]

Population variance

  • Population variance \[ \small S^2 \equiv \frac{1}{N-1}\sum_{i=1}^N \left( y_i - \bar{Y}_U \right)^2= \frac{1}{2N(N-1)} \sum_{i=1}^N \sum_{j=1}^N (y_i - y_j)^2 \]

  • To verify the equality, use

\[\begin{eqnarray*} {\small \sum_{i=1}^N \sum_{j=1}^N (y_i - y_j)^2 } & =&{\small \sum_{i=1}^N \sum_{j=1}^N \{ (y_i - \bar{Y}_U) - (y_j - \bar{Y}_U) \}^2 } \\ &=& 2 N \sum_{i=1}^N (y_i - \bar{Y}_U)^2 \end{eqnarray*}\]

Estimation of population variance

  • Sample variance \[ {\small s^2 = \frac{1}{n-1} \sum_{i \in \mathcal{S}} \left( y_i - \bar{y} \right)^2 = \frac{1}{2n(n-1)} \sum_{i\in \mathcal{S}} \sum_{j \in \mathcal{S}} (y_i - y_j)^2} \]

  • Property (under SRS) \[ E\left( s^2 \right) = S^2 \]

Proof

Variance estimation formula

  • For sample mean \(\bar{y}\): \[ \hat{V} ( {\bar{y}}) = \frac{1}{n} \left( 1- \frac{n}{N} \right) s^2 \]
  • For total estimator \[ \hat{V} ( \hat{T}) = \frac{N^2}{n} \left( 1- \frac{n}{N} \right) s^2 \]
  • For the proporton estimate \(\hat{P}\) \[ \hat{V} ( \hat{P}) = \frac{1}{n} \left( 1- \frac{n}{N} \right) \frac{n}{n-1} \hat{P} ( 1- \hat{P}) \]

Justification

Summary

  • Mean Estimation
    • Point Estimator: \(\bar{y}\)
    • Variance estimator: \[{\small \hat{V} ( \bar{y}) = \frac{1}{n} \left( 1- \frac{n}{N} \right) s^2 }\]
  • Total Estimation
    • Point Estimator: \(\hat{T} = N \bar{y}\)
    • Variance estimator: \[{\small \hat{V} ( \hat{T}) = \frac{N^2}{n} \left( 1- \frac{n}{N} \right) s^2 } \]

  • Proportion estimation
    • Point estimator: \(\hat{P} = \bar{y}\)
    • Variance estimator \[{\small \hat{V} ( \hat{P}) = \frac{1}{n} \left( 1- \frac{n}{N} \right) \frac{n}{n-1} \hat{P} (1 -\hat{P} ) } \]