Stat 421: Week 3

Jae-kwang Kim

1/28/2020

Reveiw

In Week 2, we have studied
- Selection bias and Measurement bias
- Survey Errors
- Basic concepts of probability sampling
In Week 3, we will study
- Simple random sampling (SRS) design
- Estimation under SRS (Population mean, population total, population proportions)
- Variance estimation under SRS

Simple random sampling (SRS) design

Notation

$\mathcal{U} = \left\{ 1, \cdots, N \right\}$: index set of finite population of size $N$
$\mathcal{S}$: index set of sample. $(\mathcal{S} \subset \mathcal{U})$
$P\left( \mathcal{S} \right)$: probability of selecting sample $\mathcal{S}$.
$\pi_i = P( i \in \mathcal{S} )$: (the first-order) inclusion probability of unit $i$. Note that \[ \pi_i = \sum_{\mathcal{S}; i \in \mathcal{S} } P( \mathcal{S}). \]

Basic Concepts

Sampling design (or sampling mechanism): Enumeration of possible $\left( \mathcal{S}, P \left( \mathcal{S} \right) \right)$
Probability sampling design: sampling design with $\pi_i>0$ for all $i \in \mathcal{U}$.
Definition: Simple random sampling (of size $n$) design \[ \iff P( \mathcal{S}) = \left\{ \begin{array}{ll} {N \choose n}^{-1} & \mbox{ if } |\mathcal{S} | = n \\ 0 & \mbox{ otherwise} \end{array} \right. \] where \[ {N \choose n} = \frac{ N!}{ n! (N-n)!} . \]

Remark

Define the following sampling indicator function \[ I_i = \left\{ \begin{array}{ll} 1 & \mbox{ if } i \in \mathcal{S} \\ 0 & \mbox{ otherwise} \end{array} \right. \]
By definition of $I_i$, we have \[ \sum_{i=1}^N I_i = n \] where $n$ is the (realized) sample size.
The first-order inclusion probability can be written as \[ \pi_i = Pr( I_i=1) = E( I_i) . \]

Result

The sum of $\pi_i$ over the finite population is equal to the sample size. That is, \[ \sum_{i=1}^N \pi_i = n.\]
Under SRS, $\pi_i$ are all equal. That is, $\pi_i= n/N$ for all $\mathcal{U}$, where $n=|\mathcal{S}|$ and $N=|\mathcal{U}|$.

Proof

Question

We have studied that SRS implies $\pi_i=n/N$ for all $i$. Does the reverse relationship also hold? That is, does $\pi_i = n/N$ implies that the sampling design is SRS?

A counterexample with $N=4$

Case	Sample ID	Selection Prob.
1	1,2	0
2	1,3	1/4
3	1,4	1/4
4	2,3	1/4
5	2,4	1/4
6	3,4	0

Selection procedure for SRS without replacement

Draw-by-draw Procedure
- Select one element from universe of size $N$ with probability $1/N$
- DO NOT return the selected element to the universe (=population)
- Select 2nd element from remaining units in universe with probability $1/(N -1)$
- DO NOT return the selected element to the universe
- Repeat until $n$ elements have been selected
Selection probabilities change with each draw

\[ 1/N, 1/(N -1), 1/(N -2), \cdots , 1/(N - n +1) \]

From the draw-by-draw procedure, what is the following probabilities?
- P( Unit $i$ is selected from the first draw) =
- P( Unit $i$ is selected from the second draw) =
- P( Unit $i$ is selected from the third draw) =
Thus, the probability that unit $i$ selected \[ P( i \in \mathcal{S} )= \sum_{k=1}^n P( \mbox{unit $i$ is slected from the $k$-th draw}) = \]

SRS with replacement

This is another version of SRS.
Selection procedure for SRS with replacement:
- Select one element with probability $1/N$ from $N$ elements.
- Returning selected element to universe (=population)
- Repeat this $n$ times
The SRS with replacement is like drawing $n$ independent samples of size 1
- Can draw a sampling unit twice - duplicate units
- Unappealing for finite population sampling - no additionl information in having a duplcate unit
- Useful in theoretical development for large populations

Parameter Estimation under SRS

Sample statistic

Once the sample is selected, we observe $y_i$ from the sample elements.
A statistic $\hat{\theta}$ can be computed from the sample observations. That is, $\hat{\theta}$ is a known function of $y_i$s in the sample.
$\hat{\theta}$ is random in the sense that $\hat{\theta}$ can vary with $\mathcal{S}$.
Thus, we can write $\hat{\theta}= \hat{\theta} ( \mathcal{S})$ to emphasize its dependence on $\mathcal{S}$.

Sampling Distribution

The sampling distribution of $\hat{\theta}$ is defined as the enumeration of all possible $(\hat{\theta}(\mathcal{S}), P( \mathcal{S}))$, where $P( \mathcal{S})$ is the selection probability of sample $\mathcal{S}$.
Using the sampling distribution, we can derive \[ E( \hat{\theta}) = \sum_{\mathcal{S}} \hat{\theta} (\mathcal{S}) P( \mathcal{S}). \]
If $E( \hat{\theta})=\theta$, then $\hat{\theta}$ is unbased for $\theta$.

Also, the variance is \[ V( \hat{\theta}) = E [ \{ ( \hat{\theta} - E( \hat{\theta}) \}^2 ] \] where \[ E [ \{ ( \hat{\theta} - E( \hat{\theta}) \}^2 ] = \sum_{\mathcal{S}} \{ ( \hat{\theta}( \mathcal{S} ) - E( \hat{\theta}) \}^2 P( \mathcal{S} ) \]
Note that \[ V( \hat{\theta}) = E( \hat{\theta}^2) - \{E( \hat{\theta}) \}^2 \] where \[ E( \hat{\theta}^2) = \sum_{\mathcal{S}} \{ \hat{\theta} (\mathcal{S}) \}^2 P( \mathcal{S}). \]
Standard Error of $\hat{\theta}$: \[ SE ( \hat{\theta}) = \sqrt{ V( \hat{\theta})} \]

Example: SRS

Case	Sample ID	Statistic (Sample mean)	Selection Prob.
1	1,2	$(y_1 + y_2)/2$	1/6
2	1,3	$(y_1+ y_3)/2$	1/6
3	1,4	$(y_1 + y_4)/2$	1/6
4	2,3	$(y_2 + y_3)/2$	1/6
5	2,4	$(y_2 + y_4)/2$	1/6
6	3,4	$(y_3+ y_4)/2$	1/6

Expectation of the sample mean: \[{\small E( \bar{y})= \frac{1}{6} \left\{ \frac{y_1+y_2}{2} + \cdots + \frac{y_3+y_4}{2} \right\} =\frac{1}{4}\sum_{i=1}^4 y_i. } \]

Estimation of the population mean under SRS

$\bar{Y}_U = N^{-1}\sum_{i=1}^N y_i$: population mean of $y$, parameter of interest
$\bar{y}=n^{-1} \sum_{i \in \mathcal{S}} y_i$: sample mean of $y$. An estimator of $\bar{Y}_U$.
Sampling design: SRS
We are interested in using $\bar{y}$ to estimate $\bar{Y}_U$.

Theorem 1

Under SRS without replacement, the sample mean $\bar{y}$ satisfies \[ E( \bar{y} ) = \bar{Y}_N \] and \[ V( \bar{y}) = \frac{1}{n} \left( 1- \frac{n}{N}\right) S^2 , \] where \[ S^2 = \frac{1}{N-1} \sum_{i=1}^N \left( y_i - \bar{Y}_U \right)^2 . \]

Proof of Theorem 1

First we express \[ \bar{y} = \frac{1}{n} \sum_{i=1}^N I_i y_i \] where \[ I_i =\left\{ \begin{array}{ll} 1 & \mbox{ if } i \in \mathcal{S} \\ 0 & \mbox{ otherwise} \end{array} \right. \]
Expectation of $\bar{y}$: \[\begin{eqnarray*} E( \bar{y} ) &=& E\left(\frac{1}{n} \sum_{i=1}^N I_i y_i \right)= \frac{1}{n} \sum_{i=1}^N y_i E( I_i) = \frac{1}{N} \sum_{i=1}^N y_i . \end{eqnarray*}\]

For the variance term, note that \[ V( \bar{y}) = E\{ (\bar{y}-\bar{Y}_U )^2 \} \]
We can express \[\begin{eqnarray*} \bar{y} - \bar{Y}_U &=& \frac{1}{n} \sum_{i=1}^N I_i ( y_i - \bar{Y}_U ):= \frac{1}{n} \sum_{i=1}^N I_i z_i \end{eqnarray*}\] where $z_i = y_i - \bar{Y}_U$.
Thus, we have
\[\begin{eqnarray*} (\bar{y} - \bar{Y}_U)^2 &=& \frac{1}{n^2} \left\{ \sum_{i=1}^N I_i z_i^2 + \sum_{i \neq j }I_i I_j z_i z_j \right\} \\ E\{ (\bar{y} - \bar{Y}_U)^2 \} &=& \frac{1}{n^2} \left\{ \sum_{i=1}^N E(I_i) z_i^2 + \sum_{i \neq j }E( I_i I_j) z_i z_j \right\} \end{eqnarray*}\]

We have already seen that $E(I_i)=n/N$.
For the $E(I_iI_j)$ term with $i \neq j$, we use \[ E( I_i I_j ) = \frac{{N-2 \choose n-2}}{N \choose n} = \frac{n(n-1)}{N(N-1)} \]
Also, since $\sum_{i=1}^N z_i=0$, we can show that \[ \sum_{i \neq j} z_i z_j = - \sum_{i=1}^N z_i^2 \]
Now, $\sum_{i=1}^N z_i^2 = \sum_{i=1}^N (y_i - \bar{Y}_U)^2 = (N-1) S^2$.

Combining the above results, we can obtain \[\begin{eqnarray*} E\{ (\bar{y} - \bar{Y}_U)^2 \} &=& \frac{1}{n^2} \left\{ \frac{n}{N} - \frac{n(n-1)}{ N(N-1)} \right\} \sum_{i=1}^N z_i^2 \\ &=& \frac{1}{n^2} \frac{n}{N} \left( 1 - \frac{n-1}{N-1} \right) (N-1) S^2 \\ &=& \frac{1}{nN} (N-n) S^2 . \end{eqnarray*}\]

Remark

If $n=N$, then $V( \bar{y})=0$.
If $n=1$, then $\bar{y}$ is equal to the $y$ value of the first selected element and
\[ V( \bar{y})= \frac{1}{N} \sum_{i=1}^N ( y_i - \bar{Y}_U)^2. \] It is the variance of selecting one element at random from $\{ y_1, \cdots, y_N\}$.

SRS with replacement

If the sample is SRS with replacement, then we can define $a(k)$ to be the index of the element selected at the $k$-th draw.
The $n$ sample elements, $y_{a(1)}, \cdots, y_{a(n)}$, are independently and identically distributed with the distribution \[ Y_{a(k)} = \left\{ \begin{array}{ll} y_1 & \mbox{ with prob. } 1/N \\ y_2 & \mbox{ with prob. } 1/N \\ \vdots & \\ y_N & \mbox{ with prob. } 1/N \end{array} \right. \]

The sample mean is \[ \bar{y} = \frac{1}{ n} \sum_{k=1}^n Y_{a(k)} . \]
It can be shown that the sample mean $\bar{y}$ satisfies \[ E( \bar{y} ) = \bar{Y}_N \] and \[ V( \bar{y}) = \frac{1}{n} \left( 1- \frac{1}{N}\right) S^2 , \]
Thus, the variance is smaller for without replacement sampling.

Finite population correction factor

Finite population correction factor (FPC) \[ FPC = 1 - \frac{n}{N} \]
Sampling fraction is the proportion of the population sampled, or $n/N$.
Often FPC is very close to 1.
In cases where sampling fraction is very small and FPC is very close to 1, FPC has no practical effect on the variance or estimated variance of the parameter estimate.

Estimation of the population total under SRS

$T = \sum_{i=1}^N y_i$: population total of $y$, parameter of interest
$\hat{T}=N \bar{y}$: an estimator of $T$.
Properties of $\hat{T}$
- Unbiased
- Variance \[ V\left( \hat{T} \right) = \frac{N^2}{n} \left( 1-\frac{n}{N} \right) S^2. \]

Estimation of the population proportion under SRS

Y is binary, taking 1 or 0, where $Y=1$ means the unit has the characteristic of interest
Parameter: proportion of $Y=1$. $P = N^{-1} \sum_{i=1}^N y_i$:
$\hat{P}=\bar{y}$: an estimator of $P$.
Properties
- Unbiased: $E( \hat{P})= P$
- Variance \[ V\left( \hat{P} \right) = \frac{1}{n} \left( 1-\frac{n}{N} \right) \frac{N}{N-1} P (1-P). \]

Justification

Variance Estimation

Introduction

Different concepts
- Population variance: \[ S^2= \frac{1}{N-1} \sum_{i=1}^N (y_i - \bar{Y}_U)^2\]
- (Sampling) variance of $\bar{y}$: $V( \bar{y})$
- Variance estimation of $\bar{y}$
We are interested in estimating $V( \bar{y})$ under SRS. We use $\hat{V} ( \bar{y})$ to denote an estimator of $V( \bar{y})$.
Recall that, by Theorem 1,
\[V( \bar{y}) = \frac{1}{n} \left( 1- \frac{n}{N}\right) S^2\]

Population variance

Population variance \[ \small S^2 \equiv \frac{1}{N-1}\sum_{i=1}^N \left( y_i - \bar{Y}_U \right)^2= \frac{1}{2N(N-1)} \sum_{i=1}^N \sum_{j=1}^N (y_i - y_j)^2 \]
To verify the equality, use

\[\begin{eqnarray*} {\small \sum_{i=1}^N \sum_{j=1}^N (y_i - y_j)^2 } & =&{\small \sum_{i=1}^N \sum_{j=1}^N \{ (y_i - \bar{Y}_U) - (y_j - \bar{Y}_U) \}^2 } \\ &=& 2 N \sum_{i=1}^N (y_i - \bar{Y}_U)^2 \end{eqnarray*}\]

Estimation of population variance

Sample variance \[ {\small s^2 = \frac{1}{n-1} \sum_{i \in \mathcal{S}} \left( y_i - \bar{y} \right)^2 = \frac{1}{2n(n-1)} \sum_{i\in \mathcal{S}} \sum_{j \in \mathcal{S}} (y_i - y_j)^2} \]
Property (under SRS) \[ E\left( s^2 \right) = S^2 \]

Proof

Variance estimation formula

For sample mean $\bar{y}$: \[ \hat{V} ( {\bar{y}}) = \frac{1}{n} \left( 1- \frac{n}{N} \right) s^2 \]
For total estimator \[ \hat{V} ( \hat{T}) = \frac{N^2}{n} \left( 1- \frac{n}{N} \right) s^2 \]
For the proporton estimate $\hat{P}$ \[ \hat{V} ( \hat{P}) = \frac{1}{n} \left( 1- \frac{n}{N} \right) \frac{n}{n-1} \hat{P} ( 1- \hat{P}) \]

Justification

Summary

Mean Estimation
- Point Estimator: $\bar{y}$
- Variance estimator: \[{\small \hat{V} ( \bar{y}) = \frac{1}{n} \left( 1- \frac{n}{N} \right) s^2 }\]
Total Estimation
- Point Estimator: $\hat{T} = N \bar{y}$
- Variance estimator: \[{\small \hat{V} ( \hat{T}) = \frac{N^2}{n} \left( 1- \frac{n}{N} \right) s^2 } \]

Proportion estimation
- Point estimator: $\hat{P} = \bar{y}$
- Variance estimator \[{\small \hat{V} ( \hat{P}) = \frac{1}{n} \left( 1- \frac{n}{N} \right) \frac{n}{n-1} \hat{P} (1 -\hat{P} ) } \]

Case	Sample ID	Statistic (Sample mean)	Selection Prob.
1	1,2	\((y_1 + y_2)/2\)	1/6
2	1,3	\((y_1+ y_3)/2\)	1/6
3	1,4	\((y_1 + y_4)/2\)	1/6
4	2,3	\((y_2 + y_3)/2\)	1/6
5	2,4	\((y_2 + y_4)/2\)	1/6
6	3,4	\((y_3+ y_4)/2\)	1/6

Stat 421: Week 3

Jae-kwang Kim

1/28/2020

Reveiw

Simple random sampling (SRS) design

Notation

Basic Concepts

Remark

Result

Proof

Question

A counterexample with \(N=4\)

Selection procedure for SRS without replacement

SRS with replacement

Parameter Estimation under SRS

Sample statistic

Sampling Distribution

Example: SRS

Estimation of the population mean under SRS

Theorem 1

Proof of Theorem 1

Remark

SRS with replacement

Finite population correction factor

Estimation of the population total under SRS

Estimation of the population proportion under SRS

Justification

Variance Estimation

Introduction

Population variance

Estimation of population variance

Proof

Variance estimation formula

Justification

Summary