DATA429/599: Simple Random Sample Estimators

Learning Objectives

In this lesson students will learn to:

Plot weighted survey data
Use weights in analyses
Construct confidence intervals for simple random samples from finite populations

Motivating Example

Motivating example from “Sampling: Design and Analysis” by Sharon Lohr.

A university has 807 faculty members. For each faculty member, the number of refereed publications was recorded. This number is not directly available on the database, so this requires the investigator to examine each record separately. A frequency table for number of refereed publications is given below for an SRS of 50 faculty members.

### PUBS
pubs<-0:10
fac<-c(28, 4, 3, 4, 4, 2, 1, 0, 2, 1, 1)

## LET'S CHECK THAT n=50
sum(fac)

## [1] 50

### DATA FRAME
pubDF<-data.frame(pubs, fac)
pubDF

##    pubs fac
## 1     0  28
## 2     1   4
## 3     2   3
## 4     3   4
## 5     4   4
## 6     5   2
## 7     6   1
## 8     7   0
## 9     8   2
## 10    9   1
## 11   10   1

I. Plot Weighted Data

## LOAD TIDYVERSE
library(tidyverse)

## USE GGPLOT AND SPECIFY WEIGHTS
### PLOT THE DATA
ggplot(data=pubDF, aes(x=pubs, y=fac))+
  geom_col()+ 
  ggtitle("Sample Distribution of Publications")+
  xlab("Refereed Publications")+
  ylab("Number of Faculty")

What do you see?

How would you describe the shape of this distribution?

YOUR ANSWER HERE

II. Weighted Estimates

We can try two different approaches to prove that the weighted estimator is an equivalent method.

Method One: Recreate the Sample

Weights indicate the number of units a given respondent represents. Thus, we can use these weights to “recreate” the sample using the frequency table.

## TAKE ONE
## WEIGHTS CAN BE USED TO RE-CREATE!
fullSamp<-c()
for(i in 1:length(fac)){
  fullSamp<-c(fullSamp, 
              rep(pubs[i], fac[i])) # REPLICATE
}

fullSamp

##  [1]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## [26]  0  0  0  1  1  1  1  2  2  2  3  3  3  3  4  4  4  4  5  5  6  8  8  9 10

Consider the following common sample statistics:

Sample mean: \(\bar{y}=\frac{1}{n}\sum_{i=1}^n y_i\)
Sample variance: \(s^2=\frac{1}{n-1}\sum_{i=1}^n (y_i-\bar{y})^2\)
Sample standard deviation: \(s=\sqrt{s^2}\)

## SAMPLE MEAN
mean(fullSamp)

## [1] 1.78

## SAMPLE VAR
var(fullSamp)

## [1] 7.19551

## SAMPLE SD
sd(fullSamp)

## [1] 2.682445

sqrt(var(fullSamp)) # CHECKS OUT!

## [1] 2.682445

Inclusion Probabilities

Let’s create a column of inclusion probabilities.

## SRS INCLUSION PROBABILITY
fullSamp_DF<-data.frame(publications=fullSamp)%>%
  mutate(inc_i=50/807)

head(fullSamp_DF)

##   publications      inc_i
## 1            0 0.06195787
## 2            0 0.06195787
## 3            0 0.06195787
## 4            0 0.06195787
## 5            0 0.06195787
## 6            0 0.06195787

Sampling Weights

The sampling weight is defined as the inverse of the inclusion probability.

## SAMPLING WEIGHT
fullSamp_DF<-fullSamp_DF%>%
  mutate(wgt_i=1/inc_i)

head(fullSamp_DF)

##   publications      inc_i wgt_i
## 1            0 0.06195787 16.14
## 2            0 0.06195787 16.14
## 3            0 0.06195787 16.14
## 4            0 0.06195787 16.14
## 5            0 0.06195787 16.14
## 6            0 0.06195787 16.14

Typically, weights indicates the number of units the respondents in the population (not the sample).

Estimator with Weights!

\[\bar{y}=\frac{\sum_{i=1}^n w_iy_i}{\sum_{i=1}^n w_i}\]

### WEIGHTED OBS
fullSamp_DF<-fullSamp_DF%>%
  mutate(wgt_obs=wgt_i*publications)

head(fullSamp_DF)

##   publications      inc_i wgt_i wgt_obs
## 1            0 0.06195787 16.14       0
## 2            0 0.06195787 16.14       0
## 3            0 0.06195787 16.14       0
## 4            0 0.06195787 16.14       0
## 5            0 0.06195787 16.14       0
## 6            0 0.06195787 16.14       0

### Y BAR WEIGHTED
### NUMERATOR
num<-sum(fullSamp_DF$wgt_obs)
num

## [1] 1436.46

### DENOMINATOR
den<-sum(fullSamp_DF$wgt_i)
den

## [1] 807

### Y BAR
num/den

## [1] 1.78

Note that this is the same at the sample mean without the weights. This is why simple random samples are known as self-weighted samples.

Method Two: Using Table as Weights

First, observe that the size can be given by the sum of the frequencies in the table.

In this case, since the table summarizes the publication amounts, the sum of the weights give the size of the sample, rather than the size of the population; however, the principles are similar.

## NUMERATOR
sum(pubs*fac)

## [1] 89

## DEMONINATOR
n<-sum(fac)
n

## [1] 50

## SAMPLE MEAN
y_bar<-sum(pubs*fac)/n
y_bar

## [1] 1.78

Sample Standard Deviation

Recall: The equation for the sample variance is given by

\[s^2=\frac{1}{n-1}\sum_{i=1}^n (y_i-\bar{y})^2\]

The the sample standard deviation is given by

\[s=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (y_i-\bar{y})^2}\]

Using the frequency table of the sample as weights, we can reconstruct the sample variance with

\[ s^2=\frac{\sum_{i=1}^nw_i(y_i-\bar{y})^2}{(\sum_{i=1}^nw_i)-1}\]

## SAMPLE VARIANCE
sampVar<-(1/49)*sum(fac*(pubs-y_bar)^2)
sampVar

## [1] 7.19551

## SAMPLE STANDARD DEV
sampSD<-sqrt(sampVar)
sampSD

## [1] 2.682445

Observe that these are equivalent to using the recreated sample!

III. Confidence Interval for the Mean

Let’s assume that the Central Limit Theorem works, that it’s safe to construct a confidence interval for the mean.

The confidence interval has the form: \[ Point Estimate \pm Critical Value \times Standard Error\]

A. Point Estimate

The point estimate for a confidence interval for the mean is the weighted average, \(\bar{y}\)

B. Critival Value

The critical value creates a buffer such that \((1-\frac{\alpha}{2})\times 100\%\) of the area in the asymptotic sampling distribution is contained around the mean. A common significance level is \(\alpha=0.05\), which results in \(95\%\) confidence.

If the population standard deviation is known, we use critical values from the Standard Normal (Z) distribution.

## Z CRITICAL VALUE
zCritVal<-qnorm(0.975)
zCritVal

## [1] 1.959964

In reality, we usually have to estimate the population standard deviation with the sample standard deviation. To convey less certainty in the mean, we use a t-distribution, which is more defuse.

## WHAT ABOUT T DISTR?
tCritVal<-qt(0.975, df=n-1)
tCritVal

## [1] 2.009575

C. Standard Error for the Mean

The standard error for the mean takes the form

\[SE(\bar{y}=\sqrt{(1-\frac{n}{N})\frac{s^2}{n}})\]

where \((1-\frac{n}{N})\) is known as the finite population correction, where \(\frac{n}{N}\) is the sampling fraction. As the sample becomes a larger and larger fraction of the population the variablilty decreases.

### Finite Population Correction 
n<-50
N<-807

## SAMPLING FRACTION
sampFrac<-n/N
sampFrac

## [1] 0.06195787

## FPC
thisFPC<-1-sampFrac
thisFPC

## [1] 0.9380421

## STANDARD ERROR
standErr<-sqrt(thisFPC*(sampVar/n))
standErr

## [1] 0.3674151

Bring These Pieces Together

## Z CI
y_bar+c(-1,1)*zCritVal*standErr

## [1] 1.05988 2.50012

## NEW T CI
y_bar+c(-1,1)*tCritVal*standErr

## [1] 1.041652 2.518348

IV. Skew and Sample Size

In statistics the third standardized moment is used to quantify skewness. This has been used by Sugden et al (2000) to recommend a minimum sample size in the presence of skewness.

\[n_{min}=28+25(\frac{\sum_{i=1}^n(y_i-\bar{y})^3}{Ns^3})^2\]

using the values (\(\bar{y}\) and \(s\))obtained from the sample.

For this sample we find:

### SKEWNESS
### EQN by Sugden et al (2000)

### SAMPLE THIRD CENTRAL MOMENT
thirdMom<-(1/50)*sum(fac*(pubs-y_bar)^3)
thirdMom

## [1] 28.9247

### MIN SAMP SIZE
28+25*(thirdMom/sampSD^3)^2

## [1] 84.14267

Oh no! What does this mean?!

This means that with a sample size of \(n=50\) the Central Limit Theorem will NOT kick in and therefore, we should not trust these confidence intervals.

V. Proportions

We can follow a similar recipe to construct confidence intervals for proportions.

Point Estimate: \(\hat{p}=\bar{y}\) still works because proportions are averages of binary data.
Standard Error: \(SE(\hat{p})=\sqrt{(1-\frac{n}{N})\frac{\hat{p}(1-\hat{p})}{n-1}}\)

You Try It!

Now estimate the proportion of faculty members with no publications and give a \(95\%\) confidence interval.

## CI for PROP

## BINARY
noPubs<-pubs==0
noPubs

##  [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## P HAT
## SAMPLE MEAN
p_hat<-sum(noPubs*fac)/n
p_hat

## [1] 0.56

## SE PROP
standErr_Prop<-sqrt(thisFPC*(p_hat*(1-p_hat)/n))
standErr_Prop

## [1] 0.06799023

## TOGETHER
## Z CI
p_hat+c(-1,1)*zCritVal*standErr_Prop

## [1] 0.4267416 0.6932584

VI. Totals

Although it doesn’t work for this example, for the sake of completeness, we will also include information to estimate the total.

Point Estimate: \(\hat{t}=\sum_{i=1}^n w_iy_i=N\bar{y}\)
Standard Error: \(SE(\hat{t})=N\sqrt{(1-\frac{n}{N})\frac{s^2}{n}}\)