In this lesson students will learn to:
Motivating example from “Sampling: Design and Analysis” by Sharon Lohr.
A university has 807 faculty members. For each faculty member, the number of refereed publications was recorded. This number is not directly available on the database, so this requires the investigator to examine each record separately. A frequency table for number of refereed publications is given below for an SRS of 50 faculty members.
### PUBS
pubs<-0:10
fac<-c(28, 4, 3, 4, 4, 2, 1, 0, 2, 1, 1)
## LET'S CHECK THAT n=50
sum(fac)
## [1] 50
### DATA FRAME
pubDF<-data.frame(pubs, fac)
pubDF
## pubs fac
## 1 0 28
## 2 1 4
## 3 2 3
## 4 3 4
## 5 4 4
## 6 5 2
## 7 6 1
## 8 7 0
## 9 8 2
## 10 9 1
## 11 10 1
## LOAD TIDYVERSE
library(tidyverse)
## USE GGPLOT AND SPECIFY WEIGHTS
### PLOT THE DATA
ggplot(data=pubDF, aes(x=pubs, y=fac))+
geom_col()+
ggtitle("Sample Distribution of Publications")+
xlab("Refereed Publications")+
ylab("Number of Faculty")
How would you describe the shape of this distribution?
YOUR ANSWER HERE
We can try two different approaches to prove that the weighted estimator is an equivalent method.
Weights indicate the number of units a given respondent represents. Thus, we can use these weights to “recreate” the sample using the frequency table.
## TAKE ONE
## WEIGHTS CAN BE USED TO RE-CREATE!
fullSamp<-c()
for(i in 1:length(fac)){
fullSamp<-c(fullSamp,
rep(pubs[i], fac[i])) # REPLICATE
}
fullSamp
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [26] 0 0 0 1 1 1 1 2 2 2 3 3 3 3 4 4 4 4 5 5 6 8 8 9 10
Consider the following common sample statistics:
## SAMPLE MEAN
mean(fullSamp)
## [1] 1.78
## SAMPLE VAR
var(fullSamp)
## [1] 7.19551
## SAMPLE SD
sd(fullSamp)
## [1] 2.682445
sqrt(var(fullSamp)) # CHECKS OUT!
## [1] 2.682445
Let’s create a column of inclusion probabilities.
## SRS INCLUSION PROBABILITY
fullSamp_DF<-data.frame(publications=fullSamp)%>%
mutate(inc_i=50/807)
head(fullSamp_DF)
## publications inc_i
## 1 0 0.06195787
## 2 0 0.06195787
## 3 0 0.06195787
## 4 0 0.06195787
## 5 0 0.06195787
## 6 0 0.06195787
The sampling weight is defined as the inverse of the inclusion probability.
## SAMPLING WEIGHT
fullSamp_DF<-fullSamp_DF%>%
mutate(wgt_i=1/inc_i)
head(fullSamp_DF)
## publications inc_i wgt_i
## 1 0 0.06195787 16.14
## 2 0 0.06195787 16.14
## 3 0 0.06195787 16.14
## 4 0 0.06195787 16.14
## 5 0 0.06195787 16.14
## 6 0 0.06195787 16.14
Typically, weights indicates the number of units the respondents in the population (not the sample).
\[\bar{y}=\frac{\sum_{i=1}^n w_iy_i}{\sum_{i=1}^n w_i}\]
### WEIGHTED OBS
fullSamp_DF<-fullSamp_DF%>%
mutate(wgt_obs=wgt_i*publications)
head(fullSamp_DF)
## publications inc_i wgt_i wgt_obs
## 1 0 0.06195787 16.14 0
## 2 0 0.06195787 16.14 0
## 3 0 0.06195787 16.14 0
## 4 0 0.06195787 16.14 0
## 5 0 0.06195787 16.14 0
## 6 0 0.06195787 16.14 0
### Y BAR WEIGHTED
### NUMERATOR
num<-sum(fullSamp_DF$wgt_obs)
num
## [1] 1436.46
### DENOMINATOR
den<-sum(fullSamp_DF$wgt_i)
den
## [1] 807
### Y BAR
num/den
## [1] 1.78
Note that this is the same at the sample mean without the weights. This is why simple random samples are known as self-weighted samples.
First, observe that the size can be given by the sum of the frequencies in the table.
In this case, since the table summarizes the publication amounts, the sum of the weights give the size of the sample, rather than the size of the population; however, the principles are similar.
## NUMERATOR
sum(pubs*fac)
## [1] 89
## DEMONINATOR
n<-sum(fac)
n
## [1] 50
## SAMPLE MEAN
y_bar<-sum(pubs*fac)/n
y_bar
## [1] 1.78
Recall: The equation for the sample variance is given by
\[s^2=\frac{1}{n-1}\sum_{i=1}^n (y_i-\bar{y})^2\]
The the sample standard deviation is given by
\[s=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (y_i-\bar{y})^2}\]
Using the frequency table of the sample as weights, we can reconstruct the sample variance with
\[ s^2=\frac{\sum_{i=1}^nw_i(y_i-\bar{y})^2}{(\sum_{i=1}^nw_i)-1}\]
## SAMPLE VARIANCE
sampVar<-(1/49)*sum(fac*(pubs-y_bar)^2)
sampVar
## [1] 7.19551
## SAMPLE STANDARD DEV
sampSD<-sqrt(sampVar)
sampSD
## [1] 2.682445
Observe that these are equivalent to using the recreated sample!
Let’s assume that the Central Limit Theorem works, that it’s safe to construct a confidence interval for the mean.
The confidence interval has the form: \[ Point Estimate \pm Critical Value \times Standard Error\]
The point estimate for a confidence interval for the mean is the weighted average, \(\bar{y}\)
The critical value creates a buffer such that \((1-\frac{\alpha}{2})\times 100\%\) of the area in the asymptotic sampling distribution is contained around the mean. A common significance level is \(\alpha=0.05\), which results in \(95\%\) confidence.
If the population standard deviation is known, we use critical values from the Standard Normal (Z) distribution.
## Z CRITICAL VALUE
zCritVal<-qnorm(0.975)
zCritVal
## [1] 1.959964
In reality, we usually have to estimate the population standard deviation with the sample standard deviation. To convey less certainty in the mean, we use a t-distribution, which is more defuse.
## WHAT ABOUT T DISTR?
tCritVal<-qt(0.975, df=n-1)
tCritVal
## [1] 2.009575
The standard error for the mean takes the form
\[SE(\bar{y}=\sqrt{(1-\frac{n}{N})\frac{s^2}{n}})\]
where \((1-\frac{n}{N})\) is known as the finite population correction, where \(\frac{n}{N}\) is the sampling fraction. As the sample becomes a larger and larger fraction of the population the variablilty decreases.
### Finite Population Correction
n<-50
N<-807
## SAMPLING FRACTION
sampFrac<-n/N
sampFrac
## [1] 0.06195787
## FPC
thisFPC<-1-sampFrac
thisFPC
## [1] 0.9380421
## STANDARD ERROR
standErr<-sqrt(thisFPC*(sampVar/n))
standErr
## [1] 0.3674151
## Z CI
y_bar+c(-1,1)*zCritVal*standErr
## [1] 1.05988 2.50012
## NEW T CI
y_bar+c(-1,1)*tCritVal*standErr
## [1] 1.041652 2.518348
In statistics the third standardized moment is used to quantify skewness. This has been used by Sugden et al (2000) to recommend a minimum sample size in the presence of skewness.
\[n_{min}=28+25(\frac{\sum_{i=1}^n(y_i-\bar{y})^3}{Ns^3})^2\]
using the values (\(\bar{y}\) and \(s\))obtained from the sample.
For this sample we find:
### SKEWNESS
### EQN by Sugden et al (2000)
### SAMPLE THIRD CENTRAL MOMENT
thirdMom<-(1/50)*sum(fac*(pubs-y_bar)^3)
thirdMom
## [1] 28.9247
### MIN SAMP SIZE
28+25*(thirdMom/sampSD^3)^2
## [1] 84.14267
Oh no! What does this mean?!
This means that with a sample size of \(n=50\) the Central Limit Theorem will NOT kick in and therefore, we should not trust these confidence intervals.
We can follow a similar recipe to construct confidence intervals for proportions.
Now estimate the proportion of faculty members with no publications and give a \(95\%\) confidence interval.
## CI for PROP
## BINARY
noPubs<-pubs==0
noPubs
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## P HAT
## SAMPLE MEAN
p_hat<-sum(noPubs*fac)/n
p_hat
## [1] 0.56
## SE PROP
standErr_Prop<-sqrt(thisFPC*(p_hat*(1-p_hat)/n))
standErr_Prop
## [1] 0.06799023
## TOGETHER
## Z CI
p_hat+c(-1,1)*zCritVal*standErr_Prop
## [1] 0.4267416 0.6932584
Although it doesn’t work for this example, for the sake of completeness, we will also include information to estimate the total.