DATA429/599: Introduction to Probability Sampling

Learning Objectives

In this lesson students will learn to:

Implement fundamental probability sampling methods (simple random sample, stratified sample, cluster sample, and systematic sample)

Based on notes from “Sampling: Design and Analysis” by Sharon Lohr.

I. Basic Sampling Methods

In probability samples each unit in the population has a known population of selection and a random mechanism is used to select units.

Suppose that we have a population with unique IDs 1 through 100 and we want to select 20.

A. Simple Random Sample

In a simple random sample every unit has an equal chance of selection.

### SRS
## RANDOMLY CHOOSE 20 IDS
srs<-sample(1:100, size=20)

srs

##  [1]  69  65  50  98  79  38  20  10 100  29  47   2  23  39  42  95  71  82  18
## [20]  21

## NOTE THIS IS DONE WITHOUT REPLACEMENT

B. Stratified Random Sample

Stratified sampling is a useful approach when the population can be divided into distinct subgroups, where individuals within each subgroup are more alike, and there are notable differences between the subgroups. In stratified sampling observations are randomly chosen from each strata, which ensures that each strata is represented.

Example: Suppose that the population is divided into five strata (1 through 20, 21 through 40, 41 through 60, 61 through 80, and 81 through 100). We want to select 4 from each of the five strata to have a total sample of size 20.

### STRATIFIED
strat<-c(sample(1:20, size=4), 
         sample(21:40, size=4), 
         sample(41:60, size=4), 
         sample(61:80, size=4),
         sample(81:100, size=4))

strat

##  [1] 19 15 11 17 26 21 25 30 54 51 46 56 65 71 73 69 98 83 96 99

C. Cluster Sample

Cluster sampling is a suitable method when the population is divided into natural groups or clusters, where units within each cluster may differ, but the clusters themselves are more similar to each other. Random clusters are then selected and then a census is taken with the selected clusters.

Motivation: Suppose there are 20 clusters within the population, such that every 5 IDs are grouped together into a cluster. We want to select 4 clusters and then observe all 5 IDs within the associated cluster.

### CLUSTER IDs
cluster<-sample(1:20, size=4)
cluster

## [1]  1 19  6 11

### CLUSTER Start Value
clusterRange<-(cluster-1)*5+1
clusterRange

## [1]  1 91 26 51

### CLUSTER Census
clusterUn<-c(clusterRange[1]+0:4, 
             clusterRange[2]+0:4,
             clusterRange[3]+0:4,
             clusterRange[4]+0:4)
clusterUn

##  [1]  1  2  3  4  5 91 92 93 94 95 26 27 28 29 30 51 52 53 54 55

D. Systematic Sample

Systematic sampling is simple yet efficient, particularly if the data are ordered. In systematic sampling a random starting point is chosen and then every \(k^{th}\) ID is selected.

### SYSTEM
## RANDOM START
systStart<-sample(1:5, size=1)
systStart

## [1] 3

## EVERY 5th OBS
sysSamp<-seq(systStart, 100, by=5)
sysSamp

##  [1]  3  8 13 18 23 28 33 38 43 48 53 58 63 68 73 78 83 88 93 98

length(sysSamp)

## [1] 20

Let’s Bring It All Together!

library(tidyverse)

### PUT TOGETHER
sampleDF<-data.frame(srs, strat, clusterUn, sysSamp)

head(sampleDF)

##   srs strat clusterUn sysSamp
## 1  69    19         1       3
## 2  65    15         2       8
## 3  50    11         3      13
## 4  98    17         4      18
## 5  79    26         5      23
## 6  38    21        91      28

### TIDY TO PLOT
tidySamp<-sampleDF%>%
  pivot_longer(cols=srs:sysSamp, 
               names_to = "SampType", values_to = "sample")

## RELEVEL
tidySamp$SampType<-factor(tidySamp$SampType, 
                          levels=c("srs", "strat", "clusterUn", "sysSamp"))

### LABELS FOR FACETS
samp_names <- c(
  `srs` = "Simple Random",
  `strat` = "Stratified",
  `clusterUn` = "Cluster",
  `sysSamp` = "Systematic"
)

### GGPLOT
ggplot(tidySamp, aes(x=sample, y=1, color=SampType))+
  geom_point()+
  facet_grid(SampType~., labeller = as_labeller(samp_names))+
  theme_bw()+
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        legend.position="none")

II. Framework for Probability Sampling

Suppose we have a finite population, or universe, of size \(N\) and we would like to select of sample of size \(n\).

A. Subsets/Samples

The number of subsets of size \(n\) from the population of size \(N\) can be enumerated by \[{N\choose n}=\frac{N!}{n!(N-n!)}\]

### HOW MANY COMBINATIONS? 
### POP SIZE 4
### SAMP SIZE 2
allCombos<-combn(1:4, 2)
allCombos

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    1    1    2    2    3
## [2,]    2    3    4    3    4    4

## DIM
dim(allCombos)

## [1] 2 6

Your Turn!

Suppose you have a population of size \(N=10\), how many unique samples of size \(n=3\) can be drawn (order doesn’t matter).

## INSERT CODE HERE ##

B. Probability Models

Each possible sample from the population has a known probability of selection.

Recall: Probability Axioms - Non-negativity: Probability for each subset is greater than or equal to 0 - Normalization: The sum of the probabilities for each sample is equal to 1

Cosider the following two sampling plans (distributions):

MODEL 1: MODEL 2:

### MODEL 1
probs1<-rep(1/6, 6)

### MODEL 2
probs2<-c(1/3, 1/6, 0, 0, 0, 1/2)

We can randomly select a sample using the first sampling plan.

### A SINGLE SELECTION FROM DISTR 1
obsProb1<-sample(1:6, size=1, prob=probs1)

### WHICH SUBSET DID WE CHOOSE?
allCombos[,obsProb1]

## [1] 1 3

Now let’s simulate that several times to see the distribution of how often different IDs are chosen.

### SIMULATE REPEATED DRAWS FROM THE POPULATION
### USE SAMPLING PLAN 1
simProb1<-sample(1:6, size=1000, prob=probs1, replace = TRUE)

### WHICH OBS?
simObs1<-as.numeric(allCombos[,simProb1])


## PROB OF SELECTION
table(simObs1)/1000

## simObs1
##     1     2     3     4 
## 0.507 0.507 0.507 0.479

## PLOT
as.data.frame(simObs1)%>%
  ggplot()+
  geom_histogram(aes(x=simObs1))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Your Turn!

Simulate using the second sampling plan. What do you observe?

### SIMULATE REPEATED DRAWS FROM THE POPULATION
### USE SAMPLING PLAN 2
simProb2<-sample(1:6, size=1000, prob=probs2, replace = TRUE)


### WHICH OBS
simObs2<-as.numeric(allCombos[,simProb2])

## PROB OF SELECTION
table(simObs2)/1000

## simObs2
##     1     2     3     4 
## 0.503 0.354 0.646 0.497

## PLOT
as.data.frame(simObs2)%>%
  ggplot()+
  geom_histogram(aes(x=simObs2))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Since each sample has a known probability of selection, we can calculate the probability of each unit being included in a sample. This probability is known as the inclusion probability , \(\pi_i\), which is the sum of the probabilities of all possible samples that contain unit \(i\).

For example, in the second sampling plan, \[\pi_1=\frac{1}{3}+\frac{1}{6}=\frac{1}{2}\]

Your Turn!

Calculate the (theoretic) probabilities of inclusion for 2, 3, and 4.

## YOUR CODE HERE ##

III. Sampling Distributions

Consider the following population with IDs and values (\(y\)).

### POPULATION
popData<-data.frame(IDs=1:8, 
                    y=c(1, 2, 4, 4, 7, 7, 7, 8))

popData

##   IDs y
## 1   1 1
## 2   2 2
## 3   3 4
## 4   4 4
## 5   5 7
## 6   6 7
## 7   7 7
## 8   8 8

Knowledge Check!

How many ways are there to select a subset of size \(n=4\) from this population of size \(N=8\)?

## YOUR CODE ##

What are the subsets of size \(n=4\)?

### SUBSETS OF IDS
### POP SIZE 8
### SAMPLE SIZE 4
combos_N8n4<-combn(1:8, 4)
combos_N8n4

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,]    1    1    1    1    1    1    1    1    1     1     1     1     1     1
## [2,]    2    2    2    2    2    2    2    2    2     2     2     2     2     2
## [3,]    3    3    3    3    3    4    4    4    4     5     5     5     6     6
## [4,]    4    5    6    7    8    5    6    7    8     6     7     8     7     8
##      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## [1,]     1     1     1     1     1     1     1     1     1     1     1     1
## [2,]     2     3     3     3     3     3     3     3     3     3     3     4
## [3,]     7     4     4     4     4     5     5     5     6     6     7     5
## [4,]     8     5     6     7     8     6     7     8     7     8     8     6
##      [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
## [1,]     1     1     1     1     1     1     1     1     1     2     2     2
## [2,]     4     4     4     4     4     5     5     5     6     3     3     3
## [3,]     5     5     6     6     7     6     6     7     7     4     4     4
## [4,]     7     8     7     8     8     7     8     8     8     5     6     7
##      [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50]
## [1,]     2     2     2     2     2     2     2     2     2     2     2     2
## [2,]     3     3     3     3     3     3     3     4     4     4     4     4
## [3,]     4     5     5     5     6     6     7     5     5     5     6     6
## [4,]     8     6     7     8     7     8     8     6     7     8     7     8
##      [,51] [,52] [,53] [,54] [,55] [,56] [,57] [,58] [,59] [,60] [,61] [,62]
## [1,]     2     2     2     2     2     3     3     3     3     3     3     3
## [2,]     4     5     5     5     6     4     4     4     4     4     4     5
## [3,]     7     6     6     7     7     5     5     5     6     6     7     6
## [4,]     8     7     8     8     8     6     7     8     7     8     8     7
##      [,63] [,64] [,65] [,66] [,67] [,68] [,69] [,70]
## [1,]     3     3     3     4     4     4     4     5
## [2,]     5     5     6     5     5     5     6     6
## [3,]     6     7     7     6     6     7     7     7
## [4,]     8     8     8     7     8     8     8     8

## DIM
dim(combos_N8n4)

## [1]  4 70

The first subset contains the IDs 1, 2, 3, and 4. These correspond to the following values:

### MAP IDs to VALUES
popData$y[combos_N8n4[,1]]

## [1] 1 2 4 4

vals<-popData$y[combos_N8n4[,1]]

The sample mean \(\bar{y}\) is one of the most common statistics.

### SAMPLE MEAN
mean(vals)

## [1] 2.75

Suppose that we really want to calculate the total and not the mean. We can estimate the total using the estimator \[\hat{t}=N\bar{y}\]

### ESTIMATE THE TOTAL
8*mean(vals)

## [1] 22

We can repeat this for every sample to yield the sampling distribution. This will show us how variable the statistic is.

### ALL TOTALS
t_hat<-c()
for(i in 1:70){ 
  these_vals<-popData$y[combos_N8n4[,i]]
  this_ybar<-mean(these_vals)
  this_that<-8*this_ybar
  t_hat<-c(t_hat, this_that)
}


## TABLE OF TOTALS
table(t_hat)

## t_hat
## 22 28 30 32 34 36 38 40 42 44 46 48 50 52 58 
##  1  6  2  3  7  4  6 12  6  4  7  3  2  6  1

## DISTR OF TOTALS
distr<-table(t_hat)/70
distr

## t_hat
##         22         28         30         32         34         36         38 
## 0.01428571 0.08571429 0.02857143 0.04285714 0.10000000 0.05714286 0.08571429 
##         40         42         44         46         48         50         52 
## 0.17142857 0.08571429 0.05714286 0.10000000 0.04285714 0.02857143 0.08571429 
##         58 
## 0.01428571

A. Expected Value

The expected value is the mean of the sampling distribution of a statistic. This can be thought of as a weighted average, using the probability.

\[E[\hat{t}]=\sum_k k Pr(\hat{t}=k)\]

### EXPECTED VALUE
tots<-as.numeric(names(table(t_hat)))
tots

##  [1] 22 28 30 32 34 36 38 40 42 44 46 48 50 52 58

probs<-as.numeric(table(t_hat)/70)
probs

##  [1] 0.01428571 0.08571429 0.02857143 0.04285714 0.10000000 0.05714286
##  [7] 0.08571429 0.17142857 0.08571429 0.05714286 0.10000000 0.04285714
## [13] 0.02857143 0.08571429 0.01428571

### SUMMANDS
tots*probs

##  [1] 0.3142857 2.4000000 0.8571429 1.3714286 3.4000000 2.0571429 3.2571429
##  [8] 6.8571429 3.6000000 2.5142857 4.6000000 2.0571429 1.4285714 4.4571429
## [15] 0.8285714

### EXPECTED VALUE
sum(tots*probs)

## [1] 40

How far off are we?

B. Bias

This bias is the mathematical difference between the expected value of the estimator and the true value.

\[Bias[\hat{t}]=E[\hat{t}]-t\]

In reality, we don’t usually know the true value.

In this case, we know the true value.

### TRUE TOTAL
sum(popData$y)

## [1] 40

Since the difference is zero, we would say that this estimator is unbiased.

C. Variance

Variance describes how spread out the estimators are.

\[V[\hat{t}]=E[(\hat{t}-E[\hat{t}])^2]=\sum_k P(\hat{t}=k)(k-E[\hat{t}])^2\]

### VARIANCE
### SUMMANDS
(tots-40)^2*probs

##  [1]  4.6285714 12.3428571  2.8571429  2.7428571  3.6000000  0.9142857
##  [7]  0.3428571  0.0000000  0.3428571  0.9142857  3.6000000  2.7428571
## [13]  2.8571429 12.3428571  4.6285714

### EXPECTED VALUE
sum((tots-40)^2*probs)

## [1] 54.85714

D. Mean Squared Error

\[MSE[\hat{t}]=V[\hat{t}]+[Bias{\hat{t}}]^2\]

DATA429/599: Introduction to Probability Sampling

HKS

Learning Objectives

I. Basic Sampling Methods

A. Simple Random Sample

B. Stratified Random Sample

C. Cluster Sample

D. Systematic Sample

Let’s Bring It All Together!

II. Framework for Probability Sampling

A. Subsets/Samples

Your Turn!

B. Probability Models

Your Turn!

Your Turn!

III. Sampling Distributions

Knowledge Check!

A. Expected Value

B. Bias

C. Variance

D. Mean Squared Error