DATA429/599: Foundations of Random Sets and Estimators

Learning Objectives

This Rmd is mean to accompany a hands-on sampling activity to help students practice:

Drawing random subsets from a finite population
Estimate means and totals
Compute sampling distributions
Calculate expected value, variance, bias, and mean squared error

I. Framework for Probability Sampling

Suppose we have a finite population, or universe, of size \(N\) and we would like to select of sample of size \(n\).

A. Subsets/Samples

The number of subsets of size \(n\) from the population of size \(N\) can be enumerated by \[{N\choose n}=\frac{N!}{n!(N-n!)}\]

### HOW MANY COMBINATIONS? 
### POP SIZE 4
### SAMP SIZE 2
allCombos<-combn(1:4, 2)
allCombos

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    1    1    2    2    3
## [2,]    2    3    4    3    4    4

## DIM
dim(allCombos)

## [1] 2 6

B. Probability Models

Each possible sample from the population has a known probability of selection.

Recall: Probability Axioms - Non-negativity: Probability for each subset is greater than or equal to 0 - Normalization: The sum of the probabilities for each sample is equal to 1

Consider the two probability models

# Probabilities correspond to the following observing the following sets (in order) 
## S_1 = {1,2}, S_2 = {1,3}, S_3 = {1,4}, 
## S_4 = {2,3}, S_5 = {2,4}, S_6 = {3,4}

### MODEL 1
probs1<-rep(1/6, 6)
probs1

## [1] 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667

### MODEL 2
probs2<-c(1/3, 1/6, 0, 0, 0, 1/2)
probs2

## [1] 0.3333333 0.1666667 0.0000000 0.0000000 0.0000000 0.5000000

We can randomly select a sample using the first sampling plan.

### A SINGLE SELECTION FROM DISTR 1
obsProb1<-sample(1:6, size=1, prob=probs1)

### WHICH SUBSET DID WE CHOOSE?
allCombos[,obsProb1]

## [1] 2 4

III. Sampling Distributions

Let’s turn our sets above into a data frame so that we can play with it some more!

## MAKE A DATA FRAME
sets4C2<-data.frame(set=1:6, 
                    firstObs = allCombos[1,], 
                    secondObs = allCombos[2,], 
                    prob=rep(1/6, 6))

sets4C2

##   set firstObs secondObs      prob
## 1   1        1         2 0.1666667
## 2   2        1         3 0.1666667
## 3   3        1         4 0.1666667
## 4   4        2         3 0.1666667
## 5   5        2         4 0.1666667
## 6   6        3         4 0.1666667

The sample mean \(\bar{y}\) is one of the most common statistics.

### WE'RE GOING TO NEED TIDYVERSE FOR THIS
library(tidyverse)

### SAMPLE MEAN
sets4C2<-sets4C2%>%
  mutate(y_bar=(firstObs+secondObs)/2)

sets4C2

##   set firstObs secondObs      prob y_bar
## 1   1        1         2 0.1666667   1.5
## 2   2        1         3 0.1666667   2.0
## 3   3        1         4 0.1666667   2.5
## 4   4        2         3 0.1666667   2.5
## 5   5        2         4 0.1666667   3.0
## 6   6        3         4 0.1666667   3.5

Suppose that we really want to calculate the total and not the mean. We can estimate the total using the estimator \[\hat{t}=N\bar{y}\]

### ESTIMATE THE TOTAL
sets4C2<-sets4C2%>%
  mutate(t_hat=y_bar*4)

sets4C2

##   set firstObs secondObs      prob y_bar t_hat
## 1   1        1         2 0.1666667   1.5     6
## 2   2        1         3 0.1666667   2.0     8
## 3   3        1         4 0.1666667   2.5    10
## 4   4        2         3 0.1666667   2.5    10
## 5   5        2         4 0.1666667   3.0    12
## 6   6        3         4 0.1666667   3.5    14

To sum the probabilities of the unique values to yield the sample distribution of \(\hat{t}\)

sampD_tHat<-sets4C2%>%
  group_by(t_hat)%>%
  summarise(tHatProb=sum(prob))

sampD_tHat

## # A tibble: 5 × 2
##   t_hat tHatProb
##   <dbl>    <dbl>
## 1     6    0.167
## 2     8    0.167
## 3    10    0.333
## 4    12    0.167
## 5    14    0.167

Let’s visualize that!

ggplot(data=sampD_tHat, aes(x=t_hat, y=tHatProb))+
  geom_col()

A. Expected Value

First, let’s motivate this with the long-run average by simulating the estimated total many times.

nsim <-1000
tHats<-c()
for(i in 1:1000){
  ## Pick a subset randomly
  simProb1<-sample(1:6, size=1000, prob=probs1, replace = TRUE)
  ## Which subset is that?
  simObs1<-as.numeric(allCombos[,simProb1])
  ## Calculate the estimator 
  this_tHat<-mean(simObs1)*4
  ## Store for later
  tHats<-c(tHats, this_tHat)
}

hist(tHats)

Think about it!

What do you think the expected value would be?

Theory Based Definition

The expected value is the mean of the sampling distribution of a statistic. This can be thought of as a weighted average, using the probability.

\[E[\hat{t}]=\sum_k k Pr(\hat{t}=k)\]

### EXPECTED VALUE
### STEP 1: CALCULATE THE SUMMANDS
expectedVal<-sampD_tHat%>%
  mutate(summands=t_hat*tHatProb)

expectedVal

## # A tibble: 5 × 3
##   t_hat tHatProb summands
##   <dbl>    <dbl>    <dbl>
## 1     6    0.167     1   
## 2     8    0.167     1.33
## 3    10    0.333     3.33
## 4    12    0.167     2   
## 5    14    0.167     2.33

### STEP 2: ADD THEM TOGETHER
sum(expectedVal$summands)

## [1] 10

How far off are we?

B. Bias

This bias is the mathematical difference between the expected value of the estimator and the true value.

\[Bias[\hat{t}]=E[\hat{t}]-t\]

In reality, we don’t usually know the true value.

In this case, we know the true value.

### TRUE TOTAL
sum(1:4)

## [1] 10

Since the difference is zero, we would say that this estimator is unbiased.

C. Variance

Variance describes how spread out the estimators are.

\[V[\hat{t}]=E[(\hat{t}-E[\hat{t}])^2]=\sum_k P(\hat{t}=k)(k-E[\hat{t}])^2\]

### VARIANCE
### STEP 1: Differences
var_tHat<-expectedVal%>%
  mutate(v_diffs=t_hat-10)

var_tHat

## # A tibble: 5 × 4
##   t_hat tHatProb summands v_diffs
##   <dbl>    <dbl>    <dbl>   <dbl>
## 1     6    0.167     1         -4
## 2     8    0.167     1.33      -2
## 3    10    0.333     3.33       0
## 4    12    0.167     2          2
## 5    14    0.167     2.33       4

### STEP 2: SQUARE DIFFERENCES
var_tHat<-expectedVal%>%
  mutate(v_diffs=t_hat-10)%>%
  mutate(v_diffSqr=v_diffs^2)

var_tHat

## # A tibble: 5 × 5
##   t_hat tHatProb summands v_diffs v_diffSqr
##   <dbl>    <dbl>    <dbl>   <dbl>     <dbl>
## 1     6    0.167     1         -4        16
## 2     8    0.167     1.33      -2         4
## 3    10    0.333     3.33       0         0
## 4    12    0.167     2          2         4
## 5    14    0.167     2.33       4        16

### STEP 3: WEIGHT 
var_tHat<-expectedVal%>%
  mutate(v_diffs=t_hat-10)%>%
  mutate(v_diffSqr=v_diffs^2)%>%
  mutate(v_summands=v_diffSqr*tHatProb)

var_tHat

## # A tibble: 5 × 6
##   t_hat tHatProb summands v_diffs v_diffSqr v_summands
##   <dbl>    <dbl>    <dbl>   <dbl>     <dbl>      <dbl>
## 1     6    0.167     1         -4        16      2.67 
## 2     8    0.167     1.33      -2         4      0.667
## 3    10    0.333     3.33       0         0      0    
## 4    12    0.167     2          2         4      0.667
## 5    14    0.167     2.33       4        16      2.67

### STEP 4: ADD 
sum(var_tHat$v_summands)

## [1] 6.666667

D. Mean Squared Error

\[MSE[\hat{t}]=V[\hat{t}]+[Bias({\hat{t}})]^2\]

In this case the MSE is

### MSE = VAR + BIAS^2
sum(var_tHat$v_summands)+0^2

## [1] 6.666667

IV. Inclusion Probability

To motivate this, let’s simulate several draws to see the distribution of how often different IDs are chosen.

### SIMULATE REPEATED DRAWS FROM THE POPULATION
### USE SAMPLING PLAN 1
simProb1<-sample(1:6, size=1000, prob=probs1, replace = TRUE)

### WHICH OBS?
simObs1<-as.numeric(allCombos[,simProb1])


## PROB OF SELECTION
table(simObs1)/1000

## simObs1
##     1     2     3     4 
## 0.508 0.478 0.518 0.496

## PLOT
as.data.frame(simObs1)%>%
  ggplot()+
  geom_histogram(aes(x=simObs1))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Inclusion Probability Definition

Since each sample has a known probability of selection, we can calculate the probability of each unit being included in a sample. This probability is known as the inclusion probability , \(\pi_i\), which is the sum of the probabilities of all possible samples that contain unit \(i\).

It can be shown through combinatorics that the inclusion probability in this simple random random is \(\pi_i=1/2\) for all \(i\).