Learning Objectives:

In this lesson students will learn to:

  • The woes of nonresponse!
  • How to use auxiliary data to re-weight survey responses in the face of nonresponse

The Data

These data were put together from NBA and WNBA records for the 2022-2023 and 2022 seasons, respecitively.

For this activity we are going to pretend that we are doing a survey on professional basketball players.

Download the data here:

## DOWNLOAD DATA
bball<-read.csv("https://raw.githubusercontent.com/kitadasmalley/Teaching/refs/heads/main/ProjectData/bBall2022/bballNR.csv")

str(bball)
## 'data.frame':    635 obs. of  36 variables:
##  $ Player  : chr  "Stephen Curry" "John Wall" "Russell Westbrook" "LeBron James" ...
##  $ Salary  : num  48070014 47345760 47080179 44474988 44119845 ...
##  $ SimpPos : chr  "G" "G" "G" "F" ...
##  $ Position: chr  "PG" "PG" "PG" "PF" ...
##  $ Age     : int  34 32 34 38 34 29 31 32 28 32 ...
##  $ TeamAbb : chr  "GSW" "LAC" "LAL/LAC" "LAL" ...
##  $ GP      : int  56 34 73 55 47 50 52 56 63 58 ...
##  $ GS      : int  56 3 24 54 47 50 50 56 63 58 ...
##  $ MP      : num  34.7 22.2 29.1 35.5 35.6 33.5 33.6 34.6 32.1 36.3 ...
##  $ FG      : num  10 4.1 5.9 11.1 10.3 8.9 8.6 8.2 11.2 9.6 ...
##  $ FGA     : num  20.2 9.9 13.6 22.2 18.3 17.6 16.8 17.9 20.3 20.7 ...
##  $ FGP     : num  0.493 0.408 0.436 0.5 0.56 0.506 0.512 0.457 0.553 0.463 ...
##  $ X3P     : num  4.9 1 1.2 2.2 2 1.6 2 2.8 0.7 4.2 ...
##  $ X3PA    : num  11.4 3.2 3.9 6.9 4.9 4.4 4.8 7.6 2.7 11.3 ...
##  $ X3PP    : num  0.427 0.303 0.311 0.321 0.404 0.365 0.416 0.371 0.275 0.371 ...
##  $ X2P     : num  5.1 3.1 4.7 8.9 8.3 7.3 6.6 5.4 10.5 5.4 ...
##  $ X2PA    : num  8.8 6.7 9.7 15.3 13.4 13.2 11.9 10.3 17.6 9.4 ...
##  $ X2PP    : num  0.579 0.459 0.487 0.58 0.617 0.552 0.551 0.521 0.596 0.574 ...
##  $ FT      : num  4.6 2.3 2.8 4.6 6.5 3.8 4.7 4.6 7.9 8.8 ...
##  $ FTA     : num  5 3.3 4.3 5.9 7.1 4.6 5.4 5.3 12.3 9.6 ...
##  $ FTP     : num  0.915 0.681 0.656 0.768 0.919 0.842 0.871 0.871 0.645 0.914 ...
##  $ ORB     : num  0.7 0.4 1.2 1.2 0.4 0.8 1.1 0.8 2.2 0.8 ...
##  $ DRB     : num  5.4 2.3 4.6 7.1 6.3 3.1 5.4 5.3 9.6 4 ...
##  $ TRB     : num  6.1 2.7 5.8 8.3 6.7 3.9 6.5 6.1 11.8 4.8 ...
##  $ AST     : num  6.3 5.2 7.5 6.8 5 5.4 3.9 5.1 5.7 7.3 ...
##  $ STL     : num  0.9 0.8 1 0.9 0.7 0.9 1.4 1.5 0.8 0.9 ...
##  $ BLK     : num  0.4 0.4 0.5 0.6 1.4 0.7 0.5 0.4 0.8 0.3 ...
##  $ TOV     : num  3.2 2.4 3.5 3.2 3.3 2.9 1.7 3.1 3.9 3.3 ...
##  $ PF      : num  2.1 1.7 2.2 1.6 2.1 2.1 1.6 2.8 3.1 1.9 ...
##  $ PTS     : num  29.4 11.4 15.9 28.9 29.1 23.2 23.8 23.8 31.1 32.2 ...
##  $ League  : chr  "NBA" "NBA" "NBA" "NBA" ...
##  $ ORACLE1 : int  0 1 0 1 1 1 1 1 0 1 ...
##  $ ORACLE2 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SRS_0   : int  0 0 0 1 0 1 0 0 0 1 ...
##  $ SRS_MCAR: int  0 0 0 1 0 1 0 0 0 1 ...
##  $ SRS_MAR : int  0 0 0 1 0 1 0 0 0 1 ...

We will also need the tidyverse to help with some wrangling:

library(tidyverse)

The Woes of Nonresponse

Consider two different oracles that know deterministically if someone is predestined to respond or not respond to a survey.

Oracle 1

### LOOK AT THE SUMMARY OF ORACLE 1's DATA
bball%>%
  group_by(ORACLE1)%>%
  summarise(n=n(), 
            avg=mean(Salary))
## # A tibble: 2 × 3
##   ORACLE1     n      avg
##     <int> <int>    <dbl>
## 1       0   127 6141368.
## 2       1   508 6234332.
## IT LOOKS LIKE THERE ISN'T A BIG DIFFERENCE BETWEEN THE GROUPS

Since you spoke with this oracle you also know the truth of the universe, the true average salary.

## TRUE VALUE
YU<-mean(bball$Salary)
YU
## [1] 6215739
## VERIFY FROM TABLE ABOVE
(127/635)*6141368+(508/635)*6234332
## [1] 6215739

Now, pretend that we take a sample of n=150 from the respondent group many many times. Doing this we can estimate the amount of bias.

### SIMULATE BIAS
bias<-c()
for(i in 1:1000){
  oracle1_Resp<-bball%>%
    filter(ORACLE1==1)
  thisSamp<-sample(1:dim(oracle1_Resp)[1], 150, replace=FALSE)
  thisYR<-mean(oracle1_Resp$Salary[thisSamp])
  thisBias<-thisYR-YU
  bias<-c(bias, thisBias)
}

bias_df1<-data.frame(run=1:1000, bias)

## APPROX
(127/635)*(6234332-6141368)
## [1] 18592.8
ggplot(bias_df1, aes(x=run, y=bias))+
  geom_point()+
  geom_hline(yintercept = 18592.8, color="red")

Oracle 2

Now, in a parallel universe, we go an talk to a different oracle.

### LOOK AT THE SUMMARY OF ORACLE 2's DATA
bball%>%
  group_by(ORACLE2)%>%
  summarise(n=n(), 
            avg=mean(Salary))
## # A tibble: 2 × 3
##   ORACLE2     n       avg
##     <int> <int>     <dbl>
## 1       0   127 22839209.
## 2       1   508  2059872.
## TRUE VALUE
YU<-mean(bball$Salary)
YU
## [1] 6215739
## IT LOOKS LIKE THERE IS BIG DIFFERENCE BETWEEN THE GROUPS

Now again, pretend that we take a sample of n=150 from the respondent group many many times. Doing this we can estimate the amount of bias.

### SIMULATE BIAS
bias2<-c()
for(i in 1:1000){
  oracle2_Resp<-bball%>%
    filter(ORACLE2==1)
  thisSamp<-sample(1:dim(oracle2_Resp)[1], 150, replace=FALSE)
  thisYR<-mean(oracle2_Resp$Salary[thisSamp])
  thisBias<-thisYR-YU
  bias2<-c(bias2, thisBias)
}

bias_df2<-data.frame(run=1:1000, bias2)

## APPROX
(127/635)*(2059872-22839209)
## [1] -4155867
ggplot(bias_df2, aes(x=run, y=bias2))+
  geom_point()+
  geom_hline(yintercept = -4155867, color="red")

Moral of the Story

We aren’t oracles and we never know the truth. If the group of nonrespondents is very different from the group of respondents are results will be unreliable and we can make the wrong conclusions!

Using Auxiliary Data

Sometimes we may know something about the population from which we took the sample. We can use this to check if our data appears to be representive of these dimensions

#### AUXILLARY DATA
### LEAGUE
LeTab<-table(bball$League)
prop.table(LeTab)
## 
##       NBA      WNBA 
## 0.7354331 0.2645669
### POSITION
PosTab<-table(bball$SimpPos)
prop.table(PosTab)
## 
##         C         F         G 
## 0.1811024 0.3984252 0.4204724

Case 1: SRS in a Perfect World

In a perfect world, we would have no nonresponse. The following simple random sample for generated previously in R.

#### SRS - NO NONRESPONSE
## DOES THIS LOOK SAMPLE LOOK REPRESENTATIVE?

ggplot(bball, aes(x=as.factor(SRS_0), fill=League))+
  geom_bar(position="fill")

Look at the distribution:

## FILTER
sampSRS_0<-bball%>%
  filter(SRS_0==1)

table(sampSRS_0$League)
## 
##  NBA WNBA 
##  109   41
prop.table(table(sampSRS_0$League))
## 
##       NBA      WNBA 
## 0.7266667 0.2733333
## ITS NOT TOO BAD... LET'S KEEP GOING

Since we feel satisfied and this comes from an SRS, where all the weights are the same. We can calculate the sample mean.

### CALCULATE THE MEAN FROM THE RESPONDENTS 

## MEAN
mean(sampSRS_0$Salary) ## ESTIMATE
## [1] 5889924

If we are willing to make these assumptions and we have no other information this is the best we can do.

Case 2: SRS with MCAR

In reality we will have missing data. First, let’s explore data that is Missing Completely At Random!

Now, when a respondent was selected but does not respond they will be represented in the data with an NA.

#### SRS_MCAR
### LOOK AT COLUMN
head(bball$SRS_MCAR, 20)
##  [1]  0  0  0  1  0  1  0  0  0  1  0 NA  0  0  0  0  0  0 NA  0

Let’s check the non-response rates across Leagues.

## DOES THIS LOOK SAMPLE LOOK REPRESENTATIVE?
## LET'S LOOK AT THE NONRESPONSE RATES
bball%>%
  filter(is.na(SRS_MCAR)==TRUE | SRS_MCAR==1)%>%
  group_by(League)%>%
  summarise(nonRespR=mean(is.na(SRS_MCAR)))
## # A tibble: 2 × 2
##   League nonRespR
##   <chr>     <dbl>
## 1 NBA       0.202
## 2 WNBA      0.195

These nonresponse rates are pretty similar.

## DOES THE SAMPLE LOOK REPRESENTATIVE
## REMOVE NR
sampSRS_MCAR<-bball%>%
  filter(is.na(SRS_MCAR)==FALSE)%>%
  filter(SRS_MCAR==1)

table(sampSRS_MCAR$League)
## 
##  NBA WNBA 
##   87   33
prop.table(table(sampSRS_MCAR$League))
## 
##   NBA  WNBA 
## 0.725 0.275

If we are willing to move forward, we can calculate the sample mean.

## SAMPLE MEAN
mean(sampSRS_MCAR$Salary) ## ESTIMATE
## [1] 5678742

Case 3: SRS with MAR

Now it’s getting real! There can be relationships between observed explanatory variables and the propensity to respond. When this happens, we refer to it as missing at random (MAR).

#### SRS_MAR
## DOES THIS LOOK SAMPLE LOOK REPRESENTATIVE?
## LET'S LOOK AT THE NONRESPONSE RATES
bball%>%
  filter(is.na(SRS_MAR)==TRUE | SRS_MAR==1)%>%
  group_by(League)%>%
  summarise(NonRespR=mean(is.na(SRS_MAR)))
## # A tibble: 2 × 2
##   League NonRespR
##   <chr>     <dbl>
## 1 NBA      0.248 
## 2 WNBA     0.0732

OH NO!!! It look’s like there is differential nonresponse!

### OH NO!
## REMOVE NR
sampSRS_MAR<-bball%>%
  filter(is.na(SRS_MAR)==FALSE)%>%
  filter(SRS_MAR==1)

table(sampSRS_MAR$League)
## 
##  NBA WNBA 
##   82   38
prop.table(table(sampSRS_MAR$League))
## 
##       NBA      WNBA 
## 0.6833333 0.3166667

It would not be wise to simple look at the sample mean here.

## ESTIMATE
mean(sampSRS_MAR$Salary)
## [1] 5780424

Weighting Adjustments for Nonresponse

A. Weighting Class Adjustments

GOAL: Redistribute weights within classes, such that response propensities vary from class to class but are homogenous within.

STEPS:

    1. Divide into weighting classes
    1. Redistribute sample weights
    1. Check weight adjustments
0. Prepare Data

Add a column for the original weights.

### WEIGHTING CLASS
### USE LEAGUE AS CLASSES 
selectedMAR<-bball%>%
  select(c("Player", "Salary", "League", "SRS_MAR"))%>%
  filter(is.na(SRS_MAR)==TRUE | SRS_MAR==1)%>%
  mutate(wi=635/150)### ORIGINAL WEIGHT 
  

head(selectedMAR)
##           Player   Salary League SRS_MAR       wi
## 1   LeBron James 44474988    NBA       1 4.233333
## 2   Bradley Beal 43279250    NBA       1 4.233333
## 3 Damian Lillard 42492492    NBA       1 4.233333
## 4   Kyrie Irving 38917057    NBA       1 4.233333
## 5     Trae Young 37096500    NBA       1 4.233333
## 6   James Harden 33000000    NBA      NA 4.233333
1. Divide into weighting classes
### DIVIDE INTO CLASSES 
weightClasses<-selectedMAR%>%
  group_by(League, SRS_MAR)%>%
  summarise(sumWgt=sum(wi))
## `summarise()` has grouped output by 'League'. You can override using the
## `.groups` argument.
weightClasses
## # A tibble: 4 × 3
## # Groups:   League [2]
##   League SRS_MAR sumWgt
##   <chr>    <int>  <dbl>
## 1 NBA          1  347. 
## 2 NBA         NA  114. 
## 3 WNBA         1  161. 
## 4 WNBA        NA   12.7
2. Redistribute Weights
### REDISTRIBUTE WEIGHTS
selectedMAR1<-selectedMAR%>%
  filter(SRS_MAR==1)%>%
  mutate(WCA_multiplier=case_when(
    League=="NBA"~ (1+114.3/347.133), 
    League=="WNBA"~ (1+12.7/160.867), 
  ))%>%
  mutate(new_wi=wi*WCA_multiplier)

head(selectedMAR1)
##           Player   Salary League SRS_MAR       wi WCA_multiplier   new_wi
## 1   LeBron James 44474988    NBA       1 4.233333       1.329269 5.627237
## 2   Bradley Beal 43279250    NBA       1 4.233333       1.329269 5.627237
## 3 Damian Lillard 42492492    NBA       1 4.233333       1.329269 5.627237
## 4   Kyrie Irving 38917057    NBA       1 4.233333       1.329269 5.627237
## 5     Trae Young 37096500    NBA       1 4.233333       1.329269 5.627237
## 6   Jamal Murray 31650600    NBA       1 4.233333       1.329269 5.627237
3. Check TOTAL
### STEP 3: CHECK
sum(selectedMAR1$new_wi)
## [1] 635.0001
## CLOSE BUT A TINY BIT OF ROUNDING ERROR
4. Weighted Estimate
### WEIGHTED ESTIMATE
numerator<-sum(selectedMAR1$new_wi*selectedMAR1$Salary)
numerator
## [1] 3899183392
denominator<-sum(selectedMAR1$new_wi)
denominator
## [1] 635.0001
### AVERAGE
numerator/denominator ### MUCH BETTER!
## [1] 6140445

B. Poststratification

If you know the count within your sample of a characteristic (or even multiple characteristics together) you can create stratification weights.

1. Distribution in the Sample
### POSTSTRATIFIED
sampSRS_MAR%>%
  group_by(League)%>%
  summarise(n=n())
## # A tibble: 2 × 2
##   League     n
##   <chr>  <int>
## 1 NBA       82
## 2 WNBA      38
### TRUTH
LeTab
## 
##  NBA WNBA 
##  467  168
2. Poststratification Weights
### POST STRAT WEIGHTS
### NBA
467/82
## [1] 5.695122
### WNBA
168/38
## [1] 4.421053
### ADD POST-STRAT WEIGHTS
#head(sampSRS_MAR)

postStrat_MAR<-sampSRS_MAR%>%
  select(c("Player", "Salary", "League", "SRS_MAR"))%>%
  mutate(wi_star=case_when(
    League=="NBA" ~ 467/82, 
    League=="WNBA" ~ 168/38
  ))

head(postStrat_MAR)
##           Player   Salary League SRS_MAR  wi_star
## 1   LeBron James 44474988    NBA       1 5.695122
## 2   Bradley Beal 43279250    NBA       1 5.695122
## 3 Damian Lillard 42492492    NBA       1 5.695122
## 4   Kyrie Irving 38917057    NBA       1 5.695122
## 5     Trae Young 37096500    NBA       1 5.695122
## 6   Jamal Murray 31650600    NBA       1 5.695122
3. Don’t forget to check your work!
## CHECK
sum(postStrat_MAR$wi_star)
## [1] 635
4. Weighted Estimate
### PS WEIGHTED ESTIMATE
numeratorPS<-sum(postStrat_MAR$wi_star*postStrat_MAR$Salary)
numeratorPS
## [1] 3945431325
denominatorPS<-sum(postStrat_MAR$wi_star)
denominatorPS
## [1] 635
### PS AVERAGE
numeratorPS/denominatorPS ### EVEN BETTER!
## [1] 6213278

COMPARE

### TRUTH
YU
## [1] 6215739
### WEIGHTING CLASS
numerator/denominator
## [1] 6140445
### POSTSTRAT
numeratorPS/denominatorPS
## [1] 6213278