In this lesson students will learn to:
These data were put together from NBA and WNBA records for the 2022-2023 and 2022 seasons, respecitively.
For this activity we are going to pretend that we are doing a survey on professional basketball players.
Download the data here:
## DOWNLOAD DATA
bball<-read.csv("https://raw.githubusercontent.com/kitadasmalley/Teaching/refs/heads/main/ProjectData/bBall2022/bballNR.csv")
str(bball)
## 'data.frame': 635 obs. of 36 variables:
## $ Player : chr "Stephen Curry" "John Wall" "Russell Westbrook" "LeBron James" ...
## $ Salary : num 48070014 47345760 47080179 44474988 44119845 ...
## $ SimpPos : chr "G" "G" "G" "F" ...
## $ Position: chr "PG" "PG" "PG" "PF" ...
## $ Age : int 34 32 34 38 34 29 31 32 28 32 ...
## $ TeamAbb : chr "GSW" "LAC" "LAL/LAC" "LAL" ...
## $ GP : int 56 34 73 55 47 50 52 56 63 58 ...
## $ GS : int 56 3 24 54 47 50 50 56 63 58 ...
## $ MP : num 34.7 22.2 29.1 35.5 35.6 33.5 33.6 34.6 32.1 36.3 ...
## $ FG : num 10 4.1 5.9 11.1 10.3 8.9 8.6 8.2 11.2 9.6 ...
## $ FGA : num 20.2 9.9 13.6 22.2 18.3 17.6 16.8 17.9 20.3 20.7 ...
## $ FGP : num 0.493 0.408 0.436 0.5 0.56 0.506 0.512 0.457 0.553 0.463 ...
## $ X3P : num 4.9 1 1.2 2.2 2 1.6 2 2.8 0.7 4.2 ...
## $ X3PA : num 11.4 3.2 3.9 6.9 4.9 4.4 4.8 7.6 2.7 11.3 ...
## $ X3PP : num 0.427 0.303 0.311 0.321 0.404 0.365 0.416 0.371 0.275 0.371 ...
## $ X2P : num 5.1 3.1 4.7 8.9 8.3 7.3 6.6 5.4 10.5 5.4 ...
## $ X2PA : num 8.8 6.7 9.7 15.3 13.4 13.2 11.9 10.3 17.6 9.4 ...
## $ X2PP : num 0.579 0.459 0.487 0.58 0.617 0.552 0.551 0.521 0.596 0.574 ...
## $ FT : num 4.6 2.3 2.8 4.6 6.5 3.8 4.7 4.6 7.9 8.8 ...
## $ FTA : num 5 3.3 4.3 5.9 7.1 4.6 5.4 5.3 12.3 9.6 ...
## $ FTP : num 0.915 0.681 0.656 0.768 0.919 0.842 0.871 0.871 0.645 0.914 ...
## $ ORB : num 0.7 0.4 1.2 1.2 0.4 0.8 1.1 0.8 2.2 0.8 ...
## $ DRB : num 5.4 2.3 4.6 7.1 6.3 3.1 5.4 5.3 9.6 4 ...
## $ TRB : num 6.1 2.7 5.8 8.3 6.7 3.9 6.5 6.1 11.8 4.8 ...
## $ AST : num 6.3 5.2 7.5 6.8 5 5.4 3.9 5.1 5.7 7.3 ...
## $ STL : num 0.9 0.8 1 0.9 0.7 0.9 1.4 1.5 0.8 0.9 ...
## $ BLK : num 0.4 0.4 0.5 0.6 1.4 0.7 0.5 0.4 0.8 0.3 ...
## $ TOV : num 3.2 2.4 3.5 3.2 3.3 2.9 1.7 3.1 3.9 3.3 ...
## $ PF : num 2.1 1.7 2.2 1.6 2.1 2.1 1.6 2.8 3.1 1.9 ...
## $ PTS : num 29.4 11.4 15.9 28.9 29.1 23.2 23.8 23.8 31.1 32.2 ...
## $ League : chr "NBA" "NBA" "NBA" "NBA" ...
## $ ORACLE1 : int 0 1 0 1 1 1 1 1 0 1 ...
## $ ORACLE2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SRS_0 : int 0 0 0 1 0 1 0 0 0 1 ...
## $ SRS_MCAR: int 0 0 0 1 0 1 0 0 0 1 ...
## $ SRS_MAR : int 0 0 0 1 0 1 0 0 0 1 ...
We will also need the tidyverse to help with some wrangling:
library(tidyverse)
Consider two different oracles that know deterministically if someone is predestined to respond or not respond to a survey.
### LOOK AT THE SUMMARY OF ORACLE 1's DATA
bball%>%
group_by(ORACLE1)%>%
summarise(n=n(),
avg=mean(Salary))
## # A tibble: 2 × 3
## ORACLE1 n avg
## <int> <int> <dbl>
## 1 0 127 6141368.
## 2 1 508 6234332.
## IT LOOKS LIKE THERE ISN'T A BIG DIFFERENCE BETWEEN THE GROUPS
Since you spoke with this oracle you also know the truth of the universe, the true average salary.
## TRUE VALUE
YU<-mean(bball$Salary)
YU
## [1] 6215739
## VERIFY FROM TABLE ABOVE
(127/635)*6141368+(508/635)*6234332
## [1] 6215739
Now, pretend that we take a sample of n=150 from the respondent group many many times. Doing this we can estimate the amount of bias.
### SIMULATE BIAS
bias<-c()
for(i in 1:1000){
oracle1_Resp<-bball%>%
filter(ORACLE1==1)
thisSamp<-sample(1:dim(oracle1_Resp)[1], 150, replace=FALSE)
thisYR<-mean(oracle1_Resp$Salary[thisSamp])
thisBias<-thisYR-YU
bias<-c(bias, thisBias)
}
bias_df1<-data.frame(run=1:1000, bias)
## APPROX
(127/635)*(6234332-6141368)
## [1] 18592.8
ggplot(bias_df1, aes(x=run, y=bias))+
geom_point()+
geom_hline(yintercept = 18592.8, color="red")
Now, in a parallel universe, we go an talk to a different oracle.
### LOOK AT THE SUMMARY OF ORACLE 2's DATA
bball%>%
group_by(ORACLE2)%>%
summarise(n=n(),
avg=mean(Salary))
## # A tibble: 2 × 3
## ORACLE2 n avg
## <int> <int> <dbl>
## 1 0 127 22839209.
## 2 1 508 2059872.
## TRUE VALUE
YU<-mean(bball$Salary)
YU
## [1] 6215739
## IT LOOKS LIKE THERE IS BIG DIFFERENCE BETWEEN THE GROUPS
Now again, pretend that we take a sample of n=150 from the respondent group many many times. Doing this we can estimate the amount of bias.
### SIMULATE BIAS
bias2<-c()
for(i in 1:1000){
oracle2_Resp<-bball%>%
filter(ORACLE2==1)
thisSamp<-sample(1:dim(oracle2_Resp)[1], 150, replace=FALSE)
thisYR<-mean(oracle2_Resp$Salary[thisSamp])
thisBias<-thisYR-YU
bias2<-c(bias2, thisBias)
}
bias_df2<-data.frame(run=1:1000, bias2)
## APPROX
(127/635)*(2059872-22839209)
## [1] -4155867
ggplot(bias_df2, aes(x=run, y=bias2))+
geom_point()+
geom_hline(yintercept = -4155867, color="red")
We aren’t oracles and we never know the truth. If the group of nonrespondents is very different from the group of respondents are results will be unreliable and we can make the wrong conclusions!
Sometimes we may know something about the population from which we took the sample. We can use this to check if our data appears to be representive of these dimensions
#### AUXILLARY DATA
### LEAGUE
LeTab<-table(bball$League)
prop.table(LeTab)
##
## NBA WNBA
## 0.7354331 0.2645669
### POSITION
PosTab<-table(bball$SimpPos)
prop.table(PosTab)
##
## C F G
## 0.1811024 0.3984252 0.4204724
In a perfect world, we would have no nonresponse. The following simple random sample for generated previously in R.
#### SRS - NO NONRESPONSE
## DOES THIS LOOK SAMPLE LOOK REPRESENTATIVE?
ggplot(bball, aes(x=as.factor(SRS_0), fill=League))+
geom_bar(position="fill")
Look at the distribution:
## FILTER
sampSRS_0<-bball%>%
filter(SRS_0==1)
table(sampSRS_0$League)
##
## NBA WNBA
## 109 41
prop.table(table(sampSRS_0$League))
##
## NBA WNBA
## 0.7266667 0.2733333
## ITS NOT TOO BAD... LET'S KEEP GOING
Since we feel satisfied and this comes from an SRS, where all the weights are the same. We can calculate the sample mean.
### CALCULATE THE MEAN FROM THE RESPONDENTS
## MEAN
mean(sampSRS_0$Salary) ## ESTIMATE
## [1] 5889924
If we are willing to make these assumptions and we have no other information this is the best we can do.
In reality we will have missing data. First, let’s explore data that is Missing Completely At Random!
Now, when a respondent was selected but does not respond they will be
represented in the data with an NA
.
#### SRS_MCAR
### LOOK AT COLUMN
head(bball$SRS_MCAR, 20)
## [1] 0 0 0 1 0 1 0 0 0 1 0 NA 0 0 0 0 0 0 NA 0
Let’s check the non-response rates across Leagues.
## DOES THIS LOOK SAMPLE LOOK REPRESENTATIVE?
## LET'S LOOK AT THE NONRESPONSE RATES
bball%>%
filter(is.na(SRS_MCAR)==TRUE | SRS_MCAR==1)%>%
group_by(League)%>%
summarise(nonRespR=mean(is.na(SRS_MCAR)))
## # A tibble: 2 × 2
## League nonRespR
## <chr> <dbl>
## 1 NBA 0.202
## 2 WNBA 0.195
These nonresponse rates are pretty similar.
## DOES THE SAMPLE LOOK REPRESENTATIVE
## REMOVE NR
sampSRS_MCAR<-bball%>%
filter(is.na(SRS_MCAR)==FALSE)%>%
filter(SRS_MCAR==1)
table(sampSRS_MCAR$League)
##
## NBA WNBA
## 87 33
prop.table(table(sampSRS_MCAR$League))
##
## NBA WNBA
## 0.725 0.275
If we are willing to move forward, we can calculate the sample mean.
## SAMPLE MEAN
mean(sampSRS_MCAR$Salary) ## ESTIMATE
## [1] 5678742
Now it’s getting real! There can be relationships between observed explanatory variables and the propensity to respond. When this happens, we refer to it as missing at random (MAR).
#### SRS_MAR
## DOES THIS LOOK SAMPLE LOOK REPRESENTATIVE?
## LET'S LOOK AT THE NONRESPONSE RATES
bball%>%
filter(is.na(SRS_MAR)==TRUE | SRS_MAR==1)%>%
group_by(League)%>%
summarise(NonRespR=mean(is.na(SRS_MAR)))
## # A tibble: 2 × 2
## League NonRespR
## <chr> <dbl>
## 1 NBA 0.248
## 2 WNBA 0.0732
OH NO!!! It look’s like there is differential nonresponse!
### OH NO!
## REMOVE NR
sampSRS_MAR<-bball%>%
filter(is.na(SRS_MAR)==FALSE)%>%
filter(SRS_MAR==1)
table(sampSRS_MAR$League)
##
## NBA WNBA
## 82 38
prop.table(table(sampSRS_MAR$League))
##
## NBA WNBA
## 0.6833333 0.3166667
It would not be wise to simple look at the sample mean here.
## ESTIMATE
mean(sampSRS_MAR$Salary)
## [1] 5780424
GOAL: Redistribute weights within classes, such that response propensities vary from class to class but are homogenous within.
STEPS:
Add a column for the original weights.
### WEIGHTING CLASS
### USE LEAGUE AS CLASSES
selectedMAR<-bball%>%
select(c("Player", "Salary", "League", "SRS_MAR"))%>%
filter(is.na(SRS_MAR)==TRUE | SRS_MAR==1)%>%
mutate(wi=635/150)### ORIGINAL WEIGHT
head(selectedMAR)
## Player Salary League SRS_MAR wi
## 1 LeBron James 44474988 NBA 1 4.233333
## 2 Bradley Beal 43279250 NBA 1 4.233333
## 3 Damian Lillard 42492492 NBA 1 4.233333
## 4 Kyrie Irving 38917057 NBA 1 4.233333
## 5 Trae Young 37096500 NBA 1 4.233333
## 6 James Harden 33000000 NBA NA 4.233333
### DIVIDE INTO CLASSES
weightClasses<-selectedMAR%>%
group_by(League, SRS_MAR)%>%
summarise(sumWgt=sum(wi))
## `summarise()` has grouped output by 'League'. You can override using the
## `.groups` argument.
weightClasses
## # A tibble: 4 × 3
## # Groups: League [2]
## League SRS_MAR sumWgt
## <chr> <int> <dbl>
## 1 NBA 1 347.
## 2 NBA NA 114.
## 3 WNBA 1 161.
## 4 WNBA NA 12.7
### REDISTRIBUTE WEIGHTS
selectedMAR1<-selectedMAR%>%
filter(SRS_MAR==1)%>%
mutate(WCA_multiplier=case_when(
League=="NBA"~ (1+114.3/347.133),
League=="WNBA"~ (1+12.7/160.867),
))%>%
mutate(new_wi=wi*WCA_multiplier)
head(selectedMAR1)
## Player Salary League SRS_MAR wi WCA_multiplier new_wi
## 1 LeBron James 44474988 NBA 1 4.233333 1.329269 5.627237
## 2 Bradley Beal 43279250 NBA 1 4.233333 1.329269 5.627237
## 3 Damian Lillard 42492492 NBA 1 4.233333 1.329269 5.627237
## 4 Kyrie Irving 38917057 NBA 1 4.233333 1.329269 5.627237
## 5 Trae Young 37096500 NBA 1 4.233333 1.329269 5.627237
## 6 Jamal Murray 31650600 NBA 1 4.233333 1.329269 5.627237
### STEP 3: CHECK
sum(selectedMAR1$new_wi)
## [1] 635.0001
## CLOSE BUT A TINY BIT OF ROUNDING ERROR
### WEIGHTED ESTIMATE
numerator<-sum(selectedMAR1$new_wi*selectedMAR1$Salary)
numerator
## [1] 3899183392
denominator<-sum(selectedMAR1$new_wi)
denominator
## [1] 635.0001
### AVERAGE
numerator/denominator ### MUCH BETTER!
## [1] 6140445
If you know the count within your sample of a characteristic (or even multiple characteristics together) you can create stratification weights.
### POSTSTRATIFIED
sampSRS_MAR%>%
group_by(League)%>%
summarise(n=n())
## # A tibble: 2 × 2
## League n
## <chr> <int>
## 1 NBA 82
## 2 WNBA 38
### TRUTH
LeTab
##
## NBA WNBA
## 467 168
### POST STRAT WEIGHTS
### NBA
467/82
## [1] 5.695122
### WNBA
168/38
## [1] 4.421053
### ADD POST-STRAT WEIGHTS
#head(sampSRS_MAR)
postStrat_MAR<-sampSRS_MAR%>%
select(c("Player", "Salary", "League", "SRS_MAR"))%>%
mutate(wi_star=case_when(
League=="NBA" ~ 467/82,
League=="WNBA" ~ 168/38
))
head(postStrat_MAR)
## Player Salary League SRS_MAR wi_star
## 1 LeBron James 44474988 NBA 1 5.695122
## 2 Bradley Beal 43279250 NBA 1 5.695122
## 3 Damian Lillard 42492492 NBA 1 5.695122
## 4 Kyrie Irving 38917057 NBA 1 5.695122
## 5 Trae Young 37096500 NBA 1 5.695122
## 6 Jamal Murray 31650600 NBA 1 5.695122
## CHECK
sum(postStrat_MAR$wi_star)
## [1] 635
### PS WEIGHTED ESTIMATE
numeratorPS<-sum(postStrat_MAR$wi_star*postStrat_MAR$Salary)
numeratorPS
## [1] 3945431325
denominatorPS<-sum(postStrat_MAR$wi_star)
denominatorPS
## [1] 635
### PS AVERAGE
numeratorPS/denominatorPS ### EVEN BETTER!
## [1] 6213278
### TRUTH
YU
## [1] 6215739
### WEIGHTING CLASS
numerator/denominator
## [1] 6140445
### POSTSTRAT
numeratorPS/denominatorPS
## [1] 6213278