Learning Objectives:

In this lesson students will learn to:

  • Visualize missing data
  • Implement imputation methods

The Data

These data were put together from NBA and WNBA records for the 2022-2023 and 2022 seasons, respecitively.

For this activity we are going to pretend that we are doing a survey on professional basketball players. These data represent only the raw responses to the survey. These data reflect a subset of the NBA and WNBA 2022 population and also reflects nonresponse.

Download the data here:

## DOWNLOAD RAW SURVEY DATA
bballSurv<-read.csv("https://raw.githubusercontent.com/kitadasmalley/Teaching/refs/heads/main/DATA429_599/CODE/raw_bballSurvey.csv")

str(bballSurv)
## 'data.frame':    120 obs. of  18 variables:
##  $ Player : chr  "LeBron James" "Bradley Beal" "Damian Lillard" "Kyrie Irving" ...
##  $ League : chr  "NBA" "NBA" "NBA" "NBA" ...
##  $ Age    : int  38 29 32 30 24 25 25 25 32 37 ...
##  $ wi_star: num  5.7 5.7 5.7 5.7 5.7 ...
##  $ Salary : num  44474988 43279250 42492492 38917057 37096500 ...
##  $ SimpPos: chr  "F" "G" "G" "G" ...
##  $ PTS    : num  28.9 23.2 32.2 27.1 26.2 20 20.4 25 14.7 13.9 ...
##  $ GP     : int  55 50 NA NA 73 65 75 NA 50 59 ...
##  $ GS     : int  54 50 58 60 73 NA 75 NA 50 59 ...
##  $ MP     : num  35.5 33.5 36.3 37.4 34.8 32.8 34.6 33.4 31.5 32 ...
##  $ FG     : num  11.1 8.9 9.6 9.9 8.2 7.3 8 NA 5.5 5 ...
##  $ FGA    : num  22.2 17.6 20.7 20.1 19 16 14.9 18.2 11.6 11.3 ...
##  $ FGP    : num  0.5 0.506 NA 0.494 0.429 NA 0.54 NA 0.475 0.44 ...
##  $ X3PP   : num  0.321 0.365 0.371 0.379 0.335 0.398 0.083 0.324 NA 0.375 ...
##  $ X2PP   : num  0.58 0.552 0.574 NA 0.476 0.494 0.545 0.584 NA 0.482 ...
##  $ FTP    : num  0.768 NA 0.914 0.905 0.886 0.833 0.806 0.78 0.811 NA ...
##  $ TRB    : num  8.3 3.9 4.8 5.1 NA NA 9.2 NA 4.3 4.3 ...
##  $ AST    : num  6.8 5.4 7.3 5.5 10.2 6.2 3.2 6.1 4.1 8.9 ...

1. Exploring the Data

Let’s to a quick EDA to look for patterns. How about a pairs plot?!

library(tidyverse)

### PAIRS PLOT
#install.packages("GGally")
library(GGally)

ggpairs(bballSurv, columns = c(5, 7:18))

The leagues have different styles of game play which might impact the relationships between the variables.

### ADD COLOR
ggpairs(bballSurv, columns = c(5, 7:18), ggplot2::aes(colour = League))

What happens if we wanted to model salary?

Consider the following two models.

Model 1:
#### WHAT HAPPENS WITH MISSING DATA
### SALARY AS A FUNCTION OF PTS AND LEAGUE
mod1<-lm(Salary~PTS+League, data=bballSurv)
summary(mod1)
## 
## Call:
## lm(formula = Salary ~ PTS + League, data = bballSurv)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -22252644  -3682180    435703   3124995  16951292 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -991668    1304794  -0.760     0.45    
## PTS          1177570     101875  11.559  < 2e-16 ***
## LeagueWNBA  -8598988    1622020  -5.301 1.12e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6513000 on 75 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.6966, Adjusted R-squared:  0.6885 
## F-statistic: 86.08 on 2 and 75 DF,  p-value: < 2.2e-16
Model 2:
### SALARY AS A FUNCTION OF PTS AND LEAGUE AND AST
mod2<-lm(Salary~PTS+AST+League, data=bballSurv)
summary(mod2)
## 
## Call:
## lm(formula = Salary ~ PTS + AST + League, data = bballSurv)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -16951629  -3132570   -290050   3564051  15540967 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1955961    1318845  -1.483 0.143465    
## PTS           720305     158746   4.537 2.92e-05 ***
## AST          2404291     565140   4.254 7.74e-05 ***
## LeagueWNBA  -7567365    1850653  -4.089 0.000135 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6004000 on 58 degrees of freedom
##   (58 observations deleted due to missingness)
## Multiple R-squared:  0.7847, Adjusted R-squared:  0.7736 
## F-statistic: 70.46 on 3 and 58 DF,  p-value: < 2.2e-16

These models are not actually comparable due to missing data! If any row in the data has a missing value for variables used in the model, that whole line will be removed. This is called listwise deletion.

2. Visualizing Missing Data

A. Using ‘dplyr’

In our first attempt, we might be curious about how pervasive nonresponse is.

First, let’s try to understand how complete the records are. How many nonresponses does each individual have?

### NONRESP BY PLAYER
nonR<-bballSurv%>%
  is.na()%>%
  rowSums()

### MAKE A NEW DATAFRAME
playerNR<-bballSurv%>%
  select("Player")%>%
  cbind(nonR)%>%
  mutate(respR=(14-nonR)/14)

### PLOT THE FREQ
### COMPLETE >= 80%
### PARTIAL [50, 80)
### BREAKOFF < 50%
ggplot(data=playerNR, aes(x=respR))+
  geom_histogram()+
  geom_vline(xintercept = .8, color="blue", lwd=1.5, lty=2)+
  geom_vline(xintercept = .5, color="red", lwd=1.5, lty=2)+
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, lets look at the frequency of nonresponse across the variables.

### NONRESP BY VAR
bballSurv%>%
  is.na()%>%
  colSums()
##  Player  League     Age wi_star  Salary SimpPos     PTS      GP      GS      MP 
##       0       0       0       0      21       0      22      26      21      20 
##      FG     FGA     FGP    X3PP    X2PP     FTP     TRB     AST 
##      15      32      28      25      29      37      30      21

Do any variables seem to get missed significantly more than others?

B. Visualize Missing Data

Let’s make a graphic to see where the missing data are.

### VISUALIZE MISSING
#install.packages("visdat")
library(visdat)

vis_miss(bballSurv)

#### C. Patterns in Missing Data

Sometimes there are relationships between variables such that there is a higher occurrence of multiple variables missing at the same time.

### VIM
#install.packages("VIM")
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
bballSurv%>%
  aggr(combined=TRUE, numbers=TRUE)
## Warning in plot.aggr(res, ...): not enough vertical space to display
## frequencies (too many combinations)

It’s hard to see with this many variables, let’s just look at a select few:

### MAYBE WE SHOULD LOOK AT LESS VARIABLES
## SALARY, PTS, GP, GS, MP, FG, AST 
bballSurv%>%
  select(c("Salary", "PTS", "GP", "MP", "AST"))%>%
  aggr(combined=TRUE, numbers=TRUE, 
       cex.numbers=.5)

3. Imputation

Imputation is the practice of using statistical methods to fill in missing data.

A. Mean Imputation

Mean imputation replaces missing values with the average.

#### MEAN IMPUTATION
## CREATE INDICATORS
## REPLACE WITH MEANS

marnaIMP<-bballSurv%>%
  mutate(PTS_imp=ifelse(is.na(PTS), TRUE, FALSE))%>% 
  mutate(AST_imp=ifelse(is.na(AST), TRUE, FALSE))%>%
  mutate(PTS=ifelse(is.na(PTS), mean(PTS, na.rm=TRUE),PTS))%>%  
  mutate(AST=ifelse(is.na(AST), mean(AST, na.rm=TRUE), AST))

### MARGIN PLOT
marnaIMP%>%
  select(PTS, AST, PTS_imp, AST_imp)%>%
  marginplot(delimiter = "imp")

## THIS IS NOT GREAT
## THERE IS NOT ENOUGH VARIABILITY 

B. HOT DECK

Hotdeck imputation takes “donor” from previous complete data records.

#### VANILLA HOT DECK
marnaIMP_HD<-hotdeck(bballSurv, variable=c("PTS", "AST"))

### MARGIN PLOT
marnaIMP_HD%>%
  select(PTS, AST, PTS_imp, AST_imp)%>%
  marginplot(delimiter = "imp")

The problem here is that this just borrows the previous value, but often due to relationships/correlations between variables, we should reorder the data to borrow a “better” value.

## PUT IN ORDER (NUM) - GP
marnaIMP_HD2<-hotdeck(bballSurv, 
                      variable="PTS", 
                      domain_var = "GP")

## MARGIN PLOT
marnaIMP_HD2%>%
  select(PTS, AST, PTS_imp)%>%
  marginplot(delimiter = "imp")

Similarly, there can be similar values within groups.

## PUT IN GROUPS (SimpPos)
marnaIMP_HD3<-hotdeck(bballSurv, 
                      variable="AST", 
                      ord_var = "SimpPos")

## MARGIN PLOT
marnaIMP_HD3%>%
  select(PTS, AST, AST_imp)%>%
  marginplot(delimiter = "imp")

C. K-Nearest Neighbors

Consider (completed) respondents similar to the one with missing data. Then average their responses.

### KNN (5)
marnaIMP_KNN<-kNN(bballSurv, k = 5, variable = c("PTS", "AST"))
head(marnaIMP_KNN)
##           Player League Age  wi_star   Salary SimpPos  PTS GP GS   MP   FG  FGA
## 1   LeBron James    NBA  38 5.695122 44474988       F 28.9 55 54 35.5 11.1 22.2
## 2   Bradley Beal    NBA  29 5.695122 43279250       G 23.2 50 50 33.5  8.9 17.6
## 3 Damian Lillard    NBA  32 5.695122 42492492       G 32.2 NA 58 36.3  9.6 20.7
## 4   Kyrie Irving    NBA  30 5.695122 38917057       G 27.1 NA 60 37.4  9.9 20.1
## 5     Trae Young    NBA  24 5.695122 37096500       G 26.2 73 73 34.8  8.2 19.0
## 6   Jamal Murray    NBA  25 5.695122 31650600       G 20.0 65 NA 32.8  7.3 16.0
##     FGP  X3PP  X2PP   FTP TRB  AST PTS_imp AST_imp
## 1 0.500 0.321 0.580 0.768 8.3  6.8   FALSE   FALSE
## 2 0.506 0.365 0.552    NA 3.9  5.4   FALSE   FALSE
## 3    NA 0.371 0.574 0.914 4.8  7.3   FALSE   FALSE
## 4 0.494 0.379    NA 0.905 5.1  5.5   FALSE   FALSE
## 5 0.429 0.335 0.476 0.886  NA 10.2   FALSE   FALSE
## 6    NA 0.398 0.494 0.833  NA  6.2   FALSE   FALSE
### MARGIN
marnaIMP_KNN%>%
  select(PTS, AST, PTS_imp, AST_imp)%>%
  marginplot(delimiter = "imp")

We could also do a weighted average so that the neighbors don’t have equal weight.

## KNN - WEIGHTED

marnaIMP_KNN2<-kNN(bballSurv, k = 5, 
                   numFun=weighted.mean, 
                   variable = c("PTS", "AST"))

### MARGINAL
marnaIMP_KNN2%>%
  select(PTS, AST, PTS_imp, AST_imp)%>%
  marginplot(delimiter = "imp")