In this lesson students will learn to:
These data were put together from NBA and WNBA records for the 2022-2023 and 2022 seasons, respecitively.
For this activity we are going to pretend that we are doing a survey on professional basketball players. These data represent only the raw responses to the survey. These data reflect a subset of the NBA and WNBA 2022 population and also reflects nonresponse.
Download the data here:
## DOWNLOAD RAW SURVEY DATA
bballSurv<-read.csv("https://raw.githubusercontent.com/kitadasmalley/Teaching/refs/heads/main/DATA429_599/CODE/raw_bballSurvey.csv")
str(bballSurv)
## 'data.frame': 120 obs. of 18 variables:
## $ Player : chr "LeBron James" "Bradley Beal" "Damian Lillard" "Kyrie Irving" ...
## $ League : chr "NBA" "NBA" "NBA" "NBA" ...
## $ Age : int 38 29 32 30 24 25 25 25 32 37 ...
## $ wi_star: num 5.7 5.7 5.7 5.7 5.7 ...
## $ Salary : num 44474988 43279250 42492492 38917057 37096500 ...
## $ SimpPos: chr "F" "G" "G" "G" ...
## $ PTS : num 28.9 23.2 32.2 27.1 26.2 20 20.4 25 14.7 13.9 ...
## $ GP : int 55 50 NA NA 73 65 75 NA 50 59 ...
## $ GS : int 54 50 58 60 73 NA 75 NA 50 59 ...
## $ MP : num 35.5 33.5 36.3 37.4 34.8 32.8 34.6 33.4 31.5 32 ...
## $ FG : num 11.1 8.9 9.6 9.9 8.2 7.3 8 NA 5.5 5 ...
## $ FGA : num 22.2 17.6 20.7 20.1 19 16 14.9 18.2 11.6 11.3 ...
## $ FGP : num 0.5 0.506 NA 0.494 0.429 NA 0.54 NA 0.475 0.44 ...
## $ X3PP : num 0.321 0.365 0.371 0.379 0.335 0.398 0.083 0.324 NA 0.375 ...
## $ X2PP : num 0.58 0.552 0.574 NA 0.476 0.494 0.545 0.584 NA 0.482 ...
## $ FTP : num 0.768 NA 0.914 0.905 0.886 0.833 0.806 0.78 0.811 NA ...
## $ TRB : num 8.3 3.9 4.8 5.1 NA NA 9.2 NA 4.3 4.3 ...
## $ AST : num 6.8 5.4 7.3 5.5 10.2 6.2 3.2 6.1 4.1 8.9 ...
Let’s to a quick EDA to look for patterns. How about a pairs plot?!
library(tidyverse)
### PAIRS PLOT
#install.packages("GGally")
library(GGally)
ggpairs(bballSurv, columns = c(5, 7:18))
The leagues have different styles of game play which might impact the relationships between the variables.
### ADD COLOR
ggpairs(bballSurv, columns = c(5, 7:18), ggplot2::aes(colour = League))
Consider the following two models.
#### WHAT HAPPENS WITH MISSING DATA
### SALARY AS A FUNCTION OF PTS AND LEAGUE
mod1<-lm(Salary~PTS+League, data=bballSurv)
summary(mod1)
##
## Call:
## lm(formula = Salary ~ PTS + League, data = bballSurv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22252644 -3682180 435703 3124995 16951292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -991668 1304794 -0.760 0.45
## PTS 1177570 101875 11.559 < 2e-16 ***
## LeagueWNBA -8598988 1622020 -5.301 1.12e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6513000 on 75 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.6966, Adjusted R-squared: 0.6885
## F-statistic: 86.08 on 2 and 75 DF, p-value: < 2.2e-16
### SALARY AS A FUNCTION OF PTS AND LEAGUE AND AST
mod2<-lm(Salary~PTS+AST+League, data=bballSurv)
summary(mod2)
##
## Call:
## lm(formula = Salary ~ PTS + AST + League, data = bballSurv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16951629 -3132570 -290050 3564051 15540967
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1955961 1318845 -1.483 0.143465
## PTS 720305 158746 4.537 2.92e-05 ***
## AST 2404291 565140 4.254 7.74e-05 ***
## LeagueWNBA -7567365 1850653 -4.089 0.000135 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6004000 on 58 degrees of freedom
## (58 observations deleted due to missingness)
## Multiple R-squared: 0.7847, Adjusted R-squared: 0.7736
## F-statistic: 70.46 on 3 and 58 DF, p-value: < 2.2e-16
These models are not actually comparable due to missing data! If any row in the data has a missing value for variables used in the model, that whole line will be removed. This is called listwise deletion.
In our first attempt, we might be curious about how pervasive nonresponse is.
First, let’s try to understand how complete the records are. How many nonresponses does each individual have?
### NONRESP BY PLAYER
nonR<-bballSurv%>%
is.na()%>%
rowSums()
### MAKE A NEW DATAFRAME
playerNR<-bballSurv%>%
select("Player")%>%
cbind(nonR)%>%
mutate(respR=(14-nonR)/14)
### PLOT THE FREQ
### COMPLETE >= 80%
### PARTIAL [50, 80)
### BREAKOFF < 50%
ggplot(data=playerNR, aes(x=respR))+
geom_histogram()+
geom_vline(xintercept = .8, color="blue", lwd=1.5, lty=2)+
geom_vline(xintercept = .5, color="red", lwd=1.5, lty=2)+
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, lets look at the frequency of nonresponse across the variables.
### NONRESP BY VAR
bballSurv%>%
is.na()%>%
colSums()
## Player League Age wi_star Salary SimpPos PTS GP GS MP
## 0 0 0 0 21 0 22 26 21 20
## FG FGA FGP X3PP X2PP FTP TRB AST
## 15 32 28 25 29 37 30 21
Do any variables seem to get missed significantly more than others?
Let’s make a graphic to see where the missing data are.
### VISUALIZE MISSING
#install.packages("visdat")
library(visdat)
vis_miss(bballSurv)
#### C. Patterns in Missing Data
Sometimes there are relationships between variables such that there is a higher occurrence of multiple variables missing at the same time.
### VIM
#install.packages("VIM")
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
bballSurv%>%
aggr(combined=TRUE, numbers=TRUE)
## Warning in plot.aggr(res, ...): not enough vertical space to display
## frequencies (too many combinations)
It’s hard to see with this many variables, let’s just look at a select few:
### MAYBE WE SHOULD LOOK AT LESS VARIABLES
## SALARY, PTS, GP, GS, MP, FG, AST
bballSurv%>%
select(c("Salary", "PTS", "GP", "MP", "AST"))%>%
aggr(combined=TRUE, numbers=TRUE,
cex.numbers=.5)
Imputation is the practice of using statistical methods to fill in missing data.
Mean imputation replaces missing values with the average.
#### MEAN IMPUTATION
## CREATE INDICATORS
## REPLACE WITH MEANS
marnaIMP<-bballSurv%>%
mutate(PTS_imp=ifelse(is.na(PTS), TRUE, FALSE))%>%
mutate(AST_imp=ifelse(is.na(AST), TRUE, FALSE))%>%
mutate(PTS=ifelse(is.na(PTS), mean(PTS, na.rm=TRUE),PTS))%>%
mutate(AST=ifelse(is.na(AST), mean(AST, na.rm=TRUE), AST))
### MARGIN PLOT
marnaIMP%>%
select(PTS, AST, PTS_imp, AST_imp)%>%
marginplot(delimiter = "imp")
## THIS IS NOT GREAT
## THERE IS NOT ENOUGH VARIABILITY
Hotdeck imputation takes “donor” from previous complete data records.
#### VANILLA HOT DECK
marnaIMP_HD<-hotdeck(bballSurv, variable=c("PTS", "AST"))
### MARGIN PLOT
marnaIMP_HD%>%
select(PTS, AST, PTS_imp, AST_imp)%>%
marginplot(delimiter = "imp")
The problem here is that this just borrows the previous value, but often due to relationships/correlations between variables, we should reorder the data to borrow a “better” value.
## PUT IN ORDER (NUM) - GP
marnaIMP_HD2<-hotdeck(bballSurv,
variable="PTS",
domain_var = "GP")
## MARGIN PLOT
marnaIMP_HD2%>%
select(PTS, AST, PTS_imp)%>%
marginplot(delimiter = "imp")
Similarly, there can be similar values within groups.
## PUT IN GROUPS (SimpPos)
marnaIMP_HD3<-hotdeck(bballSurv,
variable="AST",
ord_var = "SimpPos")
## MARGIN PLOT
marnaIMP_HD3%>%
select(PTS, AST, AST_imp)%>%
marginplot(delimiter = "imp")
Consider (completed) respondents similar to the one with missing data. Then average their responses.
### KNN (5)
marnaIMP_KNN<-kNN(bballSurv, k = 5, variable = c("PTS", "AST"))
head(marnaIMP_KNN)
## Player League Age wi_star Salary SimpPos PTS GP GS MP FG FGA
## 1 LeBron James NBA 38 5.695122 44474988 F 28.9 55 54 35.5 11.1 22.2
## 2 Bradley Beal NBA 29 5.695122 43279250 G 23.2 50 50 33.5 8.9 17.6
## 3 Damian Lillard NBA 32 5.695122 42492492 G 32.2 NA 58 36.3 9.6 20.7
## 4 Kyrie Irving NBA 30 5.695122 38917057 G 27.1 NA 60 37.4 9.9 20.1
## 5 Trae Young NBA 24 5.695122 37096500 G 26.2 73 73 34.8 8.2 19.0
## 6 Jamal Murray NBA 25 5.695122 31650600 G 20.0 65 NA 32.8 7.3 16.0
## FGP X3PP X2PP FTP TRB AST PTS_imp AST_imp
## 1 0.500 0.321 0.580 0.768 8.3 6.8 FALSE FALSE
## 2 0.506 0.365 0.552 NA 3.9 5.4 FALSE FALSE
## 3 NA 0.371 0.574 0.914 4.8 7.3 FALSE FALSE
## 4 0.494 0.379 NA 0.905 5.1 5.5 FALSE FALSE
## 5 0.429 0.335 0.476 0.886 NA 10.2 FALSE FALSE
## 6 NA 0.398 0.494 0.833 NA 6.2 FALSE FALSE
### MARGIN
marnaIMP_KNN%>%
select(PTS, AST, PTS_imp, AST_imp)%>%
marginplot(delimiter = "imp")
We could also do a weighted average so that the neighbors don’t have equal weight.
## KNN - WEIGHTED
marnaIMP_KNN2<-kNN(bballSurv, k = 5,
numFun=weighted.mean,
variable = c("PTS", "AST"))
### MARGINAL
marnaIMP_KNN2%>%
select(PTS, AST, PTS_imp, AST_imp)%>%
marginplot(delimiter = "imp")