Introduction

The purpose of this data dive is to explore linear and generalized linear models using NBA player performance data. Specifically, this analysis examines how points, rebounds and assists relate to overall player performance and whether they can help explain variation in outcomes.

Linear Model

model_lm <- lm(GmSc ~ PTS + TRB + AST, data = nba)
summary(model_lm)
## 
## Call:
## lm(formula = GmSc ~ PTS + TRB + AST, data = nba)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.046 -1.670  0.039  1.692 10.701 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.829657   0.204220   8.959   <2e-16 ***
## PTS         0.705896   0.006263 112.716   <2e-16 ***
## TRB         0.383637   0.014495  26.467   <2e-16 ***
## AST         0.556968   0.019863  28.041   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.609 on 1699 degrees of freedom
## Multiple R-squared:  0.9053, Adjusted R-squared:  0.9051 
## F-statistic:  5412 on 3 and 1699 DF,  p-value: < 2.2e-16

We begin with a linear model using overall game score (GmSc) as the response variable, which summarizes a player’s performance as a whole. This model should help us understand how individual statistics contribute to overall performance. Points are expected to have the strongest relationship with Game Score, since scoring is a major component of the metric. The model has an R-squared value of 0.905, indicating a strong fit. However, this strong fit is somewhat expected because Game Score is partially constructed using these same variables, which introduces a level of correlation that is already built into the model. The overall model is highly significant (F-statistic p-value < 2.2e-16), confirming that these predictors collectively explain the variation in the response variable. One key concern is multicollinearity or redundancy, since Game Score is derived from points, rebounds, and assists. This means the model may overstate how predictive” these variables are since they are not fully independent of the response.

Diagnostics

#residuals vs fitted
plot(model_lm$fitted.values, model_lm$residuals,
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0, col = "red")

#histogram
hist(model_lm$residuals,
     main = "Histogram of Residuals",
     xlab = "Residuals")

#q-q plot
qqnorm(model_lm$residuals)
qqline(model_lm$residuals, col = "red")

#scale location plot
plot(model_lm, which = 5)

Overall, this model seems to test pretty well when using these diagnostic tools. The normal q-q plot is very linear with a few trailers towards the ends and there are only a few points highlighted by Cook’s distance that could potentially be considered outliers. Some mild patterns in the residual plots suggest possible heteroscedasticity and the presence of influential observations. Additionally, since Game Score is partially derived from points, rebounds and assists, there may be some multicollinearity or redundancy in the model.

Coefficient Interpretation

The coefficient for points (PTS) represents the expected increase in Game Score for each additional point scored while holding rebounds and assists constant. For example, if the coefficient is approximately 1 then it would suggest that scoring one more point increases Game Score by about one unit which aligns with how the metric is constructed. This is exactly how it works for total rebounds and assists as well.

Logistic Model

#create binary variable for logistic model
nba$Playoffs <- ifelse(nba$Playoffs == "TRUE" | nba$Playoffs == TRUE, 1, 0)

model_log <- glm(Playoffs ~ PTS + TRB + AST, data = nba, family = "binomial")
## Warning: glm.fit: algorithm did not converge
summary(model_log)
## 
## Call:
## glm(formula = Playoffs ~ PTS + TRB + AST, family = "binomial", 
##     data = nba)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.657e+01  2.788e+04  -0.001    0.999
## PTS          4.231e-16  8.550e+02   0.000    1.000
## TRB         -7.830e-15  1.979e+03   0.000    1.000
## AST          8.723e-15  2.712e+03   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 1702  degrees of freedom
## Residual deviance: 9.8801e-09  on 1699  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 25

Diagnostics

fitted_vals <- fitted(model_log)
residuals_vals <- residuals(model_log, type = "deviance")

plot(fitted_vals, residuals_vals,
     xlab = "Fitted Values",
     ylab = "Deviance Residuals",
     main = "Residuals vs Fitted (Logistic)")
abline(h = 0, col = "red")

We can also attempt to create a logistic regression model using playoff status as a binary outcome. However, this logistic model produced warnings about non-convergence and extremely large confidence intervals. This indicates that the model is highly unstable. The main issue is class imbalance since there are far fewer playoff games than regular season games. This prevents the model from learning meaningful relationships.

Further Questions

Would limiting or removing multicollinearity improve the linear model? Would balancing the dataset improve logistic regression results? Are there better variables to predict playoff games?