Expand Regression Model

Previously, the regression model used only points scored (PTS) to predict overall Game Score (GmSc). However, basketball performance involves a lot more than just scoring. Players also contribute through rebounding and playmaking for teamates.

To capture these additional contributions, the following variables were added to the model:

TRB (Total Rebounds) – measures a player’s ability to regain possession of the ball both on offense and defense AST (Assists) – measures a player’s ability to create scoring opportunities for teammates

These variables represent different aspects of player performance and may help explain variation in Game Score. One potential concern when adding multiple variables is multicollinearity, where predictors are highly correlated with each other. Since players who score more points may also accumulate more assists or rebounds, there may be some correlation between these predictors. However, they represent different roles within the game and are not expected to be perfectly correlated with eachother.

Multiple Regression Model

model <- lm(GmSc ~ PTS + TRB + AST, data = nba)
summary(model)
## 
## Call:
## lm(formula = GmSc ~ PTS + TRB + AST, data = nba)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.046 -1.670  0.039  1.692 10.701 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.829657   0.204220   8.959   <2e-16 ***
## PTS         0.705896   0.006263 112.716   <2e-16 ***
## TRB         0.383637   0.014495  26.467   <2e-16 ***
## AST         0.556968   0.019863  28.041   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.609 on 1699 degrees of freedom
## Multiple R-squared:  0.9053, Adjusted R-squared:  0.9051 
## F-statistic:  5412 on 3 and 1699 DF,  p-value: < 2.2e-16

The regression results show that points scored remains the strongest predictor of Game Score, which is expected since scoring contributes the most directly to the Game Score formula. Rebounds and assists also have positive relationships with Game Score, suggesting that players who contribute across multiple statistical categories tend to achieve the highest performance ratings. Adding these variables should improve the model by accounting for more dimensions of a player’s performance.

Residuals vs Fitted

gg_resfitted(model) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

This plot is used to evaluate whether the relationship between the predictors and the response variable is linear and whether the variance of residuals remains constant. The residuals appear randomly scattered around zero with no strong curved pattern. This suggests that the linearity assumption is reasonably satisfied. There may be a slight increase in spread at larger fitted values as the volume of points decreases which could indicate mild heteroscedasticity but the effect does not appear to be severe. Overall, confidence in this assumption being satisfied is moderately high.

Residuals vs. X Values

plots <- gg_resX(model, plot.all = FALSE)

# for each variable of interest ...
plots$PTS +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

plots$TRB +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

plots$AST +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

For all three X variables the results appear scattered around zero without a clear pattern which suggests that the linear relationship assumption is reasonable. The PTS residuals seem to have the most variety, which makes sense because the amount of points a player scores has the strongest correlation with overall Game Score. They all seem to follow similar trends, so my confidence in this assumption is pretty high.

Normal Q-Q Plot

gg_qqplot(model)

This Q-Q plot checks whether the residuals follow a normal distribution or not. Most points lie very close to the diagonal reference line, indicating that the residuals are approximately normally distributed. A few observations at the extremes deviate slightly from the line, suggesting the presence of some outliers. These deviations are small and do not indicate a major violation of the normality assumption. Confidence in this assumption is moderate to high as well.

Histogram of Residuals

gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This histogram helps assess whether the residuals from the regression model are approximately normally distributed. In this case, the distribution appears to be roughly symmetric around zero with most residuals clustered near the center and decreasing as they spread outwards. While there may be slight deviations in some of the tails, the overall shape resembles a normal distribution very closely which suggests that the normality assumption is highly satisfied.

Cook’s Distance

gg_cooksd(model, threshold = 'matlab')

This measures how much each observation is influencing the entire regression model. Most of these observations have very small Cook’s distance values, indicating that the model is not strongly influenced by any of these observations. While there are a few observations that stand out like number 857 for example, this should not be too problematic or detrimental to the model and it should remain relatively stable.

#Insights and Significance These results show that Game Score is influenced by multiple aspects of player performance. While points scored remains the most significant predictor, rebounds and assists also contribute to explaining the variation in Game Score in a meaningful way. This makes sense within the context of basketball analytics, since players who contribute in several statistical categories tend to have higher overall productivity and success. The diagnostic plots suggest that the assumptions of linear regression are largely satisfied, giving us confidence in the reliability of the model as a whole.

Further Questions

Although this model performs well, other variables such as steals, turnovers or minutes played may also influence Game Score and could improve the model further. Future analysis could also examine whether the relationships between these variables differ between regular season and playoff games or between players on different teams with different offenses. It would also be beneficial to explore the high influence points further to determine how they should be treated as outliers, Exploring these factors could provide deeper insight into how different types of contributions affect overall player performance.