1 Introduction and Background

The dataset I am using for my analysis is an extensive collection of metrics for every NBA player’s game-by-game performance across the 2022-2023 playoff season. It was published on the open-source data science platform Kaggle (link can be found in references section below). The original source for the data’s accuracy can be confirmed at Basketball-reference.com, a well-established database for enthusiasts of the sport. All statistical analysis will be conducted via the programming language R.

As for the dataset’s contents, it is composed of all 217 players who played some amount of time in the 2022-2023 postseason. It contained 30 variables, however only 10 played a role in my analysis, so I created a reduced dataset which only contained the information pertinent to my research question. An explanation and breakdown of the names and data types for the variables used in my analysis is below:

  • “Player” = player’s name = character

  • “Pos” = player’s position on the court or in starting lineup = character

  • “Tm” = team player plays for = character

  • “Age” = player’s age = integer

  • “PTS” = player’s points per game average = double

  • “AST” = player’s assists per game average = double

  • “TRB” = player’s rebounds per game average = double

  • “STL” = player’s steals per game average = double

  • “BLK” = player’s blocks per game average = double

  • “MP” = player’s minutes per game average = double

Using this dataset, I will be exploring the relationship that a player’s impact on the game (measured via their average points, rebounds, assists, steals and blocks) and circumstance (age, position and team playing for) has on their playing time (measured by per minute averages). I will refer to the five primary metrics of productivity (points, rebounds, assists, steals and blocks) as the “five standard statistics” of basketball.

I have previously analyzed this dataset and found there to be a strong, positive and simply linear relationship between a player’s average scoring output and his average playing time. Through linear regression modeling and bootstrap sampling, I came to the conclusion that for every additional point a player averages, his playing time increases by about 1.4 minutes per game. And more broadly, I can say that a scoring increase of one point per game will result in an increase of about 1.3 to 1.55 extra minutes per game 95% of the time.

While I initially chose to focus in on the correlation between points per game and minutes per game due to the overwhelming evidence that a significant relationship between the two existed (available in references for further detail), I now plan to expand upon the foundings of my previous reporting. I will create multi-variate models to determine how changes in a player’s other metrics, as well as their age, position and team situation impact their average amount of playing time.

Given the sufficient size of my dataset, over 200 players from the 16 teams which participated in the postseason, and the reliability of its sourcing, I am confident that I will be able to come to empirically justified conclusions.

2 Multiple Linear Regression (MLR) Approach

The first modeling strategy I employed to understand what factors correlated with an NBA player’s average playing time was a multiple linear regression approach. Setting minutes played as my response and my dataset’s other six quantitative variables as the predictors, I ran a linear model which we will refer to as the “Base Regression Model.”

2.1 Base Model

The findings of this regression model were definitive. It reported that per-game averages in the points, rebounds and steals categories absolutely had a statistically significant correlation with how much playing time a player got. Alternatively, it was found that not only did a player’s age and per-game averages in blocks and assists not have a correlatory impact on his playing time, but it was not particularly close, especially for assists and blocks per game.

All that being said, I did not want to draw a definitive conclusion solely from this base regression model, so I performed variable screening and investigated for other noticeable weaknesses in the model’s efficiency.

Base_Regression_Model = lm(MP ~ Age + PTS + AST + TRB + STL + BLK, data = Reduced_Playoff_Data)
summary(Base_Regression_Model)
 # Base model: (p < 2.2e-16, F = 241.7 (6, 210) Adjusted R^2 = 0.8699
 # Significant variables per base model:
  # PTS, TRB, STL, 
  # PTS -> intercept = 0.84028, p < 2e-16
  # TRB -> intercept = 1.20050, p = 1.50e-11
  # STL -> intercept = 5.80856, p = 1.74e-08
kable(summary(Base_Regression_Model)$coef, digits = 4, align = 'c', caption ="Coefficients for the Base Regression Model")
Coefficients for the Base Regression Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2897 2.0721 1.1050 0.2704
Age 0.1002 0.0752 1.3312 0.1846
PTS 0.8403 0.0806 10.4226 0.0000
AST -0.0107 0.2951 -0.0363 0.9711
TRB 1.2005 0.1681 7.1400 0.0000
STL 5.8086 0.9906 5.8636 0.0000
BLK -0.0235 0.9071 -0.0259 0.9794

2.2 Stepwise Regression Models

Via R, I performed three types of model modification; forward selection, backward selection and stepwise selection. Forward selection consists of there being no predictors in the model, and then the most contributive predictors being iteratively added until there is no more statistically significant improvement to be made. Backward selection takes the opposite approach, beginning with a complete linear model (our base model in this case), and iteratively removing the least contributive predictors until all that are left are statistically significant. Lastly, stepwise selection uses a process that combines elements of both. This process was done to simplify our regression model and avoid the risk of overfitting that comes with a higher amount of predictor variables.

All three processes resulted in the same end model; all listing PTS, TRB and STL as the predictors which had a significant impact on a player’s given minutes per game. All three models also cited a significantly strong coefficient of determination, expressing that about 87% of variation in a player’s minutes per game comes from his metrics in those three categories (points, rebounds and steals).

# Main takeaway: P value, F stat and adjusted R^2 for total models were identical for all three stepwise regression models, this is not very surprising due to how strong the p values were in the original base model.
# All three models also confirm our original model's findings which are that points, rebounds and assists have a significant relationship with a player's average minutes per game. All three models list the same intercepts and p values for all three of those variables, as well as the same intercept for the model of 4.95295 (p < 2e-16).


# Build a null (intercept-only) model using the same data and response
Null_Model = lm(MP ~ 1, data = Reduced_Playoff_Data)

Full_Model = Base_Regression_Model

# 1) Forward Selection
  # Forward Model: (p < 2.2e-16, F = 485.5 (3, 213), adjusted R^2 = 0.8706)
  # Significant variables per Forward Selection model:
    # PTS, TRB, STL
    # PTS -> intercept = 0.83937, p < 2e-16
    # TRB -> intercept = 1.20058, p = 1.15e-14
    # STL -> intercept = 5.85402, p = 3.49e-10

  Forward_Model = step(
    object = Null_Model,
    scope = list(lower = formula(Null_Model), upper = formula(Full_Model)),
    direction = "forward",
    trace = 1
  )
  summary(Forward_Model)

# 2) Backward Selection
  # Backward Model: (p < 2.2e-16, F = 485.5 (3, 213), adjusted R^2 = 0.8706)
  # Significant variables per Backward Selection model:
    # PTS, TRB, STL
    # PTS -> intercept = 0.83937, p < 2e-16
    # TRB -> intercept = 1.20058, p = 1.15e-14
    # STL -> intercept = 5.85402, p = 3.49e-10
  Backward_Model = step(
    object = Full_Model,
    direction = "backward",
    trace = 1
  )
  summary(Backward_Model)
  
  
# 3) Stepwise Selection
  # Stepwise Model: (p < 2.2e-16, F = 485.5 (3, 213), adjusted R^2 = 0.8706)
  # Significant variables per Stepwise Selection model:
    # PTS, TRB, STL
    # PTS -> intercept = 0.83937, p < 2e-16
    # TRB -> intercept = 1.20058, p = 1.15e-14
    # STL -> intercept = 5.85402, p = 3.49e-10
  Stepwise_Model = step(
    object = Null_Model,
    scope = list(lower = formula(Null_Model), upper = formula(Full_Model)),
    direction = "both",
    trace = 1
  )
  summary(Stepwise_Model)
(kable(summary(Forward_Model)$coef, digits = 5, caption ="Coefficients for all Three Stepwise Regression Models"))
Coefficients for all Three Stepwise Regression Models
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.95295 0.51403 9.63559 0
PTS 0.83937 0.06584 12.74912 0
TRB 1.20058 0.14456 8.30534 0
STL 5.85402 0.88897 6.58518 0

2.3 Correlation Concerns (Base Model)

To ensure that multicollinearity was not a harmful factor in the base model, and hence also the stepwise model(s), I investigated using a correlation matrix and a VIF bar graph.

As per the correlation matrix, there is a strong correlation between AST and PTS (r ~ 0.819). However, assists per game do not report to be a significant statistic relevant to predicting a player’s given playing time in any of our regression models, therefore any theoretical threat of multicollinearity posed by assists per game is not realized.

Looking at the VIF (variance inflation factor) chart, the only potential concern at face value appears to be that of points per game, which is hovering slighly above a value of 4. Given that the only other variables in all of our models that were deemed significant (total rebounds and steals per game) are both rated below a value of 3, we can reasonably assess that an out-of-control coefficient variance is not a great concern.

ggpairs(Quantitative_Predictor_Variables, title = "Correlation Matrix for Age and Player Stats") + theme(plot.title = element_text(hjust = 0.5))

barplot(vif(Base_Regression_Model), main = "VIF Values", horiz = FALSE, col = "steelblue")

2.4 Further Residual Analysis (Base Model)

Given that there was no major multicollinearity issues present in our base model, I then moved forward and performed residual analysis on both the base model (seen below) and the stepwise model. Since a more parsimonious model is more ideal for prediction and estimation, moving forward I will be utilizing the stepwise model to draw ultimate takeaways and conclusions.

# Histogram
Base_Model_Residuals = resid(Base_Regression_Model)
Base_Model_Predicted_Values = predict(Base_Regression_Model)
hist(Base_Model_Residuals,
     main = "Base Model Errors",
     xlab = "Model Residuals")

# Scatterplot
plot(Base_Model_Predicted_Values, Base_Model_Residuals,
     main = "Base Model - Residual Analysis Scatterplot",
     xlab = "Predicted minutes per game",
     ylab = "Model Error (mins per game)")
abline(h=0)

# Residual Mean Calculation
mean(Base_Model_Residuals)

# QQ Plot 
qqnorm(Base_Model_Residuals, main = "QQ Residuals", ylab = "Standardized Residuals")
qqline(Base_Model_Residuals)

# Scale-Location Plot
plot(Base_Regression_Model, which = 3)

# Residuals-Leverage Plot
plot(Base_Regression_Model, which = 5)

2.5 Further Residual Analysis (Stepwise Models)

The findings of our residual analysis are not great. The residuals do not appear normally distributed, as per the histogram and QQ plot, nor do they appear to have a constant variance, as per the scatterplot and scale-location plot. The residual-leverage plot also appears to show anywhere from 2 to 5 observations that are both high leverage and high residual, which can significantly impact a model’s efficacy.

# Histogram
Forward_Model_Residuals = resid(Forward_Model)
Forward_Model_Predicted_Values = predict(Forward_Model)
hist(Forward_Model_Residuals,
     main = "Stepwise Models Errors",
     xlab = "Model Residuals")

# Scatterplot
plot(Forward_Model_Predicted_Values, Forward_Model_Residuals,
     main = "Stepwise Models - Residual Analysis Scatterplot",
     xlab = "Predicted minutes per game",
     ylab = "Model Error (mins per game)")
abline(h=0)

# Residual Mean Calculation
mean(Forward_Model_Residuals)

# QQ Plot 
qqnorm(Forward_Model_Residuals, main = "QQ Residuals", ylab = "Standardized Residuals")
qqline(Forward_Model_Residuals)

# Scale-Location Plot
plot(Forward_Model, which = 3)

# Residuals-Leverage Plot
plot(Forward_Model, which = 5)

2.6 MLR analysis with Box-Cox transformation

Given the results of the residual analysis on my stepwise model, I then performed a box-cox transformation of my model, with a lamba value of 1.05, to help mitigate the errors that inconsistent residual variance and lack of normal distribution might bring. The results were once again definitive, and in alignment with the base model and the pre-transformation stepwise models; that being that points per game, rebounds per game and steals per game were ALL highly significant factors in estimating a player’s minutes per game.

 # Had 1 observation where our response variable MP was = 0, which does not work with Box-Cox transformation.
library(MASS)

Reduced_Playoff_Data2 = Reduced_Playoff_Data[Reduced_Playoff_Data$MP > 0, ]

Forward_Model2 = lm(MP ~ PTS + TRB + STL, data = Reduced_Playoff_Data2)

Boxcox = boxcox(Forward_Model2, lambda = seq(-2, 2, 0.1))

lambda = 1.05

# Modify MP variable in dataset with lamba transformation
Reduced_Playoff_Data2$MP_bc = if (lambda == 0) {
  log(Reduced_Playoff_Data2$MP)
} else {
  (Reduced_Playoff_Data2$MP^lambda - 1) / lambda
}

BoxCox_Model = lm(MP ~ PTS + TRB + STL, data = Reduced_Playoff_Data2)
summary(BoxCox_Model)
(kable(summary(Forward_Model)$coef, digits = 5, caption ="Coefficients for Box-Cox Transformed Stepwise Model"))
Coefficients for Box-Cox Transformed Stepwise Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.95295 0.51403 9.63559 0
PTS 0.83937 0.06584 12.74912 0
TRB 1.20058 0.14456 8.30534 0
STL 5.85402 0.88897 6.58518 0

3 Analysis of Variance (ANOVA) Model

To determine whether or not there was a correlation between a player’s position on the court and team (our qualitative predictor variables) and his playing time, I performed an ANOVA test. The findings can be seen below and are definitive in suggesting that neither one of those circumstantial factors has a significant relationship with a player’s minutes per game.

ANOVA_model = aov(MP ~ Pos * Tm, data = Reduced_Playoff_Data)
summary(ANOVA_model)
ANOVA_Summary_Table = summary(ANOVA_model)[[1]]

# Print as markdown table
kable(ANOVA_Summary_Table, digits = 4, caption = "ANOVA Table: MP ~ Pos * Tm")
ANOVA Table: MP ~ Pos * Tm
Df Sum Sq Mean Sq F value Pr(>F)
Pos 4 570.5512 142.6378 0.6560 0.6236
Tm 15 628.6916 41.9128 0.1928 0.9996
Pos:Tm 60 6748.6587 112.4776 0.5173 0.9977
Residuals 137 29787.6867 217.4284 NA NA

3.1 Checking ANOVA Model Assumptions

Despite the findings of my ANOVA model being quite strong, I still wanted to review the data to ensure that no fundamental assumptions for proper ANOVA modeling were violated. The results of that review were mixed.

I performed a series of Shapiro Wilk tests to determine if players’ recorded minutes per game were normally distributed when grouping these observations, first by team and then by position. Regarding teams, five of the sixteen did not give their players a normally distributed amount of playing time. Those teams would be the Nets, Cavaliers, Nuggets, Lakers and Sixers. And when grouped by position, the data showed that playing time was certainly not distributed in a normal manner.

As for the ANOVA condition of variance equality, there was overwhelming evidence that the variance of minutes per game was equal across both teams and positions.

by(MP, Tm, shapiro.test)
by(MP, Pos, shapiro.test)

bartlett.test(MP ~ Tm, data = Reduced_Playoff_Data)
bartlett.test(MP ~ Pos, data = Reduced_Playoff_Data)

3.2 Kruskal-Wallis-H Test

Given the mixed results of my ANOVA assumption check, I chose to perform a nonparametric Kruskal-Wallis-H Test in order to ensure the validity of my ANOVA model’s findings that a player’s team and position status do not play a significant role in his playing time. The findings of the Kruskal-Wallis-H Test were in alignment with the ANOVA model’s results, reporting neither as a significant factor in predicting a player’s minutes per game.

kruskal.test(MP ~ Tm, data = Reduced_Playoff_Data)

    Kruskal-Wallis rank sum test

data:  MP by Tm
Kruskal-Wallis chi-squared = 4.1147, df = 15, p-value = 0.9973
kruskal.test(MP ~ Pos, data = Reduced_Playoff_Data)

    Kruskal-Wallis rank sum test

data:  MP by Pos
Kruskal-Wallis chi-squared = 3.6124, df = 4, p-value = 0.461

4 Conclusion

Ultimately, the findings of today’s analysis can be summarized by the statement that there is a great deal of evidence that scoring, rebounding and stealing plays a role in how much playing time an NBA player receives in the postseason. While alternatively, there is practically no statistical evidence that a player’s assists’ average, age, position on the court or team have such a relationship with minutes played. While there was some slight variation between the base model, pre-transformation stepwise model and post-transformation stepwise model, it is safe to estimate the following. For every additional point per game averaged, a player will receive an additional 0.84 minutes of playing time. He may also receive an additional 1.2 minutes per game for a one unit increase in rebounding average, and a boost of 5.8 minutes per game for every additional steal averaged per game.

