Introduction and
Background
The dataset I am using for my analysis is an extensive collection of
metrics for every NBA player’s game-by-game performance across the
2022-2023 playoff season. It was published on the open-source data
science platform Kaggle (link can be found in references section below).
The original source for the data’s accuracy can be confirmed at
Basketball-reference.com, a well-established database for enthusiasts of
the sport. All statistical analysis will be conducted via the
programming language R.
As for the dataset’s contents, it is composed of all 217 players who
played some amount of time in the 2022-2023 postseason. It contained 30
variables, however only 10 played a role in my analysis, so I created a
reduced dataset which only contained the information pertinent to my
research question. An explanation and breakdown of the names and data
types for the variables used in my analysis is below:
“Player” = player’s name = character
“Pos” = player’s position on the court or in starting lineup =
character
“Tm” = team player plays for = character
“Age” = player’s age = integer
“PTS” = player’s points per game average = double
“AST” = player’s assists per game average = double
“TRB” = player’s rebounds per game average = double
“STL” = player’s steals per game average = double
“BLK” = player’s blocks per game average = double
“MP” = player’s minutes per game average = double
Using this dataset, I will be exploring the relationship that a
player’s impact on the game (measured via their average points,
rebounds, assists, steals and blocks) and circumstance (age, position
and team playing for) has on their playing time (measured by per minute
averages). I will refer to the five primary metrics of productivity
(points, rebounds, assists, steals and blocks) as the “five standard
statistics” of basketball.
I have previously analyzed this dataset and found there to be a
strong, positive and simply linear relationship between a player’s
average scoring output and his average playing time. Through linear
regression modeling and bootstrap sampling, I came to the conclusion
that for every additional point a player averages, his playing time
increases by about 1.4 minutes per game. And more broadly, I can say
that a scoring increase of one point per game will result in an increase
of about 1.3 to 1.55 extra minutes per game 95% of the time.
While I initially chose to focus in on the correlation between points
per game and minutes per game due to the overwhelming evidence that a
significant relationship between the two existed (available in
references for further detail), I now plan to expand upon the foundings
of my previous reporting. I will create multi-variate models to
determine how changes in a player’s other metrics, as well as their age,
position and team situation impact their average amount of playing
time.
Given the sufficient size of my dataset, over 200 players from the 16
teams which participated in the postseason, and the reliability of its
sourcing, I am confident that I will be able to come to empirically
justified conclusions.
Multiple Linear
Regression (MLR) Approach
The first modeling strategy I employed to understand what factors
correlated with an NBA player’s average playing time was a multiple
linear regression approach. Setting minutes played as my response and my
dataset’s other six quantitative variables as the predictors, I ran a
linear model which we will refer to as the “Base Regression Model.”
Base Model
The findings of this regression model were definitive. It reported
that per-game averages in the points, rebounds and steals categories
absolutely had a statistically significant correlation with how much
playing time a player got. Alternatively, it was found that not only did
a player’s age and per-game averages in blocks and assists not have a
correlatory impact on his playing time, but it was not particularly
close, especially for assists and blocks per game.
All that being said, I did not want to draw a definitive conclusion
solely from this base regression model, so I performed variable
screening and investigated for other noticeable weaknesses in the
model’s efficiency.
Base_Regression_Model = lm(MP ~ Age + PTS + AST + TRB + STL + BLK, data = Reduced_Playoff_Data)
summary(Base_Regression_Model)
# Base model: (p < 2.2e-16, F = 241.7 (6, 210) Adjusted R^2 = 0.8699
# Significant variables per base model:
# PTS, TRB, STL,
# PTS -> intercept = 0.84028, p < 2e-16
# TRB -> intercept = 1.20050, p = 1.50e-11
# STL -> intercept = 5.80856, p = 1.74e-08
kable(summary(Base_Regression_Model)$coef, digits = 4, align = 'c', caption ="Coefficients for the Base Regression Model")
Coefficients for the Base Regression Model
(Intercept) |
2.2897 |
2.0721 |
1.1050 |
0.2704 |
Age |
0.1002 |
0.0752 |
1.3312 |
0.1846 |
PTS |
0.8403 |
0.0806 |
10.4226 |
0.0000 |
AST |
-0.0107 |
0.2951 |
-0.0363 |
0.9711 |
TRB |
1.2005 |
0.1681 |
7.1400 |
0.0000 |
STL |
5.8086 |
0.9906 |
5.8636 |
0.0000 |
BLK |
-0.0235 |
0.9071 |
-0.0259 |
0.9794 |
Stepwise Regression
Models
Via R, I performed three types of model modification; forward
selection, backward selection and stepwise selection. Forward selection
consists of there being no predictors in the model, and then the most
contributive predictors being iteratively added until there is no more
statistically significant improvement to be made. Backward selection
takes the opposite approach, beginning with a complete linear model (our
base model in this case), and iteratively removing the least
contributive predictors until all that are left are statistically
significant. Lastly, stepwise selection uses a process that combines
elements of both. This process was done to simplify our regression model
and avoid the risk of overfitting that comes with a higher amount of
predictor variables.
All three processes resulted in the same end model; all listing PTS,
TRB and STL as the predictors which had a significant impact on a
player’s given minutes per game. All three models also cited a
significantly strong coefficient of determination, expressing that about
87% of variation in a player’s minutes per game comes from his metrics
in those three categories (points, rebounds and steals).
# Main takeaway: P value, F stat and adjusted R^2 for total models were identical for all three stepwise regression models, this is not very surprising due to how strong the p values were in the original base model.
# All three models also confirm our original model's findings which are that points, rebounds and assists have a significant relationship with a player's average minutes per game. All three models list the same intercepts and p values for all three of those variables, as well as the same intercept for the model of 4.95295 (p < 2e-16).
# Build a null (intercept-only) model using the same data and response
Null_Model = lm(MP ~ 1, data = Reduced_Playoff_Data)
Full_Model = Base_Regression_Model
# 1) Forward Selection
# Forward Model: (p < 2.2e-16, F = 485.5 (3, 213), adjusted R^2 = 0.8706)
# Significant variables per Forward Selection model:
# PTS, TRB, STL
# PTS -> intercept = 0.83937, p < 2e-16
# TRB -> intercept = 1.20058, p = 1.15e-14
# STL -> intercept = 5.85402, p = 3.49e-10
Forward_Model = step(
object = Null_Model,
scope = list(lower = formula(Null_Model), upper = formula(Full_Model)),
direction = "forward",
trace = 1
)
summary(Forward_Model)
# 2) Backward Selection
# Backward Model: (p < 2.2e-16, F = 485.5 (3, 213), adjusted R^2 = 0.8706)
# Significant variables per Backward Selection model:
# PTS, TRB, STL
# PTS -> intercept = 0.83937, p < 2e-16
# TRB -> intercept = 1.20058, p = 1.15e-14
# STL -> intercept = 5.85402, p = 3.49e-10
Backward_Model = step(
object = Full_Model,
direction = "backward",
trace = 1
)
summary(Backward_Model)
# 3) Stepwise Selection
# Stepwise Model: (p < 2.2e-16, F = 485.5 (3, 213), adjusted R^2 = 0.8706)
# Significant variables per Stepwise Selection model:
# PTS, TRB, STL
# PTS -> intercept = 0.83937, p < 2e-16
# TRB -> intercept = 1.20058, p = 1.15e-14
# STL -> intercept = 5.85402, p = 3.49e-10
Stepwise_Model = step(
object = Null_Model,
scope = list(lower = formula(Null_Model), upper = formula(Full_Model)),
direction = "both",
trace = 1
)
summary(Stepwise_Model)
(kable(summary(Forward_Model)$coef, digits = 5, caption ="Coefficients for all Three Stepwise Regression Models"))
Coefficients for all Three Stepwise Regression Models
(Intercept) |
4.95295 |
0.51403 |
9.63559 |
0 |
PTS |
0.83937 |
0.06584 |
12.74912 |
0 |
TRB |
1.20058 |
0.14456 |
8.30534 |
0 |
STL |
5.85402 |
0.88897 |
6.58518 |
0 |
Correlation Concerns
(Base Model)
To ensure that multicollinearity was not a harmful factor in the base
model, and hence also the stepwise model(s), I investigated using a
correlation matrix and a VIF bar graph.
As per the correlation matrix, there is a strong correlation between
AST and PTS (r ~ 0.819). However, assists per game do not report to be a
significant statistic relevant to predicting a player’s given playing
time in any of our regression models, therefore any theoretical threat
of multicollinearity posed by assists per game is not realized.
Looking at the VIF (variance inflation factor) chart, the only
potential concern at face value appears to be that of points per game,
which is hovering slighly above a value of 4. Given that the only other
variables in all of our models that were deemed significant (total
rebounds and steals per game) are both rated below a value of 3, we can
reasonably assess that an out-of-control coefficient variance is not a
great concern.
ggpairs(Quantitative_Predictor_Variables, title = "Correlation Matrix for Age and Player Stats") + theme(plot.title = element_text(hjust = 0.5))

barplot(vif(Base_Regression_Model), main = "VIF Values", horiz = FALSE, col = "steelblue")

Further Residual
Analysis (Base Model)
Given that there was no major multicollinearity issues present in our
base model, I then moved forward and performed residual analysis on both
the base model (seen below) and the stepwise model. Since a more
parsimonious model is more ideal for prediction and estimation, moving
forward I will be utilizing the stepwise model to draw ultimate
takeaways and conclusions.
# Histogram
Base_Model_Residuals = resid(Base_Regression_Model)
Base_Model_Predicted_Values = predict(Base_Regression_Model)
hist(Base_Model_Residuals,
main = "Base Model Errors",
xlab = "Model Residuals")

# Scatterplot
plot(Base_Model_Predicted_Values, Base_Model_Residuals,
main = "Base Model - Residual Analysis Scatterplot",
xlab = "Predicted minutes per game",
ylab = "Model Error (mins per game)")
abline(h=0)

# Residual Mean Calculation
mean(Base_Model_Residuals)
# QQ Plot
qqnorm(Base_Model_Residuals, main = "QQ Residuals", ylab = "Standardized Residuals")
qqline(Base_Model_Residuals)

# Scale-Location Plot
plot(Base_Regression_Model, which = 3)

# Residuals-Leverage Plot
plot(Base_Regression_Model, which = 5)

Further Residual
Analysis (Stepwise Models)
The findings of our residual analysis are not great. The residuals do
not appear normally distributed, as per the histogram and QQ plot, nor
do they appear to have a constant variance, as per the scatterplot and
scale-location plot. The residual-leverage plot also appears to show
anywhere from 2 to 5 observations that are both high leverage and high
residual, which can significantly impact a model’s efficacy.
# Histogram
Forward_Model_Residuals = resid(Forward_Model)
Forward_Model_Predicted_Values = predict(Forward_Model)
hist(Forward_Model_Residuals,
main = "Stepwise Models Errors",
xlab = "Model Residuals")

# Scatterplot
plot(Forward_Model_Predicted_Values, Forward_Model_Residuals,
main = "Stepwise Models - Residual Analysis Scatterplot",
xlab = "Predicted minutes per game",
ylab = "Model Error (mins per game)")
abline(h=0)

# Residual Mean Calculation
mean(Forward_Model_Residuals)
# QQ Plot
qqnorm(Forward_Model_Residuals, main = "QQ Residuals", ylab = "Standardized Residuals")
qqline(Forward_Model_Residuals)

# Scale-Location Plot
plot(Forward_Model, which = 3)

# Residuals-Leverage Plot
plot(Forward_Model, which = 5)

Analysis of Variance
(ANOVA) Model
To determine whether or not there was a correlation between a
player’s position on the court and team (our qualitative predictor
variables) and his playing time, I performed an ANOVA test. The findings
can be seen below and are definitive in suggesting that neither one of
those circumstantial factors has a significant relationship with a
player’s minutes per game.
ANOVA_model = aov(MP ~ Pos * Tm, data = Reduced_Playoff_Data)
summary(ANOVA_model)
ANOVA_Summary_Table = summary(ANOVA_model)[[1]]
# Print as markdown table
kable(ANOVA_Summary_Table, digits = 4, caption = "ANOVA Table: MP ~ Pos * Tm")
ANOVA Table: MP ~ Pos * Tm
Pos |
4 |
570.5512 |
142.6378 |
0.6560 |
0.6236 |
Tm |
15 |
628.6916 |
41.9128 |
0.1928 |
0.9996 |
Pos:Tm |
60 |
6748.6587 |
112.4776 |
0.5173 |
0.9977 |
Residuals |
137 |
29787.6867 |
217.4284 |
NA |
NA |
Checking ANOVA Model
Assumptions
Despite the findings of my ANOVA model being quite strong, I still
wanted to review the data to ensure that no fundamental assumptions for
proper ANOVA modeling were violated. The results of that review were
mixed.
I performed a series of Shapiro Wilk tests to determine if players’
recorded minutes per game were normally distributed when grouping these
observations, first by team and then by position. Regarding teams, five
of the sixteen did not give their players a normally distributed amount
of playing time. Those teams would be the Nets, Cavaliers, Nuggets,
Lakers and Sixers. And when grouped by position, the data showed that
playing time was certainly not distributed in a normal manner.
As for the ANOVA condition of variance equality, there was
overwhelming evidence that the variance of minutes per game was equal
across both teams and positions.
by(MP, Tm, shapiro.test)
by(MP, Pos, shapiro.test)
bartlett.test(MP ~ Tm, data = Reduced_Playoff_Data)
bartlett.test(MP ~ Pos, data = Reduced_Playoff_Data)
Kruskal-Wallis-H
Test
Given the mixed results of my ANOVA assumption check, I chose to
perform a nonparametric Kruskal-Wallis-H Test in order to ensure the
validity of my ANOVA model’s findings that a player’s team and position
status do not play a significant role in his playing time. The findings
of the Kruskal-Wallis-H Test were in alignment with the ANOVA model’s
results, reporting neither as a significant factor in predicting a
player’s minutes per game.
kruskal.test(MP ~ Tm, data = Reduced_Playoff_Data)
Kruskal-Wallis rank sum test
data: MP by Tm
Kruskal-Wallis chi-squared = 4.1147, df = 15, p-value = 0.9973
kruskal.test(MP ~ Pos, data = Reduced_Playoff_Data)
Kruskal-Wallis rank sum test
data: MP by Pos
Kruskal-Wallis chi-squared = 3.6124, df = 4, p-value = 0.461
Conclusion
Ultimately, the findings of today’s analysis can be summarized by the
statement that there is a great deal of evidence that scoring,
rebounding and stealing plays a role in how much playing time an NBA
player receives in the postseason. While alternatively, there is
practically no statistical evidence that a player’s assists’ average,
age, position on the court or team have such a relationship with minutes
played. While there was some slight variation between the base model,
pre-transformation stepwise model and post-transformation stepwise
model, it is safe to estimate the following. For every additional point
per game averaged, a player will receive an additional 0.84 minutes of
playing time. He may also receive an additional 1.2 minutes per game for
a one unit increase in rebounding average, and a boost of 5.8 minutes
per game for every additional steal averaged per game.
