Introduction and
Background
The dataset I am using for my analysis is an extensive collection of
metrics for every NBA player’s game-by-game performance across the
2022-2023 playoff season. It was published on the open-source data
science platform Kaggle (link can be found in references section below).
The original source for the data’s accuracy can be confirmed at
Basketball-reference.com, a well-established database for enthusiasts of
the sport. All statistical analysis will be conducted via the
programming language R.
As for the dataset’s contents, it is composed of all 217 players who
played some amount of time in the 2022-2023 postseason. It contained 30
variables, however only 10 played a role in my analysis, so I created a
reduced dataset which only contained the information pertinent to my
research question. An explanation and breakdown of the names and data
types for the variables used in my analysis is below:
“Player” = player’s name = character
“Pos” = player’s position on the court or in starting lineup =
character
“Tm” = team player plays for = character
“Age” = player’s age = integer
“PTS” = player’s points per game average = double
“AST” = player’s assists per game average = double
“TRB” = player’s rebounds per game average = double
“STL” = player’s steals per game average = double
“BLK” = player’s blocks per game average = double
“MP” = player’s minutes per game average = double
Using this dataset, I will be exploring the relationship that a
player’s impact on the game (measured via their average points,
rebounds, assists, steals and blocks) and circumstance (age, position
and team playing for) has on their playing time (measured by per minute
averages). I will refer to the five primary metrics of productivity
(points, rebounds, assists, steals and blocks) as the “five standard
statistics” of basketball.
Before conducting any thorough analysis, I would assume that there is
a positive correlation between a player’s playing time and their five
standard statistics, and a negative correlation between a player’s
playing time and their age. As for position, it would be my intuition
that traditionally smaller players, such as point guards and shooting
guards, have a higher average amount of playing time than their larger
positional counterparts like power forwards and centers. When it comes
to the importance of what team a player is on, I am unsure exactly what
to expect regarding any sort of significant difference in playing time
distribution from team to team, but I am sure that differences do exist
considering the multidude of different coaching methodologies out
there.
Given the sufficient size of my dataset, over 200 players from the 16
teams which participated in the postseason, and the reliability of its
sourcing, I am confident that I will be able to come to empircally
justified conclusions.
Initial Visual
Analysis
Below I created a pairwise scatter plot in order to concisely examine
the relationship that each of the quantitative variables relevant to my
analysis (age and the five standard basketball stats) have with a
player’s playing time.
Unsurprisingly, there was a very strong and positive correlation
between a player’s average points per game and their average minutes per
game (r ~ .895). Averages in assists, steals, rebounds and blocks also
each shared a positive correlation with playing time, but not to as
strong a degree (r ~ 0.77 to 0.54). Given that most basketball fans and
analysts would consider scoring to be the most important skill for any
player to posess, it does make sense that a player’s scoring average is
more determinant of his playing time than any other metric.
What was noticeable from the scatter plot was the exceptionally weak
relationship between a player’s age and his playing time (r ~ .115).
Conventional wisdom amongst many fans would be that players very early
on in their careers (low age value) and at the tail end of their careers
(high age value), would typically see far less playing time than the
athletes in their physical prime, however that is not a trend that this
data provides any sufficient evidence for.
ggpairs(Quantitative_Variables, title = "Pairwise Scatterplot of Quantitative Variables")

Simple Linear
Regression (SLR) and Residual Analysis
After using the aforementioned scatter plot to determine that points
per game had a supremely strong correlation with minutes per game, I
chose to explore the relationship between those two statistics further
via a simple linear regression model.
The regression model affirmed the previous conclusion taken away from
the scatter plot, which is that there is a statistically significant,
strong and positive correlation between scoring averages and playing
time averages. It appears that, for every additional point that a player
averages, he will play approximately an additional 1.4 minutes per game.
This function also operates with a slope intercept of about 7.32,
meaning there is an assumption that a standard player will “start” with
a playing time around 7 minutes per game before his scoring average is
taken into account.
To confirm reliability of my linear regression model, I then
performed residual analysis using a histogram, calculations and residual
scatter plot. While the distribution of the model’s residuals is
approximately normal and the mean of the residuals is extremely close to
zero, homoscedasticity of errors is not particularly meant. As when the
model predicted minutes per game of about 15 to 35, all the errors were
overestimates. But when the model predicted minutes per game of about 40
to 55, all the errors were underestimates. Although the residuals do not
show constant variance, I still believe there is strong utility in this
model given that other tenants of residual analysis appear to have been
sufficiently met.
Regression_Line_PTS_MP = lm(MP ~ PTS, data = Reduced_Playoff_Data)
summary(Regression_Line_PTS_MP)
Model_Residuals = resid(Regression_Line_PTS_MP)
Model_Predicted_Values = predict(Regression_Line_PTS_MP)
hist(Model_Residuals,
main = "Residual Analysis Histogram",
xlab = "Model Residuals")

mean(Model_Residuals)
plot(Model_Predicted_Values, Model_Residuals,
main = "Residual Analysis Scatterplot",
xlab = "Predicted minutes per game",
ylab = "Model Error (mins per game)")
abline(h=0)

Bootstrap Sampling for
Regression Coefficients
Using my regression model regarding the relationship between points
per game and minutes per game, I then created confidence intervals for
both the slope of the function, and the intercept of the function
(points per game). These bootstrap confidence intervals were formed with
a confidence level of 95%, and predicated on 1,000 samples. My most
recent run of the bootstrap algorithm, produced the following
values:
Intercept ~ [6.21, 8.46]
Points Per Game ~ [1.31, 1.55]
Remember, the linear model produced coefficient values that are both
safely within those intervals (7.32 and 1.4 respectively).
set.seed(123) # for reproducibility
B = 1000 # number of bootstrap resamples
n = nrow(Reduced_Playoff_Data)
vec_id =seq_len(n)
# storage
Boot.beta0 = numeric(B) # intercept
Boot.beta1 = numeric(B) # slope
for (i in seq_len(B)) {
boot_id = sample(vec_id, n, replace = TRUE) # resample rows
boot_reg = lm(MP ~ PTS, data = Reduced_Playoff_Data[boot_id, , drop = FALSE])
co = coef(boot_reg)
Boot.beta0[i] = co[1] # intercept
Boot.beta1[i] = co[2] # slope for PTS
}
# 95% percentile confidence intervals
ci_beta0 = quantile(Boot.beta0, c(0.025, 0.975))
ci_beta1 = quantile(Boot.beta1, c(0.025, 0.975))
ci_beta0
ci_beta1
hist(Boot.beta0, main = "Slope Intercept Model Coefficient", xlab = "Slope Intercept Value", ylab = "Bootstrap Sample Frequency")

hist(Boot.beta1, main = "Points Per Game Model Coefficient", xlab = "Points Per Game Value", ylab = "Bootstrap Sample Frequency")

Conclusions
Ultimately, we can conclude that there is certainly a strong,
positive and linear relationship between a NBA player’s scoring average
and his average playing time. Per the simple regression model, a
player’s minutes per game average increases by 1.4 for every additional
point he averages. Our bootstrap sampling coefficient calculations
produced confidence intervals at the 95% confidence level that strongly
overlapped with those values. If forced to pick one model to predict a
player’s playing time based on his scoring output, the bootstrap method
may be deemed more reliable due to the regression model producing
residuals of an inconsistent variance.
