1 Introduction and Background

The dataset I am using for my analysis is an extensive collection of metrics for every NBA player’s game-by-game performance across the 2022-2023 playoff season. It was published on the open-source data science platform Kaggle (link can be found in references section below). The original source for the data’s accuracy can be confirmed at Basketball-reference.com, a well-established database for enthusiasts of the sport. All statistical analysis will be conducted via the programming language R.

As for the dataset’s contents, it is composed of all 217 players who played some amount of time in the 2022-2023 postseason. It contained 30 variables, however only 10 played a role in my analysis, so I created a reduced dataset which only contained the information pertinent to my research question. An explanation and breakdown of the names and data types for the variables used in my analysis is below:

  • “Player” = player’s name = character

  • “Pos” = player’s position on the court or in starting lineup = character

  • “Tm” = team player plays for = character

  • “Age” = player’s age = integer

  • “PTS” = player’s points per game average = double

  • “AST” = player’s assists per game average = double

  • “TRB” = player’s rebounds per game average = double

  • “STL” = player’s steals per game average = double

  • “BLK” = player’s blocks per game average = double

  • “MP” = player’s minutes per game average = double

Using this dataset, I will be exploring the relationship that a player’s impact on the game (measured via their average points, rebounds, assists, steals and blocks) and circumstance (age, position and team playing for) has on their playing time (measured by per minute averages). I will refer to the five primary metrics of productivity (points, rebounds, assists, steals and blocks) as the “five standard statistics” of basketball.

Before conducting any thorough analysis, I would assume that there is a positive correlation between a player’s playing time and their five standard statistics, and a negative correlation between a player’s playing time and their age. As for position, it would be my intuition that traditionally smaller players, such as point guards and shooting guards, have a higher average amount of playing time than their larger positional counterparts like power forwards and centers. When it comes to the importance of what team a player is on, I am unsure exactly what to expect regarding any sort of significant difference in playing time distribution from team to team, but I am sure that differences do exist considering the multidude of different coaching methodologies out there.

Given the sufficient size of my dataset, over 200 players from the 16 teams which participated in the postseason, and the reliability of its sourcing, I am confident that I will be able to come to empircally justified conclusions.

2 Initial Visual Analysis

Below I created a pairwise scatter plot in order to concisely examine the relationship that each of the quantitative variables relevant to my analysis (age and the five standard basketball stats) have with a player’s playing time.

Unsurprisingly, there was a very strong and positive correlation between a player’s average points per game and their average minutes per game (r ~ .895). Averages in assists, steals, rebounds and blocks also each shared a positive correlation with playing time, but not to as strong a degree (r ~ 0.77 to 0.54). Given that most basketball fans and analysts would consider scoring to be the most important skill for any player to posess, it does make sense that a player’s scoring average is more determinant of his playing time than any other metric.

What was noticeable from the scatter plot was the exceptionally weak relationship between a player’s age and his playing time (r ~ .115). Conventional wisdom amongst many fans would be that players very early on in their careers (low age value) and at the tail end of their careers (high age value), would typically see far less playing time than the athletes in their physical prime, however that is not a trend that this data provides any sufficient evidence for.

ggpairs(Quantitative_Variables, title = "Pairwise Scatterplot of Quantitative Variables")

3 Simple Linear Regression (SLR) and Residual Analysis

After using the aforementioned scatter plot to determine that points per game had a supremely strong correlation with minutes per game, I chose to explore the relationship between those two statistics further via a simple linear regression model.

The regression model affirmed the previous conclusion taken away from the scatter plot, which is that there is a statistically significant, strong and positive correlation between scoring averages and playing time averages. It appears that, for every additional point that a player averages, he will play approximately an additional 1.4 minutes per game. This function also operates with a slope intercept of about 7.32, meaning there is an assumption that a standard player will “start” with a playing time around 7 minutes per game before his scoring average is taken into account.

To confirm reliability of my linear regression model, I then performed residual analysis using a histogram, calculations and residual scatter plot. While the distribution of the model’s residuals is approximately normal and the mean of the residuals is extremely close to zero, homoscedasticity of errors is not particularly meant. As when the model predicted minutes per game of about 15 to 35, all the errors were overestimates. But when the model predicted minutes per game of about 40 to 55, all the errors were underestimates. Although the residuals do not show constant variance, I still believe there is strong utility in this model given that other tenants of residual analysis appear to have been sufficiently met.

Regression_Line_PTS_MP = lm(MP ~ PTS, data = Reduced_Playoff_Data)
summary(Regression_Line_PTS_MP)

Model_Residuals = resid(Regression_Line_PTS_MP)
Model_Predicted_Values = predict(Regression_Line_PTS_MP)
hist(Model_Residuals,
     main = "Residual Analysis Histogram",
     xlab = "Model Residuals")

mean(Model_Residuals)
plot(Model_Predicted_Values, Model_Residuals,
     main = "Residual Analysis Scatterplot",
     xlab = "Predicted minutes per game",
     ylab = "Model Error (mins per game)")
abline(h=0)

4 Bootstrap Sampling for Regression Coefficients

Using my regression model regarding the relationship between points per game and minutes per game, I then created confidence intervals for both the slope of the function, and the intercept of the function (points per game). These bootstrap confidence intervals were formed with a confidence level of 95%, and predicated on 1,000 samples. My most recent run of the bootstrap algorithm, produced the following values:

Intercept ~ [6.21, 8.46]

Points Per Game ~ [1.31, 1.55]

Remember, the linear model produced coefficient values that are both safely within those intervals (7.32 and 1.4 respectively).

set.seed(123)                 # for reproducibility
B = 1000                     # number of bootstrap resamples

n = nrow(Reduced_Playoff_Data)
vec_id =seq_len(n)

# storage
Boot.beta0 = numeric(B) # intercept
Boot.beta1 = numeric(B) # slope

for (i in seq_len(B)) {
  boot_id = sample(vec_id, n, replace = TRUE) # resample rows
  boot_reg = lm(MP ~ PTS, data = Reduced_Playoff_Data[boot_id, , drop = FALSE]) 
  co = coef(boot_reg)
  Boot.beta0[i] = co[1]    # intercept
  Boot.beta1[i] = co[2]    # slope for PTS
}

# 95% percentile confidence intervals
ci_beta0 = quantile(Boot.beta0, c(0.025, 0.975))
ci_beta1 = quantile(Boot.beta1, c(0.025, 0.975))
ci_beta0
ci_beta1

hist(Boot.beta0, main = "Slope Intercept Model Coefficient", xlab = "Slope Intercept Value", ylab = "Bootstrap Sample Frequency")

hist(Boot.beta1, main = "Points Per Game Model Coefficient", xlab = "Points Per Game Value", ylab = "Bootstrap Sample Frequency")

5 Conclusions

Ultimately, we can conclude that there is certainly a strong, positive and linear relationship between a NBA player’s scoring average and his average playing time. Per the simple regression model, a player’s minutes per game average increases by 1.4 for every additional point he averages. Our bootstrap sampling coefficient calculations produced confidence intervals at the 95% confidence level that strongly overlapped with those values. If forced to pick one model to predict a player’s playing time based on his scoring output, the bootstrap method may be deemed more reliable due to the regression model producing residuals of an inconsistent variance.

