Introduction and
Background
The dataset I am using for my analysis is an extensive collection of
metrics for every NBA player’s game-by-game performance across the
2022-2023 playoff season. It was published on the open-source data
science platform Kaggle (link can be found in references section below).
The original source for the data’s accuracy can be confirmed at
Basketball-reference.com, a well-established database for enthusiasts of
the sport. All statistical analysis will be conducted via the
programming language R.
As for the dataset’s contents, it is composed of all 217 players who
played some amount of time in the 2022-2023 postseason. It contained 30
variables, however only 10 played a role in my analysis, so I created a
reduced dataset which only contained the information pertinent to my
research question. An explanation and breakdown of the names and data
types for the variables used in my analysis is below:
“Player” = player’s name = character
“Pos” = player’s position on the court or in starting lineup =
character
“Tm” = team player plays for = character
“Age” = player’s age = integer
“PTS” = player’s points per game average = double
“AST” = player’s assists per game average = double
“TRB” = player’s rebounds per game average = double
“STL” = player’s steals per game average = double
“BLK” = player’s blocks per game average = double
“MP” = player’s minutes per game average = double
I chose to confine my analysis to operating on these variables, and
not to include the remaining ones in the original dataset for a couple
of reasons.
First, when making linear models to predict or estimate
relationships, it is ideal to include the least amount of variables as
possible as long as no significant amount of accuracy is lost by
variable reduction. This is to prevent a phenomenon known as
“overfitting” in which the models we create work almost perfectly for
our given dataset, but not as well when applied to samples outside of
this particular dataset.
Also, I believed from a practical persepective, many of the
variables in this dataset served a redundant purpose. For example;
measurements such as total field goals attempted and made, two-point
field goals attempted and made, three-point field goals attempted and
made, as well as the respective shooting percentages, are all
representative of a player’s scoring output. I find it more logical to
simply retain the variable PTS for the analysis as that serves as a
summary of a player’s scoring output.
Previously, I have used the data at hand to perform both simple and
multiple linear regression. Via my simple linear regression model, I
found there to be a strong, positive and significant linear relationship
between a player’s average scoring output and his average playing time.
Through conducting multiple linear regression, I discovered that there
is strong evidence to suggest a positive and linear relationship between
a player’s given playing time and their per-game averages in points,
rebounds and steals, yet no such significant relationship exists between
his minutes per game and his averages in blocks or assists, nor his age,
position on the court or team he plays for. Both of my previous analyses
can be found under the References section below.
Today, I am going to expand upon my multiple linear regression model
which found a significant relationship between an NBA player’s playing
time and his points per game (ppg), rebounds per game (rpg) and steals
per game (spg). I will be using bootstrap sampling to create confidence
intervals for the coefficients of each of those three metrics. A benefit
of doing so is that bootstrap sampling is a non-parametric technique,
meaning we do not need to make as strong of assumptions about the
underlying population (in this case professional basketball players) in
order to ensure the validity of results.
Multiple Regression
Model
A summary of my final multiple regression model can be seen below.
After previously performing numerous rounds of stepwise regression,
residual analysis and a box-cox transformation, I came to the following
conclusion. Assuming all other elements of a player’s performance remain
unchanged; every increase of one point per game will lead to an extra
0.84 minutes per game, every extra rebound averaged per game will cause
an extra 1.2 minutes per game, and every additional steal a player
averages per game correlates with an extra 5.8 minutes of playing time
per game.
library(MASS)
Reduced_Playoff_Data2 = Reduced_Playoff_Data[Reduced_Playoff_Data$MP > 0, ]
Forward_Model2 = lm(MP ~ PTS + TRB + STL, data = Reduced_Playoff_Data2)
lambda = 1.05
# Modify MP variable in dataset with lamba transformation
Reduced_Playoff_Data2$MP_bc = if (lambda == 0) {
log(Reduced_Playoff_Data2$MP)
} else {
(Reduced_Playoff_Data2$MP^lambda - 1) / lambda
}
BoxCox_Model = lm(MP ~ PTS + TRB + STL, data = Reduced_Playoff_Data2)
invisible(summary(BoxCox_Model))
(kable(summary(BoxCox_Model)$coef, digits = 5, caption ="Coefficients for Final Multiple Linear Regression Model"))
Coefficients for Final Multiple Linear Regression
Model
(Intercept) |
5.01154 |
0.51694 |
9.69466 |
0 |
PTS |
0.83900 |
0.06582 |
12.74628 |
0 |
TRB |
1.19507 |
0.14462 |
8.26368 |
0 |
STL |
5.82980 |
0.88906 |
6.55724 |
0 |
Bootstrap
Regression
I employed two different bootstrap sampling methods (case and
residual) to estimate the coefficients of my regression model. With each
model, I created 95% confidence intervals for each coefficient of
interest (intercept, PTS, TRB and STL).
Bootstrap Approach 1
- Bootstrap Cases
The first method I used was called bootstrap case sampling. This
approach consists of taking an extremely high number of repeated samples
(with replacement), of the dataset at hand and then recalculating the
predictor variables’ regression coefficients each time to fit the
response variable’s value. These regression coefficients from each
sample then make up each coefficient’s bootstrap distribution.
Bootstrap_Model = lm (MP ~ PTS + TRB + STL, data = Reduced_Playoff_Data2)
##
B = 1000 # choose the number of bootstrap replicates.
##
num.p = dim(model.frame(Bootstrap_Model))[2] # returns number of parameters in the model
smpl.n = dim(model.frame(Bootstrap_Model))[1] # sample size
## zero matrix to store bootstrap coefficients
coef.mtrx = matrix(rep(0, B*num.p), ncol = num.p)
##
for (i in 1:B){
bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE) # fit final model to the bootstrap sample
Bootstrap_Model.btc = lm (MP ~ PTS + TRB + STL, data = Reduced_Playoff_Data2[bootc.id,])
coef.mtrx[i,] = coef(Bootstrap_Model.btc) # extract coefs from bootstrap regression model
}
boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm){
## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
## var.id = variable ID (1, 2, ..., k+1)
## var.nm = variable name on the hist title, must be the string in the double quotes
## coefficient matrix of the final model
## Bootstrap sampling distribution of the estimated coefficients
x1.1 = seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
y1.1 = dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
# height of the histogram - use it to make a nice-looking histogram.
highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density)
ylimit = max(c(y1.1,highestbar))
hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="",
col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
lines(x = x1.1, y = y1.1, col = "red3")
lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")
#legend("topright", c(""))
}
par(mfrow=c(2,2)) # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Slope Intercept" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="Points Per Game" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="Rebounds Per Game" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=4, var.nm ="Steals Per Game" )

num.p = dim(coef.mtrx)[2] # number of parameters
btc.ci = NULL
btc.wd = NULL
for (i in 1:num.p){
lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2),8)
uci.975 = round(quantile(coef.mtrx[, i],0.975, type = 2 ),8)
btc.wd[i] = uci.975 - lci.025
btc.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
#as.data.frame(btc.ci)
Rounded_btc.wd = round(btc.wd,4)
cmtrx = summary(Bootstrap_Model)$coef
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btc.ci.95=btc.ci, Rounded_btc.wd)),
caption = "Bootstrap Cases Regression Coefficient Matrix, 95% CI")
Bootstrap Cases Regression Coefficient Matrix, 95% CI
(Intercept) |
5.0115 |
0.5169 |
9.6947 |
0.0000 |
[ 4.0371 , 6.0342 ] |
1.9971 |
PTS |
0.8390 |
0.0658 |
12.7463 |
0.0000 |
[ 0.6894 , 0.9928 ] |
0.3034 |
TRB |
1.1951 |
0.1446 |
8.2637 |
0.0000 |
[ 0.8464 , 1.5828 ] |
0.7364 |
STL |
5.8298 |
0.8891 |
6.5572 |
0.0000 |
[ 3.7215 , 8.0458 ] |
4.3244 |
Bootstrap Cases
Summary
After performing bootstrap case sampling (n = 1000), I created
histograms of each coefficient’s bootstrap sampling distribution. Each
histogram also contains two density curves. The red curve is
representative of estimated coefficients via the linear model that we
started today’s analysis with (the box-cox transformed model). While the
blue curve represents the bootstrap sampling distribution formed from
the cases method. We see tremendous overlap between the two curves in
all of our histograms, especially for the slope intercept and rebounds
per game. This indicates that there is not much difference between the
coefficient values originally calculated via the regression model, and
those found through the bootstrap sampling.
Next, we look at our 95% confidence intervals (btc.ci.95). We can see
that each interval captures the range of values that our original
regression coefficient plus or minus its given standard error would be
equal to. This means that there is no contradiction between our model
and its bootstrap sampling distribution.
Bootstrap Approach 2
- Bootstrap Residuals
The second bootstrap method I used is known as bootstrap residual
sampling. This approach works by fitting a regression model to our data
in an attempt to obtain predicted values and then creating new bootstrap
samples from so. These new samples consist of a new response variable by
which the model fits by adding randomly sampled residuals (with
replacement) to said response variables (PTS, TRB and STL).
hist(sort(Bootstrap_Model$residuals),n=40,
xlab="Residuals",
col = "lightblue",
border="navy",
main = "Histogram of Model Residuals")

invisible(mean(Bootstrap_Model$residuals))
Bootstrap_Model = lm (MP ~ PTS + TRB + STL, data = Reduced_Playoff_Data2)
model.resid = Bootstrap_Model$residuals
##
C=1500
num.p = dim(model.matrix(Bootstrap_Model))[2] # number of parameters
samp.n = dim(model.matrix(Bootstrap_Model))[1] # sample size
btr.mtrx = matrix(rep(0,6*B), ncol=num.p) # zero matrix to store boot coefs
for (i in 1:C){
## Bootstrap response values
bt.Bootstrap_Model = Bootstrap_Model$fitted.values +
sample(Bootstrap_Model$residuals, samp.n, replace = TRUE) # bootstrap residuals
# replace PriceUnitArea with bootstrap log price
Reduced_Playoff_Data2$bt.Bootstrap_Model = bt.Bootstrap_Model # send the boot response to the data
btr.model = lm(bt.Bootstrap_Model ~ PTS + TRB + STL, data = Reduced_Playoff_Data2)
btr.mtrx[i,]=btr.model$coefficients
}
boot.hist = function(bt.coef.mtrx, var.id, var.nm){
## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
## var.id = variable ID (1, 2, ..., k+1)
## var.nm = variable name on the hist title, must be the string in the double quotes
## Bootstrap sampling distribution of the estimated coefficients
x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
# height of the histogram - use it to make a nice-looking histogram.
highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density)
ylimit <- max(c(y1.1,highestbar))
hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="",
col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
lines(x = x1.1, y = y1.1, col = "red3") # normal density curve
lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") # loess curve
}
par(mfrow=c(2,2)) # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Slope Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="Points Per Game" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="Rebounds Per Game" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=4, var.nm ="Steals Per Game" )

num.p = dim(coef.mtrx)[2] # number of parameters
btr.ci = NULL
btr.wd = NULL
for (i in 1:num.p){
lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2),8)
uci.975 = round(quantile(btr.mtrx[, i],0.975, type = 2 ),8)
btr.wd[i] = uci.975 - lci.025
btr.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
Rounded_btr.wd = round(btr.wd,4)
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btr.ci.95=btr.ci, Rounded_btr.wd)),
caption = "Bootstrap Residual Regression Coefficient Matrix, 95% CI")
Bootstrap Residual Regression Coefficient Matrix, 95%
CI
(Intercept) |
5.0115 |
0.5169 |
9.6947 |
0.0000 |
[ 4.0389 , 6.0412 ] |
2.0023 |
PTS |
0.8390 |
0.0658 |
12.7463 |
0.0000 |
[ 0.7148 , 0.975 ] |
0.2602 |
TRB |
1.1951 |
0.1446 |
8.2637 |
0.0000 |
[ 0.9104 , 1.4664 ] |
0.556 |
STL |
5.8298 |
0.8891 |
6.5572 |
0.0000 |
[ 4.1143 , 7.5571 ] |
3.4429 |
Bootstrap Residuals
Summary
We can see from our model residual histogram that our residuals are
not quite normally distributed, as there is a visible slight rightward
skew, yet the calculated mean of the residuals is extremely close to
zero (~7.4-17), so we proceed with caution.
Like in the residual bootstrap histograms produced from the bootstrap
cases method, our histograms here show a close proximity between the
traditional regression density curve (red line) and the bootstrap-based
density curve (blue line). This indicates a high degree of shared values
from the two estimations. Note that in this instance, the blue line is
representative of bootstrap residuals’ sampling distribution.
Lastly, we take a look at the 95% confidence intervals (btr.ci.95)
created for our regression coefficients. Once again, all of our
confidence intervals via the bootstrap method are within the range of
our original regression-calculated coefficients, plus or minus their
given standard error. Hence the results from performing the bootstrap
residuals method are in alignment with the takeaways from our simplified
model that we started today’s analysis with.
Conclusion
Below we can see a summary of our model, its original regression
coefficients, as well as the confidence interval metrics for its
bootstrap cases and bootstrap residual distributions. Given that all
three methods yielded results practically in alignment with each other,
we can safely say that an NBA player’s average playing time has a strong
linear relationship with his averages in points, rebounds and steals per
game, with the exact degree of correlation slightly differing via the
widths of said confidence intervals.
Rounded_Interval_Widths = cbind(Rounded_btc.wd, Rounded_btr.wd)
kable(as.data.frame(cbind(formatC(cmtrx[,-3],4,format="f"), btc.ci.95=btc.ci,btr.ci.95=btr.ci, Rounded_Interval_Widths)),
caption="Final Regression Coefficient Estimates: P-Values and Bootstrap CIs")
Final Regression Coefficient Estimates: P-Values and Bootstrap
CIs
(Intercept) |
5.0115 |
0.5169 |
0.0000 |
[ 4.0371 , 6.0342 ] |
[ 4.0389 , 6.0412 ] |
1.9971 |
2.0023 |
PTS |
0.8390 |
0.0658 |
0.0000 |
[ 0.6894 , 0.9928 ] |
[ 0.7148 , 0.975 ] |
0.3034 |
0.2602 |
TRB |
1.1951 |
0.1446 |
0.0000 |
[ 0.8464 , 1.5828 ] |
[ 0.9104 , 1.4664 ] |
0.7364 |
0.556 |
STL |
5.8298 |
0.8891 |
0.0000 |
[ 3.7215 , 8.0458 ] |
[ 4.1143 , 7.5571 ] |
4.3244 |
3.4429 |
Summary/Discussion
The findings from my analysis over the past few weeks could be
valuable to NBA players as well as those deeply vested in their on-court
performance such as agents and advertisers. If a player desires more
playing time, it is helpful to know exactly what metrics (PTS, TRB and
STL in this case) having an increase in will traditionally correlate
with more playing time across the league.
For my analysis, the most recent dataset I could find was one from
the 2022-2023 NBA season. Moving forward, if there was a way I could
perform similar exploration on data from the most recent NBA season,
that would potentially bring even more credibility to whatever
findings resulted.
---
title: "Bootstrap Sampling to Explore Relationship Between an NBA Player's Scoring, Rebounding and Stealing with Playing Time"
author: "Chris Bahm"
date: "2025-09-29"
output:
  html_document:
    toc: true
    toc_float:
      collapsed: true
      smooth_scroll: true
    toc_depth: 4
    fig_width: 6
    fig_height: 4
    fig_caption: true
    number_sections: true
    code_folding: hide
    code_download: true
    theme: lumen
    highlight: tango
  pdf_document:
    toc: true
    toc_depth: 4
    fig_caption: true
    number_sections: true
  word_document:
    toc: true
    toc_depth: 4
---

```{css, echo = FALSE}
div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.

if (!require("knitr")) {                      # use conditional statement to detect
   install.packages("knitr")                  # whether a package was installed in
   library(knitr)                             # your machine. If not, install it and
}                                             # load it to the working directory.

if (!require(tidyverse)) {library(tidyvserse)} 

if (!require(GGally)) {library(GGally)} 

if (!require(kableExtra)) {library(kableExtra)} 

if (!require(ggplot2)) {library(ggplot2)} 

if (!require(car)) {library(car)} 

if (!require(dplyr)) {library(dplyr)} 

if (!require("scales")) {
install.packages("scales")                                        
library("scales") 
}

knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE,
	comment = NA,
	results = TRUE
)
   
```

# Introduction and Background
The dataset I am using for my analysis is an extensive collection of metrics for every NBA player's game-by-game performance across the 2022-2023 playoff season. It was published on the open-source data science platform Kaggle (link can be found in references section below). The original source for the data's accuracy can be confirmed at Basketball-reference.com, a well-established database for enthusiasts of the sport. All statistical analysis will be conducted via the programming language R.

As for the dataset's contents, it is composed of all 217 players who played some amount of time in the 2022-2023 postseason. It contained 30 variables, however only 10 played a role in my analysis, so I created a reduced dataset which only contained the information pertinent to my research question. An explanation and breakdown of the names and data types for the variables used in my analysis is below:

-   "Player" = player's name = character

-   "Pos" = player's position on the court or in starting lineup = character

-   "Tm" = team player plays for = character

-   "Age" = player's age = integer

-   "PTS" = player's points per game average = double

-   "AST" = player's assists per game average = double

-   "TRB" = player's rebounds per game average = double

-   "STL" = player's steals per game average = double

-   "BLK" = player's blocks per game average = double

-   "MP" = player's minutes per game average = double

I chose to confine my analysis to operating on these variables, and not to include the remaining ones in the original dataset for a couple of reasons.

- First, when making linear models to predict or estimate relationships, it is ideal to include the least amount of variables as possible as long as no significant amount of accuracy is lost by variable reduction. This is to prevent a phenomenon known as "overfitting" in which the models we create work almost perfectly for our given dataset, but not as well when applied to samples outside of this particular dataset.

- Also, I believed from a practical persepective, many of the variables in this dataset served a redundant purpose. For example; measurements such as total field goals attempted and made, two-point field goals attempted and made, three-point field goals attempted and made, as well as the respective shooting percentages, are all representative of a player's scoring output. I find it more logical to simply retain the variable PTS for the analysis as that serves as a summary of a player's scoring output.

Previously, I have used the data at hand to perform both simple and multiple linear regression. Via my simple linear regression model, I found there to be a strong, positive and significant linear relationship between a player's average scoring output and his average playing time. Through conducting multiple linear regression, I discovered that there is strong evidence to suggest a positive and linear relationship between a player's given playing time and their per-game averages in points, rebounds and steals, yet no such significant relationship exists between his minutes per game and his averages in blocks or assists, nor his age, position on the court or team he plays for. Both of my previous analyses can be found under the References section below.

Today, I am going to expand upon my multiple linear regression model which found a significant relationship between an NBA player's playing time and his points per game (ppg), rebounds per game (rpg) and steals per game (spg). I will be using bootstrap sampling to create confidence intervals for the coefficients of each of those three metrics. A benefit of doing so is that bootstrap sampling is a non-parametric technique, meaning we do not need to make as strong of assumptions about the underlying population (in this case professional basketball players) in order to ensure the validity of results.

```{r Data_Read_in_and_Cleaning, include=FALSE}

Data_Url = "https://raw.githubusercontent.com/ChrisB2323/STA321/main/Playoff_Data.csv"
Full_Playoff_Data = read_delim(Data_Url, delim = ";", show_col_types = FALSE)
glimpse(Full_Playoff_Data)

Reduced_Playoff_Data = Full_Playoff_Data %>% dplyr::select(Player, Pos, Tm, Age, PTS, AST, TRB, STL, BLK, MP)
glimpse(Reduced_Playoff_Data)

Player = Reduced_Playoff_Data$Player
Pos = Reduced_Playoff_Data$Pos
Tm = Reduced_Playoff_Data$Tm
Age = Reduced_Playoff_Data$Age
PTS = Reduced_Playoff_Data$PTS
AST = Reduced_Playoff_Data$AST
TRB = Reduced_Playoff_Data$TRB
STL = Reduced_Playoff_Data$STL
BLK = Reduced_Playoff_Data$BLK
MP = Reduced_Playoff_Data$MP

Quantitative_Predictor_Variables = Reduced_Playoff_Data[c("Age", "PTS", "AST", "TRB", "STL", "BLK")]

Categorical_Predictor_Variables = Reduced_Playoff_Data[c("Pos", "Tm")]

# Quantitative Predictor: Age, PTS, AST, TRB, STL, BLK
# Categorical Predictor: Pos, Tm

All_Predictor_Variables = cbind(Quantitative_Predictor_Variables, Categorical_Predictor_Variables)
```

# Multiple Regression Model
A summary of my final multiple regression model can be seen below. After previously performing numerous rounds of stepwise regression, residual analysis and a box-cox transformation, I came to the following conclusion. Assuming all other elements of a player's performance remain unchanged; every increase of one point per game will lead to an extra 0.84 minutes per game, every extra rebound averaged per game will cause an extra 1.2 minutes per game, and every additional steal a player averages per game correlates with an extra 5.8 minutes of playing time per game.
```{r results='hide'}
library(MASS)

Reduced_Playoff_Data2 = Reduced_Playoff_Data[Reduced_Playoff_Data$MP > 0, ]

Forward_Model2 = lm(MP ~ PTS + TRB + STL, data = Reduced_Playoff_Data2)
```

```{r, include=FALSE}
Boxcox = boxcox(Forward_Model2, lambda = seq(-2, 2, 0.1))
```

```{r}
lambda = 1.05

# Modify MP variable in dataset with lamba transformation
Reduced_Playoff_Data2$MP_bc = if (lambda == 0) {
  log(Reduced_Playoff_Data2$MP)
} else {
  (Reduced_Playoff_Data2$MP^lambda - 1) / lambda
}
BoxCox_Model = lm(MP ~ PTS + TRB + STL, data = Reduced_Playoff_Data2)
invisible(summary(BoxCox_Model))

(kable(summary(BoxCox_Model)$coef, digits = 5, caption ="Coefficients for Final Multiple Linear Regression Model"))
```

# Bootstrap Regression
I employed two different bootstrap sampling methods (case and residual) to estimate the coefficients of my regression model. With each model, I created 95% confidence intervals for each coefficient of interest (intercept, PTS, TRB and STL). 

## Bootstrap Approach 1 - Bootstrap Cases
The first method I used was called bootstrap case sampling. This approach consists of taking an extremely high number of repeated samples (with replacement), of the dataset at hand and then recalculating the predictor variables' regression coefficients each time to fit the response variable's value. These regression coefficients from each sample then make up each coefficient's bootstrap distribution.

```{r Original Coding of Bootstrap Method 1, include=FALSE}
set.seed(123)                 # for reproducibility
B = 1000                     # number of bootstrap resamples

n = nrow(Reduced_Playoff_Data2)
vec_id =seq_len(n)

# storage
Boot.beta0 = numeric(B) # intercept
Boot.beta1 = numeric(B) # slope
Boot.beta2 = numeric(B) # slope
Boot.beta3 = numeric(B) # slope

for (i in seq_len(B)) {
  boot_id = sample(vec_id, n, replace = TRUE) # resample rows
  boot_reg = lm(MP ~ PTS + TRB + STL, data = Reduced_Playoff_Data2[boot_id, , drop = FALSE]) 
  co = coef(boot_reg)
  Boot.beta0[i] = co[1]    # intercept
  Boot.beta1[i] = co[2]    # slope for PTS
  Boot.beta2[i] = co[3]    # slope for TRB
  Boot.beta3[i] = co[4]    # slope for STL
}

# 95% percentile confidence intervals
ci_beta0 = quantile(Boot.beta0, c(0.025, 0.975))
ci_beta1 = quantile(Boot.beta1, c(0.025, 0.975))
ci_beta2 = quantile(Boot.beta2, c(0.025, 0.975))
ci_beta3 = quantile(Boot.beta3, c(0.025, 0.975))

ci_beta0 
ci_beta1
ci_beta2
ci_beta3

par(mfrow=c(2,2))
hist(Boot.beta0, main = "Slope Intercept Model Coefficient", xlab = "Slope Intercept Value", ylab = "Bootstrap Sample Frequency")
hist(Boot.beta1, main = "Points Per Game Model Coefficient", xlab = "Points Per Game Value", ylab = "Bootstrap Sample Frequency")
hist(Boot.beta2, main = "Rebounds Per Game Model Coefficient", xlab = "Rebounds Per Game Value", ylab = "Bootstrap Sample Frequency")
hist(Boot.beta3, main = "Steals Per Game Model Coefficient", xlab = "Steals Per Game Value", ylab = "Bootstrap Sample Frequency")
```

```{r New Coding of Bootstrap Method 1}
Bootstrap_Model = lm (MP ~ PTS + TRB + STL, data = Reduced_Playoff_Data2)
##
B = 1000       # choose the number of bootstrap replicates.
## 
num.p = dim(model.frame(Bootstrap_Model))[2]  # returns number of parameters in the model
smpl.n = dim(model.frame(Bootstrap_Model))[1] # sample size
## zero matrix to store bootstrap coefficients 
coef.mtrx = matrix(rep(0, B*num.p), ncol = num.p)       
## 
for (i in 1:B){
  bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE) # fit final model to the bootstrap sample
  Bootstrap_Model.btc = lm (MP ~ PTS + TRB + STL, data = Reduced_Playoff_Data2[bootc.id,])     
  coef.mtrx[i,] = coef(Bootstrap_Model.btc)    # extract coefs from bootstrap regression model    
}

boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm){
  ## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
  ## var.id = variable ID (1, 2, ..., k+1)
  ## var.nm = variable name on the hist title, must be the string in the double quotes
  ## coefficient matrix of the final model
  ## Bootstrap sampling distribution of the estimated coefficients
  x1.1 = seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 = dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
  # height of the histogram - use it to make a nice-looking histogram.
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit = max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") 
  #legend("topright", c(""))
} 
  par(mfrow=c(2,2))  # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Slope Intercept" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="Points Per Game" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="Rebounds Per Game" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=4, var.nm ="Steals Per Game" )

num.p = dim(coef.mtrx)[2]  # number of parameters
btc.ci = NULL
btc.wd = NULL
for (i in 1:num.p){
  lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2),8)
  uci.975 = round(quantile(coef.mtrx[, i],0.975, type = 2 ),8)
  btc.wd[i] =  uci.975 - lci.025
  btc.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
 }
#as.data.frame(btc.ci)
Rounded_btc.wd = round(btc.wd,4)

cmtrx = summary(Bootstrap_Model)$coef

kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btc.ci.95=btc.ci, Rounded_btc.wd)), 
      caption = "Bootstrap Cases Regression Coefficient Matrix, 95% CI")
```
### Bootstrap Cases Summary
After performing bootstrap case sampling (n = 1000), I created histograms of each coefficient's bootstrap sampling distribution. Each histogram also contains two density curves. The red curve is representative of estimated coefficients via the linear model that we started today's analysis with (the box-cox transformed model). While the blue curve represents the bootstrap sampling distribution formed from the cases method. We see tremendous overlap between the two curves in all of our histograms, especially for the slope intercept and rebounds per game. This indicates that there is not much difference between the coefficient values originally calculated via the regression model, and those found through the bootstrap sampling.

Next, we look at our 95% confidence intervals (btc.ci.95). We can see that each interval captures the range of values that our original regression coefficient plus or minus its given standard error would be equal to. This means that there is no contradiction between our model and its bootstrap sampling distribution.

## Bootstrap Approach 2 - Bootstrap Residuals
The second bootstrap method I used is known as bootstrap residual sampling. This approach works by fitting a regression model to our data in an attempt to obtain predicted values and then creating new bootstrap samples from so. These new samples consist of a new response variable by which the model fits by adding randomly sampled residuals (with replacement) to said response variables (PTS, TRB and STL).

```{r}
hist(sort(Bootstrap_Model$residuals),n=40,
     xlab="Residuals",
     col = "lightblue",
     border="navy",
     main = "Histogram of Model Residuals")

invisible(mean(Bootstrap_Model$residuals))

Bootstrap_Model = lm (MP ~ PTS + TRB + STL, data = Reduced_Playoff_Data2)
model.resid = Bootstrap_Model$residuals
##
C=1500
num.p = dim(model.matrix(Bootstrap_Model))[2]   # number of parameters
samp.n = dim(model.matrix(Bootstrap_Model))[1]  # sample size
btr.mtrx = matrix(rep(0,6*B), ncol=num.p) # zero matrix to store boot coefs
for (i in 1:C){
  ## Bootstrap response values
  bt.Bootstrap_Model = Bootstrap_Model$fitted.values + 
        sample(Bootstrap_Model$residuals, samp.n, replace = TRUE)  # bootstrap residuals
  # replace PriceUnitArea with bootstrap log price
  Reduced_Playoff_Data2$bt.Bootstrap_Model =  bt.Bootstrap_Model   #  send the boot response to the data
  btr.model = lm(bt.Bootstrap_Model ~ PTS + TRB + STL, data = Reduced_Playoff_Data2)   
  btr.mtrx[i,]=btr.model$coefficients
}

boot.hist = function(bt.coef.mtrx, var.id, var.nm){
  ## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
  ## var.id = variable ID (1, 2, ..., k+1)
  ## var.nm = variable name on the hist title, must be the string in the double quotes
  ## Bootstrap sampling distribution of the estimated coefficients
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
  # height of the histogram - use it to make a nice-looking histogram.
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")       # normal density curve         
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")    # loess curve
} 
par(mfrow=c(2,2))  # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Slope Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="Points Per Game" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="Rebounds Per Game" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=4, var.nm ="Steals Per Game" )

num.p = dim(coef.mtrx)[2]  # number of parameters
btr.ci = NULL
btr.wd = NULL
for (i in 1:num.p){
  lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2),8)
  uci.975 = round(quantile(btr.mtrx[, i],0.975, type = 2 ),8)
  btr.wd[i] = uci.975 - lci.025
  btr.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}

Rounded_btr.wd = round(btr.wd,4)
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btr.ci.95=btr.ci, Rounded_btr.wd)), 
      caption = "Bootstrap Residual Regression Coefficient Matrix, 95% CI")

```
### Bootstrap Residuals Summary
We can see from our model residual histogram that our residuals are not quite normally distributed, as there is a visible slight rightward skew, yet the calculated mean of the residuals is extremely close to zero (~7.4^-17^), so we proceed with caution.

Like in the residual bootstrap histograms produced from the bootstrap cases method, our histograms here show a close proximity between the traditional regression density curve (red line) and the bootstrap-based density curve (blue line). This indicates a high degree of shared values from the two estimations. Note that in this instance, the blue line is representative of bootstrap residuals' sampling distribution.

Lastly, we take a look at the 95% confidence intervals (btr.ci.95) created for our regression coefficients. Once again, all of our confidence intervals via the bootstrap method are within the range of our original regression-calculated coefficients, plus or minus their given standard error. Hence the results from performing the bootstrap residuals method are in alignment with the takeaways from our simplified model that we started today's analysis with. 


# Conclusion
Below we can see a summary of our model, its original regression coefficients, as well as the confidence interval metrics for its bootstrap cases and bootstrap residual distributions. Given that all three methods yielded results practically in alignment with each other, we can safely say that an NBA player's average playing time has a strong linear relationship with his averages in points, rebounds and steals per game, with the exact degree of correlation slightly differing via the widths of said confidence intervals.

```{r}
Rounded_Interval_Widths = cbind(Rounded_btc.wd, Rounded_btr.wd)

kable(as.data.frame(cbind(formatC(cmtrx[,-3],4,format="f"), btc.ci.95=btc.ci,btr.ci.95=btr.ci, Rounded_Interval_Widths)), 
      caption="Final Regression Coefficient Estimates: P-Values and Bootstrap CIs")

```

# Summary/Discussion
The findings from my analysis over the past few weeks could be valuable to NBA players as well as those deeply vested in their on-court performance such as agents and advertisers. If a player desires more playing time, it is helpful to know exactly what metrics (PTS, TRB and STL in this case) having an increase in will traditionally correlate with more playing time across the league.

For my analysis, the most recent dataset I could find was one from the 2022-2023 NBA season. Moving forward, if there was a way I could perform similar exploration on data from the most recent NBA season, that would potentially bring even _more_ credibility to whatever findings resulted.

# References:
Original Dataset Source: <https://www.kaggle.com/datasets/vivovinco/20222023-nba-player-stats-regular>

Dataset Download Link via Github:
https://raw.githubusercontent.com/ChrisB2323/STA321/main/Playoff_Data.csv 

My Previous Analyses:

- Relationship Between All 5 Standard Statistics and Circumstance with Playing Time https://rpubs.com/Chris_Bahm/1347436

- Relationship Between Scoring Output and Playing Time: https://rpubs.com/Chris_Bahm/1344468