This analysis examines factors that influence an NBA basketball player’s WinShare totals. Using specific explanatory variables within the dataset, I explore the linear relationship between WinShares and each selected variable using a multiple linear regression model. Each observation in the dataset are players who were drafted between 1989 and 2021. Independent variables used include career player stats of points, rebounds, assists, field goal percentage, minutes played, plus/minus, and overall draft selection. For purposes of this analysis, players containing missing data were removed.
Because of its skewed distribution, WinShares was log-transformed and created as a separate target variable. ggpairs() was used to check for multicollinearity between each of the chosen independent variables. Field goal percentage, plus/minus, and overall draft pick selection did not show high correlation with other variables, and were used in the multiple linear regression model. The initial regression model using the original WinShare variable produced an Adjusted R-square of 32.63%. Although there were outliers, most of the data points in the Residual vs Fitted Values plot were clustered around the zero line threshold with no distinct pattern. The near normal residuals histogram showed skewness to the right with its center at approximately zero. The QQ-plot displayed a relatively straight line, with its upper end positively skewed. This indicates that the conditions of linearity, near normal residuals, and constant variability are met. When replacing the target variable with its log-transformed counterpart, the model increased its Adjusted R-Square performance to 46.56%. The Residual vs Fitted Values displayed a cluster of data points around the zero threshold but had less outliers. The near normal residuals histogram showed a symmetrical normal distribution with its center at approximately zero. The line of data points in the QQ-plot was more straight with no discernible skewness. The revised model improved the model’s overall performance.
The National Basketball Association has seen a huge surge in the use of data analytics over the past 15 years. The purpose of aggregating data is to assist in making basketball decisions that will help identify players that can lead to wins and possibly a championship. Using NBA player data obtained from Kaggle, I will explore various variables to ascertain what influences a player’s WinShare. Independent variables utilized include plus/minus, points, total rebounds, assists, minutes played, overall draft pick selection, and field goal percentage.
library(tidyverse)
library(DT)
library(GGally)
library(vtable)
library(visreg)
glimpse(df)
## Rows: 1,922
## Columns: 24
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ year <int> 1989, 1989, 1989, 1989, 1989, 1989, 1989, 19…
## $ rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ overall_pick <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ team <chr> "SAC", "LAC", "SAS", "MIA", "CHH", "CHI", "I…
## $ player <chr> "Pervis Ellison", "Danny Ferry", "Sean Ellio…
## $ college <chr> "Louisville", "Duke", "Arizona", "Michigan",…
## $ years_active <int> 11, 13, 12, 15, 11, 8, 12, 5, 12, 10, 13, 13…
## $ games <int> 474, 917, 742, 1000, 672, 438, 766, 281, 687…
## $ minutes_played <int> 11593, 18133, 24502, 34985, 15370, 7406, 174…
## $ points <int> 4494, 6439, 10544, 18336, 5680, 2819, 6925, …
## $ total_rebounds <int> 3170, 2550, 3204, 4387, 3381, 1460, 2342, 13…
## $ assists <int> 691, 1185, 1897, 2097, 639, 387, 1769, 175, …
## $ field_goal_percentage <dbl> 0.510, 0.446, 0.465, 0.456, 0.472, 0.478, 0.…
## $ X3_point_percentage <dbl> 0.050, 0.393, 0.375, 0.400, 0.135, 0.235, 0.…
## $ free_throw_percentage <dbl> 0.689, 0.840, 0.799, 0.846, 0.716, 0.707, 0.…
## $ average_minutes_played <dbl> 24.5, 19.8, 33.0, 35.0, 22.9, 16.9, 22.8, 19…
## $ points_per_game <dbl> 9.5, 7.0, 14.2, 18.3, 8.5, 6.4, 9.0, 7.4, 5.…
## $ average_total_rebounds <dbl> 6.7, 2.8, 4.3, 4.4, 5.0, 3.3, 3.1, 4.9, 3.3,…
## $ average_assists <dbl> 1.5, 1.3, 2.6, 2.1, 1.0, 0.9, 2.3, 0.6, 0.6,…
## $ win_shares <dbl> 21.8, 34.8, 55.7, 88.7, 22.5, 10.9, 24.6, 1.…
## $ win_shares_per_48_minutes <dbl> 0.090, 0.092, 0.109, 0.122, 0.070, 0.071, 0.…
## $ box_plus_minus <dbl> -0.5, -0.9, 0.2, 0.8, -2.9, -3.4, -0.8, -5.0…
## $ value_over_replacement <dbl> 4.4, 4.9, 13.5, 24.9, -3.7, -2.7, 5.3, -4.0,…
The dataset has 1,922 observations and 24 columns/variables. Each row represents a player who was drafted in the first or second round between 1989 and 2021. Players who did not play in the NBA have missing values. For purposes of this project, I will drop those players with missing data from the dataset. The target variable is WinShares, defined as adding together Offensive Win Shares and Defensive Win Shares. This article from Basketball Reference describes how Offensive and Defensive Win Shares are respectively calculated. The independent variables used are:
overall_pick: Overall draft selection of each player (categorical variable) points: Total career points total_rebounds: Total career rebounds assists: Total career assists field_goal_percentage: Career field goal percentage minutes_played: Total career minutes played box_plus_minus: Measure of a player’s productivity on the court. Positive numbers indicate that the player helped increase their respective team’s lead or decrease the deficit. A minus indicates that the deficit increased or the team’s lead decreased.
# summary statistics
st(df)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| id | 1922 | 962 | 555 | 1 | 481 | 1442 | 1922 |
| year | 1922 | 2005 | 9.5 | 1989 | 1997 | 2013 | 2021 |
| rank | 1922 | 30 | 17 | 1 | 15 | 44 | 60 |
| overall_pick | 1922 | 30 | 17 | 1 | 15 | 44 | 60 |
| years_active | 1669 | 6.3 | 4.7 | 1 | 2 | 10 | 22 |
| games | 1669 | 348 | 325 | 1 | 72 | 584 | 1541 |
| minutes_played | 1669 | 8399 | 9846 | 0 | 838 | 13246 | 52139 |
| points | 1669 | 3580 | 4826 | 0 | 265 | 5150 | 37062 |
| total_rebounds | 1669 | 1497 | 2004 | 0 | 128 | 2139 | 15091 |
| assists | 1669 | 774 | 1285 | 0 | 46 | 910 | 12091 |
| field_goal_percentage | 1665 | 0.44 | 0.084 | 0 | 0.4 | 0.47 | 1 |
| X3_point_percentage | 1545 | 0.27 | 0.13 | 0 | 0.22 | 0.36 | 1 |
| free_throw_percentage | 1633 | 0.72 | 0.12 | 0 | 0.66 | 0.8 | 1 |
| average_minutes_played | 1669 | 18 | 8.7 | 0 | 11 | 25 | 41 |
| points_per_game | 1669 | 7.3 | 5 | 0 | 3.4 | 10 | 27 |
| average_total_rebounds | 1669 | 3.2 | 2.1 | 0 | 1.7 | 4.2 | 13 |
| average_assists | 1669 | 1.6 | 1.5 | 0 | 0.5 | 2.1 | 9.5 |
| win_shares | 1669 | 18 | 28 | -1.7 | 0.4 | 24 | 250 |
| win_shares_per_48_minutes | 1668 | 0.062 | 0.094 | -1.3 | 0.03 | 0.1 | 1.4 |
| box_plus_minus | 1668 | -2.3 | 4.1 | -52 | -3.9 | -0.3 | 51 |
| value_over_replacement | 1669 | 4.4 | 11 | -8.5 | -0.4 | 4.5 | 143 |
Along with the dependent variable win_shares, I will
choose 7 independent variables to build a multiple linear regression
model:
new_df <- df %>%
select(overall_pick, win_shares, minutes_played, points, total_rebounds, assists, box_plus_minus, field_goal_percentage)
DT::datatable(head(new_df))
# summary statistics of dependent and independent variables
st(new_df)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| overall_pick | 1922 | 30 | 17 | 1 | 15 | 44 | 60 |
| win_shares | 1669 | 18 | 28 | -1.7 | 0.4 | 24 | 250 |
| minutes_played | 1669 | 8399 | 9846 | 0 | 838 | 13246 | 52139 |
| points | 1669 | 3580 | 4826 | 0 | 265 | 5150 | 37062 |
| total_rebounds | 1669 | 1497 | 2004 | 0 | 128 | 2139 | 15091 |
| assists | 1669 | 774 | 1285 | 0 | 46 | 910 | 12091 |
| box_plus_minus | 1668 | -2.3 | 4.1 | -52 | -3.9 | -0.3 | 51 |
| field_goal_percentage | 1665 | 0.44 | 0.084 | 0 | 0.4 | 0.47 | 1 |
Players that were drafted that do not have data are removed from the dataset:
new_df <- new_df %>%
drop_na()
sum(is.na(new_df))
## [1] 0
Checking the correlation coefficients between win_shares
and independent variables using ggpairs(). It appears that
box_plus_minus, field_goal_percentage, and
overall_pick are the only variables that don’t have
correlation coefficients with other variables:
p <- ggpairs(new_df[,c(1:8)], lower = list(continuous = wrap("smooth", se=FALSE, alpha = 0.7, size=0.5)))
p[5,3] <- p[5,3] + theme(panel.border = element_rect(color = 'blue', fill = NA, size = 2))
p[3,5] <- p[3,5] + theme(panel.border = element_rect(color = 'blue', fill = NA, size = 2))
p
Based on these results, we will only keep
box_plus_minus, overall_pick, and
field_goal_percentage, since the other variables had
multicollinearity greater than 0.80.
Below is a histogram showing the distribution of
win_shares. It appears that the distribution is skewed to
the left:
new_df %>%
ggplot(aes(x=win_shares)) +
geom_histogram(bins = 50) +
labs(title="Amount of WinShares for Each Player",
x="Number of WinShares",
y="Count")
To normalize the win_shares data, I log-transformed the
variable using log1p. This log-transformed target variable
will be used in a separate multiple linear regression model with the
dependent variables. The histogram appears to show more of a normalized
win_shares variable:
# log transformation, log winshares +1
new_df$log_win_shares <- log1p(new_df$win_shares)
new_df %>%
ggplot(aes(x=log_win_shares)) +
geom_histogram(bins = 15) +
labs(title="Amount of Log WinShares for Each Player",
x="Log WinShares",
y="Count")
I wanted to get a snapshot of the distribution of data for each
independent variable. box_plus_minus,
field_goal_percentage, and overall_pick
appears to show a normal distribution, while the other independent
variables show a skew distribution to the left:
new_df %>%
ggplot(aes(box_plus_minus)) +
geom_histogram(bins = 50) +
labs(title="Plus/Minus for Each Player",
x="Plus/Minus",
y="Count")
new_df %>%
ggplot(aes(field_goal_percentage)) +
geom_histogram(bins = 50) +
labs(title="Career Field Goal Percentage for Each Player",
x="Field Goal %",
y="Count")
new_df %>%
ggplot(aes(overall_pick)) +
geom_histogram(bins = 10) +
labs(title="Overall Draft Pick for Each Player",
x="Draft Pick Selection",
y="Count")
Using the original win_shares variable, I created a
multiple linear regression model:
m_initial <- lm(win_shares ~ box_plus_minus + field_goal_percentage
+ overall_pick, data = new_df)
summary(m_initial)
##
## Call:
## lm(formula = win_shares ~ box_plus_minus + field_goal_percentage +
## overall_pick, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -169.933 -12.910 -5.237 5.759 184.231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.86838 4.09678 12.905 < 2e-16 ***
## box_plus_minus 3.38248 0.18220 18.565 < 2e-16 ***
## field_goal_percentage -34.14967 8.53325 -4.002 6.56e-05 ***
## overall_pick -0.45810 0.03685 -12.432 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.99 on 1661 degrees of freedom
## Multiple R-squared: 0.3275, Adjusted R-squared: 0.3263
## F-statistic: 269.6 on 3 and 1661 DF, p-value: < 2.2e-16
The initial results show that the model has an Adjusted R-square of
32.63%. All 3 variables have p-values less than 0.05, and therefore show
statistical significance with win_shares.
The following are residual plots to capture the linearity, near normal residuals, and constant variability of the model:
ggplot(m_initial, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title="Residual vs. Fitted Values Plot") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = m_initial, aes(x = .resid)) +
geom_histogram(binwidth = 1.5) +
xlab("Residuals")
ggplot(data = m_initial, aes(sample = .resid)) +
stat_qq()
qqnorm(m_initial$residuals)
qqline(m_initial$residuals)
The residual vs.fitted plot shows data points clustered around the zero threshold with noticeable positive and negative outliers. The histogram shows a skewness to the right, with its center approximately at zero and a narrow distribution. The qq plot appears to show most of the data points along the line, with its upper end positively skewed. This may indicate that there are more extreme values than would be expected in a normal distribution. Overall, I think this model meets the conditions of least squares in linearity, near normal residuals, and constant variability.
I will replace the win_shares variable with the
log-transformed win_shares variable see if there are any
differences with the initial model:
m_revised <- lm(log_win_shares ~ box_plus_minus + overall_pick
+ field_goal_percentage, data = new_df)
summary(m_revised)
##
## Call:
## lm(formula = log_win_shares ~ box_plus_minus + overall_pick +
## field_goal_percentage, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9597 -0.7986 -0.0048 0.8275 8.3809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.907331 0.206838 14.056 <2e-16 ***
## box_plus_minus 0.196156 0.009188 21.349 <2e-16 ***
## overall_pick -0.030494 0.001868 -16.328 <2e-16 ***
## field_goal_percentage 0.557263 0.430395 1.295 0.196
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.158 on 1648 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.4666, Adjusted R-squared: 0.4656
## F-statistic: 480.5 on 3 and 1648 DF, p-value: < 2.2e-16
Using the log-transformed target variable, the Adjusted R-Square
increased to 46.56%. However, field_goal_percentage has a
p-value greater than 0.05. I removed the variable and re-ran the
model:
m_revised2 <- lm(log_win_shares ~ box_plus_minus + overall_pick, data = new_df)
summary(m_revised2)
##
## Call:
## lm(formula = log_win_shares ~ box_plus_minus + overall_pick,
## data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0281 -0.8093 0.0176 0.8215 8.4910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.165260 0.055668 56.86 <2e-16 ***
## box_plus_minus 0.203290 0.007355 27.64 <2e-16 ***
## overall_pick -0.030425 0.001867 -16.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.158 on 1649 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.466, Adjusted R-squared: 0.4654
## F-statistic: 719.6 on 2 and 1649 DF, p-value: < 2.2e-16
The new revised model without field_goal_percentage
decreased the Adjusted R-Square slightly to 46.54%.
ggplot(m_revised2, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title="Residual vs. Fitted Values Plot") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = m_revised, aes(x = .resid)) +
geom_histogram(binwidth = 1.5) +
xlab("Residuals")
ggplot(data = m_revised, aes(sample = .resid)) +
stat_qq()
qqnorm(m_revised2$residuals)
qqline(m_revised2$residuals)
Below are visualizations showing each independent variable’s linear relationship with the dependent variable:
visreg(m_revised2)
When assessing each model, it appears that the revised model is a better fit for the data than the initial model. The Adjusted R-Square improved, from 32.63% to 46.54%. This means 46.54% of the variance in WinShares can be explained by where players are selected and there plus/minus values. The Residual vs Fitted Values plot contained less outliers and was more clustered around the zero threshold. There was more symmetrical shape and wider spread in the Near Normal Residuals plot, with its center approximately at zero. The line of data points in the revised QQ-plot was straight and did not have a discernible skewness. This model does a better job of meeting the conditions of least squares in linearity, near normal residuals, and constant variability. Possible next steps to improve the model are identifying and removing additional outliers, and log-transforming the independent variables.