Abstract

This analysis examines factors that influence an NBA basketball player’s WinShare totals. Using specific explanatory variables within the dataset, I explore the linear relationship between WinShares and each selected variable using a multiple linear regression model. Each observation in the dataset are players who were drafted between 1989 and 2021. Independent variables used include career player stats of points, rebounds, assists, field goal percentage, minutes played, plus/minus, and overall draft selection. For purposes of this analysis, players containing missing data were removed.

Because of its skewed distribution, WinShares was log-transformed and created as a separate target variable. ggpairs() was used to check for multicollinearity between each of the chosen independent variables. Field goal percentage, plus/minus, and overall draft pick selection did not show high correlation with other variables, and were used in the multiple linear regression model. The initial regression model using the original WinShare variable produced an Adjusted R-square of 32.63%. Although there were outliers, most of the data points in the Residual vs Fitted Values plot were clustered around the zero line threshold with no distinct pattern. The near normal residuals histogram showed skewness to the right with its center at approximately zero. The QQ-plot displayed a relatively straight line, with its upper end positively skewed. This indicates that the conditions of linearity, near normal residuals, and constant variability are met. When replacing the target variable with its log-transformed counterpart, the model increased its Adjusted R-Square performance to 46.56%. The Residual vs Fitted Values displayed a cluster of data points around the zero threshold but had less outliers. The near normal residuals histogram showed a symmetrical normal distribution with its center at approximately zero. The line of data points in the QQ-plot was more straight with no discernible skewness. The revised model improved the model’s overall performance.

Introduction

The National Basketball Association has seen a huge surge in the use of data analytics over the past 15 years. The purpose of aggregating data is to assist in making basketball decisions that will help identify players that can lead to wins and possibly a championship. Using NBA player data obtained from Kaggle, I will explore various variables to ascertain what influences a player’s WinShare. Independent variables utilized include plus/minus, points, total rebounds, assists, minutes played, overall draft pick selection, and field goal percentage.

Part 2 - Data

library(tidyverse)
library(DT)
library(GGally)
library(vtable)
library(visreg)
glimpse(df)
## Rows: 1,922
## Columns: 24
## $ id                        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ year                      <int> 1989, 1989, 1989, 1989, 1989, 1989, 1989, 19…
## $ rank                      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ overall_pick              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ team                      <chr> "SAC", "LAC", "SAS", "MIA", "CHH", "CHI", "I…
## $ player                    <chr> "Pervis Ellison", "Danny Ferry", "Sean Ellio…
## $ college                   <chr> "Louisville", "Duke", "Arizona", "Michigan",…
## $ years_active              <int> 11, 13, 12, 15, 11, 8, 12, 5, 12, 10, 13, 13…
## $ games                     <int> 474, 917, 742, 1000, 672, 438, 766, 281, 687…
## $ minutes_played            <int> 11593, 18133, 24502, 34985, 15370, 7406, 174…
## $ points                    <int> 4494, 6439, 10544, 18336, 5680, 2819, 6925, …
## $ total_rebounds            <int> 3170, 2550, 3204, 4387, 3381, 1460, 2342, 13…
## $ assists                   <int> 691, 1185, 1897, 2097, 639, 387, 1769, 175, …
## $ field_goal_percentage     <dbl> 0.510, 0.446, 0.465, 0.456, 0.472, 0.478, 0.…
## $ X3_point_percentage       <dbl> 0.050, 0.393, 0.375, 0.400, 0.135, 0.235, 0.…
## $ free_throw_percentage     <dbl> 0.689, 0.840, 0.799, 0.846, 0.716, 0.707, 0.…
## $ average_minutes_played    <dbl> 24.5, 19.8, 33.0, 35.0, 22.9, 16.9, 22.8, 19…
## $ points_per_game           <dbl> 9.5, 7.0, 14.2, 18.3, 8.5, 6.4, 9.0, 7.4, 5.…
## $ average_total_rebounds    <dbl> 6.7, 2.8, 4.3, 4.4, 5.0, 3.3, 3.1, 4.9, 3.3,…
## $ average_assists           <dbl> 1.5, 1.3, 2.6, 2.1, 1.0, 0.9, 2.3, 0.6, 0.6,…
## $ win_shares                <dbl> 21.8, 34.8, 55.7, 88.7, 22.5, 10.9, 24.6, 1.…
## $ win_shares_per_48_minutes <dbl> 0.090, 0.092, 0.109, 0.122, 0.070, 0.071, 0.…
## $ box_plus_minus            <dbl> -0.5, -0.9, 0.2, 0.8, -2.9, -3.4, -0.8, -5.0…
## $ value_over_replacement    <dbl> 4.4, 4.9, 13.5, 24.9, -3.7, -2.7, 5.3, -4.0,…

The dataset has 1,922 observations and 24 columns/variables. Each row represents a player who was drafted in the first or second round between 1989 and 2021. Players who did not play in the NBA have missing values. For purposes of this project, I will drop those players with missing data from the dataset. The target variable is WinShares, defined as adding together Offensive Win Shares and Defensive Win Shares. This article from Basketball Reference describes how Offensive and Defensive Win Shares are respectively calculated. The independent variables used are:

overall_pick: Overall draft selection of each player (categorical variable) points: Total career points total_rebounds: Total career rebounds assists: Total career assists field_goal_percentage: Career field goal percentage minutes_played: Total career minutes played box_plus_minus: Measure of a player’s productivity on the court. Positive numbers indicate that the player helped increase their respective team’s lead or decrease the deficit. A minus indicates that the deficit increased or the team’s lead decreased.

Part 3 - Exploratory Data Analysis

# summary statistics
st(df)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
id 1922 962 555 1 481 1442 1922
year 1922 2005 9.5 1989 1997 2013 2021
rank 1922 30 17 1 15 44 60
overall_pick 1922 30 17 1 15 44 60
years_active 1669 6.3 4.7 1 2 10 22
games 1669 348 325 1 72 584 1541
minutes_played 1669 8399 9846 0 838 13246 52139
points 1669 3580 4826 0 265 5150 37062
total_rebounds 1669 1497 2004 0 128 2139 15091
assists 1669 774 1285 0 46 910 12091
field_goal_percentage 1665 0.44 0.084 0 0.4 0.47 1
X3_point_percentage 1545 0.27 0.13 0 0.22 0.36 1
free_throw_percentage 1633 0.72 0.12 0 0.66 0.8 1
average_minutes_played 1669 18 8.7 0 11 25 41
points_per_game 1669 7.3 5 0 3.4 10 27
average_total_rebounds 1669 3.2 2.1 0 1.7 4.2 13
average_assists 1669 1.6 1.5 0 0.5 2.1 9.5
win_shares 1669 18 28 -1.7 0.4 24 250
win_shares_per_48_minutes 1668 0.062 0.094 -1.3 0.03 0.1 1.4
box_plus_minus 1668 -2.3 4.1 -52 -3.9 -0.3 51
value_over_replacement 1669 4.4 11 -8.5 -0.4 4.5 143

Along with the dependent variable win_shares, I will choose 7 independent variables to build a multiple linear regression model:

new_df <- df %>%
  select(overall_pick, win_shares, minutes_played, points, total_rebounds, assists, box_plus_minus, field_goal_percentage)
DT::datatable(head(new_df))
# summary statistics of dependent and independent variables
st(new_df)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
overall_pick 1922 30 17 1 15 44 60
win_shares 1669 18 28 -1.7 0.4 24 250
minutes_played 1669 8399 9846 0 838 13246 52139
points 1669 3580 4826 0 265 5150 37062
total_rebounds 1669 1497 2004 0 128 2139 15091
assists 1669 774 1285 0 46 910 12091
box_plus_minus 1668 -2.3 4.1 -52 -3.9 -0.3 51
field_goal_percentage 1665 0.44 0.084 0 0.4 0.47 1

Players that were drafted that do not have data are removed from the dataset:

new_df <- new_df %>%
  drop_na()
sum(is.na(new_df))
## [1] 0

Checking the correlation coefficients between win_shares and independent variables using ggpairs(). It appears that box_plus_minus, field_goal_percentage, and overall_pick are the only variables that don’t have correlation coefficients with other variables:

p <- ggpairs(new_df[,c(1:8)], lower = list(continuous = wrap("smooth", se=FALSE, alpha = 0.7, size=0.5)))
p[5,3] <- p[5,3] + theme(panel.border = element_rect(color = 'blue', fill = NA, size = 2))
p[3,5] <- p[3,5] + theme(panel.border = element_rect(color = 'blue', fill = NA, size = 2))
p

Based on these results, we will only keep box_plus_minus, overall_pick, and field_goal_percentage, since the other variables had multicollinearity greater than 0.80.

Below is a histogram showing the distribution of win_shares. It appears that the distribution is skewed to the left:

new_df %>% 
  ggplot(aes(x=win_shares)) +
  geom_histogram(bins = 50) +
  labs(title="Amount of WinShares for Each Player",
       x="Number of WinShares",
       y="Count")

To normalize the win_shares data, I log-transformed the variable using log1p. This log-transformed target variable will be used in a separate multiple linear regression model with the dependent variables. The histogram appears to show more of a normalized win_shares variable:

# log transformation, log winshares +1
new_df$log_win_shares <- log1p(new_df$win_shares)
new_df %>% 
  ggplot(aes(x=log_win_shares)) +
  geom_histogram(bins = 15) +
  labs(title="Amount of Log WinShares for Each Player",
       x="Log WinShares",
       y="Count")

I wanted to get a snapshot of the distribution of data for each independent variable. box_plus_minus, field_goal_percentage, and overall_pick appears to show a normal distribution, while the other independent variables show a skew distribution to the left:

Box Plus/Minus

new_df %>%
  ggplot(aes(box_plus_minus)) +
  geom_histogram(bins = 50) +
  labs(title="Plus/Minus for Each Player",
       x="Plus/Minus",
       y="Count")

Field Goal Percentage

new_df %>%
  ggplot(aes(field_goal_percentage)) +
  geom_histogram(bins = 50) +
  labs(title="Career Field Goal Percentage for Each Player",
       x="Field Goal %",
       y="Count")

Overall Draft Pick Selection

new_df %>%
  ggplot(aes(overall_pick)) +
  geom_histogram(bins = 10) +
  labs(title="Overall Draft Pick for Each Player",
       x="Draft Pick Selection",
       y="Count")

WinShares and Box Plus/Minus

new_df %>%
  ggplot(aes(box_plus_minus, win_shares, na.rm=TRUE)) +
  geom_point() +
  labs(title="Amount of WinShares Based on Career Plus/Minus of Each Player",
       x="Plus/Minus",
       y="WinShares")

WinShares and Overall Draft Pick Selection

new_df %>%
  ggplot(aes(overall_pick, win_shares, na.rm=TRUE)) +
  geom_point() +
  labs(title="Amount of WinShares Based on Overall Pick of Players",
       x="Overall Draft Picks",
       y="WinShares")

WinShares and Field Goal Percentage

new_df %>%
  ggplot(aes(field_goal_percentage, win_shares)) +
  geom_point() +
  labs(title="Amount of WinShares Based on Field Goal Percentage",
       x="Field Goal Percentage",
       y="WinShares")

Part 4 - Inference

Using the original win_shares variable, I created a multiple linear regression model:

m_initial <- lm(win_shares ~ box_plus_minus + field_goal_percentage 
                + overall_pick, data = new_df)
summary(m_initial)
## 
## Call:
## lm(formula = win_shares ~ box_plus_minus + field_goal_percentage + 
##     overall_pick, data = new_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -169.933  -12.910   -5.237    5.759  184.231 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            52.86838    4.09678  12.905  < 2e-16 ***
## box_plus_minus          3.38248    0.18220  18.565  < 2e-16 ***
## field_goal_percentage -34.14967    8.53325  -4.002 6.56e-05 ***
## overall_pick           -0.45810    0.03685 -12.432  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.99 on 1661 degrees of freedom
## Multiple R-squared:  0.3275, Adjusted R-squared:  0.3263 
## F-statistic: 269.6 on 3 and 1661 DF,  p-value: < 2.2e-16

The initial results show that the model has an Adjusted R-square of 32.63%. All 3 variables have p-values less than 0.05, and therefore show statistical significance with win_shares.

The following are residual plots to capture the linearity, near normal residuals, and constant variability of the model:

Residual vs Fitted Values Plot

ggplot(m_initial, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title="Residual vs. Fitted Values Plot") +
  xlab("Fitted values") +
  ylab("Residuals")

Near Normal Residuals Histogram

ggplot(data = m_initial, aes(x = .resid)) +
  geom_histogram(binwidth = 1.5) +
  xlab("Residuals")

QQ Plot

ggplot(data = m_initial, aes(sample = .resid)) +
  stat_qq()

qqnorm(m_initial$residuals)
qqline(m_initial$residuals)

The residual vs.fitted plot shows data points clustered around the zero threshold with noticeable positive and negative outliers. The histogram shows a skewness to the right, with its center approximately at zero and a narrow distribution. The qq plot appears to show most of the data points along the line, with its upper end positively skewed. This may indicate that there are more extreme values than would be expected in a normal distribution. Overall, I think this model meets the conditions of least squares in linearity, near normal residuals, and constant variability.

I will replace the win_shares variable with the log-transformed win_shares variable see if there are any differences with the initial model:

m_revised <- lm(log_win_shares ~ box_plus_minus + overall_pick 
                + field_goal_percentage, data = new_df)
summary(m_revised)
## 
## Call:
## lm(formula = log_win_shares ~ box_plus_minus + overall_pick + 
##     field_goal_percentage, data = new_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9597  -0.7986  -0.0048   0.8275   8.3809 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.907331   0.206838  14.056   <2e-16 ***
## box_plus_minus         0.196156   0.009188  21.349   <2e-16 ***
## overall_pick          -0.030494   0.001868 -16.328   <2e-16 ***
## field_goal_percentage  0.557263   0.430395   1.295    0.196    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.158 on 1648 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.4666, Adjusted R-squared:  0.4656 
## F-statistic: 480.5 on 3 and 1648 DF,  p-value: < 2.2e-16

Using the log-transformed target variable, the Adjusted R-Square increased to 46.56%. However, field_goal_percentage has a p-value greater than 0.05. I removed the variable and re-ran the model:

m_revised2 <- lm(log_win_shares ~ box_plus_minus + overall_pick, data = new_df)
summary(m_revised2)
## 
## Call:
## lm(formula = log_win_shares ~ box_plus_minus + overall_pick, 
##     data = new_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0281  -0.8093   0.0176   0.8215   8.4910 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.165260   0.055668   56.86   <2e-16 ***
## box_plus_minus  0.203290   0.007355   27.64   <2e-16 ***
## overall_pick   -0.030425   0.001867  -16.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.158 on 1649 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.466,  Adjusted R-squared:  0.4654 
## F-statistic: 719.6 on 2 and 1649 DF,  p-value: < 2.2e-16

The new revised model without field_goal_percentage decreased the Adjusted R-Square slightly to 46.54%.

Residual vs Fitted Values

ggplot(m_revised2, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title="Residual vs. Fitted Values Plot") +
  xlab("Fitted values") +
  ylab("Residuals")

Near Normal Residuals Histogram

ggplot(data = m_revised, aes(x = .resid)) +
  geom_histogram(binwidth = 1.5) +
  xlab("Residuals")

QQ Plot

ggplot(data = m_revised, aes(sample = .resid)) +
  stat_qq()

qqnorm(m_revised2$residuals)
qqline(m_revised2$residuals)

Below are visualizations showing each independent variable’s linear relationship with the dependent variable:

visreg(m_revised2)

Part 5 - Conclusion

When assessing each model, it appears that the revised model is a better fit for the data than the initial model. The Adjusted R-Square improved, from 32.63% to 46.54%. This means 46.54% of the variance in WinShares can be explained by where players are selected and there plus/minus values. The Residual vs Fitted Values plot contained less outliers and was more clustered around the zero threshold. There was more symmetrical shape and wider spread in the Near Normal Residuals plot, with its center approximately at zero. The line of data points in the revised QQ-plot was straight and did not have a discernible skewness. This model does a better job of meeting the conditions of least squares in linearity, near normal residuals, and constant variability. Possible next steps to improve the model are identifying and removing additional outliers, and log-transforming the independent variables.