Abstract

This analysis examines factors that influence an NBA basketball player’s WinShare totals. Using specific explanatory variables within the dataset, I explore the linear relationship between WinShares and each selected variable using a multiple linear regression model. Each observation in the dataset are players who were drafted between 1989 and 2021. Independent variables used include career player stats of points, rebounds, assists, field goal percentage, minutes played, plus/minus, and overall draft selection. For purposes of this analysis, players containing missing data were removed.

Because of its skewed distribution, WinShares was log-transformed and created as a separate target variable. ggpairs() was used to check for multicollinearity between each of the chosen independent variables. Field goal percentage, plus/minus, and overall draft pick selection did not show high correlation with other variables, and were used in the multiple linear regression model. The initial regression model using the original WinShare variable produced an Adjusted R-square of 32.63%. Although there were outliers, most of the data points in the Residual vs Fitted Values plot were clustered around the zero line threshold with no distinct pattern. The near normal residuals histogram showed skewness to the right with its center at approximately zero. The QQ-plot displayed a relatively straight line, with its upper end positively skewed. This indicates that the conditions of linearity, near normal residuals, and constant variability are met. When replacing the target variable with its log-transformed counterpart, the model increased its Adjusted R-Square performance to 46.56%. The Residual vs Fitted Values displayed a cluster of data points around the zero threshold but had less outliers. The near normal residuals histogram showed a symmetrical normal distribution with its center at approximately zero. The line of data points in the QQ-plot was more straight with no discernible skewness. The revised model improved the model’s overall performance.

Introduction

The National Basketball Association has seen a huge surge in the use of data analytics over the past 15 years. The purpose of aggregating data is to assist in making basketball decisions that will help identify players that can lead to wins and possibly a championship. Using NBA player data obtained from Kaggle, I will explore various variables to ascertain what influences a player’s WinShare. Independent variables utilized include plus/minus, points, total rebounds, assists, minutes played, overall draft pick selection, and field goal percentage.

Part 2 - Data

library(tidyverse)
library(DT)
library(GGally)
library(vtable)
library(visreg)

glimpse(df)

## Rows: 1,922
## Columns: 24
## $ id                        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ year                      <int> 1989, 1989, 1989, 1989, 1989, 1989, 1989, 19…
## $ rank                      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ overall_pick              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ team                      <chr> "SAC", "LAC", "SAS", "MIA", "CHH", "CHI", "I…
## $ player                    <chr> "Pervis Ellison", "Danny Ferry", "Sean Ellio…
## $ college                   <chr> "Louisville", "Duke", "Arizona", "Michigan",…
## $ years_active              <int> 11, 13, 12, 15, 11, 8, 12, 5, 12, 10, 13, 13…
## $ games                     <int> 474, 917, 742, 1000, 672, 438, 766, 281, 687…
## $ minutes_played            <int> 11593, 18133, 24502, 34985, 15370, 7406, 174…
## $ points                    <int> 4494, 6439, 10544, 18336, 5680, 2819, 6925, …
## $ total_rebounds            <int> 3170, 2550, 3204, 4387, 3381, 1460, 2342, 13…
## $ assists                   <int> 691, 1185, 1897, 2097, 639, 387, 1769, 175, …
## $ field_goal_percentage     <dbl> 0.510, 0.446, 0.465, 0.456, 0.472, 0.478, 0.…
## $ X3_point_percentage       <dbl> 0.050, 0.393, 0.375, 0.400, 0.135, 0.235, 0.…
## $ free_throw_percentage     <dbl> 0.689, 0.840, 0.799, 0.846, 0.716, 0.707, 0.…
## $ average_minutes_played    <dbl> 24.5, 19.8, 33.0, 35.0, 22.9, 16.9, 22.8, 19…
## $ points_per_game           <dbl> 9.5, 7.0, 14.2, 18.3, 8.5, 6.4, 9.0, 7.4, 5.…
## $ average_total_rebounds    <dbl> 6.7, 2.8, 4.3, 4.4, 5.0, 3.3, 3.1, 4.9, 3.3,…
## $ average_assists           <dbl> 1.5, 1.3, 2.6, 2.1, 1.0, 0.9, 2.3, 0.6, 0.6,…
## $ win_shares                <dbl> 21.8, 34.8, 55.7, 88.7, 22.5, 10.9, 24.6, 1.…
## $ win_shares_per_48_minutes <dbl> 0.090, 0.092, 0.109, 0.122, 0.070, 0.071, 0.…
## $ box_plus_minus            <dbl> -0.5, -0.9, 0.2, 0.8, -2.9, -3.4, -0.8, -5.0…
## $ value_over_replacement    <dbl> 4.4, 4.9, 13.5, 24.9, -3.7, -2.7, 5.3, -4.0,…

The dataset has 1,922 observations and 24 columns/variables. Each row represents a player who was drafted in the first or second round between 1989 and 2021. Players who did not play in the NBA have missing values. For purposes of this project, I will drop those players with missing data from the dataset. The target variable is WinShares, defined as adding together Offensive Win Shares and Defensive Win Shares. This article from Basketball Reference describes how Offensive and Defensive Win Shares are respectively calculated. The independent variables used are:

overall_pick: Overall draft selection of each player (categorical variable) points: Total career points total_rebounds: Total career rebounds assists: Total career assists field_goal_percentage: Career field goal percentage minutes_played: Total career minutes played box_plus_minus: Measure of a player’s productivity on the court. Positive numbers indicate that the player helped increase their respective team’s lead or decrease the deficit. A minus indicates that the deficit increased or the team’s lead decreased.

Part 3 - Exploratory Data Analysis

# summary statistics
st(df)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
id	1922	962	555	1	481	1442	1922
year	1922	2005	9.5	1989	1997	2013	2021
rank	1922	30	17	1	15	44	60
overall_pick	1922	30	17	1	15	44	60
years_active	1669	6.3	4.7	1	2	10	22
games	1669	348	325	1	72	584	1541
minutes_played	1669	8399	9846	0	838	13246	52139
points	1669	3580	4826	0	265	5150	37062
total_rebounds	1669	1497	2004	0	128	2139	15091
assists	1669	774	1285	0	46	910	12091
field_goal_percentage	1665	0.44	0.084	0	0.4	0.47	1
X3_point_percentage	1545	0.27	0.13	0	0.22	0.36	1
free_throw_percentage	1633	0.72	0.12	0	0.66	0.8	1
average_minutes_played	1669	18	8.7	0	11	25	41
points_per_game	1669	7.3	5	0	3.4	10	27
average_total_rebounds	1669	3.2	2.1	0	1.7	4.2	13
average_assists	1669	1.6	1.5	0	0.5	2.1	9.5
win_shares	1669	18	28	-1.7	0.4	24	250
win_shares_per_48_minutes	1668	0.062	0.094	-1.3	0.03	0.1	1.4
box_plus_minus	1668	-2.3	4.1	-52	-3.9	-0.3	51
value_over_replacement	1669	4.4	11	-8.5	-0.4	4.5	143

Along with the dependent variable win_shares, I will choose 7 independent variables to build a multiple linear regression model:

new_df <- df %>%
  select(overall_pick, win_shares, minutes_played, points, total_rebounds, assists, box_plus_minus, field_goal_percentage)
DT::datatable(head(new_df))

# summary statistics of dependent and independent variables
st(new_df)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
overall_pick	1922	30	17	1	15	44	60
win_shares	1669	18	28	-1.7	0.4	24	250
minutes_played	1669	8399	9846	0	838	13246	52139
points	1669	3580	4826	0	265	5150	37062
total_rebounds	1669	1497	2004	0	128	2139	15091
assists	1669	774	1285	0	46	910	12091
box_plus_minus	1668	-2.3	4.1	-52	-3.9	-0.3	51
field_goal_percentage	1665	0.44	0.084	0	0.4	0.47	1

Players that were drafted that do not have data are removed from the dataset:

new_df <- new_df %>%
  drop_na()

sum(is.na(new_df))

## [1] 0

Checking the correlation coefficients between win_shares and independent variables using ggpairs(). It appears that box_plus_minus, field_goal_percentage, and overall_pick are the only variables that don’t have correlation coefficients with other variables:

p <- ggpairs(new_df[,c(1:8)], lower = list(continuous = wrap("smooth", se=FALSE, alpha = 0.7, size=0.5)))
p[5,3] <- p[5,3] + theme(panel.border = element_rect(color = 'blue', fill = NA, size = 2))
p[3,5] <- p[3,5] + theme(panel.border = element_rect(color = 'blue', fill = NA, size = 2))
p

Based on these results, we will only keep box_plus_minus, overall_pick, and field_goal_percentage, since the other variables had multicollinearity greater than 0.80.

Below is a histogram showing the distribution of win_shares. It appears that the distribution is skewed to the left:

new_df %>% 
  ggplot(aes(x=win_shares)) +
  geom_histogram(bins = 50) +
  labs(title="Amount of WinShares for Each Player",
       x="Number of WinShares",
       y="Count")

To normalize the win_shares data, I log-transformed the variable using log1p. This log-transformed target variable will be used in a separate multiple linear regression model with the dependent variables. The histogram appears to show more of a normalized win_shares variable:

# log transformation, log winshares +1
new_df$log_win_shares <- log1p(new_df$win_shares)

new_df %>% 
  ggplot(aes(x=log_win_shares)) +
  geom_histogram(bins = 15) +
  labs(title="Amount of Log WinShares for Each Player",
       x="Log WinShares",
       y="Count")

I wanted to get a snapshot of the distribution of data for each independent variable. box_plus_minus, field_goal_percentage, and overall_pick appears to show a normal distribution, while the other independent variables show a skew distribution to the left:

Box Plus/Minus

new_df %>%
  ggplot(aes(box_plus_minus)) +
  geom_histogram(bins = 50) +
  labs(title="Plus/Minus for Each Player",
       x="Plus/Minus",
       y="Count")

Field Goal Percentage

new_df %>%
  ggplot(aes(field_goal_percentage)) +
  geom_histogram(bins = 50) +
  labs(title="Career Field Goal Percentage for Each Player",
       x="Field Goal %",
       y="Count")

Overall Draft Pick Selection

new_df %>%
  ggplot(aes(overall_pick)) +
  geom_histogram(bins = 10) +
  labs(title="Overall Draft Pick for Each Player",
       x="Draft Pick Selection",
       y="Count")

WinShares and Box Plus/Minus

new_df %>%
  ggplot(aes(box_plus_minus, win_shares, na.rm=TRUE)) +
  geom_point() +
  labs(title="Amount of WinShares Based on Career Plus/Minus of Each Player",
       x="Plus/Minus",
       y="WinShares")

WinShares and Overall Draft Pick Selection

new_df %>%
  ggplot(aes(overall_pick, win_shares, na.rm=TRUE)) +
  geom_point() +
  labs(title="Amount of WinShares Based on Overall Pick of Players",
       x="Overall Draft Picks",
       y="WinShares")

WinShares and Field Goal Percentage

new_df %>%
  ggplot(aes(field_goal_percentage, win_shares)) +
  geom_point() +
  labs(title="Amount of WinShares Based on Field Goal Percentage",
       x="Field Goal Percentage",
       y="WinShares")

Part 4 - Inference

Using the original win_shares variable, I created a multiple linear regression model:

m_initial <- lm(win_shares ~ box_plus_minus + field_goal_percentage 
                + overall_pick, data = new_df)
summary(m_initial)

## 
## Call:
## lm(formula = win_shares ~ box_plus_minus + field_goal_percentage + 
##     overall_pick, data = new_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -169.933  -12.910   -5.237    5.759  184.231 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            52.86838    4.09678  12.905  < 2e-16 ***
## box_plus_minus          3.38248    0.18220  18.565  < 2e-16 ***
## field_goal_percentage -34.14967    8.53325  -4.002 6.56e-05 ***
## overall_pick           -0.45810    0.03685 -12.432  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.99 on 1661 degrees of freedom
## Multiple R-squared:  0.3275, Adjusted R-squared:  0.3263 
## F-statistic: 269.6 on 3 and 1661 DF,  p-value: < 2.2e-16

The initial results show that the model has an Adjusted R-square of 32.63%. All 3 variables have p-values less than 0.05, and therefore show statistical significance with win_shares.

The following are residual plots to capture the linearity, near normal residuals, and constant variability of the model:

Residual vs Fitted Values Plot

ggplot(m_initial, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title="Residual vs. Fitted Values Plot") +
  xlab("Fitted values") +
  ylab("Residuals")

Near Normal Residuals Histogram

ggplot(data = m_initial, aes(x = .resid)) +
  geom_histogram(binwidth = 1.5) +
  xlab("Residuals")

QQ Plot

ggplot(data = m_initial, aes(sample = .resid)) +
  stat_qq()

qqnorm(m_initial$residuals)
qqline(m_initial$residuals)

The residual vs.fitted plot shows data points clustered around the zero threshold with noticeable positive and negative outliers. The histogram shows a skewness to the right, with its center approximately at zero and a narrow distribution. The qq plot appears to show most of the data points along the line, with its upper end positively skewed. This may indicate that there are more extreme values than would be expected in a normal distribution. Overall, I think this model meets the conditions of least squares in linearity, near normal residuals, and constant variability.

I will replace the win_shares variable with the log-transformed win_shares variable see if there are any differences with the initial model:

m_revised <- lm(log_win_shares ~ box_plus_minus + overall_pick 
                + field_goal_percentage, data = new_df)
summary(m_revised)

## 
## Call:
## lm(formula = log_win_shares ~ box_plus_minus + overall_pick + 
##     field_goal_percentage, data = new_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9597  -0.7986  -0.0048   0.8275   8.3809 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.907331   0.206838  14.056   <2e-16 ***
## box_plus_minus         0.196156   0.009188  21.349   <2e-16 ***
## overall_pick          -0.030494   0.001868 -16.328   <2e-16 ***
## field_goal_percentage  0.557263   0.430395   1.295    0.196    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.158 on 1648 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.4666, Adjusted R-squared:  0.4656 
## F-statistic: 480.5 on 3 and 1648 DF,  p-value: < 2.2e-16

Using the log-transformed target variable, the Adjusted R-Square increased to 46.56%. However, field_goal_percentage has a p-value greater than 0.05. I removed the variable and re-ran the model:

m_revised2 <- lm(log_win_shares ~ box_plus_minus + overall_pick, data = new_df)
summary(m_revised2)

## 
## Call:
## lm(formula = log_win_shares ~ box_plus_minus + overall_pick, 
##     data = new_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0281  -0.8093   0.0176   0.8215   8.4910 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.165260   0.055668   56.86   <2e-16 ***
## box_plus_minus  0.203290   0.007355   27.64   <2e-16 ***
## overall_pick   -0.030425   0.001867  -16.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.158 on 1649 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.466,  Adjusted R-squared:  0.4654 
## F-statistic: 719.6 on 2 and 1649 DF,  p-value: < 2.2e-16

The new revised model without field_goal_percentage decreased the Adjusted R-Square slightly to 46.54%.

Residual vs Fitted Values

ggplot(m_revised2, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title="Residual vs. Fitted Values Plot") +
  xlab("Fitted values") +
  ylab("Residuals")

Near Normal Residuals Histogram

ggplot(data = m_revised, aes(x = .resid)) +
  geom_histogram(binwidth = 1.5) +
  xlab("Residuals")

QQ Plot

ggplot(data = m_revised, aes(sample = .resid)) +
  stat_qq()

qqnorm(m_revised2$residuals)
qqline(m_revised2$residuals)

Below are visualizations showing each independent variable’s linear relationship with the dependent variable:

visreg(m_revised2)

Part 5 - Conclusion

When assessing each model, it appears that the revised model is a better fit for the data than the initial model. The Adjusted R-Square improved, from 32.63% to 46.54%. This means 46.54% of the variance in WinShares can be explained by where players are selected and there plus/minus values. The Residual vs Fitted Values plot contained less outliers and was more clustered around the zero threshold. There was more symmetrical shape and wider spread in the Near Normal Residuals plot, with its center approximately at zero. The line of data points in the revised QQ-plot was straight and did not have a discernible skewness. This model does a better job of meeting the conditions of least squares in linearity, near normal residuals, and constant variability. Possible next steps to improve the model are identifying and removing additional outliers, and log-transforming the independent variables.

References

https://www.kaggle.com/datasets/mattop/nba-draft-basketball-player-data-19892021

https://www.basketball-reference.com/about/ws.html

Data 606 Final Project

Mohamed Hassan-El Serafi

2023-05-16

Abstract

Introduction

Part 2 - Data

Part 3 - Exploratory Data Analysis

Box Plus/Minus

Field Goal Percentage

Overall Draft Pick Selection

WinShares and Box Plus/Minus

WinShares and Overall Draft Pick Selection

WinShares and Field Goal Percentage

Part 4 - Inference

Residual vs Fitted Values Plot

Near Normal Residuals Histogram

QQ Plot

Residual vs Fitted Values

Near Normal Residuals Histogram

QQ Plot

Part 5 - Conclusion

References