Introduction

The National Basketball Association (NBA) is comprised of 30 teams in which players from all over the world strive to become NBA Champions through a grueling 82 game regular season and the playoffs. The current generation of players brought upon a shift in the ideologies of how basketball should be played. In the past, it was always believed that centers and power fowards should not shoot the ball becasue they only should try to get rebounds due to their height and weight advantage. Prior ideology created a stigma that big men should not shoot the basketball which deterred bigger basketball players to not work on their jumpshot. In the past few years, I have noticed a difference when watching NBA games in which the centers and power fowards are now able to shoot the ball at a consistent rate, regardless of their height or weight. This created a new market of players who are tall and can shoot the ball in which coaches and general managers believed it to be a neccessary part to achieve a high level of success in the NBA. The main goal that coaches and general mangers have in the NBA is to win because it entails a higher level of revenue within the organization.

The goal of this paper is to use the NBA players’ statistics and biographical information from the 1996 to 2019 seasons to analize the height and weight distribution to observe whether there has been an increase in true shooting percentage over the years. True shooting percentage is a measure of shooting efficiency that takes into account field goals, 3-point field goals, and free throws. When taking this variable into account, there will be a better understanding of how true shooting percentage has rose throughout the years, especially with the big power fowards and centers. The increase in a taller and heavier basketball player’s ability to shoot has transitioned the choices of coaches and general mangers when selecting draft prospects because having a big man that can shoot is a team’s main goal with the new generation of basketball players.

Data

The NBA player statistics and biographical information from the 1996 to 2019 seasons are provided by https://www.kaggle.com/justinas/nba-players-data.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(descr)
library(class)
library(visreg)
nba <- read.csv("all_seasons.csv")
str(nba)
## 'data.frame':    11145 obs. of  22 variables:
##  $ X                : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ player_name      : Factor w/ 2235 levels "A.C. Green","A.J. Bramlett",..: 535 648 658 662 663 669 670 679 680 685 ...
##  $ team_abbreviation: Factor w/ 36 levels "ATL","BKN","BOS",..: 6 14 33 8 17 12 15 15 1 18 ...
##  $ age              : num  36 28 39 24 34 38 25 28 29 28 ...
##  $ player_height    : num  198 216 206 203 206 ...
##  $ player_weight    : num  99.8 117.9 95.3 100.7 108.9 ...
##  $ college          : Factor w/ 316 levels " ","Alabama",..: 237 77 67 273 288 101 253 52 296 140 ...
##  $ country          : Factor w/ 76 levels "Angola","Argentina",..: 73 73 73 73 73 73 73 73 73 73 ...
##  $ draft_year       : Factor w/ 45 levels "1963","1976",..: 11 15 4 20 10 6 19 15 17 16 ...
##  $ draft_round      : Factor w/ 8 levels "1","2","3","4",..: 2 1 3 1 1 2 1 1 8 2 ...
##  $ draft_number     : Factor w/ 75 levels "1","10","11",..: 26 23 60 74 2 28 2 26 75 37 ...
##  $ gp               : int  55 15 9 64 27 52 80 77 71 82 ...
##  $ pts              : num  5.7 2.3 0.8 3.7 2.4 8.2 17.2 14.9 5.7 6.9 ...
##  $ reb              : num  16.1 1.5 1 2.3 2.4 2.7 4.1 8 1.6 1.5 ...
##  $ ast              : num  3.1 0.3 0.4 0.6 0.2 1 3.4 1.6 1.3 3 ...
##  $ net_rating       : num  16.1 12.3 -2.1 -8.7 -11.2 4.1 4.1 3.3 -0.3 -1.2 ...
##  $ oreb_pct         : num  0.186 0.078 0.105 0.06 0.109 0.034 0.035 0.095 0.036 0.018 ...
##  $ dreb_pct         : num  0.323 0.151 0.102 0.149 0.179 0.126 0.091 0.183 0.076 0.081 ...
##  $ usg_pct          : num  0.1 0.175 0.103 0.167 0.127 0.22 0.209 0.222 0.172 0.177 ...
##  $ ts_pct           : num  0.479 0.43 0.376 0.399 0.611 0.541 0.559 0.52 0.539 0.557 ...
##  $ ast_pct          : num  0.113 0.048 0.148 0.077 0.04 0.102 0.149 0.087 0.141 0.262 ...
##  $ season           : Factor w/ 24 levels "1996-97","1997-98",..: 1 1 1 1 1 1 1 1 1 1 ...
nba <- nba[ -c(1) ]
nba$season <- as.integer(nba$season)
n_distinct(nba$player_name)
## [1] 2235
summary(nba)
##          player_name    team_abbreviation      age        player_height  
##  Vince Carter  :   22   CLE    : 390      Min.   :18.00   Min.   :160.0  
##  Dirk Nowitzki :   21   TOR    : 390      1st Qu.:24.00   1st Qu.:195.6  
##  Kevin Garnett :   20   LAC    : 389      Median :27.00   Median :200.7  
##  Kobe Bryant   :   20   MIA    : 387      Mean   :27.17   Mean   :200.8  
##  Jamal Crawford:   19   DAL    : 384      3rd Qu.:30.00   3rd Qu.:208.3  
##  Jason Terry   :   19   ATL    : 383      Max.   :44.00   Max.   :231.1  
##  (Other)       :11024   (Other):8822                                     
##  player_weight              college          country         draft_year  
##  Min.   : 60.33   None          :1684   USA      :9410   Undrafted:1942  
##  1st Qu.: 90.72   Kentucky      : 360   France   : 153   1998     : 454  
##  Median : 99.79   Duke          : 331   Canada   : 140   2003     : 430  
##  Mean   :100.64   North Carolina: 318   Spain    :  79   2005     : 420  
##  3rd Qu.:109.32   UCLA          : 280   Brazil   :  78   1996     : 406  
##  Max.   :163.29   Arizona       : 257   Australia:  74   2001     : 403  
##                   (Other)       :7915   (Other)  :1211   (Other)  :7090  
##     draft_round      draft_number        gp             pts        
##  1        :6513   Undrafted:1959   Min.   : 1.00   Min.   : 0.000  
##  2        :2629   1        : 320   1st Qu.:32.00   1st Qu.: 3.500  
##  Undrafted:1959   5        : 320   Median :58.00   Median : 6.600  
##  3        :  20   4        : 311   Mean   :52.01   Mean   : 8.126  
##  4        :  12   2        : 299   3rd Qu.:74.00   3rd Qu.:11.500  
##  6        :   5   3        : 299   Max.   :85.00   Max.   :36.100  
##  (Other)  :   7   (Other)  :7637                                   
##       reb             ast           net_rating          oreb_pct      
##  Min.   : 0.00   Min.   : 0.000   Min.   :-200.000   Min.   :0.00000  
##  1st Qu.: 1.80   1st Qu.: 0.600   1st Qu.:  -6.300   1st Qu.:0.02200  
##  Median : 3.00   Median : 1.200   Median :  -1.300   Median :0.04300  
##  Mean   : 3.56   Mean   : 1.801   Mean   :  -2.154   Mean   :0.05559  
##  3rd Qu.: 4.70   3rd Qu.: 2.400   3rd Qu.:   3.200   3rd Qu.:0.08600  
##  Max.   :16.30   Max.   :11.700   Max.   : 300.000   Max.   :1.00000  
##                                                                       
##     dreb_pct         usg_pct           ts_pct          ast_pct      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0960   1st Qu.:0.1500   1st Qu.:0.4780   1st Qu.:0.0650  
##  Median :0.1320   Median :0.1820   Median :0.5210   Median :0.1020  
##  Mean   :0.1418   Mean   :0.1856   Mean   :0.5081   Mean   :0.1311  
##  3rd Qu.:0.1820   3rd Qu.:0.2180   3rd Qu.:0.5570   3rd Qu.:0.1780  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.5000   Max.   :1.0000  
##                                                                     
##      season     
##  Min.   : 1.00  
##  1st Qu.: 7.00  
##  Median :13.00  
##  Mean   :12.88  
##  3rd Qu.:19.00  
##  Max.   :24.00  
## 

This data is comprised of 11145 rows and 21 variables. The first column can be removed for better clarification. What stands out to me is that the maximum number of games played is 85 which means that a very few amount of players have played more than an offical 82 game regular season. For clarification, this could occur because a player could get traded to another team mid season and have to player a few more games.

The variables “player_height”, “player_weight”, and “ts_pct” will be very important for the statistical model and analysis. I will investigate the distribution of weight and height throughout the NBA players and observe if there is a correlation between weight and height as well as points, assists, and rebounds. Then I will observe the average true shooting percentage throughout the seasons between players that are 208.28cm or taller and players that are smaller than 208.28cm. I will see how accurate my predictions will be for predicting the true shooting percentage and net rating when including factors like weight and height. I will convert the season variable to an integer in which 1 will represent the 1996-97 NBA season and 24 will represent the 2019-20 season.

Explanatory Analysis

stargazer(select(nba, player_height, player_weight), type = "text", median=TRUE)
## 
## ===============================================================================
## Statistic       N     Mean   St. Dev.   Min   Pctl(25) Median  Pctl(75)   Max  
## -------------------------------------------------------------------------------
## player_height 11,145 200.813  9.191   160.020 195.580  200.660 208.280  231.140
## player_weight 11,145 100.638  12.576  60.328   90.718  99.790  109.316  163.293
## -------------------------------------------------------------------------------
summary(nba$ts_pct)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.4780  0.5210  0.5081  0.5570  1.5000
ggplot(data=nba, aes(nba$player_height)) + 
  geom_histogram(aes(y =..density..), col="Black", fill="Yellow", alpha=1, bins = 10) + 
  geom_density(col=4) + 
  geom_vline(aes(xintercept = mean(nba$player_height)),col='red',size=1) +
  labs(title='NBA Height Distribution', x='Height', y='Density')

ggplot(data=nba, aes(nba$player_weight)) + 
  geom_histogram(aes(y =..density..), col="Black", fill="Yellow", alpha=1, bins = 10) + 
  geom_density(col=4) + 
  geom_vline(aes(xintercept = mean(nba$player_weight)),col='red',size=1) +
  labs(title='NBA Weight Distribution', x='Weight', y='Density')

It seems that both weight and height are normally distributed in the NBA. Now I will see if there is a strong correlation or not between weight and height.

ggplot(nba, aes(player_weight, player_height), color=season) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title='Regression Model of the Realationship Between Height and Weight', x='Weight', y='Height')

The regression models shows that weight and height are strongly correlated in the NBA which doesn’t seem surprising because it is common that when an individual’s height increases, their weight will tend to follow. I will now plot the average height and weight of NBA players throughout the 24 observed seasons to hopefully discover a trend.

nba2 <- nba %>%
  group_by(season) %>%
  summarise(AveHeight = mean(player_height), AveWeight = mean(player_weight))
## `summarise()` ungrouping output (override with `.groups` argument)
coeff <- .5
ggplot(nba2, aes(x=season, group = 1)) + 
  geom_line(aes(y = AveHeight), color = "darkred") +
  geom_line( aes(y=AveWeight / coeff)) + 
  scale_y_continuous(name = "Average Height (cm)", sec.axis = sec_axis(~.*coeff, name="Average Weight (kg)")) + 
  labs(title = "Avereage Height and Weight Trend From 1996-2020")

What is very surprising is that from 1996 to 2020, the average height in the NBA fell by 2 centimeters. The same outcome occured with weight too in which it fell about 2 kg from 1996 to 2020. Now I will plot the correlation between height and weight as well as the amount of games played, points scored, rebounds obtained, and assists made.

library(corrplot)
## corrplot 0.84 loaded
cor <- cor(select(nba, player_weight, player_height, gp, pts, reb, ast), use="pairwise.complete.obs")
corrplot(cor, type="upper", method="number")

The correlation coeffient between weight and height is .83 which is extremely stong and does not seem surprising since the graphs above reprent a strong relationship between height and weight. There is a fairly stong correlation for rebounds and height, and rebounds and weight which does not seem surprising since taller and heavier players most often get the rebounds during a game. Weight and height also have an effect on assists, but the correlation is negative because taller and heavier players are not often involved in the playmaking aspects of basketball. Height and weight do not have a significant impact on the amount of games played and the avereage amount of points scored since the correlation coeffient is roughly 0.

nba3 <- nba %>%
  filter(player_height >= 208.28) %>%
  group_by(season) %>%
  summarise(ave_ts_pct_tall = mean(ts_pct))
## `summarise()` ungrouping output (override with `.groups` argument)
nba4 <- nba %>%
  filter(player_height < 208.28) %>%
  group_by(season) %>%
  summarise(ave_ts_pct_small = mean(ts_pct))
## `summarise()` ungrouping output (override with `.groups` argument)
nba5 <- inner_join(nba3, nba4)
## Joining, by = "season"
colors <- c("208.28cm or taller" = "green", "Smaller than 208.28cm" = "purple")

ggplot(nba5, aes(x=season)) + 
  geom_line(aes(y = ave_ts_pct_tall, color = "208.28cm or taller"), size = 1) +
  geom_line(aes(y=ave_ts_pct_small, color = "Smaller than 208.28cm"), size = 1) +
  labs(title = "Average True Shooting Percentage of Taller Players versus Smaller Players", x = "Season",  y = "Average True Shooting Percentage", color = "Legend") + scale_color_manual(values = colors)

This model visualizes the average true shooting percentage between players who are 208.28cm or taller and players who are smaller than 208.28cm. It gives great insight of how taller basketball players have significantly improved their ability to score with the ball. In the 2013-2014 season, taller basketball player’s average true shooting percentage began to steadly rise and is still rising today, and most likely continuing to grow as more seasons are played in the future. It was interesting to observe that the player’s smaller than 208.28cm did not have a drastic linear change and stayed fairly consistent from 1996 to 2020.

Statistical Modeling

I will now use logistic regression to predict the probability that a player’s true shooting percentage is above 50%, which is about the NBA average. I will observe how the probability that a player’s true shooting percentage is above 50% changes due to height and weight in the 1996-97 season compared to the 2019-20 season while controling for age and country in all of the following regressions.

nba <- nba %>% mutate(Good_Bad_Trueshooting = ifelse(ts_pct >= .5, "good", "bad"))
nba$country <- as.factor(nba$country)
nba$Good_Bad_Trueshooting <- as.factor(nba$Good_Bad_Trueshooting)
past <- filter(nba, season == "1")
present <- filter(nba, season == "24")
ts_glm <- glm(Good_Bad_Trueshooting ~ player_weight + player_height + age + country, 
                 family="binomial", 
                 data=present)
ts_glmpast <- glm(Good_Bad_Trueshooting ~ player_weight + player_height + age + country, 
                 family="binomial", 
                 data=past)
visreg(ts_glmpast, "player_weight", 
       gg = TRUE, 
       scale="response") +
  labs(y = "Probability that True Shooting Percentage is Above 50%", 
       x = "Weight (kg)",
       title = "Relationship of True Shooting and Weight in the 1996-1997 Season",
       subtitle = "controlling for height, age, and country")

visreg(ts_glm, "player_weight", 
       gg = TRUE, 
       scale="response") +
  labs(y = "Probability that True Shooting Percentage is Above 50%", 
       x = "Weight (kg)",
       title = "Relationship of True Shooting and Weight in the 2019-2020 Season",
       subtitle = "controlling for height, age, and country")

In the 1996-1997 season, the probability that a player’s true shooting percentage is above 50% is estimated to be roughly 0.6 consistently as weight increases. In the 2019-2020 season, the probability that a player’s true shooting percentage is above 50% is estimated to be roughly 0.6 at the smallest weight and as weight gradually increases, so does the true shooting percentage in which the heaviest weight is estimated to be roughly 0.9. When comparing these two graphs, I found that currently, players who weigh more have a better probability of achieving a true shooting percentage above 50%, where as in the 1996-97 season, an increase or decrease in weight did not have an affect probability of having a true shooting percentage above 50%.

visreg(ts_glmpast, "player_height", 
       gg = TRUE, 
       scale="response") +
  labs(y = "Probability that True Shooting Percentage is Above 50%", 
       x = "Height (cm)",
       title = "Relationship of True Shooting and Height in the 1996-1997 Season",
       subtitle = "controlling for weight, age, and country")

visreg(ts_glm, "player_height", 
       gg = TRUE, 
       scale="response") +
  labs(y = "Probability that True Shooting Percentage is Above 50%", 
       x = "Height (cm)",
       title = "Relationship of True Shooting and Height in the 2019-2020 Season",
       subtitle = "controlling for weight, age, and country")

In the 1996-1997 season, the probability that a player’s true shooting percentage is above 50% is estimated to be roughly 0.8 at the smallest height and gradually decreases as height increases in which the true shooting percentage at the tallest height is estimated to be roughly 0.45 as weight increases. In the 2019-2020 season, the probability that a player’s true shooting percentage is above 50% is estimated to be roughly 0.75 consistently as height increases. When comparing these two graphs, I found that currently, players will have a the same chance of obtaining a true shooting percentage above 50%, no matter how tall the player is, where as in the 1996-97 season, a player had a much better chance of obtaining a true shooting percentage above 50% because taller basketball players did not have the ability to shoot the ball effectively, so increasing their height meant that their true shooting percentage had a greater chance of being low.

Conclusion

The weight and height analysis in this project was meant to observe the trend in which NBA players are following to see the height and weight difference from today compared to the seasons after 1996. After observing weight and height, I found that they were very strongly correlated and that they are acually decreasing slowly throughout the NBA seasons. My logistic regression models describe the relationship that true shooting percentage has between height and weight. As I visualized these relationships, I found that heavier players today have a higher probabilty of having a true shooting percentage above 50% which was not the case in 1996 where different weights did not have an affect on the probabilty of having a true shooting percentage above 50%. I also found that taller players today have the same probability of obtaining a true shooting percentage above 50% compared to the smaller players, but in 1996, as a player gets taller, their chance of having a true shooting percentage above 50% is very small compared to today. The outcome of these models resulting in intreguing information regarding the shift in basketball ideogolies from an old belief that taller and heavier men should not shoot the basketball into a generation where, no matter the height and weight, a basketball player knows how to successfully shoot the basketball. The increase in a taller and heavier basketball player’s ability to shoot has made these players prime targets for coaches and general mangers to obtain for their team.

Reference

Data for this paper courtesy of: https://www.kaggle.com/justinas/nba-players-data