The National Basketball Association (NBA) is comprised of 30 teams in which players from all over the world strive to become NBA Champions through a grueling 82 game regular season and the playoffs. The current generation of players brought upon a shift in the ideologies of how basketball should be played. In the past, it was always believed that centers and power fowards should not shoot the ball becasue they only should try to get rebounds due to their height and weight advantage. Prior ideology created a stigma that big men should not shoot the basketball which deterred bigger basketball players to not work on their jumpshot. In the past few years, I have noticed a difference when watching NBA games in which the centers and power fowards are now able to shoot the ball at a consistent rate, regardless of their height or weight. This created a new market of players who are tall and can shoot the ball in which coaches and general managers believed it to be a neccessary part to achieve a high level of success in the NBA. The main goal that coaches and general mangers have in the NBA is to win because it entails a higher level of revenue within the organization.
The goal of this paper is to use the NBA players’ statistics and biographical information from the 1996 to 2019 seasons to analize the height and weight distribution to observe whether there has been an increase in true shooting percentage over the years. True shooting percentage is a measure of shooting efficiency that takes into account field goals, 3-point field goals, and free throws. When taking this variable into account, there will be a better understanding of how true shooting percentage has rose throughout the years, especially with the big power fowards and centers. The increase in a taller and heavier basketball player’s ability to shoot has transitioned the choices of coaches and general mangers when selecting draft prospects because having a big man that can shoot is a team’s main goal with the new generation of basketball players.
The NBA player statistics and biographical information from the 1996 to 2019 seasons are provided by https://www.kaggle.com/justinas/nba-players-data.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(descr)
library(class)
library(visreg)
nba <- read.csv("all_seasons.csv")
str(nba)
## 'data.frame': 11145 obs. of 22 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ player_name : Factor w/ 2235 levels "A.C. Green","A.J. Bramlett",..: 535 648 658 662 663 669 670 679 680 685 ...
## $ team_abbreviation: Factor w/ 36 levels "ATL","BKN","BOS",..: 6 14 33 8 17 12 15 15 1 18 ...
## $ age : num 36 28 39 24 34 38 25 28 29 28 ...
## $ player_height : num 198 216 206 203 206 ...
## $ player_weight : num 99.8 117.9 95.3 100.7 108.9 ...
## $ college : Factor w/ 316 levels " ","Alabama",..: 237 77 67 273 288 101 253 52 296 140 ...
## $ country : Factor w/ 76 levels "Angola","Argentina",..: 73 73 73 73 73 73 73 73 73 73 ...
## $ draft_year : Factor w/ 45 levels "1963","1976",..: 11 15 4 20 10 6 19 15 17 16 ...
## $ draft_round : Factor w/ 8 levels "1","2","3","4",..: 2 1 3 1 1 2 1 1 8 2 ...
## $ draft_number : Factor w/ 75 levels "1","10","11",..: 26 23 60 74 2 28 2 26 75 37 ...
## $ gp : int 55 15 9 64 27 52 80 77 71 82 ...
## $ pts : num 5.7 2.3 0.8 3.7 2.4 8.2 17.2 14.9 5.7 6.9 ...
## $ reb : num 16.1 1.5 1 2.3 2.4 2.7 4.1 8 1.6 1.5 ...
## $ ast : num 3.1 0.3 0.4 0.6 0.2 1 3.4 1.6 1.3 3 ...
## $ net_rating : num 16.1 12.3 -2.1 -8.7 -11.2 4.1 4.1 3.3 -0.3 -1.2 ...
## $ oreb_pct : num 0.186 0.078 0.105 0.06 0.109 0.034 0.035 0.095 0.036 0.018 ...
## $ dreb_pct : num 0.323 0.151 0.102 0.149 0.179 0.126 0.091 0.183 0.076 0.081 ...
## $ usg_pct : num 0.1 0.175 0.103 0.167 0.127 0.22 0.209 0.222 0.172 0.177 ...
## $ ts_pct : num 0.479 0.43 0.376 0.399 0.611 0.541 0.559 0.52 0.539 0.557 ...
## $ ast_pct : num 0.113 0.048 0.148 0.077 0.04 0.102 0.149 0.087 0.141 0.262 ...
## $ season : Factor w/ 24 levels "1996-97","1997-98",..: 1 1 1 1 1 1 1 1 1 1 ...
nba <- nba[ -c(1) ]
nba$season <- as.integer(nba$season)
n_distinct(nba$player_name)
## [1] 2235
summary(nba)
## player_name team_abbreviation age player_height
## Vince Carter : 22 CLE : 390 Min. :18.00 Min. :160.0
## Dirk Nowitzki : 21 TOR : 390 1st Qu.:24.00 1st Qu.:195.6
## Kevin Garnett : 20 LAC : 389 Median :27.00 Median :200.7
## Kobe Bryant : 20 MIA : 387 Mean :27.17 Mean :200.8
## Jamal Crawford: 19 DAL : 384 3rd Qu.:30.00 3rd Qu.:208.3
## Jason Terry : 19 ATL : 383 Max. :44.00 Max. :231.1
## (Other) :11024 (Other):8822
## player_weight college country draft_year
## Min. : 60.33 None :1684 USA :9410 Undrafted:1942
## 1st Qu.: 90.72 Kentucky : 360 France : 153 1998 : 454
## Median : 99.79 Duke : 331 Canada : 140 2003 : 430
## Mean :100.64 North Carolina: 318 Spain : 79 2005 : 420
## 3rd Qu.:109.32 UCLA : 280 Brazil : 78 1996 : 406
## Max. :163.29 Arizona : 257 Australia: 74 2001 : 403
## (Other) :7915 (Other) :1211 (Other) :7090
## draft_round draft_number gp pts
## 1 :6513 Undrafted:1959 Min. : 1.00 Min. : 0.000
## 2 :2629 1 : 320 1st Qu.:32.00 1st Qu.: 3.500
## Undrafted:1959 5 : 320 Median :58.00 Median : 6.600
## 3 : 20 4 : 311 Mean :52.01 Mean : 8.126
## 4 : 12 2 : 299 3rd Qu.:74.00 3rd Qu.:11.500
## 6 : 5 3 : 299 Max. :85.00 Max. :36.100
## (Other) : 7 (Other) :7637
## reb ast net_rating oreb_pct
## Min. : 0.00 Min. : 0.000 Min. :-200.000 Min. :0.00000
## 1st Qu.: 1.80 1st Qu.: 0.600 1st Qu.: -6.300 1st Qu.:0.02200
## Median : 3.00 Median : 1.200 Median : -1.300 Median :0.04300
## Mean : 3.56 Mean : 1.801 Mean : -2.154 Mean :0.05559
## 3rd Qu.: 4.70 3rd Qu.: 2.400 3rd Qu.: 3.200 3rd Qu.:0.08600
## Max. :16.30 Max. :11.700 Max. : 300.000 Max. :1.00000
##
## dreb_pct usg_pct ts_pct ast_pct
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0960 1st Qu.:0.1500 1st Qu.:0.4780 1st Qu.:0.0650
## Median :0.1320 Median :0.1820 Median :0.5210 Median :0.1020
## Mean :0.1418 Mean :0.1856 Mean :0.5081 Mean :0.1311
## 3rd Qu.:0.1820 3rd Qu.:0.2180 3rd Qu.:0.5570 3rd Qu.:0.1780
## Max. :1.0000 Max. :1.0000 Max. :1.5000 Max. :1.0000
##
## season
## Min. : 1.00
## 1st Qu.: 7.00
## Median :13.00
## Mean :12.88
## 3rd Qu.:19.00
## Max. :24.00
##
This data is comprised of 11145 rows and 21 variables. The first column can be removed for better clarification. What stands out to me is that the maximum number of games played is 85 which means that a very few amount of players have played more than an offical 82 game regular season. For clarification, this could occur because a player could get traded to another team mid season and have to player a few more games.
The variables “player_height”, “player_weight”, and “ts_pct” will be very important for the statistical model and analysis. I will investigate the distribution of weight and height throughout the NBA players and observe if there is a correlation between weight and height as well as points, assists, and rebounds. Then I will observe the average true shooting percentage throughout the seasons between players that are 208.28cm or taller and players that are smaller than 208.28cm. I will see how accurate my predictions will be for predicting the true shooting percentage and net rating when including factors like weight and height. I will convert the season variable to an integer in which 1 will represent the 1996-97 NBA season and 24 will represent the 2019-20 season.
stargazer(select(nba, player_height, player_weight), type = "text", median=TRUE)
##
## ===============================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Median Pctl(75) Max
## -------------------------------------------------------------------------------
## player_height 11,145 200.813 9.191 160.020 195.580 200.660 208.280 231.140
## player_weight 11,145 100.638 12.576 60.328 90.718 99.790 109.316 163.293
## -------------------------------------------------------------------------------
summary(nba$ts_pct)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4780 0.5210 0.5081 0.5570 1.5000
ggplot(data=nba, aes(nba$player_height)) +
geom_histogram(aes(y =..density..), col="Black", fill="Yellow", alpha=1, bins = 10) +
geom_density(col=4) +
geom_vline(aes(xintercept = mean(nba$player_height)),col='red',size=1) +
labs(title='NBA Height Distribution', x='Height', y='Density')
ggplot(data=nba, aes(nba$player_weight)) +
geom_histogram(aes(y =..density..), col="Black", fill="Yellow", alpha=1, bins = 10) +
geom_density(col=4) +
geom_vline(aes(xintercept = mean(nba$player_weight)),col='red',size=1) +
labs(title='NBA Weight Distribution', x='Weight', y='Density')
It seems that both weight and height are normally distributed in the NBA. Now I will see if there is a strong correlation or not between weight and height.
ggplot(nba, aes(player_weight, player_height), color=season) +
geom_point() +
geom_smooth(method = "lm") +
labs(title='Regression Model of the Realationship Between Height and Weight', x='Weight', y='Height')
The regression models shows that weight and height are strongly correlated in the NBA which doesn’t seem surprising because it is common that when an individual’s height increases, their weight will tend to follow. I will now plot the average height and weight of NBA players throughout the 24 observed seasons to hopefully discover a trend.
nba2 <- nba %>%
group_by(season) %>%
summarise(AveHeight = mean(player_height), AveWeight = mean(player_weight))
## `summarise()` ungrouping output (override with `.groups` argument)
coeff <- .5
ggplot(nba2, aes(x=season, group = 1)) +
geom_line(aes(y = AveHeight), color = "darkred") +
geom_line( aes(y=AveWeight / coeff)) +
scale_y_continuous(name = "Average Height (cm)", sec.axis = sec_axis(~.*coeff, name="Average Weight (kg)")) +
labs(title = "Avereage Height and Weight Trend From 1996-2020")
What is very surprising is that from 1996 to 2020, the average height in the NBA fell by 2 centimeters. The same outcome occured with weight too in which it fell about 2 kg from 1996 to 2020. Now I will plot the correlation between height and weight as well as the amount of games played, points scored, rebounds obtained, and assists made.
library(corrplot)
## corrplot 0.84 loaded
cor <- cor(select(nba, player_weight, player_height, gp, pts, reb, ast), use="pairwise.complete.obs")
corrplot(cor, type="upper", method="number")
The correlation coeffient between weight and height is .83 which is extremely stong and does not seem surprising since the graphs above reprent a strong relationship between height and weight. There is a fairly stong correlation for rebounds and height, and rebounds and weight which does not seem surprising since taller and heavier players most often get the rebounds during a game. Weight and height also have an effect on assists, but the correlation is negative because taller and heavier players are not often involved in the playmaking aspects of basketball. Height and weight do not have a significant impact on the amount of games played and the avereage amount of points scored since the correlation coeffient is roughly 0.
nba3 <- nba %>%
filter(player_height >= 208.28) %>%
group_by(season) %>%
summarise(ave_ts_pct_tall = mean(ts_pct))
## `summarise()` ungrouping output (override with `.groups` argument)
nba4 <- nba %>%
filter(player_height < 208.28) %>%
group_by(season) %>%
summarise(ave_ts_pct_small = mean(ts_pct))
## `summarise()` ungrouping output (override with `.groups` argument)
nba5 <- inner_join(nba3, nba4)
## Joining, by = "season"
colors <- c("208.28cm or taller" = "green", "Smaller than 208.28cm" = "purple")
ggplot(nba5, aes(x=season)) +
geom_line(aes(y = ave_ts_pct_tall, color = "208.28cm or taller"), size = 1) +
geom_line(aes(y=ave_ts_pct_small, color = "Smaller than 208.28cm"), size = 1) +
labs(title = "Average True Shooting Percentage of Taller Players versus Smaller Players", x = "Season", y = "Average True Shooting Percentage", color = "Legend") + scale_color_manual(values = colors)
This model visualizes the average true shooting percentage between players who are 208.28cm or taller and players who are smaller than 208.28cm. It gives great insight of how taller basketball players have significantly improved their ability to score with the ball. In the 2013-2014 season, taller basketball player’s average true shooting percentage began to steadly rise and is still rising today, and most likely continuing to grow as more seasons are played in the future. It was interesting to observe that the player’s smaller than 208.28cm did not have a drastic linear change and stayed fairly consistent from 1996 to 2020.
I will now use logistic regression to predict the probability that a player’s true shooting percentage is above 50%, which is about the NBA average. I will observe how the probability that a player’s true shooting percentage is above 50% changes due to height and weight in the 1996-97 season compared to the 2019-20 season while controling for age and country in all of the following regressions.
nba <- nba %>% mutate(Good_Bad_Trueshooting = ifelse(ts_pct >= .5, "good", "bad"))
nba$country <- as.factor(nba$country)
nba$Good_Bad_Trueshooting <- as.factor(nba$Good_Bad_Trueshooting)
past <- filter(nba, season == "1")
present <- filter(nba, season == "24")
ts_glm <- glm(Good_Bad_Trueshooting ~ player_weight + player_height + age + country,
family="binomial",
data=present)
ts_glmpast <- glm(Good_Bad_Trueshooting ~ player_weight + player_height + age + country,
family="binomial",
data=past)
visreg(ts_glmpast, "player_weight",
gg = TRUE,
scale="response") +
labs(y = "Probability that True Shooting Percentage is Above 50%",
x = "Weight (kg)",
title = "Relationship of True Shooting and Weight in the 1996-1997 Season",
subtitle = "controlling for height, age, and country")
visreg(ts_glm, "player_weight",
gg = TRUE,
scale="response") +
labs(y = "Probability that True Shooting Percentage is Above 50%",
x = "Weight (kg)",
title = "Relationship of True Shooting and Weight in the 2019-2020 Season",
subtitle = "controlling for height, age, and country")
In the 1996-1997 season, the probability that a player’s true shooting percentage is above 50% is estimated to be roughly 0.6 consistently as weight increases. In the 2019-2020 season, the probability that a player’s true shooting percentage is above 50% is estimated to be roughly 0.6 at the smallest weight and as weight gradually increases, so does the true shooting percentage in which the heaviest weight is estimated to be roughly 0.9. When comparing these two graphs, I found that currently, players who weigh more have a better probability of achieving a true shooting percentage above 50%, where as in the 1996-97 season, an increase or decrease in weight did not have an affect probability of having a true shooting percentage above 50%.
visreg(ts_glmpast, "player_height",
gg = TRUE,
scale="response") +
labs(y = "Probability that True Shooting Percentage is Above 50%",
x = "Height (cm)",
title = "Relationship of True Shooting and Height in the 1996-1997 Season",
subtitle = "controlling for weight, age, and country")
visreg(ts_glm, "player_height",
gg = TRUE,
scale="response") +
labs(y = "Probability that True Shooting Percentage is Above 50%",
x = "Height (cm)",
title = "Relationship of True Shooting and Height in the 2019-2020 Season",
subtitle = "controlling for weight, age, and country")
In the 1996-1997 season, the probability that a player’s true shooting percentage is above 50% is estimated to be roughly 0.8 at the smallest height and gradually decreases as height increases in which the true shooting percentage at the tallest height is estimated to be roughly 0.45 as weight increases. In the 2019-2020 season, the probability that a player’s true shooting percentage is above 50% is estimated to be roughly 0.75 consistently as height increases. When comparing these two graphs, I found that currently, players will have a the same chance of obtaining a true shooting percentage above 50%, no matter how tall the player is, where as in the 1996-97 season, a player had a much better chance of obtaining a true shooting percentage above 50% because taller basketball players did not have the ability to shoot the ball effectively, so increasing their height meant that their true shooting percentage had a greater chance of being low.
The weight and height analysis in this project was meant to observe the trend in which NBA players are following to see the height and weight difference from today compared to the seasons after 1996. After observing weight and height, I found that they were very strongly correlated and that they are acually decreasing slowly throughout the NBA seasons. My logistic regression models describe the relationship that true shooting percentage has between height and weight. As I visualized these relationships, I found that heavier players today have a higher probabilty of having a true shooting percentage above 50% which was not the case in 1996 where different weights did not have an affect on the probabilty of having a true shooting percentage above 50%. I also found that taller players today have the same probability of obtaining a true shooting percentage above 50% compared to the smaller players, but in 1996, as a player gets taller, their chance of having a true shooting percentage above 50% is very small compared to today. The outcome of these models resulting in intreguing information regarding the shift in basketball ideogolies from an old belief that taller and heavier men should not shoot the basketball into a generation where, no matter the height and weight, a basketball player knows how to successfully shoot the basketball. The increase in a taller and heavier basketball player’s ability to shoot has made these players prime targets for coaches and general mangers to obtain for their team.
Data for this paper courtesy of: https://www.kaggle.com/justinas/nba-players-data