Author

Griffin Lessinger

Introduction

The dataset that we work with today is a Kaggle dataset about NBA players. Specifically, the entries each give an NBA player within a specific season, from 1996 to 2021. We will focus on seasons from 2005 to 2021 (inclusive), analyzing the data to answer specific questions about player performance, scoring trends, etc.

Here is a sample of some rows and columns of the dataset:

Code
library(lubridate)
library(ggplot2)
library(dplyr)
library(readxl)

nba <- read.csv("/home/user/School/STAT360/Project 3 (NBA)/all_seasons.csv")
nba <- nba[as.numeric(substr(nba$season, 1, 4)) >= 2005 & as.numeric(substr(nba$season, 1, 4)) <= 2021, ]

nba[1:5, 2:7]
         player_name team_abbreviation age player_height player_weight  college
3973      Tyronn Lue               ATL  29        182.88      80.73938 Nebraska
3974       T.J. Ford               MIL  23        182.88      74.84268    Texas
3975 Tayshaun Prince               DET  26        205.74      97.52228 Kentucky
3976  Terence Morris               ORL  27        205.74     100.24383 Maryland
3977    Theo Ratliff               POR  33        208.28     106.59412  Wyoming

It is important to understand that the same athlete may appear multiple times if they were present in multiple seasons! This is incredibly useful for tracking statistics both over time and by athlete.

1. Age and Performance: How does the performance change with players’ age?

A large indicator of professional sports fit-ness is the age of the players. Given that “old age” for am NBA athlete is around 35-40 years old, it would be interesting to understand how a player’s performance tracks as they age.

Code
avebyage.pts <- tapply(nba$pts, nba$age, mean)

plot(
  x = 18:44,
  y = avebyage.pts,
  frame.plot = FALSE,
  type = "l",
  main = "Average points/game by age",
  xlab = "Age (years)",
  ylab = "Points/game"
)

Above is a plot of the average number of points per game for each age represented in the dataset. Note (for future age discussions, as well) that there are only 3 entries in which the athlete is 43 years old or older. As such, extrapolations cannot effectively be made about this oldest age bracket, as there is very limited data.

However, we can still see a notable (unsteady) incline in average points/game in ages less than 30, and an decline in points for ages greater than 30. Maybe the incline can be attributed to experience gain while the decline can be attributed to a lessening athleticism.

Code
ggplot(data = nba, aes(x = age, y = pts)) +
  geom_density_2d_filled(bins = 7) +
  scale_fill_manual(
    values = colorRampPalette(c("grey92", "blue4"))(7),
    name = "Density"
  ) +
  labs(
    title = "Age vs. Average Points/Game",
    subtitle = "2d kernel density estimate", 
    x = "Age (years)",
    y = "Average points"
  ) +
  geom_point(
    data = nba[nba$player_name == "LeBron James", ], 
    aes(x = age, y = pts, color = "LeBron")
  ) +
  geom_point(
    data = nba[nba$player_name == "Reggie Bullock", ], 
    aes(x = age, y = pts, color = "Reggie")
  ) +
  scale_color_manual(
    name = "Players",
    values = c(LeBron = "red3", Reggie = "orange2")
  ) +
  theme_classic()

For convenience, here is a kernel density estimate of the joint distribution of age and average points/game. As a reference, the statistics of LeBron James are added (he is often going to be an outlier in these plots), as well as those of Reggie Bullock, a more “typical” player (according to a Reddit post that I found).

Code
avebyage.reb <- tapply(nba$reb, nba$age, mean)

plot(
  x = 18:44,
  y = avebyage.reb,
  frame.plot = FALSE,
  type = "l",
  main = "Average rebounds/game by age",
  xlab = "Age (years)",
  ylab = "Rebounds/game"
)

Same deal here, although the decline in rebounds seems to begin earlier and manifest as a more gradual decrease of average rebounds/game, as opposed to the sharper decline in overall points that we saw before.

Code
ggplot(data = nba, aes(x = age, y = reb)) +
  geom_density_2d_filled(bins = 7) +
  scale_fill_manual(
    values = colorRampPalette(c("grey92", "purple4"))(7),
    name = "Density"
  ) +
  labs(
    title = "Age vs. Average Rebounds/Game",
    subtitle = "2d kernel density estimate", 
    x = "Age (years)",
    y = "Average rebounds"
  ) +
  geom_point(
    data = nba[nba$player_name == "LeBron James", ], 
    aes(x = age, y = reb, color = "LeBron")
  ) +
  geom_point(
    data = nba[nba$player_name == "Reggie Bullock", ], 
    aes(x = age, y = reb, color = "Reggie")
  ) +
  scale_color_manual(
    name = "Players",
    values = c(LeBron = "red3", Reggie = "orange2")
  ) +
  theme_classic()

As LeBron ages, the number of rebounds he got on average per game stayed roughly the same. However, Reggie improved over time! This is unexpected given the earlier plot that shows gradual decrease in rebounds/game with age, but good for him.

Code
avebyage.ast <- tapply(nba$ast, nba$age, mean)

plot(
  x = 18:44,
  y = avebyage.ast,
  frame.plot = FALSE,
  type = "l",
  main = "Average assists/game by age",
  xlab = "Age (years)",
  ylab = "Assists/game"
)

Lastly, we have assists/game by age. Unlike the first two indicators of performance, this one remains largely constant (on average) for most players, at roughly 1.5-2 assists/game across most ages. The initial uptick is likely explainable by an experience increase, and there is not enough data to explain the drop at ages of 40+.

Code
ggplot(data = nba, aes(x = age, y = ast)) +
  geom_density_2d_filled(bins = 7) +
  scale_fill_manual(
    values = colorRampPalette(c("grey92", "green4"))(7),
    name = "Density"
  ) +
  labs(
    title = "Age vs. Average Assists/Game",
    subtitle = "2d kernel density estimate", 
    x = "Age (years)",
    y = "Average assists"
  ) +
  geom_point(
    data = nba[nba$player_name == "LeBron James", ], 
    aes(x = age, y = ast, color = "LeBron")
  ) +
  geom_point(
    data = nba[nba$player_name == "Reggie Bullock", ], 
    aes(x = age, y = ast, color = "Reggie")
  ) +
  scale_color_manual(
    name = "Players",
    values = c(LeBron = "red3", Reggie = "orange2")
  ) +
  theme_classic()

And once more, we see LeBron James up high with at least 6-7 assists/game on average (regardless of age), and our benchmark Reggie with a somewhat modest increase.

In conclusion, performance of a player (especially points scored and rebounds per game) typically seems to improve with age for most players until their late 20s, possibly due to an growing level of skill and experience gain. From then, performance largely declines with continuing age, until most careers end by the time the players turn 40.

Of course, some outliers (such as LeBron James) do exist and continue to remain competitive, even as they grow older.

3. Rookie Impact: What is the average performance of rookie players compared to players in their prime years?

We can define “rookie” to be a player who was first drafted within the season of focus. Then, a player in their “prime” to be a player that is at least 1 year since their drafting year, but no more than 4 (this is a bit unfair, but on average, each player is only around for about 4.5 seasons. This works as an estimate).

Code
nba$draft_year.num <- ifelse(nba$draft_year == "Undrafted", NA, as.numeric(nba$draft_year))
nba$season.num <- as.numeric(substr(nba$season, 1, 4))

nba <- nba |>
  mutate(
    status = case_when(
      season.num == draft_year.num ~ "Rookie",
      season.num >= draft_year.num + 1 &
      season.num <= draft_year.num + 3 ~ "Experienced",
    )
  )

boxplot(
  data = na.omit(nba),
  pts ~ status,
  main = "Average points/game by experience",
  xlab = "Player status",
  ylab = "Average points"
)

Of course we are going to see better performance among the experienced population of players, as defined above. There is a selection bias here in which only the players that survived their rookie season actually made it to the “experienced” pool.

That said, we can partially counter this effect by including only players who appear in at least 4 seasons. This restricts the total player pool to relatively “good” players, but the trajectories by player should still be similar:

Code
boxplot(
  data = na.omit(nba)|>
    group_by(player_name) |>
    filter(n() >= 4) |>
    ungroup(),
  pts ~ status,
  main = "Average points/game by experience",
  xlab = "Player status",
  ylab = "Average points"
)
text(
  x = 1.5,
  y = 33,
  labels = "Only players present in >=4 seasons",
  xpd = TRUE
)

The numbers do not change much, and the effect is still the same. The “experienced” players are typically outperforming the rookies by a decent margin. One could argue that this idea reasonably extends to the player pool as a whole

Code
boxplot(
  data = na.omit(nba)|>
    group_by(player_name) |>
    filter(n() >= 4) |>
    ungroup(),
  reb ~ status,
  main = "Average rebounds/game by experience",
  xlab = "Player status",
  ylab = "Average rebounds"
)
text(
  x = 1.5,
  y = 33,
  labels = "Only players present in >=4 seasons",
  xpd = TRUE
)

Code
boxplot(
  data = na.omit(nba)|>
    group_by(player_name) |>
    filter(n() >= 4) |>
    ungroup(),
  ast ~ status,
  main = "Average assists/game by experience",
  xlab = "Player status",
  ylab = "Average assists"
)
text(
  x = 1.5,
  y = 33,
  labels = "Only players present in >=4 seasons",
  xpd = TRUE
)

These metrics are actually closer when split by experience level, but still notably different. it’s also worth noting how high the variance of the data is; in all metrics looked at so far, there have been many outliers!

We can safely say that it is likely that player experience (especially rookie vs. non-rookie) does have a meaningful effect on the performance of a player. A more rigorous approach would be to use some statistical test (2-sample t test especially) to demonstrate this phenomenon, but a visual examination is sufficient for now.

As for specific rookie players who scored a disproportionately high number of points/game, we can find that too:

Code
rookies <- nba |>
  filter(status == "Rookie") |>
  arrange(desc(pts))

rookies[1:10, c("player_name", "pts", "team_abbreviation", "age", "draft_year")]
        player_name  pts team_abbreviation age draft_year
1   Zion Williamson 22.5               NOP  19       2019
2       Luka Doncic 21.2               DAL  20       2018
3  Donovan Mitchell 20.5               UTA  21       2017
4      Kevin Durant 20.3               SEA  19       2007
5      Tyreke Evans 20.1               SAC  20       2009
6   Anthony Edwards 19.3               MIN  19       2020
7        Trae Young 19.1               ATL  20       2018
8    Damian Lillard 19.0               POR  22       2012
9         O.J. Mayo 18.5               MEM  21       2008
10     Kyrie Irving 18.5               CLE  20       2011

4. Career Longevity: What is the average career length of NBA players during this period?

On the topic of rookies vs. experienced players, what is a typical career length for players who’s careers began and ended during this window? How many games will they play?

We need to consider that some players’ careers may not yet have ended by the final year of examination, 2021.

Code
overs <- nba |>
  group_by(player_name) |>
  summarise(first_season = min(season.num),
            last_season  = max(season.num),
            score = mean(pts)) |>
  filter(first_season >= 2005, last_season <= 2020)

overs$career_length <- overs$last_season - overs$first_season + 1

hist(
  x = overs$career_length,
  main = "Histogram of NBA career lengths (n = 1247)",
  xlab = "Length (seasons)",
  ylab = "Count"
)
abline(v = mean(overs$career_length), lwd = 2, col = "red3")
abline(v = median(overs$career_length), lwd = 2, col = "purple3")
legend(
  x = "topright",
  fill = c("red3", "purple3"),
  legend = c(
    paste0("Mean: ", trunc(mean(overs$career_length*10))/10),
    paste0("Median: ", median(overs$career_length))
  ),
  bty = "n"
)

We can further analyze career length for “high scorers”:

Code
overs <- overs |>
  arrange(desc(score))

hist(
  x = overs[1:200, ]$career_length,
  main = "Histogram of NBA career lengths (n = 200)",
  breaks = max(overs[1:200, ]$career_length) - min(overs[1:200, ]$career_length) + 1,
  xlim = c(min(overs[1:200, ]$career_length), max(overs[1:200, ]$career_length)),
  xlab = "Length (seasons)",
  ylab = "Count"
)
abline(v = mean(overs[1:200, ]$career_length), lwd = 2, col = "red3")
abline(v = median(overs[1:200, ]$career_length), lwd = 2, col = "purple3")
legend(
  x = "topright",
  fill = c("red3", "purple3"),
  legend = c(
    paste0("Mean: ", trunc(mean(overs[1:200, ]$career_length*10))/10),
    paste0("Median: ", median(overs[1:200, ]$career_length))
  ),
  bty = "n"
)
text(
  x = 8,
  y = 26.5,
  labels = "2005 - 2020 careers of top 200 scorers",
  xpd = TRUE
)

It seem that those who score within the top ~15% of all players tend to have careers that are almost twice as long on average, and three times the length on the median. This makes sense, given that overall performance is the most important aspect of an NBA athlete, and average points scored/game is a strong indicator of high performance.

Note that there are other performance metrics as well; a player who plays defensively may not score as highly on average, but could still have a long ands valuable NBA career.

There seems to be a clear pattern: if you want to stay in the NBA, perform highly (or otherwise score a lot!).

5. All-Star Selections: Which players were selected as All-Stars in the most seasons during this period?

Code
allstar <- read.csv("/home/user/School/STAT360/Project 3 (NBA)/all_star.csv")
nba$allstar <- ifelse(
  paste(nba$player_name, nba$season.num) %in% paste(allstar$Player, allstar$Year),
  "all-star", 
  "not"
)

The data for NBA All-star teams was gotten from somewhere on Google, then incorporated into the original dataset. During this period, there were 113 distinct players selected to be All-stars. Some players were selected as All-stars a lot, while most others were only a few times:

Code
starplayers <- nba |>
  filter(allstar == "all-star") |>
  group_by(player_name) |>
  summarise(
    count = n()
  ) |>
  arrange(desc(count))

starplayers[1:10, ]
# A tibble: 10 × 2
   player_name       count
   <chr>             <int>
 1 LeBron James         17
 2 Dwyane Wade          12
 3 Chris Paul           11
 4 Kobe Bryant          11
 5 Carmelo Anthony      10
 6 Chris Bosh           10
 7 Dirk Nowitzki        10
 8 Kevin Durant         10
 9 James Harden          9
10 Russell Westbrook     9

There are, of course, many ways to approach analyzing the All-star players vs. typical players. The easiest way (again) would just be to compare performance metrics of these All-star players compared to others:

Code
ptsbyyear <- nba |>
  group_by(season.num) |>
  filter(allstar == "not")  |>
  summarise(pts.ave = mean(pts)) |>
  pull(pts.ave)

ptsbyyear.as <- nba |>
  group_by(season.num) |>
  filter(allstar == "all-star")  |>
  summarise(pts.ave = mean(pts)) |>
  pull(pts.ave)

mydata <- data.frame(
  year <- 2005:2021,
  pts <- ptsbyyear,
  pts.as <- ptsbyyear.as
)

ggplot(data = mydata, aes(x = year)) +
  ylim(0, 25) +
  geom_line(aes(y = pts, color = "Non All-star")) +
  geom_line(aes(y = pts.as, color = "All-star")) +
  scale_color_manual(
    name = "Class",
    values = c("Non All-star" = "red4", "All-star" = "royalblue4")
  ) +
  labs(
    title = "Average points/game",
    x = "Season",
    y = "Average points"
  ) +
  theme_classic()

Generally, the metrics of the players selected as All-stars are going to be significantly higher than those not.

Code
boxplot(
  data = nba,
  pts ~ allstar,
  names = c("All-star", "Not All-star"),
  main = "Boxplots of points/game vs. class",
  xlab = "Class",
  ylab = "Average points/game"
)

Even still, the boxplot above reveals that there are many outliers that do score average points/game within that All-star range. One would figure that performance alone may not dictate All-star lineups as there are many facets of an NBA player other than how they play.