1. Abstract

This project explores NBA player statistics from the season beginning in 1996 to the 2022-2023 season to answer questions about career longevity, age and player performance, scoring trends, rookie impact, and player efficiency. We visualize our results using R programming to answer these questions.

2. Introduction

The popularity and complexity of basketball leave sports enthusiasts and data analysts alike wondering about trends among players and who the next best player for aspiring rookies to beat is. In this project, we explored average career length, performance based on age, top scorers, rookie performance and impact, and player efficiency according to net rating using R programming methods.

3. Methodology

Data Source: Kaggle NBA Players Dataset. Tools: R programming language, base packages, dplyr, tidyr, reshape2, ggplot2. Data preprocessing: cleaning and grouping data. Data visualization: creating histograms, lineplots, barplots, and boxplots using ggplot2.

4. Results

4.1 Career Longevity

4.1.1 Histogram of Player Career Lengths

This histogram shows the distribution of career lengths from all players from the 1996-97 season to the 2022-23 season. Using the interquartile range, we have highlighted where most player careers fall in green, which is between 1.5 and 7.5 years. It is rare to have a basketball career above 10 years, and it becomes increasingly rare above 15. A basketball career of 20 years and above is almost entirely unheard of. We also see that a large number of rookie players never make it to their second year, whether from injury or other factors that cause them not to make it to a second year.

4.1.2 Career Length and Position

We would have liked to explore whether there are differences between career lengths of players in different positions, however, we unfortunately were unable to as player position is not included in our data set, and acquiring this information is outside the scope of the project.

4.2 Age and Performance

4.2.1 Lineplot of Average NBA Player Performance by Age

The lineplot above shows how the average performance, which consists of points, assists, and rebounds, changes with the player’s age. We can see that the average amount of points scored peaks at age 28 and steadily decreases as players age. Rebounds and assists are more consistent, although they also tend to peak at age 28. Rebounds and assists stay pretty consistent from 25 to 35. Both start slowly decreasing between 35 and 40. Rebounds see an uptick at age 42 before plummeting and assists decrease abruptly after age 40.

4.2.2 Lineplot of Number of Players Exceeding Performance Thresholds

This line plot shows the number of players that excel in each performance metric of points, rebounds, and assists by age group. We set these thresholds to be an average metric per game of 20 points, 8 rebounds, and, 8 assists. The number of players that meet or exceed the point or rebound metrics increases as players approach 25, also reaching their peak at 25, and decreases significantly after age 28. Looking at assists these differences are much less pronounced, and the number of excelling players within the assists category stays consistent between 28 and 32 when they start tapering off.

4.4 Rookie Impact

4.4.1

The barplot above shows how the average performance of rookie players compares to players in their prime. We defined rookie players as players in their first season and players in their prime as players between the ages of 26 and 30. We can see that rookie players perform worse on average than players in their prime for all performance metrics, though the difference in assists is less.

4.4.2 Barplot of Exceptional Rookie Players

For this barplot, we aimed to find rookies who had an exceptional impact on their teams. We did this by setting the thresholds for exceptional rookie players to be the averages for points, rebounds, assists, and net rating greater than or equal to 15, 6, 5, and 3, respectively. Based on these metrics, we identified these four players to have an exceptional impact on their teams. However, a better way to identify rookies with an exceptional impact on their teams may be to look at something such as the Rookie of the Year Award, which takes much more into account when being awarded.

4.5 Player Efficiency

4.5.1 Boxplots of Player Net Rating Distributions Across Seasons

These boxplots show the distribution of player net ratings across the seasons. They have been expanded into three separate graphs, each containing 9 years of data, to improve legibility and to avoid clutter. The dots represent players whose net ratings fall outside of the expected range of the data. We will determine which of these players qualify as outliers in our next step.

4.5.2 Players Who Consistently Maintain a High Net Rating
## # A tibble: 24 × 2
##    player_name          season_range
##    <chr>                <chr>       
##  1 Andre Iguodala       2013-2022   
##  2 Bruce Bowen          1996-2006   
##  3 Chris Paul           2013-2017   
##  4 Danny Green          2010-2018   
##  5 David Robinson       1998-2002   
##  6 Dirk Nowitzki        2002-2010   
##  7 Draymond Green       2014-2016   
##  8 John Stockton        1996-2000   
##  9 Kawhi Leonard        2014-2020   
## 10 Kendrick Perkins     2003-2017   
## 11 Kevin Durant         2012-2018   
## 12 Kevin Garnett        2007-2010   
## 13 Klay Thompson        2014-2016   
## 14 Kostas Antetokounmpo 2018-2020   
## 15 LeBron James         2008-2015   
## 16 Manu Ginobili        2004-2015   
## 17 Nick Collison        2011-2013   
## 18 Randy Livingston     1998-2005   
## 19 Robert Horry         1997-2005   
## 20 Shaquille O'Neal     1997-2010   
## 21 Shavlik Randolph     2007-2013   
## 22 Stephen Curry        2014-2018   
## 23 Terry Porter         1999-2001   
## 24 Tim Duncan           2000-2015

For identifying and displaying outliers who consistently maintain a high net rating, we decided to display them in a list with their season range because no matter what type of plot we tried, the net ratings were too close together and, with 24 players that qualified as outliers, the plots became too cluttered and unreadable. We defined the outliers using a percentile base threshold of 95% to narrow down the exceptionally high net rating based on how much data we have in our dataset. However, this list would be much longer if we used the interquartile range method, where most, if not all of, the dots on our boxplots would be considered outliers.

5. Conclusion

This project aimed to explore NBA player data to answer research questions that generally considered player performance. We answered these questions using our dataset and visualized these results using a variety of methods. We also recognized the limitations of our dataset and gave suggestions for potential improvements when filtering our data to determine our results where relevant.

7. Appendices

7.1 Setup Code: Loading Libraries, Reading Data

knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(dplyr)
library(tidyr)
library(reshape2)

nba_data <- read.csv('all_seasons.csv')

nba_data$season <- as.numeric(sub("-.*", "", as.character(nba_data$season)))

7.2 Grouping Data, Histogram 4.1.1

career_data <- nba_data %>%
  group_by(player_name) %>%
  summarise(
    first_season = min(season),
    last_season = max(season),
    career_length = last_season - first_season + 1
  )

iqr_values <- quantile(career_data$career_length, c(0.25, 0.75), na.rm = TRUE)
iqr_lower <- iqr_values[1]
iqr_upper <- iqr_values[2]

ggplot(career_data, aes(x = career_length)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black", alpha = 0.7) +
  geom_rect(aes(xmin = iqr_lower, xmax = iqr_upper, ymin = 0, ymax = Inf), 
            fill = "lightgreen", alpha = 0.01) +  
  labs(
    title = "Distribution of NBA Player Career Lengths",
    x = "Career Length (Years)",
    y = "Number of Players"
  ) +
  theme_minimal()

7.3 Grouping and Reformatting Data, Lineplot 4.2.1

performance_by_age <- nba_data %>%
  group_by(age) %>%
  summarise(
    avg_pts = mean(pts, na.rm = TRUE),
    avg_reb = mean(reb, na.rm = TRUE),
    avg_ast = mean(ast, na.rm = TRUE),
    .groups = 'drop'  
  )

performance_long <- performance_by_age %>%
  pivot_longer(cols = c(avg_pts, avg_reb, avg_ast), 
               names_to = "metric", 
               values_to = "value")

performance_long <- performance_long %>%
  mutate(
    metric = recode(metric,
                    "avg_pts" = "Points",
                    "avg_reb" = "Rebounds",
                    "avg_ast" = "Assists")
  )

ggplot(performance_long, aes(x = age, y = value, color = metric)) +
  geom_line(linewidth = 1.2) +  
  geom_point(size = 3) +  
  labs(
    title = "Average NBA Player Performance by Age",
    x = "Age",
    y = "Average Performance",
    color = "Metric"
  ) +
  scale_color_manual(values = c("Points" = "lightblue", "Rebounds" = "lightgreen", "Assists" = "lightcoral")) +  
  theme_minimal() +
  theme(legend.position = "top")

7.4 Grouping and Reformatting Data, Lineplot 4.2.2

performance_by_age <- nba_data %>%
  group_by(age) %>%
  summarise(
    avg_pts = mean(pts, na.rm = TRUE),
    avg_reb = mean(reb, na.rm = TRUE),
    avg_ast = mean(ast, na.rm = TRUE),
    .groups = 'drop'  
  )

performance_long <- performance_by_age %>%
  pivot_longer(cols = c(avg_pts, avg_reb, avg_ast), 
               names_to = "metric", 
               values_to = "value")

performance_long <- performance_long %>%
  mutate(
    metric = recode(metric,
                    "avg_pts" = "Points",
                    "avg_reb" = "Rebounds",
                    "avg_ast" = "Assists")
  )

ggplot(performance_long, aes(x = age, y = value, color = metric)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +  
  labs(
    title = "Average NBA Player Performance by Age",
    x = "Age",
    y = "Average Performance",
    color = "Metric"
  ) +
  scale_color_manual(values = c("Points" = "lightblue", "Rebounds" = "lightgreen", "Assists" = "lightcoral")) +  
  theme_minimal() +
  theme(legend.position = "top")

7.5 Cleaning and Grouping Data, Lineplot 4.3.1

nba_data_clean_ppg <- nba_data %>%
  filter(!is.na(pts) & !is.na(gp)) 

top_scorer_per_season <- nba_data_clean_ppg %>%
  group_by(season) %>%
  filter(pts == max(pts, na.rm = TRUE)) %>%  
  ungroup()

ggplot(top_scorer_per_season, aes(x = season, y = pts, color = player_name)) +
  geom_line(aes(group = player_name), linewidth = 1) +  
  geom_point(size = 2) +  
  labs(
    title = "Top Points Per Game Over the Seasons",
    x = "Season",
    y = "Points Per Game (PPG)",
    color = "Player"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

7.6 Cleaning and Grouping Data, Lineplot 4.3.2

nba_data_clean_avg_ppg <- nba_data %>%
  filter(!is.na(pts) & pts > 0)

average_ppg_per_season <- nba_data_clean_avg_ppg %>%
  group_by(season) %>%
  summarise(avg_ppg = mean(pts, na.rm = TRUE)) %>%
  ungroup()

ggplot(average_ppg_per_season, aes(x = season, y = avg_ppg)) +
  geom_line(color = "lightblue") + 
  geom_point(color = "lightcoral", size = 2) + 
  labs(
    title = "Trend of Average Points Per Game Per Player Across Seasons",
    x = "Season",
    y = "Average Points Per Game Per Player"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

7.7 Cleaning, Grouping and Reformatting Data, Barplot 4.4.1

nba_data_clean_r <- nba_data %>%
  group_by(player_name) %>%
  mutate(rookie_season = min(season)) %>% 
  ungroup()

nba_data_clean_r <- nba_data_clean_r %>%
  mutate(
    player_type = case_when(
      season == rookie_season ~ "Rookie",  
      age >= 26 & age <= 30 ~ "Prime",  
      TRUE ~ "Other"
    )
  )

nba_data_filtered <- nba_data_clean_r %>%
  filter(player_type %in% c("Rookie", "Prime"))

average_performance <- nba_data_filtered %>%
  group_by(player_type) %>%
  summarise(
    avg_pts = mean(pts, na.rm = TRUE),
    avg_reb = mean(reb, na.rm = TRUE),
    avg_ast = mean(ast, na.rm = TRUE)
  )

average_performance_long <- melt(average_performance, id.vars = "player_type")

average_performance_long$variable <- recode(average_performance_long$variable,
                                            "avg_pts" = "Points",
                                            "avg_reb" = "Rebounds",
                                            "avg_ast" = "Assists"
)

ggplot(average_performance_long, aes(x = player_type, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +  
  facet_wrap(~ variable) +  
  labs(
    title = "Comparison of Average Performance: Rookie vs Prime Players",
    x = "Player Type",
    y = "Average Value",
    fill = "Metric"
  ) +
  scale_fill_manual(values = c("Points" = "lightblue", "Rebounds" = "lightgreen", "Assists" = "lightcoral")) +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5)) 

7.8 Grouping and Reformatting Data, Barplot 4.4.2

exceptional_rookies <- nba_data_clean_r %>%
  filter(season == rookie_season) %>%
  filter(
    pts >= 15,     
    reb >= 6,           
    ast >= 5,           
    net_rating >= 3    
  )

exceptional_rookies_long <- melt(exceptional_rookies, id.vars = c("player_name", "season"), 
                                 measure.vars = c("pts", "reb", "ast", "net_rating"),
                                 variable.name = "metric", value.name = "value")

exceptional_rookies_long$metric <- recode(exceptional_rookies_long$metric,
                                          "pts" = "Points",
                                          "reb" = "Rebounds",
                                          "ast" = "Assists",
                                          "net_rating" = "Net Rating"
)

ggplot(exceptional_rookies_long, aes(x = player_name, y = value, fill = metric)) +
  geom_bar(stat = "identity", position = "dodge") +  
  coord_flip() +  
  labs(
    title = "Exceptional Rookie Players",
    x = "Player Name",
    y = "Performance Value",
    fill = "Metric"
  ) +
  scale_fill_manual(values = c("Points" = "lightblue", "Rebounds" = "lightgreen", "Assists" = "lightcoral", "Net Rating" = "lightgray")) +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

7.9 Filtering Data, Boxplots in 4.5.1

nba_data_clean_n <- nba_data %>%
  filter(!is.na(net_rating)) 

nba_data_period_1 <- nba_data_clean_n %>%
  filter(season >= 1996 & season <= 2004)

nba_data_period_2 <- nba_data_clean_n %>%
  filter(season >= 2005 & season <= 2013)

nba_data_period_3 <- nba_data_clean_n %>%
  filter(season >= 2014 & season <= 2022)

ggplot(nba_data_period_1, aes(x = factor(season), y = net_rating)) +
  geom_boxplot(aes(color = factor(season)), fill = "lightyellow", alpha = 0.6) +  
  labs(
    title = "Distribution of Player Net Rating (1996-2004)",
    x = "Season",
    y = "Net Rating",
    color = "Season"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(nba_data_period_2, aes(x = factor(season), y = net_rating)) +
  geom_boxplot(aes(color = factor(season)), fill = "lightyellow", alpha = 0.6) +  
  labs(
    title = "Distribution of Player Net Rating (2005-2013)",
    x = "Season",
    y = "Net Rating",
    color = "Season"  
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(nba_data_period_3, aes(x = factor(season), y = net_rating)) +
  geom_boxplot(aes(color = factor(season)), fill = "lightyellow", alpha = 0.6) +  
  labs(
    title = "Distribution of Player Net Rating (2014-2022)",
    x = "Season",
    y = "Net Rating",
    color = "Season"  
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

7.10 Calculating Outliers, Grouping and Filtering Data, List 4.5.2

high_net_rating_threshold <- quantile(nba_data_clean$net_rating, 0.95, na.rm = TRUE)

high_net_rating_players <- nba_data_clean_n %>%
  filter(net_rating > high_net_rating_threshold)

high_net_rating_consistent_players <- high_net_rating_players %>%
  group_by(player_name) %>%
  summarize(seasons_with_high_rating = n(),
            first_season = min(season),
            last_season = max(season)) %>%
  filter(seasons_with_high_rating >= 3) 

high_net_rating_consistent_players <- high_net_rating_consistent_players %>%
  mutate(season_range = paste(first_season, last_season, sep = "-"))

outlier_players <- high_net_rating_consistent_players %>%
  select(player_name, season_range)

print(n = 24, outlier_players)