This project explores NBA player statistics from the season beginning in 1996 to the 2022-2023 season to answer questions about career longevity, age and player performance, scoring trends, rookie impact, and player efficiency. We visualize our results using R programming to answer these questions.
The popularity and complexity of basketball leave sports enthusiasts and data analysts alike wondering about trends among players and who the next best player for aspiring rookies to beat is. In this project, we explored average career length, performance based on age, top scorers, rookie performance and impact, and player efficiency according to net rating using R programming methods.
Data Source: Kaggle NBA Players Dataset. Tools: R programming language, base packages, dplyr, tidyr, reshape2, ggplot2. Data preprocessing: cleaning and grouping data. Data visualization: creating histograms, lineplots, barplots, and boxplots using ggplot2.
This histogram shows the distribution of career lengths from all players from the 1996-97 season to the 2022-23 season. Using the interquartile range, we have highlighted where most player careers fall in green, which is between 1.5 and 7.5 years. It is rare to have a basketball career above 10 years, and it becomes increasingly rare above 15. A basketball career of 20 years and above is almost entirely unheard of. We also see that a large number of rookie players never make it to their second year, whether from injury or other factors that cause them not to make it to a second year.
We would have liked to explore whether there are differences between career lengths of players in different positions, however, we unfortunately were unable to as player position is not included in our data set, and acquiring this information is outside the scope of the project.
The lineplot above shows how the average performance, which consists of points, assists, and rebounds, changes with the player’s age. We can see that the average amount of points scored peaks at age 28 and steadily decreases as players age. Rebounds and assists are more consistent, although they also tend to peak at age 28. Rebounds and assists stay pretty consistent from 25 to 35. Both start slowly decreasing between 35 and 40. Rebounds see an uptick at age 42 before plummeting and assists decrease abruptly after age 40.
This line plot shows the number of players that excel in each performance metric of points, rebounds, and assists by age group. We set these thresholds to be an average metric per game of 20 points, 8 rebounds, and, 8 assists. The number of players that meet or exceed the point or rebound metrics increases as players approach 25, also reaching their peak at 25, and decreases significantly after age 28. Looking at assists these differences are much less pronounced, and the number of excelling players within the assists category stays consistent between 28 and 32 when they start tapering off.
The line plot above shows the players who scored the most points per game over the seasons, how their points per game compare to each other, and how they vary from season to season. There are 13 unique players; among them are some of the most famous players in the history of the NBA, such as Lebron James and Michael Jordan. Kobe Bryant and James Harden are the top scorers of top scorers in their prime, though all of these players are exceptional.
Now, we look at the trend of average points scored per game per player varying across the seasons using the entirety of our dataset instead of just the top players per season. Looking at the line plot, we can see the high and low-scoring years, but the general trend is that the average points scored per game per player has increased over time, especially within the past ten years or so.
The barplot above shows how the average performance of rookie players compares to players in their prime. We defined rookie players as players in their first season and players in their prime as players between the ages of 26 and 30. We can see that rookie players perform worse on average than players in their prime for all performance metrics, though the difference in assists is less.
For this barplot, we aimed to find rookies who had an exceptional impact on their teams. We did this by setting the thresholds for exceptional rookie players to be the averages for points, rebounds, assists, and net rating greater than or equal to 15, 6, 5, and 3, respectively. Based on these metrics, we identified these four players to have an exceptional impact on their teams. However, a better way to identify rookies with an exceptional impact on their teams may be to look at something such as the Rookie of the Year Award, which takes much more into account when being awarded.
These boxplots show the distribution of player net ratings across the seasons. They have been expanded into three separate graphs, each containing 9 years of data, to improve legibility and to avoid clutter. The dots represent players whose net ratings fall outside of the expected range of the data. We will determine which of these players qualify as outliers in our next step.
## # A tibble: 24 × 2
## player_name season_range
## <chr> <chr>
## 1 Andre Iguodala 2013-2022
## 2 Bruce Bowen 1996-2006
## 3 Chris Paul 2013-2017
## 4 Danny Green 2010-2018
## 5 David Robinson 1998-2002
## 6 Dirk Nowitzki 2002-2010
## 7 Draymond Green 2014-2016
## 8 John Stockton 1996-2000
## 9 Kawhi Leonard 2014-2020
## 10 Kendrick Perkins 2003-2017
## 11 Kevin Durant 2012-2018
## 12 Kevin Garnett 2007-2010
## 13 Klay Thompson 2014-2016
## 14 Kostas Antetokounmpo 2018-2020
## 15 LeBron James 2008-2015
## 16 Manu Ginobili 2004-2015
## 17 Nick Collison 2011-2013
## 18 Randy Livingston 1998-2005
## 19 Robert Horry 1997-2005
## 20 Shaquille O'Neal 1997-2010
## 21 Shavlik Randolph 2007-2013
## 22 Stephen Curry 2014-2018
## 23 Terry Porter 1999-2001
## 24 Tim Duncan 2000-2015
For identifying and displaying outliers who consistently maintain a high net rating, we decided to display them in a list with their season range because no matter what type of plot we tried, the net ratings were too close together and, with 24 players that qualified as outliers, the plots became too cluttered and unreadable. We defined the outliers using a percentile base threshold of 95% to narrow down the exceptionally high net rating based on how much data we have in our dataset. However, this list would be much longer if we used the interquartile range method, where most, if not all of, the dots on our boxplots would be considered outliers.
This project aimed to explore NBA player data to answer research questions that generally considered player performance. We answered these questions using our dataset and visualized these results using a variety of methods. We also recognized the limitations of our dataset and gave suggestions for potential improvements when filtering our data to determine our results where relevant.
Kaggle Dataset: [https://www.kaggle.com/datasets/justinas/nba-players-data]
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(dplyr)
library(tidyr)
library(reshape2)
nba_data <- read.csv('all_seasons.csv')
nba_data$season <- as.numeric(sub("-.*", "", as.character(nba_data$season)))
career_data <- nba_data %>%
group_by(player_name) %>%
summarise(
first_season = min(season),
last_season = max(season),
career_length = last_season - first_season + 1
)
iqr_values <- quantile(career_data$career_length, c(0.25, 0.75), na.rm = TRUE)
iqr_lower <- iqr_values[1]
iqr_upper <- iqr_values[2]
ggplot(career_data, aes(x = career_length)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black", alpha = 0.7) +
geom_rect(aes(xmin = iqr_lower, xmax = iqr_upper, ymin = 0, ymax = Inf),
fill = "lightgreen", alpha = 0.01) +
labs(
title = "Distribution of NBA Player Career Lengths",
x = "Career Length (Years)",
y = "Number of Players"
) +
theme_minimal()
performance_by_age <- nba_data %>%
group_by(age) %>%
summarise(
avg_pts = mean(pts, na.rm = TRUE),
avg_reb = mean(reb, na.rm = TRUE),
avg_ast = mean(ast, na.rm = TRUE),
.groups = 'drop'
)
performance_long <- performance_by_age %>%
pivot_longer(cols = c(avg_pts, avg_reb, avg_ast),
names_to = "metric",
values_to = "value")
performance_long <- performance_long %>%
mutate(
metric = recode(metric,
"avg_pts" = "Points",
"avg_reb" = "Rebounds",
"avg_ast" = "Assists")
)
ggplot(performance_long, aes(x = age, y = value, color = metric)) +
geom_line(linewidth = 1.2) +
geom_point(size = 3) +
labs(
title = "Average NBA Player Performance by Age",
x = "Age",
y = "Average Performance",
color = "Metric"
) +
scale_color_manual(values = c("Points" = "lightblue", "Rebounds" = "lightgreen", "Assists" = "lightcoral")) +
theme_minimal() +
theme(legend.position = "top")
performance_by_age <- nba_data %>%
group_by(age) %>%
summarise(
avg_pts = mean(pts, na.rm = TRUE),
avg_reb = mean(reb, na.rm = TRUE),
avg_ast = mean(ast, na.rm = TRUE),
.groups = 'drop'
)
performance_long <- performance_by_age %>%
pivot_longer(cols = c(avg_pts, avg_reb, avg_ast),
names_to = "metric",
values_to = "value")
performance_long <- performance_long %>%
mutate(
metric = recode(metric,
"avg_pts" = "Points",
"avg_reb" = "Rebounds",
"avg_ast" = "Assists")
)
ggplot(performance_long, aes(x = age, y = value, color = metric)) +
geom_line(linewidth = 1.2) +
geom_point(size = 3) +
labs(
title = "Average NBA Player Performance by Age",
x = "Age",
y = "Average Performance",
color = "Metric"
) +
scale_color_manual(values = c("Points" = "lightblue", "Rebounds" = "lightgreen", "Assists" = "lightcoral")) +
theme_minimal() +
theme(legend.position = "top")
nba_data_clean_ppg <- nba_data %>%
filter(!is.na(pts) & !is.na(gp))
top_scorer_per_season <- nba_data_clean_ppg %>%
group_by(season) %>%
filter(pts == max(pts, na.rm = TRUE)) %>%
ungroup()
ggplot(top_scorer_per_season, aes(x = season, y = pts, color = player_name)) +
geom_line(aes(group = player_name), linewidth = 1) +
geom_point(size = 2) +
labs(
title = "Top Points Per Game Over the Seasons",
x = "Season",
y = "Points Per Game (PPG)",
color = "Player"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
nba_data_clean_avg_ppg <- nba_data %>%
filter(!is.na(pts) & pts > 0)
average_ppg_per_season <- nba_data_clean_avg_ppg %>%
group_by(season) %>%
summarise(avg_ppg = mean(pts, na.rm = TRUE)) %>%
ungroup()
ggplot(average_ppg_per_season, aes(x = season, y = avg_ppg)) +
geom_line(color = "lightblue") +
geom_point(color = "lightcoral", size = 2) +
labs(
title = "Trend of Average Points Per Game Per Player Across Seasons",
x = "Season",
y = "Average Points Per Game Per Player"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
nba_data_clean_r <- nba_data %>%
group_by(player_name) %>%
mutate(rookie_season = min(season)) %>%
ungroup()
nba_data_clean_r <- nba_data_clean_r %>%
mutate(
player_type = case_when(
season == rookie_season ~ "Rookie",
age >= 26 & age <= 30 ~ "Prime",
TRUE ~ "Other"
)
)
nba_data_filtered <- nba_data_clean_r %>%
filter(player_type %in% c("Rookie", "Prime"))
average_performance <- nba_data_filtered %>%
group_by(player_type) %>%
summarise(
avg_pts = mean(pts, na.rm = TRUE),
avg_reb = mean(reb, na.rm = TRUE),
avg_ast = mean(ast, na.rm = TRUE)
)
average_performance_long <- melt(average_performance, id.vars = "player_type")
average_performance_long$variable <- recode(average_performance_long$variable,
"avg_pts" = "Points",
"avg_reb" = "Rebounds",
"avg_ast" = "Assists"
)
ggplot(average_performance_long, aes(x = player_type, y = value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ variable) +
labs(
title = "Comparison of Average Performance: Rookie vs Prime Players",
x = "Player Type",
y = "Average Value",
fill = "Metric"
) +
scale_fill_manual(values = c("Points" = "lightblue", "Rebounds" = "lightgreen", "Assists" = "lightcoral")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
exceptional_rookies <- nba_data_clean_r %>%
filter(season == rookie_season) %>%
filter(
pts >= 15,
reb >= 6,
ast >= 5,
net_rating >= 3
)
exceptional_rookies_long <- melt(exceptional_rookies, id.vars = c("player_name", "season"),
measure.vars = c("pts", "reb", "ast", "net_rating"),
variable.name = "metric", value.name = "value")
exceptional_rookies_long$metric <- recode(exceptional_rookies_long$metric,
"pts" = "Points",
"reb" = "Rebounds",
"ast" = "Assists",
"net_rating" = "Net Rating"
)
ggplot(exceptional_rookies_long, aes(x = player_name, y = value, fill = metric)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
labs(
title = "Exceptional Rookie Players",
x = "Player Name",
y = "Performance Value",
fill = "Metric"
) +
scale_fill_manual(values = c("Points" = "lightblue", "Rebounds" = "lightgreen", "Assists" = "lightcoral", "Net Rating" = "lightgray")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
nba_data_clean_n <- nba_data %>%
filter(!is.na(net_rating))
nba_data_period_1 <- nba_data_clean_n %>%
filter(season >= 1996 & season <= 2004)
nba_data_period_2 <- nba_data_clean_n %>%
filter(season >= 2005 & season <= 2013)
nba_data_period_3 <- nba_data_clean_n %>%
filter(season >= 2014 & season <= 2022)
ggplot(nba_data_period_1, aes(x = factor(season), y = net_rating)) +
geom_boxplot(aes(color = factor(season)), fill = "lightyellow", alpha = 0.6) +
labs(
title = "Distribution of Player Net Rating (1996-2004)",
x = "Season",
y = "Net Rating",
color = "Season"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(nba_data_period_2, aes(x = factor(season), y = net_rating)) +
geom_boxplot(aes(color = factor(season)), fill = "lightyellow", alpha = 0.6) +
labs(
title = "Distribution of Player Net Rating (2005-2013)",
x = "Season",
y = "Net Rating",
color = "Season"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(nba_data_period_3, aes(x = factor(season), y = net_rating)) +
geom_boxplot(aes(color = factor(season)), fill = "lightyellow", alpha = 0.6) +
labs(
title = "Distribution of Player Net Rating (2014-2022)",
x = "Season",
y = "Net Rating",
color = "Season"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
high_net_rating_threshold <- quantile(nba_data_clean$net_rating, 0.95, na.rm = TRUE)
high_net_rating_players <- nba_data_clean_n %>%
filter(net_rating > high_net_rating_threshold)
high_net_rating_consistent_players <- high_net_rating_players %>%
group_by(player_name) %>%
summarize(seasons_with_high_rating = n(),
first_season = min(season),
last_season = max(season)) %>%
filter(seasons_with_high_rating >= 3)
high_net_rating_consistent_players <- high_net_rating_consistent_players %>%
mutate(season_range = paste(first_season, last_season, sep = "-"))
outlier_players <- high_net_rating_consistent_players %>%
select(player_name, season_range)
print(n = 24, outlier_players)