In my R Deliverable I analyzed a NBA Players Database. Using this database I created multiple data visualizations that can tell a story across many years of the NBA.
setwd("C:/Users/Kyle/Downloads/Data_Visualization")
filename <- "PlayerIndex_nba_stats.csv"
df <- read.csv(filename)
# -------------------------------
library(dplyr)
library(ggplot2)
library(scales)
library(RColorBrewer)
library(lubridate)
library(ggthemes)
library(ggrepel)
library(plotly)
The NBA Players Database was found on Kaggle, and is a curated data set the compiles player information sourced from NBA.com. It was designed to give analysts a structured, ready-to-use collection of NBA players, past and present for basketball analytics, visualizations, and machine learning tasks.
Looking at the structure and the information in the data set, it includes information about:
Players Biographical Information
Career Timeline
Team Information
Player Attributes
Draft Information
Player Performance
There is a high-level structure of the data. It is a comprehensive snapshot of NBA players, including biographical details, physical attributes, draft information, and team associations. The data is cleanly structured, with each row representing a unique player and each column capturing a specific attribute such as height, weight, position, and draft year. There are some missing values, like in draft round/pick for un-drafted players, but is typical for NBA data.
There is a wide variety in player backgrounds, including difference in country of origin, college attendance, positions, stats, and physical measurements.
The data set spans many draft years and career lengths, allowing for observations about how player characteristics and entry paths into the league have changed over time.
Players are linked to a broad range of NBA teams, enabling comparisons of roster composition and positional distribution across franchises.
The data is generally clean, with missing values appearing only in predictable areas, making it ready for visualization and deeper analysis.
The descriptive statistics show that this data set covers nearly the full history of the NBA, with players dating back to the 1940s and extending through modern seasons. A major pattern is the large amount of missing draft information, about a third of all players, which reflects undrafted signings and early‑era league structure. Physical attributes are mostly consistent, though height is stored as text, which stands out as a data‑quality issue. Team information is stable, with only a small share of players tied to changed franchises, and the distributions of points, rebounds, and assists align with the reality that most NBA players historically have been role players rather than stars. Overall, the data set is broad and detailed, with a few unique characteristics worth noting before moving into visualizations.
head(df)
## PERSON_ID PLAYER_LAST_NAME PLAYER_FIRST_NAME PLAYER_SLUG TEAM_ID
## 1 76001 Abdelnaby Alaa alaa-abdelnaby 1610612757
## 2 76002 Abdul-Aziz Zaid zaid-abdul-aziz 1610612745
## 3 76003 Abdul-Jabbar Kareem kareem-abdul-jabbar 1610612747
## 4 51 Abdul-Rauf Mahmoud mahmoud-abdul-rauf 1610612743
## 5 1505 Abdul-Wahad Tariq tariq-abdul-wahad 1610612758
## 6 949 Abdur-Rahim Shareef shareef-abdur-rahim 1610612763
## TEAM_SLUG IS_DEFUNCT TEAM_CITY TEAM_NAME TEAM_ABBREVIATION
## 1 blazers 0 Portland Trail Blazers POR
## 2 rockets 0 Houston Rockets HOU
## 3 lakers 0 Los Angeles Lakers LAL
## 4 nuggets 0 Denver Nuggets DEN
## 5 kings 0 Sacramento Kings SAC
## 6 grizzlies 0 Memphis Grizzlies MEM
## JERSEY_NUMBER POSITION HEIGHT WEIGHT COLLEGE COUNTRY DRAFT_YEAR
## 1 30 F 6-10 240 Duke USA 1990
## 2 54 C 6-9 235 Iowa State USA 1968
## 3 33 C 7-2 225 UCLA USA 1969
## 4 1 G 6-1 162 Louisiana State USA 1990
## 5 9 F-G 6-6 235 San Jose State France 1997
## 6 3 F 6-9 245 California USA 1996
## DRAFT_ROUND DRAFT_NUMBER ROSTER_STATUS PTS REB AST STATS_TIMEFRAME
## 1 1 25 NA 5.7 3.3 0.3 Career
## 2 1 5 NA 9.0 8.0 1.2 Career
## 3 1 1 NA 24.6 11.2 3.6 Career
## 4 1 3 NA 14.6 1.9 3.5 Career
## 5 1 11 NA 7.8 3.3 1.1 Career
## 6 1 3 NA 18.1 7.5 2.5 Career
## FROM_YEAR TO_YEAR
## 1 1990 1994
## 2 1968 1977
## 3 1969 1988
## 4 1990 2000
## 5 1997 2003
## 6 1996 2007
summary(df)
## PERSON_ID PLAYER_LAST_NAME PLAYER_FIRST_NAME PLAYER_SLUG
## Min. : 2 Length:5025 Length:5025 Length:5025
## 1st Qu.: 76174 Class :character Class :character Class :character
## Median : 77722 Mode :character Mode :character Mode :character
## Mean : 383646
## 3rd Qu.: 202951
## Max. :1642530
##
## TEAM_ID TEAM_SLUG IS_DEFUNCT TEAM_CITY
## Min. :1.611e+09 Length:5025 Min. :0.00000 Length:5025
## 1st Qu.:1.611e+09 Class :character 1st Qu.:0.00000 Class :character
## Median :1.611e+09 Mode :character Median :0.00000 Mode :character
## Mean :1.611e+09 Mean :0.05294
## 3rd Qu.:1.611e+09 3rd Qu.:0.00000
## Max. :1.611e+09 Max. :1.00000
##
## TEAM_NAME TEAM_ABBREVIATION JERSEY_NUMBER POSITION
## Length:5025 Length:5025 Length:5025 Length:5025
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HEIGHT WEIGHT COLLEGE COUNTRY
## Length:5025 Min. :133.0 Length:5025 Length:5025
## Class :character 1st Qu.:190.0 Class :character Class :character
## Mode :character Median :210.0 Mode :character Mode :character
## Mean :211.4
## 3rd Qu.:230.0
## Max. :360.0
## NA's :53
## DRAFT_YEAR DRAFT_ROUND DRAFT_NUMBER ROSTER_STATUS
## Min. :1947 Min. : 0.000 Min. : 0.00 Min. :1
## 1st Qu.:1974 1st Qu.: 1.000 1st Qu.: 12.00 1st Qu.:1
## Median :1990 Median : 2.000 Median : 26.00 Median :1
## Mean :1990 Mean : 2.077 Mean : 32.02 Mean :1
## 3rd Qu.:2009 3rd Qu.: 2.000 3rd Qu.: 43.00 3rd Qu.:1
## Max. :2024 Max. :20.000 Max. :221.00 Max. :1
## NA's :1325 NA's :1523 NA's :1591 NA's :4491
## PTS REB AST STATS_TIMEFRAME
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Length:5025
## 1st Qu.: 2.800 1st Qu.: 1.300 1st Qu.: 0.500 Class :character
## Median : 5.000 Median : 2.400 Median : 1.000 Mode :character
## Mean : 6.293 Mean : 2.934 Mean : 1.432
## 3rd Qu.: 8.500 3rd Qu.: 3.900 3rd Qu.: 1.900
## Max. :32.700 Max. :22.900 Max. :11.600
## NA's :24 NA's :316 NA's :24
## FROM_YEAR TO_YEAR
## Min. :1946 Min. :1946
## 1st Qu.:1974 1st Qu.:1978
## Median :1993 Median :2000
## Mean :1991 Mean :1995
## 3rd Qu.:2012 3rd Qu.:2017
## Max. :2024 Max. :2024
##
str(df)
## 'data.frame': 5025 obs. of 26 variables:
## $ PERSON_ID : int 76001 76002 76003 51 1505 949 76005 76006 76007 203518 ...
## $ PLAYER_LAST_NAME : chr "Abdelnaby" "Abdul-Aziz" "Abdul-Jabbar" "Abdul-Rauf" ...
## $ PLAYER_FIRST_NAME: chr "Alaa" "Zaid" "Kareem" "Mahmoud" ...
## $ PLAYER_SLUG : chr "alaa-abdelnaby" "zaid-abdul-aziz" "kareem-abdul-jabbar" "mahmoud-abdul-rauf" ...
## $ TEAM_ID : int 1610612757 1610612745 1610612747 1610612743 1610612758 1610612763 1610612744 1610612755 1610610031 1610612760 ...
## $ TEAM_SLUG : chr "blazers" "rockets" "lakers" "nuggets" ...
## $ IS_DEFUNCT : int 0 0 0 0 0 0 0 0 1 0 ...
## $ TEAM_CITY : chr "Portland" "Houston" "Los Angeles" "Denver" ...
## $ TEAM_NAME : chr "Trail Blazers" "Rockets" "Lakers" "Nuggets" ...
## $ TEAM_ABBREVIATION: chr "POR" "HOU" "LAL" "DEN" ...
## $ JERSEY_NUMBER : chr "30" "54" "33" "1" ...
## $ POSITION : chr "F" "C" "C" "G" ...
## $ HEIGHT : chr "6-10" "6-9" "7-2" "6-1" ...
## $ WEIGHT : int 240 235 225 162 235 245 220 180 195 190 ...
## $ COLLEGE : chr "Duke" "Iowa State" "UCLA" "Louisiana State" ...
## $ COUNTRY : chr "USA" "USA" "USA" "USA" ...
## $ DRAFT_YEAR : num 1990 1968 1969 1990 1997 ...
## $ DRAFT_ROUND : num 1 1 1 1 1 1 3 NA NA 2 ...
## $ DRAFT_NUMBER : num 25 5 1 3 11 3 43 NA NA 32 ...
## $ ROSTER_STATUS : num NA NA NA NA NA NA NA NA NA NA ...
## $ PTS : num 5.7 9 24.6 14.6 7.8 18.1 5.6 0 9.5 5.3 ...
## $ REB : num 3.3 8 11.2 1.9 3.3 7.5 3.2 1 NA 1.4 ...
## $ AST : num 0.3 1.2 3.6 3.5 1.1 2.5 1.2 1 0.7 0.5 ...
## $ STATS_TIMEFRAME : chr "Career" "Career" "Career" "Career" ...
## $ FROM_YEAR : int 1990 1968 1969 1990 1997 1996 1976 1956 1946 2016 ...
## $ TO_YEAR : int 1994 1977 1988 2000 2003 2007 1980 1956 1947 2018 ...
colSums(is.na(df))
## PERSON_ID PLAYER_LAST_NAME PLAYER_FIRST_NAME PLAYER_SLUG
## 0 0 0 0
## TEAM_ID TEAM_SLUG IS_DEFUNCT TEAM_CITY
## 0 0 0 0
## TEAM_NAME TEAM_ABBREVIATION JERSEY_NUMBER POSITION
## 0 0 0 0
## HEIGHT WEIGHT COLLEGE COUNTRY
## 0 53 0 0
## DRAFT_YEAR DRAFT_ROUND DRAFT_NUMBER ROSTER_STATUS
## 1325 1523 1591 4491
## PTS REB AST STATS_TIMEFRAME
## 24 316 24 0
## FROM_YEAR TO_YEAR
## 0 0
Top 10 NBA Scorers AVG
This bar chart highlights the top ten players in the dataset based on their career scoring average. Presenting the data horizontally makes it easy to compare players side‑by‑side and quickly identify who stands out as the most efficient scorers. The chart shows a mix of historical legends and modern stars, illustrating how elite scoring spans across eras. Shai Gilgeous‑Alexander appears at the top with the highest average, while players like Michael Jordan, Wilt Chamberlain, and Giannis Antetokounmpo reinforce the consistency of all‑time great scorers.
df$PTS <- as.numeric(df$PTS)
player_pts <- df %>%
mutate(Player = paste(PLAYER_FIRST_NAME, PLAYER_LAST_NAME)) %>%
group_by(Player) %>%
summarise(TotalPTS = sum(PTS, na.rm = TRUE), .groups = "drop")
player_pts <- player_pts[order(player_pts$TotalPTS, decreasing = TRUE), ]
ggplot(player_pts[1:10, ], aes(x = reorder(Player, -TotalPTS), y = TotalPTS)) +
geom_bar(stat = "identity", colour = "black", fill = "lightblue") +
geom_text(aes(label = round(TotalPTS, 1)),
hjust = -0.1,
size = 4) +
labs(title = "Top 10 NBA Players by Total Points",
x = "Player",
y = "Total Points") +
theme(plot.title = element_text(hjust = 0.5)) +
coord_flip()
Average points by each team the last 5 years
This line chart compares each NBA team’s average points per game over the last five seasons, allowing you to see how team scoring has changed year‑to‑year. By plotting all five seasons together, the visualization highlights both consistent trends and noticeable shifts, some teams show steady improvement, others fluctuate, and a few remain relatively stable across all years. The color‑coded lines make it easy to track how each season compares, while the team abbreviations along the x‑axis provide a quick reference for identifying which franchises tend to score more or less over time.
df$FROM_YEAR <- as.numeric(df$FROM_YEAR)
df$PTS <- as.numeric(df$PTS)
# Identify last 5 years in the dataset
max_year <- max(df$FROM_YEAR, na.rm = TRUE)
years_to_keep <- seq(max_year, max_year - 4, by = -1)
# Filter to last 5 years
pts_team_df <- df %>%
filter(FROM_YEAR %in% years_to_keep,
!is.na(TEAM_ABBREVIATION)) %>%
group_by(FROM_YEAR, TEAM_ABBREVIATION) %>%
summarise(avg_pts = mean(PTS, na.rm = TRUE), .groups = "keep") %>%
data.frame()
pts_team_df$FROM_YEAR <- as.factor(pts_team_df$FROM_YEAR)
ggplot(pts_team_df, aes(x = TEAM_ABBREVIATION, y = avg_pts, group = FROM_YEAR)) +
geom_line(aes(color = FROM_YEAR), size = 2) +
geom_point(shape = 21, size = 4, color = "black", fill = "white") +
labs(title = "Average Points by Team (Last 5 Years)",
x = "Team",
y = "Average Points") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 60, hjust = 1, size = 8),
plot.margin = margin(20, 20, 20, 20))+
scale_y_continuous(labels = comma) +
scale_color_brewer(palette = "Paired",
name = "Year",
guide = guide_legend(reverse = TRUE))
Which Decade Averaged the Most Points per Player?
This line chart shows how average career points vary depending on the decade in which players entered the league. The trend peaks in the 1960s, where players averaged the highest career point totals, before gradually declining in later decades. The slight rise and fall across eras reflects changes in league style, pace, and player roles over time. The error bars help illustrate the variability within each decade, showing that some eras had wider differences in player scoring than others.
# Convert columns
df$FROM_YEAR <- as.numeric(df$FROM_YEAR)
df$PTS <- as.numeric(df$PTS)
# Create decade variable
df <- df %>%
mutate(decade = floor(FROM_YEAR / 10) * 10)
# Summarize average points by decade
decade_pts <- df %>%
group_by(decade) %>%
summarise(avg_pts = mean(PTS, na.rm = TRUE))
# Plot with labels
ggplot(decade_pts, aes(x = decade, y = avg_pts)) +
geom_line(size = 1.2, color = "blue") +
geom_point(size = 3, color = "blue") +
geom_text_repel(aes(label = round(avg_pts, .5)),
size = 4,nudge_y = .5,color = "black") +
theme_minimal() +
labs(title = "Average Career Points by Decade Entered",
x = "Decade",
y = "Average Career Points")
Do Taller Players Score More?
This scatterplot explores whether taller players tend to score more over their careers by comparing height (in inches) to total career points. While there is a slight upward trend, many high‑scoring players fall in the mid‑to‑upper height range, the plot makes it clear that height alone does not determine scoring ability. Some of the league’s greatest scorers, like Michael Jordan and Allen Iverson, are not among the tallest players, while several extremely tall players cluster at lower scoring totals. The labeled points help highlight notable outliers across the height spectrum.
# Convert HEIGHT from "6-10" format to inches
df <- df %>%
mutate(
height_ft = as.numeric(sub("-.*", "", HEIGHT)),
height_in = as.numeric(sub(".*-", "", HEIGHT)),
height_total = height_ft * 12 + height_in)
# Create player name for labeling
df <- df %>%
mutate(Player = paste(PLAYER_FIRST_NAME, PLAYER_LAST_NAME))
# Scatterplot with smoothing line
ggplot(df, aes(x = height_total, y = PTS)) +
geom_point(color = "lightblue", alpha = 0.6, size = 3) +
geom_text_repel(aes(label = Player),
size = 3,
max.overlaps = 20) +
theme_minimal() +
labs(title = "Do Taller Players Score More?",
subtitle = "Scatterplot of Height vs Career Points",
x = "Height (inches)",
y = "Career Points")
Do we see an increase in international players?
This donut chart compares the distribution of NBA players by country in the 2000s versus the 2020s, highlighting how the league’s international presence has evolved. The inner ring shows that the NBA in the 2000s was overwhelmingly dominated by U.S.‑born players, with nearly 80% of the league coming from the United States. In the 2020s, that share decreases slightly as the outer ring reveals a broader mix of international talent. Countries like France, Canada, Australia, Nigeria, Serbia, and Spain all show meaningful representation, reflecting the NBA’s growing global reach.
# 2000s
top5_2000s <- df %>%
filter(decade == 2000, COUNTRY != "USA") %>%
count(COUNTRY, sort = TRUE) %>%
slice_head(n = 5) %>%
pull(COUNTRY)
# 2020s
top5_2020s <- df %>%
filter(decade == 2020, COUNTRY != "USA") %>%
count(COUNTRY, sort = TRUE) %>%
slice_head(n = 5) %>%
pull(COUNTRY)
country_2000s <- df %>%
filter(decade == 2000) %>%
mutate(COUNTRY = ifelse(COUNTRY %in% c("USA", top5_2000s),COUNTRY,"Other")) %>%
count(COUNTRY)
country_2020s <- df %>%
filter(decade == 2020) %>%
mutate(COUNTRY = ifelse(COUNTRY %in% c("USA", top5_2020s),COUNTRY,"Other")) %>%
count(COUNTRY)
fig <- plot_ly(hole = 0.7) %>%
layout(title = "NBA Player Country Distribution (2000s vs 2020s)") %>%
add_trace(
data = country_2020s,
labels = ~COUNTRY,
values = ~n,
type = "pie",
textposition = "inside",
hovertemplate = "Decade: 2020s<br>Country:%{label}<br>Percent:%{percent}<br>Count:%{value}<extra></extra>") %>%
add_trace(
data = country_2000s,
labels = ~COUNTRY,
values = ~n,
type = "pie",
textposition = "inside",
hovertemplate = "Decade: 2000s<br>Country:%{label}<br>Percent:%{percent}<br>Count:%{value}<extra></extra>",
domain = list(
x = c(0.16, 0.84),
y = c(0.16, 0.84)))
fig
Taken together, these visualizations show how NBA player performance and league demographics have evolved over time. Scoring trends vary by era and team, but elite production consistently comes from players of many different sizes and backgrounds. Height alone doesn’t determine scoring ability, and while the league remains U.S.‑dominated, international representation has clearly expanded in recent decades. Overall, the data set highlights a league that is historically diverse, constantly changing, and shaped by a wide range of player profiles and playing styles.