NBA R Module Deliverable

Introduction

In my R Deliverable I analyzed a NBA Players Database. Using this database I created multiple data visualizations that can tell a story across many years of the NBA.

Dataset

setwd("C:/Users/Kyle/Downloads/Data_Visualization")
filename <- "PlayerIndex_nba_stats.csv"
df <- read.csv(filename)
# -------------------------------
library(dplyr)
library(ggplot2) 
library(scales) 
library(RColorBrewer)
library(lubridate)
library(ggthemes)
library(ggrepel)
library(plotly)

The NBA Players Database was found on Kaggle, and is a curated data set the compiles player information sourced from NBA.com. It was designed to give analysts a structured, ready-to-use collection of NBA players, past and present for basketball analytics, visualizations, and machine learning tasks.

Looking at the structure and the information in the data set, it includes information about:

Players Biographical Information
Career Timeline
Team Information
Player Attributes
Draft Information
Player Performance

Findings

There is a high-level structure of the data. It is a comprehensive snapshot of NBA players, including biographical details, physical attributes, draft information, and team associations. The data is cleanly structured, with each row representing a unique player and each column capturing a specific attribute such as height, weight, position, and draft year. There are some missing values, like in draft round/pick for un-drafted players, but is typical for NBA data.

There is a wide variety in player backgrounds, including difference in country of origin, college attendance, positions, stats, and physical measurements.

The data set spans many draft years and career lengths, allowing for observations about how player characteristics and entry paths into the league have changed over time.

Players are linked to a broad range of NBA teams, enabling comparisons of roster composition and positional distribution across franchises.

The data is generally clean, with missing values appearing only in predictable areas, making it ready for visualization and deeper analysis.

Tab 1

The descriptive statistics show that this data set covers nearly the full history of the NBA, with players dating back to the 1940s and extending through modern seasons. A major pattern is the large amount of missing draft information, about a third of all players, which reflects undrafted signings and early‑era league structure. Physical attributes are mostly consistent, though height is stored as text, which stands out as a data‑quality issue. Team information is stable, with only a small share of players tied to changed franchises, and the distributions of points, rebounds, and assists align with the reality that most NBA players historically have been role players rather than stars. Overall, the data set is broad and detailed, with a few unique characteristics worth noting before moving into visualizations.

head(df)

##   PERSON_ID PLAYER_LAST_NAME PLAYER_FIRST_NAME         PLAYER_SLUG    TEAM_ID
## 1     76001        Abdelnaby              Alaa      alaa-abdelnaby 1610612757
## 2     76002       Abdul-Aziz              Zaid     zaid-abdul-aziz 1610612745
## 3     76003     Abdul-Jabbar            Kareem kareem-abdul-jabbar 1610612747
## 4        51       Abdul-Rauf           Mahmoud  mahmoud-abdul-rauf 1610612743
## 5      1505      Abdul-Wahad             Tariq   tariq-abdul-wahad 1610612758
## 6       949      Abdur-Rahim           Shareef shareef-abdur-rahim 1610612763
##   TEAM_SLUG IS_DEFUNCT   TEAM_CITY     TEAM_NAME TEAM_ABBREVIATION
## 1   blazers          0    Portland Trail Blazers               POR
## 2   rockets          0     Houston       Rockets               HOU
## 3    lakers          0 Los Angeles        Lakers               LAL
## 4   nuggets          0      Denver       Nuggets               DEN
## 5     kings          0  Sacramento         Kings               SAC
## 6 grizzlies          0     Memphis     Grizzlies               MEM
##   JERSEY_NUMBER POSITION HEIGHT WEIGHT         COLLEGE COUNTRY DRAFT_YEAR
## 1            30        F   6-10    240            Duke     USA       1990
## 2            54        C    6-9    235      Iowa State     USA       1968
## 3            33        C    7-2    225            UCLA     USA       1969
## 4             1        G    6-1    162 Louisiana State     USA       1990
## 5             9      F-G    6-6    235  San Jose State  France       1997
## 6             3        F    6-9    245      California     USA       1996
##   DRAFT_ROUND DRAFT_NUMBER ROSTER_STATUS  PTS  REB AST STATS_TIMEFRAME
## 1           1           25            NA  5.7  3.3 0.3          Career
## 2           1            5            NA  9.0  8.0 1.2          Career
## 3           1            1            NA 24.6 11.2 3.6          Career
## 4           1            3            NA 14.6  1.9 3.5          Career
## 5           1           11            NA  7.8  3.3 1.1          Career
## 6           1            3            NA 18.1  7.5 2.5          Career
##   FROM_YEAR TO_YEAR
## 1      1990    1994
## 2      1968    1977
## 3      1969    1988
## 4      1990    2000
## 5      1997    2003
## 6      1996    2007

summary(df)

##    PERSON_ID       PLAYER_LAST_NAME   PLAYER_FIRST_NAME  PLAYER_SLUG       
##  Min.   :      2   Length:5025        Length:5025        Length:5025       
##  1st Qu.:  76174   Class :character   Class :character   Class :character  
##  Median :  77722   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 383646                                                           
##  3rd Qu.: 202951                                                           
##  Max.   :1642530                                                           
##                                                                            
##     TEAM_ID           TEAM_SLUG           IS_DEFUNCT       TEAM_CITY        
##  Min.   :1.611e+09   Length:5025        Min.   :0.00000   Length:5025       
##  1st Qu.:1.611e+09   Class :character   1st Qu.:0.00000   Class :character  
##  Median :1.611e+09   Mode  :character   Median :0.00000   Mode  :character  
##  Mean   :1.611e+09                      Mean   :0.05294                     
##  3rd Qu.:1.611e+09                      3rd Qu.:0.00000                     
##  Max.   :1.611e+09                      Max.   :1.00000                     
##                                                                             
##   TEAM_NAME         TEAM_ABBREVIATION  JERSEY_NUMBER        POSITION        
##  Length:5025        Length:5025        Length:5025        Length:5025       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     HEIGHT              WEIGHT        COLLEGE            COUNTRY         
##  Length:5025        Min.   :133.0   Length:5025        Length:5025       
##  Class :character   1st Qu.:190.0   Class :character   Class :character  
##  Mode  :character   Median :210.0   Mode  :character   Mode  :character  
##                     Mean   :211.4                                        
##                     3rd Qu.:230.0                                        
##                     Max.   :360.0                                        
##                     NA's   :53                                           
##    DRAFT_YEAR    DRAFT_ROUND      DRAFT_NUMBER    ROSTER_STATUS 
##  Min.   :1947   Min.   : 0.000   Min.   :  0.00   Min.   :1     
##  1st Qu.:1974   1st Qu.: 1.000   1st Qu.: 12.00   1st Qu.:1     
##  Median :1990   Median : 2.000   Median : 26.00   Median :1     
##  Mean   :1990   Mean   : 2.077   Mean   : 32.02   Mean   :1     
##  3rd Qu.:2009   3rd Qu.: 2.000   3rd Qu.: 43.00   3rd Qu.:1     
##  Max.   :2024   Max.   :20.000   Max.   :221.00   Max.   :1     
##  NA's   :1325   NA's   :1523     NA's   :1591     NA's   :4491  
##       PTS              REB              AST         STATS_TIMEFRAME   
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Length:5025       
##  1st Qu.: 2.800   1st Qu.: 1.300   1st Qu.: 0.500   Class :character  
##  Median : 5.000   Median : 2.400   Median : 1.000   Mode  :character  
##  Mean   : 6.293   Mean   : 2.934   Mean   : 1.432                     
##  3rd Qu.: 8.500   3rd Qu.: 3.900   3rd Qu.: 1.900                     
##  Max.   :32.700   Max.   :22.900   Max.   :11.600                     
##  NA's   :24       NA's   :316      NA's   :24                         
##    FROM_YEAR       TO_YEAR    
##  Min.   :1946   Min.   :1946  
##  1st Qu.:1974   1st Qu.:1978  
##  Median :1993   Median :2000  
##  Mean   :1991   Mean   :1995  
##  3rd Qu.:2012   3rd Qu.:2017  
##  Max.   :2024   Max.   :2024  
##

str(df)

## 'data.frame':    5025 obs. of  26 variables:
##  $ PERSON_ID        : int  76001 76002 76003 51 1505 949 76005 76006 76007 203518 ...
##  $ PLAYER_LAST_NAME : chr  "Abdelnaby" "Abdul-Aziz" "Abdul-Jabbar" "Abdul-Rauf" ...
##  $ PLAYER_FIRST_NAME: chr  "Alaa" "Zaid" "Kareem" "Mahmoud" ...
##  $ PLAYER_SLUG      : chr  "alaa-abdelnaby" "zaid-abdul-aziz" "kareem-abdul-jabbar" "mahmoud-abdul-rauf" ...
##  $ TEAM_ID          : int  1610612757 1610612745 1610612747 1610612743 1610612758 1610612763 1610612744 1610612755 1610610031 1610612760 ...
##  $ TEAM_SLUG        : chr  "blazers" "rockets" "lakers" "nuggets" ...
##  $ IS_DEFUNCT       : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ TEAM_CITY        : chr  "Portland" "Houston" "Los Angeles" "Denver" ...
##  $ TEAM_NAME        : chr  "Trail Blazers" "Rockets" "Lakers" "Nuggets" ...
##  $ TEAM_ABBREVIATION: chr  "POR" "HOU" "LAL" "DEN" ...
##  $ JERSEY_NUMBER    : chr  "30" "54" "33" "1" ...
##  $ POSITION         : chr  "F" "C" "C" "G" ...
##  $ HEIGHT           : chr  "6-10" "6-9" "7-2" "6-1" ...
##  $ WEIGHT           : int  240 235 225 162 235 245 220 180 195 190 ...
##  $ COLLEGE          : chr  "Duke" "Iowa State" "UCLA" "Louisiana State" ...
##  $ COUNTRY          : chr  "USA" "USA" "USA" "USA" ...
##  $ DRAFT_YEAR       : num  1990 1968 1969 1990 1997 ...
##  $ DRAFT_ROUND      : num  1 1 1 1 1 1 3 NA NA 2 ...
##  $ DRAFT_NUMBER     : num  25 5 1 3 11 3 43 NA NA 32 ...
##  $ ROSTER_STATUS    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ PTS              : num  5.7 9 24.6 14.6 7.8 18.1 5.6 0 9.5 5.3 ...
##  $ REB              : num  3.3 8 11.2 1.9 3.3 7.5 3.2 1 NA 1.4 ...
##  $ AST              : num  0.3 1.2 3.6 3.5 1.1 2.5 1.2 1 0.7 0.5 ...
##  $ STATS_TIMEFRAME  : chr  "Career" "Career" "Career" "Career" ...
##  $ FROM_YEAR        : int  1990 1968 1969 1990 1997 1996 1976 1956 1946 2016 ...
##  $ TO_YEAR          : int  1994 1977 1988 2000 2003 2007 1980 1956 1947 2018 ...

colSums(is.na(df))

##         PERSON_ID  PLAYER_LAST_NAME PLAYER_FIRST_NAME       PLAYER_SLUG 
##                 0                 0                 0                 0 
##           TEAM_ID         TEAM_SLUG        IS_DEFUNCT         TEAM_CITY 
##                 0                 0                 0                 0 
##         TEAM_NAME TEAM_ABBREVIATION     JERSEY_NUMBER          POSITION 
##                 0                 0                 0                 0 
##            HEIGHT            WEIGHT           COLLEGE           COUNTRY 
##                 0                53                 0                 0 
##        DRAFT_YEAR       DRAFT_ROUND      DRAFT_NUMBER     ROSTER_STATUS 
##              1325              1523              1591              4491 
##               PTS               REB               AST   STATS_TIMEFRAME 
##                24               316                24                 0 
##         FROM_YEAR           TO_YEAR 
##                 0                 0

Tab 2

Top 10 NBA Scorers AVG

This bar chart highlights the top ten players in the dataset based on their career scoring average. Presenting the data horizontally makes it easy to compare players side‑by‑side and quickly identify who stands out as the most efficient scorers. The chart shows a mix of historical legends and modern stars, illustrating how elite scoring spans across eras. Shai Gilgeous‑Alexander appears at the top with the highest average, while players like Michael Jordan, Wilt Chamberlain, and Giannis Antetokounmpo reinforce the consistency of all‑time great scorers.

df$PTS <- as.numeric(df$PTS)

player_pts <- df %>%
  mutate(Player = paste(PLAYER_FIRST_NAME, PLAYER_LAST_NAME)) %>%
  group_by(Player) %>%
  summarise(TotalPTS = sum(PTS, na.rm = TRUE), .groups = "drop")

player_pts <- player_pts[order(player_pts$TotalPTS, decreasing = TRUE), ]

ggplot(player_pts[1:10, ], aes(x = reorder(Player, -TotalPTS), y = TotalPTS)) +
  geom_bar(stat = "identity", colour = "black", fill = "lightblue") +
  geom_text(aes(label = round(TotalPTS, 1)),
            hjust = -0.1,
            size = 4) +
  labs(title = "Top 10 NBA Players by Total Points",
       x = "Player",
       y = "Total Points") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_flip()

Tab 3

Average points by each team the last 5 years

This line chart compares each NBA team’s average points per game over the last five seasons, allowing you to see how team scoring has changed year‑to‑year. By plotting all five seasons together, the visualization highlights both consistent trends and noticeable shifts, some teams show steady improvement, others fluctuate, and a few remain relatively stable across all years. The color‑coded lines make it easy to track how each season compares, while the team abbreviations along the x‑axis provide a quick reference for identifying which franchises tend to score more or less over time.

df$FROM_YEAR <- as.numeric(df$FROM_YEAR)
df$PTS <- as.numeric(df$PTS)

# Identify last 5 years in the dataset
max_year <- max(df$FROM_YEAR, na.rm = TRUE)
years_to_keep <- seq(max_year, max_year - 4, by = -1)

# Filter to last 5 years
pts_team_df <- df %>%
  filter(FROM_YEAR %in% years_to_keep,
         !is.na(TEAM_ABBREVIATION)) %>%
  group_by(FROM_YEAR, TEAM_ABBREVIATION) %>%
  summarise(avg_pts = mean(PTS, na.rm = TRUE), .groups = "keep") %>%
  data.frame()

pts_team_df$FROM_YEAR <- as.factor(pts_team_df$FROM_YEAR)

ggplot(pts_team_df, aes(x = TEAM_ABBREVIATION, y = avg_pts, group = FROM_YEAR)) +
  geom_line(aes(color = FROM_YEAR), size = 2) +
  geom_point(shape = 21, size = 4, color = "black", fill = "white") +
  labs(title = "Average Points by Team (Last 5 Years)",
       x = "Team",
       y = "Average Points") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5),
  axis.text.x = element_text(angle = 60, hjust = 1, size = 8),
  plot.margin = margin(20, 20, 20, 20))+
  scale_y_continuous(labels = comma) +
  scale_color_brewer(palette = "Paired",
                     name = "Year",
                     guide = guide_legend(reverse = TRUE))

Tab 4

Which Decade Averaged the Most Points per Player?

This line chart shows how average career points vary depending on the decade in which players entered the league. The trend peaks in the 1960s, where players averaged the highest career point totals, before gradually declining in later decades. The slight rise and fall across eras reflects changes in league style, pace, and player roles over time. The error bars help illustrate the variability within each decade, showing that some eras had wider differences in player scoring than others.

# Convert columns
df$FROM_YEAR <- as.numeric(df$FROM_YEAR)
df$PTS <- as.numeric(df$PTS)

# Create decade variable
df <- df %>%
  mutate(decade = floor(FROM_YEAR / 10) * 10)

# Summarize average points by decade
decade_pts <- df %>%
  group_by(decade) %>%
  summarise(avg_pts = mean(PTS, na.rm = TRUE))

# Plot with labels
ggplot(decade_pts, aes(x = decade, y = avg_pts)) +
  geom_line(size = 1.2, color = "blue") +
  geom_point(size = 3, color = "blue") +
  geom_text_repel(aes(label = round(avg_pts, .5)),
    size = 4,nudge_y = .5,color = "black") +
  theme_minimal() +
  labs(title = "Average Career Points by Decade Entered",
    x = "Decade",
    y = "Average Career Points")

Tab 5

Do Taller Players Score More?

This scatterplot explores whether taller players tend to score more over their careers by comparing height (in inches) to total career points. While there is a slight upward trend, many high‑scoring players fall in the mid‑to‑upper height range, the plot makes it clear that height alone does not determine scoring ability. Some of the league’s greatest scorers, like Michael Jordan and Allen Iverson, are not among the tallest players, while several extremely tall players cluster at lower scoring totals. The labeled points help highlight notable outliers across the height spectrum.

# Convert HEIGHT from "6-10" format to inches
df <- df %>%
  mutate(
    height_ft = as.numeric(sub("-.*", "", HEIGHT)),
    height_in = as.numeric(sub(".*-", "", HEIGHT)),
    height_total = height_ft * 12 + height_in)

# Create player name for labeling
df <- df %>%
  mutate(Player = paste(PLAYER_FIRST_NAME, PLAYER_LAST_NAME))

# Scatterplot with smoothing line
ggplot(df, aes(x = height_total, y = PTS)) +
  geom_point(color = "lightblue", alpha = 0.6, size = 3) +
  geom_text_repel(aes(label = Player),
    size = 3,
    max.overlaps = 20) +
  theme_minimal() +
  labs(title = "Do Taller Players Score More?",
    subtitle = "Scatterplot of Height vs Career Points",
    x = "Height (inches)",
    y = "Career Points")

Tab 6

Do we see an increase in international players?

This donut chart compares the distribution of NBA players by country in the 2000s versus the 2020s, highlighting how the league’s international presence has evolved. The inner ring shows that the NBA in the 2000s was overwhelmingly dominated by U.S.‑born players, with nearly 80% of the league coming from the United States. In the 2020s, that share decreases slightly as the outer ring reveals a broader mix of international talent. Countries like France, Canada, Australia, Nigeria, Serbia, and Spain all show meaningful representation, reflecting the NBA’s growing global reach.

# 2000s
top5_2000s <- df %>%
  filter(decade == 2000, COUNTRY != "USA") %>%
  count(COUNTRY, sort = TRUE) %>%
  slice_head(n = 5) %>%
  pull(COUNTRY)

# 2020s
top5_2020s <- df %>%
  filter(decade == 2020, COUNTRY != "USA") %>%
  count(COUNTRY, sort = TRUE) %>%
  slice_head(n = 5) %>%
  pull(COUNTRY)
country_2000s <- df %>%
  filter(decade == 2000) %>%
  mutate(COUNTRY = ifelse(COUNTRY %in% c("USA", top5_2000s),COUNTRY,"Other")) %>%
  count(COUNTRY)

country_2020s <- df %>%
  filter(decade == 2020) %>%
  mutate(COUNTRY = ifelse(COUNTRY %in% c("USA", top5_2020s),COUNTRY,"Other")) %>%
  count(COUNTRY)

fig <- plot_ly(hole = 0.7) %>%
  layout(title = "NBA Player Country Distribution (2000s vs 2020s)") %>%
  add_trace(
    data = country_2020s,
    labels = ~COUNTRY,
    values = ~n,
    type = "pie",
    textposition = "inside",
    hovertemplate = "Decade: 2020s<br>Country:%{label}<br>Percent:%{percent}<br>Count:%{value}<extra></extra>") %>%
  add_trace(
    data = country_2000s,
    labels = ~COUNTRY,
    values = ~n,
    type = "pie",
    textposition = "inside",
    hovertemplate = "Decade: 2000s<br>Country:%{label}<br>Percent:%{percent}<br>Count:%{value}<extra></extra>",
    domain = list(
      x = c(0.16, 0.84),
      y = c(0.16, 0.84)))

fig

Conclusion

Taken together, these visualizations show how NBA player performance and league demographics have evolved over time. Scoring trends vary by era and team, but elite production consistently comes from players of many different sizes and backgrounds. Height alone doesn’t determine scoring ability, and while the league remains U.S.‑dominated, international representation has clearly expanded in recent decades. Overall, the data set highlights a league that is historically diverse, constantly changing, and shaped by a wide range of player profiles and playing styles.