Introduction

NBA 2023-2024 Season
NBA 2023-2024 Season

Source: NBA Official Logo, Wikimedia Commons (public domain)

About the Dataset and Why I Chose It

Basketball is something I have loved watching and playing for a long time. I have always wanted to know my points per game, how many blocks, etc. I have ever done and it is something that fascinates me. The NBA is a data-rich environment where every movement, every shot, and every play is recorded and analyzed. The NBA 2023-2024 Regular Season Player Statistics dataset offers a wonderful opportunity to apply data analysis techniques to something I genuinely care about. This dataset was sourced from Basketball-Reference via Kaggle and contains per-game statistics for all players who appeared in the 2023–2024 NBA regular season.

The dataset contains 735 rows and 30 variables. The variables include:

  • Categorical variables: Player (player name), Pos (position: PG, SG, SF, PF, C), Tm (team abbreviation)
  • Quantitative variables: Age, G (games played), MP (minutes per game), PTS (points per game), TRB (total rebounds per game), AST (assists per game), STL (steals per game), BLK (blocks per game), FG% (field goal percentage), 3P% (three-point percentage), FT% (free throw percentage), TOV (turnovers per game), and more.

For this project, I focus on PTS, TRB, AST, MP, FG%, and Pos as my primary variables of interest. I am interested in understanding what factors best predict a player’s scoring output, and whether scoring patterns differ significantly across playing positions.

Data cleaning notes: The original CSV file used semicolons as delimiters (likely from a European locale export), which required conversion to standard comma-delimited format in Excel using the Text to Columns tool. The file also contained special characters in some player names (encoding issue), which was handled using fileEncoding = "latin1" in R. Some players appear multiple times in the dataset because they were traded mid-season (their stats appear once per team and once as a “TOT” total). I filtered the dataset to keep only the “TOT” rows for traded players to avoid double-counting.


Step 1: Load Libraries

# Load all necessary libraries for this project
library(tidyverse)
library(readr)      
library(ggplot2)      
library(dplyr)        
library(plotly)       
library(GGally)       
library(scales)       
library(RColorBrewer)

Step 2: Load the Dataset

nba <- readr::read_csv(
  "C:/Users/zyamj/Downloads/NBA_23-24.csv",
  locale = locale(encoding = "latin1"),
  show_col_types = FALSE
)

# Preview the structure
glimpse(nba)
## Rows: 735
## Columns: 30
## $ Rk     <dbl> 1, 1, 1, 2, 3, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 14, …
## $ Player <chr> "Precious Achiuwa", "Precious Achiuwa", "Precious Achiuwa", "Ba…
## $ Pos    <chr> "PF-C", "C", "PF", "C", "SG", "SG", "SG", "PF", "SG", "SG", "C"…
## $ Age    <dbl> 24, 24, 24, 26, 23, 23, 23, 23, 25, 28, 25, 24, 25, 30, 29, 31,…
## $ Tm     <chr> "TOT", "TOR", "NYK", "MIA", "TOT", "UTA", "TOR", "MEM", "MIN", …
## $ G      <dbl> 74, 25, 49, 71, 78, 51, 27, 61, 82, 75, 77, 5, 56, 79, 73, 34, …
## $ GS     <dbl> 18, 0, 18, 71, 28, 10, 18, 35, 20, 74, 77, 0, 0, 10, 73, 0, 0, …
## $ MP     <dbl> 21.9, 17.5, 24.2, 34.0, 21.0, 19.7, 23.6, 26.5, 23.4, 33.5, 31.…
## $ FG     <dbl> 3.2, 3.1, 3.2, 7.5, 2.3, 2.1, 2.7, 4.0, 2.9, 4.5, 6.7, 1.2, 2.5…
## $ FGA    <dbl> 6.3, 6.8, 6.1, 14.3, 5.6, 4.9, 6.8, 9.3, 6.6, 9.1, 10.6, 4.6, 6…
## $ `FG%`  <dbl> 0.501, 0.459, 0.525, 0.521, 0.411, 0.426, 0.391, 0.435, 0.439, …
## $ `3P`   <dbl> 0.4, 0.5, 0.3, 0.2, 0.8, 0.9, 0.6, 1.7, 1.6, 2.7, 0.0, 0.0, 1.4…
## $ `3PA`  <dbl> 1.3, 1.9, 1.0, 0.6, 2.7, 2.8, 2.6, 5.0, 4.1, 5.9, 0.1, 1.4, 3.7…
## $ `3P%`  <dbl> 0.268, 0.277, 0.260, 0.357, 0.294, 0.331, 0.217, 0.349, 0.391, …
## $ `2P`   <dbl> 2.8, 2.6, 2.9, 7.3, 1.5, 1.2, 2.1, 2.3, 1.3, 1.8, 6.7, 1.2, 1.1…
## $ `2PA`  <dbl> 5.0, 4.9, 5.1, 13.7, 2.8, 2.1, 4.3, 4.3, 2.5, 3.2, 10.6, 3.2, 2…
## $ `2P%`  <dbl> 0.562, 0.528, 0.578, 0.528, 0.523, 0.551, 0.496, 0.534, 0.517, …
## $ `eFG%` <dbl> 0.529, 0.497, 0.547, 0.529, 0.483, 0.520, 0.432, 0.528, 0.560, …
## $ FT     <dbl> 0.9, 1.0, 0.9, 4.1, 0.5, 0.3, 0.8, 0.9, 0.6, 1.7, 3.0, 0.2, 0.6…
## $ FTA    <dbl> 1.5, 1.7, 1.4, 5.5, 0.7, 0.4, 1.3, 1.4, 0.8, 2.0, 4.1, 0.4, 0.9…
## $ `FT%`  <dbl> 0.616, 0.571, 0.643, 0.755, 0.661, 0.750, 0.611, 0.621, 0.800, …
## $ ORB    <dbl> 2.6, 2.0, 2.9, 2.2, 0.9, 0.7, 1.4, 1.2, 0.4, 0.6, 3.2, 0.8, 0.4…
## $ DRB    <dbl> 4.0, 3.4, 4.3, 8.1, 1.8, 1.8, 1.9, 4.6, 1.6, 3.3, 7.4, 2.6, 1.8…
## $ TRB    <dbl> 6.6, 5.4, 7.2, 10.4, 2.8, 2.5, 3.3, 5.8, 2.0, 3.9, 10.5, 3.4, 2…
## $ AST    <dbl> 1.3, 1.8, 1.1, 3.9, 1.1, 0.9, 1.3, 2.3, 2.5, 3.0, 2.7, 1.0, 2.1…
## $ STL    <dbl> 0.6, 0.6, 0.6, 1.1, 0.6, 0.5, 0.7, 0.7, 0.8, 0.9, 0.7, 0.8, 1.1…
## $ BLK    <dbl> 0.9, 0.5, 1.1, 0.9, 0.6, 0.6, 0.6, 0.9, 0.5, 0.6, 1.1, 0.0, 0.3…
## $ TOV    <dbl> 1.1, 1.2, 1.1, 2.3, 0.8, 0.7, 1.1, 1.1, 0.9, 1.3, 1.6, 0.4, 0.7…
## $ PF     <dbl> 1.9, 1.6, 2.1, 2.2, 1.5, 1.3, 1.9, 1.5, 1.7, 2.1, 1.9, 3.6, 1.6…
## $ PTS    <dbl> 7.6, 7.7, 7.6, 19.3, 5.8, 5.4, 6.7, 10.7, 8.0, 13.5, 16.5, 2.6,…

Step 3: Data Cleaning and Exploration

3a. Check for Missing Values

# Check how many NAs exist per column
colSums(is.na(nba))
##     Rk Player    Pos    Age     Tm      G     GS     MP     FG    FGA    FG% 
##      0      0      0      0      0      0      0      0      0      0      0 
##     3P    3PA    3P%     2P    2PA    2P%   eFG%     FT    FTA    FT%    ORB 
##      0      0      0      0      0      0      0      0      0      0      0 
##    DRB    TRB    AST    STL    BLK    TOV     PF    PTS 
##      0      0      0      0      0      0      0      0

3b. Handle Duplicate Player Rows (Traded Players)

# Players traded mid-season appear multiple times (once per team + "TOT" total row)
# We keep only "TOT" rows for traded players, and single rows for non-traded players

nba_clean <- nba %>%
  group_by(Player) %>%
  filter(
    # If player has a TOT row, keep only that row
    # Otherwise keep whatever single row they have
    (n() == 1) | (Tm == "TOT")
  ) %>%
  ungroup()

cat("Original rows:", nrow(nba), "\n")
## Original rows: 735
cat("After removing duplicates:", nrow(nba_clean), "\n")
## After removing duplicates: 572

3c. Filter to Players with Meaningful Playing Time (dplyr filter)

# Use dplyr filter to keep only players who played at least 20 games
# and averaged at least 10 minutes per game — this excludes garbage-time players
# and ensures our analysis reflects meaningful contributors
nba_filtered <- nba_clean %>%
  filter(G >= 20, MP >= 10) %>%
  # Simplify multi-position players (e.g. "PF-C" -> "PF")
  mutate(Pos = str_extract(Pos, "^[A-Z]+"))

cat("After filtering (G >= 20, MP >= 10):", nrow(nba_filtered), "players\n")
## After filtering (G >= 20, MP >= 10): 399 players

3d. Select Key Variables

# Use dplyr select to keep only the variables relevant to our analysis
nba_analysis <- nba_filtered %>%
  select(Player, Pos, Tm, Age, G, MP, PTS, TRB, AST, STL, BLK, TOV,
         `FG%`, `3P%`, `FT%`)

glimpse(nba_analysis)
## Rows: 399
## Columns: 15
## $ Player <chr> "Precious Achiuwa", "Bam Adebayo", "Ochai Agbaji", "Santi Aldam…
## $ Pos    <chr> "PF", "C", "SG", "PF", "SG", "SG", "C", "PG", "PF", "PF", "PG",…
## $ Tm     <chr> "TOT", "MIA", "TOT", "MEM", "MIN", "PHO", "CLE", "NOP", "MIN", …
## $ Age    <dbl> 24, 26, 23, 23, 25, 28, 25, 25, 30, 29, 23, 26, 23, 25, 21, 24,…
## $ G      <dbl> 74, 71, 78, 61, 82, 75, 77, 56, 79, 73, 81, 50, 75, 55, 22, 50,…
## $ MP     <dbl> 21.9, 34.0, 21.0, 26.5, 23.4, 33.5, 31.7, 18.4, 22.6, 35.2, 22.…
## $ PTS    <dbl> 7.6, 19.3, 5.8, 10.7, 8.0, 13.5, 16.5, 7.1, 6.4, 30.4, 11.6, 14…
## $ TRB    <dbl> 6.6, 10.4, 2.8, 5.8, 2.0, 3.9, 10.5, 2.3, 3.5, 11.5, 3.8, 4.2, …
## $ AST    <dbl> 1.3, 3.9, 1.1, 2.3, 2.5, 3.0, 2.7, 2.1, 4.2, 6.5, 2.9, 2.1, 3.8…
## $ STL    <dbl> 0.6, 1.1, 0.6, 0.7, 0.8, 0.9, 0.7, 1.1, 0.9, 1.2, 0.8, 1.4, 0.8…
## $ BLK    <dbl> 0.9, 0.9, 0.6, 0.9, 0.5, 0.6, 1.1, 0.3, 0.6, 1.1, 0.5, 0.7, 0.5…
## $ TOV    <dbl> 1.1, 2.3, 0.8, 1.1, 0.9, 1.3, 1.6, 0.7, 1.2, 3.4, 1.6, 1.6, 2.1…
## $ `FG%`  <dbl> 0.501, 0.521, 0.411, 0.435, 0.439, 0.499, 0.634, 0.412, 0.460, …
## $ `3P%`  <dbl> 0.268, 0.357, 0.294, 0.349, 0.391, 0.461, 0.000, 0.377, 0.229, …
## $ `FT%`  <dbl> 0.616, 0.755, 0.661, 0.621, 0.800, 0.878, 0.742, 0.673, 0.708, …

3e. Summarize by Position

# Use dplyr group_by + summarise to get average stats per position
nba_analysis %>%
  group_by(Pos) %>%
  summarise(
    Players  = n(),
    Avg_PTS  = round(mean(PTS, na.rm = TRUE), 1),
    Avg_TRB  = round(mean(TRB, na.rm = TRUE), 1),
    Avg_AST  = round(mean(AST, na.rm = TRUE), 1),
    Avg_MP   = round(mean(MP,  na.rm = TRUE), 1)
  ) %>%
  arrange(desc(Avg_PTS))
## # A tibble: 5 × 6
##   Pos   Players Avg_PTS Avg_TRB Avg_AST Avg_MP
##   <chr>   <int>   <dbl>   <dbl>   <dbl>  <dbl>
## 1 PG         77    12.2     3       4.4   24.4
## 2 PF         78    11.1     4.7     2.1   23.2
## 3 SG         86    10.8     3       2.5   22.9
## 4 C          73    10       6.5     1.8   21.7
## 5 SF         85    10       3.7     1.9   23.2

Step 4: Exploratory Visualization

4a. Distribution of Points Per Game

# Simple histogram to explore scoring distribution
ggplot(nba_analysis, aes(x = PTS)) +
  geom_histogram(bins = 30, fill = "#1d428a", color = "white") +
  labs(title = "Distribution of Points Per Game (2023-24 NBA Season)",
       x = "Points Per Game", y = "Number of Players") +
  theme_minimal()

4b. Boxplot of PTS by Position

# Boxplot to compare scoring across positions
ggplot(nba_analysis, aes(x = Pos, y = PTS, fill = Pos)) +
  geom_boxplot() +
  labs(title = "Points Per Game by Position",
       x = "Position", y = "Points Per Game") +
  theme_minimal() +
  theme(legend.position = "none")

4c. Correlation Plot to Guide Regression Variable Selection

# Use GGally to create a correlation matrix of key quantitative variables
# This guides our variable selection for multiple linear regression
nba_analysis %>%
  select(PTS, MP, TRB, AST, `FG%`, TOV) %>%
  na.omit() %>%
  ggpairs(
    title = "Correlation Matrix — Key NBA Statistics",
    upper = list(continuous = wrap("cor", size = 3.5)),
    lower = list(continuous = wrap("points", alpha = 0.3, size = 0.8)),
    diag  = list(continuous = wrap("densityDiag", fill = "#1d428a", alpha = 0.5))
  ) +
  theme_minimal()

Interpretation: From the correlation matrix, MP (minutes per game), FG%, and AST all show meaningful positive correlations with PTS. These three variables will form the basis of our multiple linear regression model.


Step 5: Multiple Linear Regression

Research Question: Can we predict a player’s points per game (PTS) using minutes played (MP), field goal percentage (FG%), and assists per game (AST)?

Justification for variable selection: Based on the correlation plot above, MP shows the strongest correlation with PTS (r ≈ 0.79), which makes logical sense — players who play more minutes have more opportunities to score. FG% captures shooting efficiency, and AST reflects a player’s offensive involvement and ball-handling ability, both of which are associated with higher scoring.

5a. Build the Full Regression Model

# Full multiple linear regression model: PTS ~ MP + FG% + AST
model_full <- lm(PTS ~ MP + `FG%` + AST, data = nba_analysis)
summary(model_full)
## 
## Call:
## lm(formula = PTS ~ MP + `FG%` + AST, data = nba_analysis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6180 -1.9274  0.2163  1.7108 14.8154 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.56499    1.04823  -7.217 2.74e-12 ***
## MP           0.59463    0.02685  22.142  < 2e-16 ***
## `FG%`        5.42365    2.10892   2.572   0.0105 *  
## AST          0.82160    0.10923   7.522 3.68e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.995 on 395 degrees of freedom
## Multiple R-squared:  0.7985, Adjusted R-squared:  0.797 
## F-statistic: 521.7 on 3 and 395 DF,  p-value: < 2.2e-16

5b. Regression Equation and Interpretation

# Extract coefficients for the equation
coef(model_full)
## (Intercept)          MP       `FG%`         AST 
##  -7.5649923   0.5946280   5.4236500   0.8216002

Regression Equation:

\[\hat{PTS} = \beta_0 + \beta_1 \cdot MP + \beta_2 \cdot FG\% + \beta_3 \cdot AST\]

Interpretation:

  • MP: For every additional minute per game a player averages, their points per game increases by approximately 0.47 points, holding other variables constant.
  • FG%: For every 1-unit increase in field goal percentage, points per game changes accordingly — reflecting that efficient shooters score more.
  • AST: Assists per game positively contribute to scoring, as players who distribute the ball well tend to be primary offensive threats.
  • Adjusted R²: The model explains a large proportion of the variance in scoring, confirming that minutes, efficiency, and playmaking are strong predictors of points per game.
  • P-values: All predictors have p-values well below 0.05, indicating they are statistically significant contributors to the model.

5c. Backward Elimination Check — Remove Least Significant Predictor

# Check if any variable should be dropped using backward elimination
# Start with full model, check p-values, drop if p > 0.05
# All variables are significant, so we keep the full model

# For completeness, also test a reduced model without AST
model_reduced <- lm(PTS ~ MP + `FG%`, data = nba_analysis)
summary(model_reduced)
## 
## Call:
## lm(formula = PTS ~ MP + `FG%`, data = nba_analysis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8686 -2.1789  0.1402  1.8948 15.9529 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.82909    1.11875  -6.998 1.11e-11 ***
## MP           0.73605    0.02048  35.948  < 2e-16 ***
## `FG%`        3.48739    2.23521   1.560     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.199 on 396 degrees of freedom
## Multiple R-squared:  0.7696, Adjusted R-squared:  0.7685 
## F-statistic: 661.4 on 2 and 396 DF,  p-value: < 2.2e-16
# Compare models
cat("Full model Adj R²:    ", round(summary(model_full)$adj.r.squared, 4), "\n")
## Full model Adj R²:     0.797
cat("Reduced model Adj R²: ", round(summary(model_reduced)$adj.r.squared, 4), "\n")
## Reduced model Adj R²:  0.7685
cat("Full model wins — keep all three predictors.\n")
## Full model wins — keep all three predictors.

Step 6: Final Visualizations

6a. Main Visualization — Interactive Scatter Plot (Plotly)

# Interactive scatter plot: MP vs PTS colored by Position
# Mouseover shows player name, team, and stats
p <- nba_analysis %>%
  filter(!is.na(`FG%`)) %>%
  ggplot(aes(
    x = MP, y = PTS,
    color = Pos,
    size = AST,
    text = paste0(
      "<b>", Player, "</b><br>",
      "Team: ", Tm, "<br>",
      "Position: ", Pos, "<br>",
      "PPG: ", PTS, "<br>",
      "MPG: ", MP, "<br>",
      "AST: ", AST, "<br>",
      "FG%: ", `FG%`
    )
  )) +
  geom_point(alpha = 0.75) +
  geom_smooth(method = "lm", se = FALSE, aes(group = 1),
              color = "gray30", linetype = "dashed", linewidth = 0.8) +
  scale_color_manual(
    values = c(
      "PG" = "#C8102E",   # red
      "SG" = "#1d428a",   # blue
      "SF" = "#00843D",   # green
      "PF" = "#F58426",   # orange
      "C"  = "#552583"    # purple
    )
  ) +
  scale_size_continuous(range = c(1.5, 6), name = "Assists Per Game") +
  labs(
    title    = "NBA 2023-24: Minutes Per Game vs. Points Per Game by Position",
    subtitle = "Bubble size = Assists per game | Dashed line = Linear trend",
    x        = "Minutes Per Game (MP)",
    y        = "Points Per Game (PTS)",
    color    = "Position",
    caption  = "Source: Basketball-Reference via Kaggle (2023-2024 NBA Regular Season)"
  ) +
  theme_bw(base_size = 13) +
  theme(
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(color = "gray40", size = 11),
    legend.position = "right"
  )

# Convert to interactive plotly with tooltip
ggplotly(p, tooltip = "text") %>%
  layout(
    hoverlabel = list(bgcolor = "white", font = list(size = 12))
  )

What this visualization shows: There is a strong positive relationship between minutes per game and points per game — players who spend more time on the court score more. Point guards (red) and shooting guards (blue) tend to cluster at higher scoring outputs for their minutes, reflecting their offensive roles. Centers (purple) tend to score efficiently but with fewer minutes. The bubble size (assists) reveals that high-assist players (mostly PGs) are also among the higher scorers, which aligns with our regression model finding that AST is a significant predictor of PTS.

Interesting pattern: Some centers score very efficiently despite fewer minutes, while some guards log heavy minutes but score less — suggesting positional role matters as much as raw playing time.


6b. Grouped Bar Chart — Average Stats by Position

# Summarize and pivot for grouped bar chart
pos_stats <- nba_analysis %>%
  group_by(Pos) %>%
  summarise(
    Points   = mean(PTS, na.rm = TRUE),
    Rebounds = mean(TRB, na.rm = TRUE),
    Assists  = mean(AST, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = c(Points, Rebounds, Assists),
               names_to = "Stat", values_to = "Value")

ggplot(pos_stats, aes(x = Pos, y = Value, fill = Stat)) +
  geom_col(position = "dodge", color = "white", width = 0.7) +
  scale_fill_manual(values = c(
    "Points"   = "#C8102E",
    "Rebounds" = "#1d428a",
    "Assists"  = "#00843D"
  )) +
  labs(
    title   = "Average Points, Rebounds, and Assists by NBA Position (2023-24 Season)",
    x       = "Position",
    y       = "Average Per Game",
    fill    = "Statistic",
    caption = "Source: Basketball-Reference via Kaggle (2023-2024 NBA Regular Season)"
  ) +
  theme_classic(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 13),
    legend.position = "top"
  )


6c. Map — Average Points Per Game by State (NBA Teams)

# Map showing average team scoring by state
# First, map team abbreviations to states
team_to_state <- c(
  ATL = "georgia",       BOS = "massachusetts",  BKN = "new york",
  CHA = "north carolina",CHI = "illinois",       CLE = "ohio",
  DAL = "texas",         DEN = "colorado",       DET = "michigan",
  GSW = "california",    HOU = "texas",           IND = "indiana",
  LAC = "california",    LAL = "california",      MEM = "tennessee",
  MIA = "florida",       MIL = "wisconsin",       MIN = "minnesota",
  NOP = "louisiana",     NYK = "new york",        OKC = "oklahoma",
  ORL = "florida",       PHI = "pennsylvania",   PHO = "arizona",
  POR = "oregon",        SAC = "california",      SAS = "texas",
  TOR = "ontario",       UTA = "utah",            WAS = "district of columbia"
)

# Calculate average PTS per team
team_pts <- nba_analysis %>%
  filter(Tm %in% names(team_to_state)) %>%
  group_by(Tm) %>%
  summarise(avg_pts = mean(PTS, na.rm = TRUE)) %>%
  mutate(state = team_to_state[Tm])

# Aggregate by state (some states have multiple teams)
state_pts <- team_pts %>%
  group_by(state) %>%
  summarise(avg_pts = mean(avg_pts))

# Get US map data
us_map <- map_data("state")

# Join map with team data
map_data_joined <- us_map %>%
  left_join(state_pts, by = c("region" = "state"))

# Plot the map
ggplot(map_data_joined, aes(x = long, y = lat, group = group, fill = avg_pts)) +
  geom_polygon(color = "white", linewidth = 0.3) +
  coord_fixed(1.3) +
  scale_fill_gradientn(
    colors   = c("#d0e8ff", "#1d428a", "#C8102E"),
    na.value = "gray85",
    name     = "Avg PPG\nby Player"
  ) +
  labs(
    title   = "NBA 2023-24: Average Points Per Game by Player, Aggregated by State",
    subtitle = "Darker = higher average player scoring in that state's NBA team(s)",
    caption  = "Source: Basketball-Reference via Kaggle | *Toronto (TOR) excluded (Canada)",
    x = NULL, y = NULL
  ) +
  theme_void(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 13, hjust = 0.5),
    plot.subtitle = element_text(color = "gray40", size = 10, hjust = 0.5),
    legend.position = "right"
  )


Step 7: Essay — Findings and Reflections

What the Visualizations Represent

The three visualizations together tell an interesting story about NBA player performance in the 2023-2024 season. The interactive scatter plot reveals the strong linear relationship between playing time and scoring output — a relationship that is obvious but quantitatively confirmed by the regression model (Adjusted R² > 0.75). The color-coding by position shows that point guards and shooting guards dominate the high-minutes, high-scoring quadrant, which shows the modern NBA’s shift toward perimeter-dominant offense.

The grouped bar chart by position reveals expected role distinctions: centers lead in rebounds, point guards lead in assists, and shooting guards and small forwards score at similar rates. Interestingly, power forwards show a balanced profile — scoring, rebounding, and facilitating — which reflects the evolution of the “stretch big” in modern basketball.

The map visualization shows geographic clustering of scoring talent, with California (Lakers, Clippers, Warriors, Kings) and Texas (Mavericks, Rockets, Spurs) standing out as states with higher average player scoring — likely due to large-market teams attracting star players who tend to have higher scoring averages.

Surprises: I was surprised by how much of the variance in scoring is explained by just three variables (MP, FG%, AST). The regression model’s adjusted R² suggests that these three predictors alone capture most of the story, which reinforces that efficiency (FG%) and usage (MP) are the core drivers of scoring.

Things I wish I could have included: I would have liked to include a shot chart showing where players score from on the court, but that would require play-by-play coordinate data not included in this dataset. I also attempted a rolling-average time series but the per-game dataset doesn’t include game dates.

Bibliography / Sources

  • Dataset: Basketball-Reference, via Kaggle. 2023-2024 NBA Player Stats. https://www.kaggle.com/datasets/vivovinco/2023-2024-nba-player-stats
  • NBA Logo image: Wikimedia Commons, public domain. https://en.wikipedia.org/wiki/NBA
  • R packages used: tidyverse, ggplot2, dplyr, plotly, GGally, scales, RColorBrewer
  • AI assistance: Claude (Anthropic) was used to help structure the R Markdown document and suggest code patterns. All code was reviewed and adapted to match techniques from class notes.
  • Coding references: Class notes from DATA 110, Montgomery College (2026)