Source: NBA Official Logo, Wikimedia Commons (public domain)
Basketball is something I have loved watching and playing for a long time. I have always wanted to know my points per game, how many blocks, etc. I have ever done and it is something that fascinates me. The NBA is a data-rich environment where every movement, every shot, and every play is recorded and analyzed. The NBA 2023-2024 Regular Season Player Statistics dataset offers a wonderful opportunity to apply data analysis techniques to something I genuinely care about. This dataset was sourced from Basketball-Reference via Kaggle and contains per-game statistics for all players who appeared in the 2023–2024 NBA regular season.
The dataset contains 735 rows and 30 variables. The variables include:
Player (player
name), Pos (position: PG, SG, SF, PF, C), Tm
(team abbreviation)Age,
G (games played), MP (minutes per game),
PTS (points per game), TRB (total rebounds per
game), AST (assists per game), STL (steals per
game), BLK (blocks per game), FG% (field goal
percentage), 3P% (three-point percentage), FT%
(free throw percentage), TOV (turnovers per game), and
more.For this project, I focus on PTS, TRB, AST, MP, FG%, and Pos as my primary variables of interest. I am interested in understanding what factors best predict a player’s scoring output, and whether scoring patterns differ significantly across playing positions.
Data cleaning notes: The original CSV file used
semicolons as delimiters (likely from a European locale export), which
required conversion to standard comma-delimited format in Excel using
the Text to Columns tool. The file also contained special characters in
some player names (encoding issue), which was handled using
fileEncoding = "latin1" in R. Some players appear multiple
times in the dataset because they were traded mid-season (their stats
appear once per team and once as a “TOT” total). I filtered the dataset
to keep only the “TOT” rows for traded players to avoid
double-counting.
# Load all necessary libraries for this project
library(tidyverse)
library(readr)
library(ggplot2)
library(dplyr)
library(plotly)
library(GGally)
library(scales)
library(RColorBrewer)nba <- readr::read_csv(
"C:/Users/zyamj/Downloads/NBA_23-24.csv",
locale = locale(encoding = "latin1"),
show_col_types = FALSE
)
# Preview the structure
glimpse(nba)## Rows: 735
## Columns: 30
## $ Rk <dbl> 1, 1, 1, 2, 3, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 14, …
## $ Player <chr> "Precious Achiuwa", "Precious Achiuwa", "Precious Achiuwa", "Ba…
## $ Pos <chr> "PF-C", "C", "PF", "C", "SG", "SG", "SG", "PF", "SG", "SG", "C"…
## $ Age <dbl> 24, 24, 24, 26, 23, 23, 23, 23, 25, 28, 25, 24, 25, 30, 29, 31,…
## $ Tm <chr> "TOT", "TOR", "NYK", "MIA", "TOT", "UTA", "TOR", "MEM", "MIN", …
## $ G <dbl> 74, 25, 49, 71, 78, 51, 27, 61, 82, 75, 77, 5, 56, 79, 73, 34, …
## $ GS <dbl> 18, 0, 18, 71, 28, 10, 18, 35, 20, 74, 77, 0, 0, 10, 73, 0, 0, …
## $ MP <dbl> 21.9, 17.5, 24.2, 34.0, 21.0, 19.7, 23.6, 26.5, 23.4, 33.5, 31.…
## $ FG <dbl> 3.2, 3.1, 3.2, 7.5, 2.3, 2.1, 2.7, 4.0, 2.9, 4.5, 6.7, 1.2, 2.5…
## $ FGA <dbl> 6.3, 6.8, 6.1, 14.3, 5.6, 4.9, 6.8, 9.3, 6.6, 9.1, 10.6, 4.6, 6…
## $ `FG%` <dbl> 0.501, 0.459, 0.525, 0.521, 0.411, 0.426, 0.391, 0.435, 0.439, …
## $ `3P` <dbl> 0.4, 0.5, 0.3, 0.2, 0.8, 0.9, 0.6, 1.7, 1.6, 2.7, 0.0, 0.0, 1.4…
## $ `3PA` <dbl> 1.3, 1.9, 1.0, 0.6, 2.7, 2.8, 2.6, 5.0, 4.1, 5.9, 0.1, 1.4, 3.7…
## $ `3P%` <dbl> 0.268, 0.277, 0.260, 0.357, 0.294, 0.331, 0.217, 0.349, 0.391, …
## $ `2P` <dbl> 2.8, 2.6, 2.9, 7.3, 1.5, 1.2, 2.1, 2.3, 1.3, 1.8, 6.7, 1.2, 1.1…
## $ `2PA` <dbl> 5.0, 4.9, 5.1, 13.7, 2.8, 2.1, 4.3, 4.3, 2.5, 3.2, 10.6, 3.2, 2…
## $ `2P%` <dbl> 0.562, 0.528, 0.578, 0.528, 0.523, 0.551, 0.496, 0.534, 0.517, …
## $ `eFG%` <dbl> 0.529, 0.497, 0.547, 0.529, 0.483, 0.520, 0.432, 0.528, 0.560, …
## $ FT <dbl> 0.9, 1.0, 0.9, 4.1, 0.5, 0.3, 0.8, 0.9, 0.6, 1.7, 3.0, 0.2, 0.6…
## $ FTA <dbl> 1.5, 1.7, 1.4, 5.5, 0.7, 0.4, 1.3, 1.4, 0.8, 2.0, 4.1, 0.4, 0.9…
## $ `FT%` <dbl> 0.616, 0.571, 0.643, 0.755, 0.661, 0.750, 0.611, 0.621, 0.800, …
## $ ORB <dbl> 2.6, 2.0, 2.9, 2.2, 0.9, 0.7, 1.4, 1.2, 0.4, 0.6, 3.2, 0.8, 0.4…
## $ DRB <dbl> 4.0, 3.4, 4.3, 8.1, 1.8, 1.8, 1.9, 4.6, 1.6, 3.3, 7.4, 2.6, 1.8…
## $ TRB <dbl> 6.6, 5.4, 7.2, 10.4, 2.8, 2.5, 3.3, 5.8, 2.0, 3.9, 10.5, 3.4, 2…
## $ AST <dbl> 1.3, 1.8, 1.1, 3.9, 1.1, 0.9, 1.3, 2.3, 2.5, 3.0, 2.7, 1.0, 2.1…
## $ STL <dbl> 0.6, 0.6, 0.6, 1.1, 0.6, 0.5, 0.7, 0.7, 0.8, 0.9, 0.7, 0.8, 1.1…
## $ BLK <dbl> 0.9, 0.5, 1.1, 0.9, 0.6, 0.6, 0.6, 0.9, 0.5, 0.6, 1.1, 0.0, 0.3…
## $ TOV <dbl> 1.1, 1.2, 1.1, 2.3, 0.8, 0.7, 1.1, 1.1, 0.9, 1.3, 1.6, 0.4, 0.7…
## $ PF <dbl> 1.9, 1.6, 2.1, 2.2, 1.5, 1.3, 1.9, 1.5, 1.7, 2.1, 1.9, 3.6, 1.6…
## $ PTS <dbl> 7.6, 7.7, 7.6, 19.3, 5.8, 5.4, 6.7, 10.7, 8.0, 13.5, 16.5, 2.6,…
## Rk Player Pos Age Tm G GS MP FG FGA FG%
## 0 0 0 0 0 0 0 0 0 0 0
## 3P 3PA 3P% 2P 2PA 2P% eFG% FT FTA FT% ORB
## 0 0 0 0 0 0 0 0 0 0 0
## DRB TRB AST STL BLK TOV PF PTS
## 0 0 0 0 0 0 0 0
# Players traded mid-season appear multiple times (once per team + "TOT" total row)
# We keep only "TOT" rows for traded players, and single rows for non-traded players
nba_clean <- nba %>%
group_by(Player) %>%
filter(
# If player has a TOT row, keep only that row
# Otherwise keep whatever single row they have
(n() == 1) | (Tm == "TOT")
) %>%
ungroup()
cat("Original rows:", nrow(nba), "\n")## Original rows: 735
## After removing duplicates: 572
# Use dplyr filter to keep only players who played at least 20 games
# and averaged at least 10 minutes per game — this excludes garbage-time players
# and ensures our analysis reflects meaningful contributors
nba_filtered <- nba_clean %>%
filter(G >= 20, MP >= 10) %>%
# Simplify multi-position players (e.g. "PF-C" -> "PF")
mutate(Pos = str_extract(Pos, "^[A-Z]+"))
cat("After filtering (G >= 20, MP >= 10):", nrow(nba_filtered), "players\n")## After filtering (G >= 20, MP >= 10): 399 players
# Use dplyr select to keep only the variables relevant to our analysis
nba_analysis <- nba_filtered %>%
select(Player, Pos, Tm, Age, G, MP, PTS, TRB, AST, STL, BLK, TOV,
`FG%`, `3P%`, `FT%`)
glimpse(nba_analysis)## Rows: 399
## Columns: 15
## $ Player <chr> "Precious Achiuwa", "Bam Adebayo", "Ochai Agbaji", "Santi Aldam…
## $ Pos <chr> "PF", "C", "SG", "PF", "SG", "SG", "C", "PG", "PF", "PF", "PG",…
## $ Tm <chr> "TOT", "MIA", "TOT", "MEM", "MIN", "PHO", "CLE", "NOP", "MIN", …
## $ Age <dbl> 24, 26, 23, 23, 25, 28, 25, 25, 30, 29, 23, 26, 23, 25, 21, 24,…
## $ G <dbl> 74, 71, 78, 61, 82, 75, 77, 56, 79, 73, 81, 50, 75, 55, 22, 50,…
## $ MP <dbl> 21.9, 34.0, 21.0, 26.5, 23.4, 33.5, 31.7, 18.4, 22.6, 35.2, 22.…
## $ PTS <dbl> 7.6, 19.3, 5.8, 10.7, 8.0, 13.5, 16.5, 7.1, 6.4, 30.4, 11.6, 14…
## $ TRB <dbl> 6.6, 10.4, 2.8, 5.8, 2.0, 3.9, 10.5, 2.3, 3.5, 11.5, 3.8, 4.2, …
## $ AST <dbl> 1.3, 3.9, 1.1, 2.3, 2.5, 3.0, 2.7, 2.1, 4.2, 6.5, 2.9, 2.1, 3.8…
## $ STL <dbl> 0.6, 1.1, 0.6, 0.7, 0.8, 0.9, 0.7, 1.1, 0.9, 1.2, 0.8, 1.4, 0.8…
## $ BLK <dbl> 0.9, 0.9, 0.6, 0.9, 0.5, 0.6, 1.1, 0.3, 0.6, 1.1, 0.5, 0.7, 0.5…
## $ TOV <dbl> 1.1, 2.3, 0.8, 1.1, 0.9, 1.3, 1.6, 0.7, 1.2, 3.4, 1.6, 1.6, 2.1…
## $ `FG%` <dbl> 0.501, 0.521, 0.411, 0.435, 0.439, 0.499, 0.634, 0.412, 0.460, …
## $ `3P%` <dbl> 0.268, 0.357, 0.294, 0.349, 0.391, 0.461, 0.000, 0.377, 0.229, …
## $ `FT%` <dbl> 0.616, 0.755, 0.661, 0.621, 0.800, 0.878, 0.742, 0.673, 0.708, …
# Use dplyr group_by + summarise to get average stats per position
nba_analysis %>%
group_by(Pos) %>%
summarise(
Players = n(),
Avg_PTS = round(mean(PTS, na.rm = TRUE), 1),
Avg_TRB = round(mean(TRB, na.rm = TRUE), 1),
Avg_AST = round(mean(AST, na.rm = TRUE), 1),
Avg_MP = round(mean(MP, na.rm = TRUE), 1)
) %>%
arrange(desc(Avg_PTS))## # A tibble: 5 × 6
## Pos Players Avg_PTS Avg_TRB Avg_AST Avg_MP
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 PG 77 12.2 3 4.4 24.4
## 2 PF 78 11.1 4.7 2.1 23.2
## 3 SG 86 10.8 3 2.5 22.9
## 4 C 73 10 6.5 1.8 21.7
## 5 SF 85 10 3.7 1.9 23.2
# Simple histogram to explore scoring distribution
ggplot(nba_analysis, aes(x = PTS)) +
geom_histogram(bins = 30, fill = "#1d428a", color = "white") +
labs(title = "Distribution of Points Per Game (2023-24 NBA Season)",
x = "Points Per Game", y = "Number of Players") +
theme_minimal()# Boxplot to compare scoring across positions
ggplot(nba_analysis, aes(x = Pos, y = PTS, fill = Pos)) +
geom_boxplot() +
labs(title = "Points Per Game by Position",
x = "Position", y = "Points Per Game") +
theme_minimal() +
theme(legend.position = "none")# Use GGally to create a correlation matrix of key quantitative variables
# This guides our variable selection for multiple linear regression
nba_analysis %>%
select(PTS, MP, TRB, AST, `FG%`, TOV) %>%
na.omit() %>%
ggpairs(
title = "Correlation Matrix — Key NBA Statistics",
upper = list(continuous = wrap("cor", size = 3.5)),
lower = list(continuous = wrap("points", alpha = 0.3, size = 0.8)),
diag = list(continuous = wrap("densityDiag", fill = "#1d428a", alpha = 0.5))
) +
theme_minimal()Interpretation: From the correlation matrix,
MP (minutes per game), FG%, and
AST all show meaningful positive correlations with
PTS. These three variables will form the basis of our
multiple linear regression model.
Research Question: Can we predict a player’s points per game (PTS) using minutes played (MP), field goal percentage (FG%), and assists per game (AST)?
Justification for variable selection: Based on the correlation plot
above, MP shows the strongest correlation with
PTS (r ≈ 0.79), which makes logical sense — players who
play more minutes have more opportunities to score. FG%
captures shooting efficiency, and AST reflects a player’s
offensive involvement and ball-handling ability, both of which are
associated with higher scoring.
# Full multiple linear regression model: PTS ~ MP + FG% + AST
model_full <- lm(PTS ~ MP + `FG%` + AST, data = nba_analysis)
summary(model_full)##
## Call:
## lm(formula = PTS ~ MP + `FG%` + AST, data = nba_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6180 -1.9274 0.2163 1.7108 14.8154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.56499 1.04823 -7.217 2.74e-12 ***
## MP 0.59463 0.02685 22.142 < 2e-16 ***
## `FG%` 5.42365 2.10892 2.572 0.0105 *
## AST 0.82160 0.10923 7.522 3.68e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.995 on 395 degrees of freedom
## Multiple R-squared: 0.7985, Adjusted R-squared: 0.797
## F-statistic: 521.7 on 3 and 395 DF, p-value: < 2.2e-16
## (Intercept) MP `FG%` AST
## -7.5649923 0.5946280 5.4236500 0.8216002
Regression Equation:
\[\hat{PTS} = \beta_0 + \beta_1 \cdot MP + \beta_2 \cdot FG\% + \beta_3 \cdot AST\]
Interpretation:
# Check if any variable should be dropped using backward elimination
# Start with full model, check p-values, drop if p > 0.05
# All variables are significant, so we keep the full model
# For completeness, also test a reduced model without AST
model_reduced <- lm(PTS ~ MP + `FG%`, data = nba_analysis)
summary(model_reduced)##
## Call:
## lm(formula = PTS ~ MP + `FG%`, data = nba_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8686 -2.1789 0.1402 1.8948 15.9529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.82909 1.11875 -6.998 1.11e-11 ***
## MP 0.73605 0.02048 35.948 < 2e-16 ***
## `FG%` 3.48739 2.23521 1.560 0.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.199 on 396 degrees of freedom
## Multiple R-squared: 0.7696, Adjusted R-squared: 0.7685
## F-statistic: 661.4 on 2 and 396 DF, p-value: < 2.2e-16
## Full model Adj R²: 0.797
## Reduced model Adj R²: 0.7685
## Full model wins — keep all three predictors.
# Interactive scatter plot: MP vs PTS colored by Position
# Mouseover shows player name, team, and stats
p <- nba_analysis %>%
filter(!is.na(`FG%`)) %>%
ggplot(aes(
x = MP, y = PTS,
color = Pos,
size = AST,
text = paste0(
"<b>", Player, "</b><br>",
"Team: ", Tm, "<br>",
"Position: ", Pos, "<br>",
"PPG: ", PTS, "<br>",
"MPG: ", MP, "<br>",
"AST: ", AST, "<br>",
"FG%: ", `FG%`
)
)) +
geom_point(alpha = 0.75) +
geom_smooth(method = "lm", se = FALSE, aes(group = 1),
color = "gray30", linetype = "dashed", linewidth = 0.8) +
scale_color_manual(
values = c(
"PG" = "#C8102E", # red
"SG" = "#1d428a", # blue
"SF" = "#00843D", # green
"PF" = "#F58426", # orange
"C" = "#552583" # purple
)
) +
scale_size_continuous(range = c(1.5, 6), name = "Assists Per Game") +
labs(
title = "NBA 2023-24: Minutes Per Game vs. Points Per Game by Position",
subtitle = "Bubble size = Assists per game | Dashed line = Linear trend",
x = "Minutes Per Game (MP)",
y = "Points Per Game (PTS)",
color = "Position",
caption = "Source: Basketball-Reference via Kaggle (2023-2024 NBA Regular Season)"
) +
theme_bw(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "gray40", size = 11),
legend.position = "right"
)
# Convert to interactive plotly with tooltip
ggplotly(p, tooltip = "text") %>%
layout(
hoverlabel = list(bgcolor = "white", font = list(size = 12))
)What this visualization shows: There is a strong positive relationship between minutes per game and points per game — players who spend more time on the court score more. Point guards (red) and shooting guards (blue) tend to cluster at higher scoring outputs for their minutes, reflecting their offensive roles. Centers (purple) tend to score efficiently but with fewer minutes. The bubble size (assists) reveals that high-assist players (mostly PGs) are also among the higher scorers, which aligns with our regression model finding that AST is a significant predictor of PTS.
Interesting pattern: Some centers score very efficiently despite fewer minutes, while some guards log heavy minutes but score less — suggesting positional role matters as much as raw playing time.
# Summarize and pivot for grouped bar chart
pos_stats <- nba_analysis %>%
group_by(Pos) %>%
summarise(
Points = mean(PTS, na.rm = TRUE),
Rebounds = mean(TRB, na.rm = TRUE),
Assists = mean(AST, na.rm = TRUE)
) %>%
pivot_longer(cols = c(Points, Rebounds, Assists),
names_to = "Stat", values_to = "Value")
ggplot(pos_stats, aes(x = Pos, y = Value, fill = Stat)) +
geom_col(position = "dodge", color = "white", width = 0.7) +
scale_fill_manual(values = c(
"Points" = "#C8102E",
"Rebounds" = "#1d428a",
"Assists" = "#00843D"
)) +
labs(
title = "Average Points, Rebounds, and Assists by NBA Position (2023-24 Season)",
x = "Position",
y = "Average Per Game",
fill = "Statistic",
caption = "Source: Basketball-Reference via Kaggle (2023-2024 NBA Regular Season)"
) +
theme_classic(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 13),
legend.position = "top"
)# Map showing average team scoring by state
# First, map team abbreviations to states
team_to_state <- c(
ATL = "georgia", BOS = "massachusetts", BKN = "new york",
CHA = "north carolina",CHI = "illinois", CLE = "ohio",
DAL = "texas", DEN = "colorado", DET = "michigan",
GSW = "california", HOU = "texas", IND = "indiana",
LAC = "california", LAL = "california", MEM = "tennessee",
MIA = "florida", MIL = "wisconsin", MIN = "minnesota",
NOP = "louisiana", NYK = "new york", OKC = "oklahoma",
ORL = "florida", PHI = "pennsylvania", PHO = "arizona",
POR = "oregon", SAC = "california", SAS = "texas",
TOR = "ontario", UTA = "utah", WAS = "district of columbia"
)
# Calculate average PTS per team
team_pts <- nba_analysis %>%
filter(Tm %in% names(team_to_state)) %>%
group_by(Tm) %>%
summarise(avg_pts = mean(PTS, na.rm = TRUE)) %>%
mutate(state = team_to_state[Tm])
# Aggregate by state (some states have multiple teams)
state_pts <- team_pts %>%
group_by(state) %>%
summarise(avg_pts = mean(avg_pts))
# Get US map data
us_map <- map_data("state")
# Join map with team data
map_data_joined <- us_map %>%
left_join(state_pts, by = c("region" = "state"))
# Plot the map
ggplot(map_data_joined, aes(x = long, y = lat, group = group, fill = avg_pts)) +
geom_polygon(color = "white", linewidth = 0.3) +
coord_fixed(1.3) +
scale_fill_gradientn(
colors = c("#d0e8ff", "#1d428a", "#C8102E"),
na.value = "gray85",
name = "Avg PPG\nby Player"
) +
labs(
title = "NBA 2023-24: Average Points Per Game by Player, Aggregated by State",
subtitle = "Darker = higher average player scoring in that state's NBA team(s)",
caption = "Source: Basketball-Reference via Kaggle | *Toronto (TOR) excluded (Canada)",
x = NULL, y = NULL
) +
theme_void(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 13, hjust = 0.5),
plot.subtitle = element_text(color = "gray40", size = 10, hjust = 0.5),
legend.position = "right"
)The three visualizations together tell an interesting story about NBA player performance in the 2023-2024 season. The interactive scatter plot reveals the strong linear relationship between playing time and scoring output — a relationship that is obvious but quantitatively confirmed by the regression model (Adjusted R² > 0.75). The color-coding by position shows that point guards and shooting guards dominate the high-minutes, high-scoring quadrant, which shows the modern NBA’s shift toward perimeter-dominant offense.
The grouped bar chart by position reveals expected role distinctions: centers lead in rebounds, point guards lead in assists, and shooting guards and small forwards score at similar rates. Interestingly, power forwards show a balanced profile — scoring, rebounding, and facilitating — which reflects the evolution of the “stretch big” in modern basketball.
The map visualization shows geographic clustering of scoring talent, with California (Lakers, Clippers, Warriors, Kings) and Texas (Mavericks, Rockets, Spurs) standing out as states with higher average player scoring — likely due to large-market teams attracting star players who tend to have higher scoring averages.
Surprises: I was surprised by how much of the variance in scoring is explained by just three variables (MP, FG%, AST). The regression model’s adjusted R² suggests that these three predictors alone capture most of the story, which reinforces that efficiency (FG%) and usage (MP) are the core drivers of scoring.
Things I wish I could have included: I would have liked to include a shot chart showing where players score from on the court, but that would require play-by-play coordinate data not included in this dataset. I also attempted a rolling-average time series but the per-game dataset doesn’t include game dates.
tidyverse, ggplot2,
dplyr, plotly, GGally,
scales, RColorBrewer