#install.packages('qrcode')
library(qrcode)Warning: package 'qrcode' was built under R version 4.5.3
#will need to update URL once final draft is complete
qr <- qr_code('https://rpubs.com/wmostofa/1428118')
plot(qr)#install.packages('qrcode')
library(qrcode)Warning: package 'qrcode' was built under R version 4.5.3
#will need to update URL once final draft is complete
qr <- qr_code('https://rpubs.com/wmostofa/1428118')
plot(qr)Our project analyzes whether NBA players are being paid relative to their on-court performance during the 2024–2025 season. The main goal is to understand how player salaries relate to key performance metrics such as points, rebounds, and assists, and to identify players who may be overpaid or underpaid based on their production.
To do this, we used two main data sources. Salary data was collected from Basketball Reference and cleaned to ensure consistency, while performance data was obtained using the nbastatR package (also from Basketball Reference). These datasets were merged so that each player had both salary and performance statistics in a single dataset. Players with very limited playing time were removed to ensure more reliable findings.
After cleaning and merging the data, we conducted exploratory data analysis using visualizations to identify trends and relationships between salary and performance. Then we fit a multiple linear regression model that was used to quantify how performance metrics impact salary. Lastly, we applied clustering techniques to group players with similar performance profiles, providing additional insight into player roles and value.
# 1. Setup
library(tidyverse) Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.3
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.3 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringi) Warning: package 'stringi' was built under R version 4.5.2
library(ggrepel) Warning: package 'ggrepel' was built under R version 4.5.2
library(patchwork) Warning: package 'patchwork' was built under R version 4.5.2
library(gridExtra) Warning: package 'gridExtra' was built under R version 4.5.2
Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':
combine
library(plotly) Warning: package 'plotly' was built under R version 4.5.2
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(cluster) Warning: package 'cluster' was built under R version 4.5.3
library(NbClust)Warning: package 'NbClust' was built under R version 4.5.2
# Step 1: Install remotes (simpler and more lightweight than devtools)
#install.packages("remotes")
# Step 2: Restart R Session
# In RStudio: Session -> Restart R
# Step 3: Install nbastatR using remotes instead
#remotes::install_github("abresler/nbastatR")
# Step 4: Restart R Session again
# In RStudio: Session -> Restart R
# Step 5: Load the package
library(nbastatR)Warning: replacing previous import 'curl::handle_reset' by 'httr::handle_reset'
when loading 'nbastatR'
Warning: replacing previous import 'httr::timeout' by 'memoise::timeout' when
loading 'nbastatR'
Warning: replacing previous import 'magrittr::set_names' by 'purrr::set_names'
when loading 'nbastatR'
Warning: replacing previous import 'jsonlite::flatten' by 'purrr::flatten' when
loading 'nbastatR'
Warning: replacing previous import 'curl::parse_date' by 'readr::parse_date'
when loading 'nbastatR'
Warning: replacing previous import 'purrr::flatten_lgl' by 'rlang::flatten_lgl'
when loading 'nbastatR'
Warning: replacing previous import 'purrr::splice' by 'rlang::splice' when
loading 'nbastatR'
Warning: replacing previous import 'purrr::flatten_chr' by 'rlang::flatten_chr'
when loading 'nbastatR'
Warning: replacing previous import 'purrr::flatten_raw' by 'rlang::flatten_raw'
when loading 'nbastatR'
Warning: replacing previous import 'purrr::flatten' by 'rlang::flatten' when
loading 'nbastatR'
Warning: replacing previous import 'jsonlite::unbox' by 'rlang::unbox' when
loading 'nbastatR'
Warning: replacing previous import 'purrr::flatten_dbl' by 'rlang::flatten_dbl'
when loading 'nbastatR'
Warning: replacing previous import 'purrr::invoke' by 'rlang::invoke' when
loading 'nbastatR'
Warning: replacing previous import 'purrr::flatten_int' by 'rlang::flatten_int'
when loading 'nbastatR'
Warning: replacing previous import 'readr::guess_encoding' by
'rvest::guess_encoding' when loading 'nbastatR'
Warning: replacing previous import 'magrittr::extract' by 'tidyr::extract' when
loading 'nbastatR'
Warning: replacing previous import 'rlang::as_list' by 'xml2::as_list' when
loading 'nbastatR'
library(ggplot2)
# eliminate scientific notation in salaries
options(scipen = 999)
library(tidyverse)
library(readr)
# ============================================================
# STEP 1 — Load and clean salary CSV
# ============================================================
salaries_raw <- read_csv("sportsref_download.csv", skip = 1)Rows: 530 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Player, Tm, 2025-26, 2026-27, 2027-28, 2028-29, 2029-30, 2030-31, G...
dbl (1): Rk
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Preview
glimpse(salaries_raw)Rows: 530
Columns: 10
$ Rk <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ Player <chr> "Stephen Curry", "Joel Embiid", "Nikola Jokić", "Kevin Dura…
$ Tm <chr> "GSW", "PHI", "DEN", "HOU", "BOS", "WAS", "MIL", "GSW", "BO…
$ `2025-26` <chr> "$59,606,817", "$55,224,526", "$55,224,526", "$54,708,609",…
$ `2026-27` <chr> "$62,587,158", "$58,100,000", "$59,033,114", "$43,902,439",…
$ `2027-28` <chr> NA, "$62,748,000", "$62,841,702", "$46,097,561", "$62,786,6…
$ `2028-29` <chr> NA, "$67,396,000", NA, NA, "$67,116,798", NA, NA, NA, "$64,…
$ `2029-30` <chr> NA, NA, NA, NA, "$71,446,914", NA, NA, NA, NA, "$69,191,228…
$ `2030-31` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Guaranteed <chr> "$122,193,975", "$176,072,526", "$114,257,640", "$98,611,04…
# Function to clean a salary string and return a numeric value
# Handles inputs like "$59,606,817" or "59606817" or "$59606817"
clean_salary <- function(salary) {
cleaned <- c() # empty vector to store results
for (i in salary) {
if (is.character(i)) {
# if salary is a string, strip $ signs, commas, and whitespace
i <- i |>
str_remove_all("\\$") |>
str_remove_all(",") |>
str_trim() |>
as.numeric()
}
cleaned <- c(cleaned, i)
}
return(cleaned)
}
# Apply the function inside mutate()
salary_df <- salaries_raw |>
select(Player, Tm, `2025-26`) |>
rename(salary = `2025-26`) |>
filter(!is.na(Player), !is.na(salary)) |>
mutate(salary = clean_salary(salary))
#will only have one player per row. keeps highest salary row
salary_df_clean <- salary_df |>
group_by(Player) |>
summarize(
team = Tm[which.max(salary)], # keep team associated with highest salary
salary = max(salary), # keep highest salary value
.groups = "drop"
)
# Confirm duplicates are gone
salary_df_clean |>
count(Player) |>
filter(n > 1)# A tibble: 0 × 2
# ℹ 2 variables: Player <chr>, n <int>
# ============================================================
# STEP 2 Performance data (nbastatR, 2024-25 regular season)
# ============================================================
# nbastatR labels seasons by the END year, so 2024-25 = 2025.
# bref_players_stats() pulls per season averages directly from Basketball Reference
Sys.setenv(VROOM_CONNECTION_SIZE = 500072)
bref_pg <- nbastatR::bref_players_stats(
seasons = 2025,
tables = "per_game"
)parsed http://www.basketball-reference.com/leagues/NBA_2025_per_game.html
PerGame
Warning: package 'future' was built under R version 4.5.3
colnames(bref_pg) [1] "slugSeason" "namePlayer"
[3] "groupPosition" "yearSeason"
[5] "slugPosition" "isSeasonCurrent"
[7] "slugPlayerSeason" "slugPlayerBREF"
[9] "agePlayer" "countGames"
[11] "countGamesStarted" "pctFG"
[13] "pctFG3" "pctFG2"
[15] "pctEFG" "pctFT"
[17] "isHOFPlayer" "slugTeamsBREF"
[19] "idPlayerNBA" "urlPlayerThumbnail"
[21] "urlPlayerHeadshot" "urlPlayerPhoto"
[23] "urlPlayerStats" "urlPlayerActionPhoto"
[25] "nameTeamPerGame" "minutesPerGame"
[27] "fgmPerGame" "fgaPerGame"
[29] "fg3mPerGame" "fg3aPerGame"
[31] "fg2mPerGame" "fg2aPerGame"
[33] "ftmPerGame" "ftaPerGame"
[35] "orbPerGame" "drbPerGame"
[37] "trbPerGame" "astPerGame"
[39] "stlPerGame" "blkPerGame"
[41] "tovPerGame" "pfPerGame"
[43] "ptsPerGame" "AwardsPerGame"
[45] "countTeamsPlayerSeasonPerGame" "urlPlayerBREF"
glimpse(bref_pg)Rows: 572
Columns: 46
$ slugSeason <chr> "2024-25", "2024-25", "2024-25", "2024-2…
$ namePlayer <chr> "Shai Gilgeous-Alexander", "Giannis Ante…
$ groupPosition <chr> "G", "F", "C", "G", "G", "F", "F", "G", …
$ yearSeason <dbl> 2025, 2025, 2025, 2025, 2025, 2025, 2025…
$ slugPosition <chr> "PG", "PF", "C", "PG", "SG", "PF", "PF",…
$ isSeasonCurrent <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ slugPlayerSeason <chr> "gilgesh01_2025", "antetgi01_2025", "jok…
$ slugPlayerBREF <chr> "gilgesh01", "antetgi01", "jokicni01", "…
$ agePlayer <dbl> 26, 30, 29, 25, 23, 26, 36, 24, 23, 28, …
$ countGames <dbl> 76, 67, 70, 50, 79, 72, 62, 52, 70, 65, …
$ countGamesStarted <dbl> 76, 67, 70, 50, 79, 72, 62, 52, 70, 65, …
$ pctFG <dbl> 0.519, 0.601, 0.576, 0.450, 0.447, 0.452…
$ pctFG3 <dbl> 0.375, 0.222, 0.417, 0.368, 0.395, 0.343…
$ pctFG2 <dbl> 0.571, 0.620, 0.627, 0.522, 0.501, 0.559…
$ pctEFG <dbl> 0.569, 0.607, 0.627, 0.536, 0.547, 0.537…
$ pctFT <dbl> 0.898, 0.617, 0.800, 0.782, 0.837, 0.814…
$ isHOFPlayer <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ slugTeamsBREF <chr> "0", "0", "0", "0 | 0 | 0", "0", "0", "0…
$ idPlayerNBA <dbl> 1628983, 203507, 0, 0, 1630162, 1628369,…
$ urlPlayerThumbnail <chr> "https://ak-static.cms.nba.com/wp-conten…
$ urlPlayerHeadshot <chr> "https://ak-static.cms.nba.com/wp-conten…
$ urlPlayerPhoto <chr> "https://ak-static.cms.nba.com/wp-conten…
$ urlPlayerStats <chr> "https://stats.nba.com/player/1628983", …
$ urlPlayerActionPhoto <chr> "https://stats.nba.com/media/players/700…
$ nameTeamPerGame <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ minutesPerGame <dbl> 34.2, 34.2, 36.7, 35.4, 36.3, 36.4, 36.5…
$ fgmPerGame <dbl> 11.3, 11.8, 11.2, 9.2, 9.1, 9.2, 9.5, 9.…
$ fgaPerGame <dbl> 21.8, 19.7, 19.5, 20.5, 20.4, 20.3, 18.1…
$ fg3mPerGame <dbl> 2.1, 0.2, 2.0, 3.5, 4.1, 3.5, 2.6, 3.1, …
$ fg3aPerGame <dbl> 5.7, 0.9, 4.7, 9.6, 10.3, 10.1, 6.0, 9.2…
$ fg2mPerGame <dbl> 9.2, 11.6, 9.3, 5.7, 5.1, 5.7, 7.0, 6.1,…
$ fg2aPerGame <dbl> 16.1, 18.7, 14.8, 10.9, 10.1, 10.2, 12.1…
$ ftmPerGame <dbl> 7.9, 6.5, 5.2, 6.2, 5.3, 5.0, 4.9, 4.9, …
$ ftaPerGame <dbl> 8.8, 10.6, 6.4, 7.9, 6.3, 6.1, 5.8, 5.6,…
$ orbPerGame <dbl> 0.9, 2.2, 2.9, 0.8, 0.8, 0.7, 0.4, 0.3, …
$ drbPerGame <dbl> 4.1, 9.7, 9.9, 7.4, 4.9, 8.0, 5.7, 3.1, …
$ trbPerGame <dbl> 5.0, 11.9, 12.7, 8.2, 5.7, 8.7, 6.0, 3.3…
$ astPerGame <dbl> 6.4, 6.5, 10.2, 7.7, 4.5, 6.0, 4.2, 6.1,…
$ stlPerGame <dbl> 1.7, 0.9, 1.8, 1.8, 1.2, 1.1, 0.8, 1.8, …
$ blkPerGame <dbl> 1.0, 1.2, 0.6, 0.4, 0.6, 0.5, 1.2, 0.4, …
$ tovPerGame <dbl> 2.4, 3.1, 3.3, 3.6, 3.2, 2.9, 3.1, 2.4, …
$ pfPerGame <dbl> 2.2, 2.3, 2.3, 2.5, 1.9, 2.2, 1.7, 2.2, …
$ ptsPerGame <dbl> 32.7, 30.4, 29.6, 28.2, 27.6, 26.8, 26.6…
$ AwardsPerGame <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ countTeamsPlayerSeasonPerGame <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ urlPlayerBREF <chr> "http://www.basketball-reference.com/pla…
# rename to match what our downstream plots / model expect.
# note: bref_players_stats returns groupPosition already bucketed
# as G / F / C, so we use that directly.
stats_df <- bref_pg |>
select(
player_name = namePlayer,
position = groupPosition,
GP = countGames,
MPG = minutesPerGame,
PPG = ptsPerGame,
RPG = trbPerGame,
APG = astPerGame,
SPG = stlPerGame,
BPG = blkPerGame,
TOPG = tovPerGame,
PFPG = pfPerGame,
FG_PCT = pctFG,
FG3_PCT = pctFG3,
FT_PCT = pctFT
) |>
# bref lists traded players once per stint plus a "TOT" row.
# keep the row with the most games played per player.
group_by(player_name) |>
slice_max(order_by = GP, n = 1, with_ties = FALSE) |>
ungroup() |>
filter(GP >= 20, !is.na(position)) # drop low GP and missing positions
# Verify
stats_df |>
count(position, sort = TRUE)# A tibble: 3 × 2
position n
<chr> <int>
1 G 208
2 F 162
3 C 86
# ============================================================
# STEP 3 — Use a join command to combine Salary and Player Stats
# ============================================================
# Join stats_df + salary_df
# inner_join — only keep players with both stats AND a salary
# names differ between tables so we specify both sides
nba_df <- stats_df |>
inner_join(salary_df_clean, by = c("player_name" = "Player"))
# Check for any duplicate player rows
nba_df |>
count(player_name) |>
filter(n > 1)# A tibble: 0 × 2
# ℹ 2 variables: player_name <chr>, n <int>
# Spot check to confirm if the join method worked
nba_df |>
select(player_name, position, team, salary, GP, PPG, RPG, APG) |>
arrange(desc(salary)) |>
head(10)# A tibble: 10 × 8
player_name position team salary GP PPG RPG APG
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Stephen Curry G GSW 59606817 70 24.5 4.4 6
2 Kevin Durant F HOU 54708609 62 26.6 6 4.2
3 Anthony Davis C WAS 54126450 51 24.7 11.6 3.5
4 Giannis Antetokounmpo F MIL 54126450 67 30.4 11.9 6.5
5 Jayson Tatum F BOS 54126450 72 26.8 8.7 6
6 Jimmy Butler F GSW 54126450 55 17.5 5.4 5.4
7 Devin Booker G PHO 53142264 75 25.6 4.1 7.1
8 Jaylen Brown F BOS 53142264 63 22.2 5.8 4.5
9 Karl-Anthony Towns C NYK 53142264 72 24.4 12.8 3.1
10 LeBron James F LAL 52627153 70 24.4 7.8 8.2
# ============================================================
# STEP 4 — Visualize Data
# ============================================================
# Histogram
p1_hist_salarydistribution <- nba_df |>
ggplot(aes(x = salary/1e6)) + #dividing by 1e6 to make the Xaxis more readable
geom_histogram(bins = 30, fill = "blue", color = "white") +
labs(
title = "NBA Player Salary Distribution",
caption = "Data: Basketball Reference + nbastatR",
x = "Salary (USD Millions)",
y = "Number of Players"
) +
theme_light() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
#Graph1
p1_hist_salarydistribution #show datap_top10_salary <- nba_df |>
arrange(desc(salary)) |>
slice_head(n = 10) |>
ggplot(aes(x = salary / 1e6, y = reorder(player_name, salary), fill = position)) +
geom_bar(stat = "identity") +
labs(
title = "Top 10 Highest Paid NBA Players",
x = "Salary (Millions USD)",
y = "Player",
fill = "Position",
caption = "Data: Basketball Reference + nbastatR"
) +
theme_light() +
theme(
plot.title = element_text(hjust = 0.5)
)
#Graph2
p_top10_salary #show graphp_top10_ppg <- nba_df |>
arrange(desc(PPG)) |>
slice_head(n = 10) |>
ggplot(aes(x = PPG, y = reorder(player_name, PPG), fill = position)) +
geom_bar(stat = "identity") +
labs(
title = "Top 10 Scorers in the NBA",
x = "Points Per Game (PPG)",
y = "Player",
fill = "Position",
caption = "Data: Basketball Reference + nbastatR"
) +
theme_light() +
theme(
plot.title = element_text(hjust = 0.5)
)
#Graph3
p_top10_ppg #show graph#Boxplot
p_salary_position <- nba_df |>
ggplot(aes(x = position, y = salary / 1e6, color = position)) +
geom_boxplot() +
labs(
title = "NBA Salary Distribution by Position",
subtitle = "How does pay vary across Guards, Forwards and Centers?",
x = "Position",
y = "Salary (Millions USD)",
caption = "Data: Basketball Reference + nbastatR"
) +
theme_light() +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "none"
)
#Graph4
p_salary_position#scatterplot
p_salary_ppg <- nba_df |>
ggplot(aes(x = PPG, y = salary / 1e6)) +
geom_point(aes(color = position), alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(
title = "Salary vs. Points Per Game",
x = "Points Per Game (PPG)",
y = "Salary (Millions USD)",
color = "Position",
caption = "Data: Basketball Reference + nbastatR"
) +
theme_light() +
theme(plot.title = element_text(hjust = 0.5))
#Graph5
p_salary_ppg #scatter salary vs PPG`geom_smooth()` using formula = 'y ~ x'
p_salary_rpg <- nba_df |>
ggplot(aes(x = RPG, y = salary / 1e6)) +
geom_point(aes(color = position), alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(
title = "Salary vs. Rebounds Per Game",
x = "Rebounds Per Game (RPG)",
y = "Salary (Millions USD)",
color = "Position",
caption = "Data: Basketball Reference + nbastatR"
) +
theme_light() +
theme(plot.title = element_text(hjust = 0.5))
#Graph6
p_salary_rpg #scatter salary vs RPG`geom_smooth()` using formula = 'y ~ x'
p_salary_apg <- nba_df |>
ggplot(aes(x = APG, y = salary / 1e6)) +
geom_point(aes(color = position), alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(
title = "Salary vs. Assists Per Game",
x = "Assists Per Game (APG)",
y = "Salary (Millions USD)",
color = "Position",
caption = "Data: Basketball Reference + nbastatR"
) +
theme_light() +
theme(plot.title = element_text(hjust = 0.5))
#Graph7
p_salary_apg #scatter salary vs APG`geom_smooth()` using formula = 'y ~ x'
#adding a new variable (value score = PPG + RPG + APG)
nba_df <- nba_df |>
mutate(
salary_M = salary / 1e6,
value_score = (PPG + RPG + APG) / salary_M
)
#Graph 8: top 10 most underpaid nba players
p_underpaid <- nba_df |>
arrange(desc(value_score)) |>
slice_head(n = 10) |>
ggplot(aes(x = value_score, y = reorder(player_name, value_score), fill = position)) +
geom_bar(stat = "identity") +
labs(
title = "Top 10 Most Underpaid NBA Players",
subtitle = "Highest production per million dollars earned",
x = "Value Score (PPG + RPG + APG) / Salary (Millions)",
y = "Player",
fill = "Position",
caption = "Data: Basketball Reference + nbastatR"
) +
theme_light() +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
p_underpaid#Graph 9: top 10 most overderpaid nba players
p_overpaid <- nba_df |>
arrange(value_score) |>
slice_head(n = 10) |>
ggplot(aes(x = value_score, y = reorder(player_name, value_score), fill = position)) +
geom_bar(stat = "identity") +
labs(
title = "Top 10 Most Overpaid NBA Players",
subtitle = "Lowest production per million dollars earned",
x = "Value Score (PPG + RPG + APG) / Salary (Millions)",
y = "Player",
fill = "Position",
caption = "Data: Basketball Reference + nbastatR"
) +
theme_light() +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
p_overpaid#Graph 10: Scatter plot of Salary vs Value score
p_salary_value <- nba_df |>
ggplot(aes(x = salary_M, y = value_score)) +
geom_point(aes(color = position), alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(
title = "NBA Salary vs. Value Score",
x = "Salary (Millions USD)",
y = "Value Score (PPG + RPG + APG) / Salary (Millions)",
color = "Position",
caption = "Data: Basketball Reference + nbastatR"
) +
theme_light() +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
p_salary_value`geom_smooth()` using formula = 'y ~ x'
Graph 1 - Distribution of 2025-26 salaries. A histogram of salary / 1e6 (salary in millions of USD) across all 375 players. A classic right-skewed distribution: the bulk of the NBA is concentrated in the $1–5M range (rookie scale, minimums, small role-player deals), and the tail stretches out past $50M for supermax contracts. You see a spike near the league minimum (~$2M), a secondary hump in the mid-level exception range ($10–15M), and a sparse tail of $30M+ max contracts.
Graph 2 - Top 10 highest-paid NBA players. A horizontal bar chart of the 10 players with the highest 2025-26 salary, sorted by salary, colored by position.
Graph 3 - Top 10 scorers. A horizontal bar chart of the 10 players with the highest 2024-25 PPG, again colored by position.
Graph 4 - Salary boxplot by position. A box-and-whisker plot of salary / 1e6 across the three position buckets (C, F, G). Centers have the highest median salary and the longest upper tail; forwards are close behind; guards have the lowest median with a heavy upper tail (because a handful of superstar guards like Steph Curry sit at the very top). The box heights are similar, but the medians shift in the order C ≥ F > G, which is exactly what the regression is capturing.
Graph 5 - Salary vs. Points Per Game. A scatter plot with PPG on the x-axis and salary / 1e6 on the y-axis, colored by position. A clear upward-sloping linear trend: more points per game, more salary. The slope of the fitted line is roughly $1.3–1.4M per additional PPG, matching the regression’s PPG coefficient of $1,366,110. Wide vertical scatter at every PPG level - lots of $5M players and $25M players all putting up ~15 PPG which is the residual variation the regression cannot explain.
Graph 6 - Salary vs. Rebounds Per Game. This is the visualization for a subtle regression finding: in a simple model salary ~ RPG, rebounds look meaningful, but in the multiple regression controlling for PPG, APG, and position, RPG collapses to a non-significant coefficient. A textbook example of confounding.
Graph 7 - Salary vs. Assists Per Game. A scatter of salary vs APG, same structure as Graphs 5 and 6. Guards dominate the right side of the x-axis (high APG), but the fitted line still sits lower overall than Graph 5’s line, because the guard position dummy drags the intercept down. Interpretation: assists are more valuable per unit, and teams pay for playmaking even more than for scoring.
Graph 8 - Top 10 most underpaid players. A bar chart of the 10 players with the highest value_score = (PPG + RPG + APG / salary_M, i.e., players who produce the most counting stats per $1M of salary. This is a different definition of “underpaid” than the regression residuals: it is a simple ratio and does not control for position or anything else. As a productivity-per-dollar view, it reliably shows rookie contracts and hard-working role players, but it is not the same as being underpaid relative to expectations.
Graph 9 - Top 10 most overpaid players. Same as Graph 8 but using the lowest value scores (smallest production / salary ratio). It shows large-salary, small-production players, which are typically: injured stars on supermax deals who missed most of 2024-25 (Kawhi Leonard, Zion Williamson, Joel Embiid seasons with low GP); aging veterans on legacy max contracts whose per-game production has fallen off (Paul George); and players who were traded mid-season and ended up in a reduced role (Cam Thomas, Buddy Hield after leaving the Pacers - neither of which have bad contracts, but they fit the mold nonetheless).
Graph 10 - Scatter of Salary vs. Value Score. Every player plotted with salary on x and value score on y, colored by position, with a linear regression line. Points far above the curve for a given salary are local bargains; points far below are local overpays. This pairs naturally with the residual-based over/underpaid lists from the regression.
# ============================================================
# STEP 7 — create interactive Plots
# ============================================================
# Salary vs PPG animated by postion (G,F,C)
library(plotly)
library(ggplot2)
plot_ly(
data=nba_df,
x=~salary,
y=~PPG,
frame = ~position,
type='histogram2d'
) |>
layout(
title='NBA Salary vs Points Per Game By Position',
xaxis=list(title='Salary'),
yaxis=list(title='Points Per Game')
)# Interactive scatter salary vs value score
p_interactive_value <- nba_df |>
ggplot(aes(
x = salary_M,
y = value_score,
color = position,
text = paste0(
player_name,
"<br>Team: ", team,
"<br>Position: ", position,
"<br>Salary: $", round(salary_M, 2), "M",
"<br>PPG: ", round(PPG, 1),
"<br>RPG: ", round(RPG, 1),
"<br>APG: ", round(APG, 1),
"<br>Value: ", round(value_score, 2)
)
)) +
geom_point(alpha = 0.7, size = 2) +
labs(
title = "NBA Salary vs. Value Score",
x = "Salary (Millions USD)",
y = "Value Score (PPG + RPG + APG) / Salary (Millions)",
color = "Position",
caption = "Hover to identify players; data: Basketball Reference + nbastatR"
) +
theme_light() +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(p_interactive_value, tooltip = "text")Interactive Figure 1 - Salary × PPG, faceted by position. This produces an animated 2D density / heatmap where hitting Play (or clicking the slider) cycles through the three position groups (C, F, G). For each frame, you see the joint density of salary and PPG as a colored grid: where players cluster most heavily.
Interactive Figure 2 - Scatter of Salary vs. Value Score with hover. The same content as Graph 10, but rendered via ggplotly() so every point exposes a hover tooltip containing player name, team, position, salary (in M), PPG / RPG / APG, and value score. The static Graph 10 gives the shape; this interactive version lets you identify individual players by hovering on the outlier dots to see who the bargains and luxuries are without needing a separate label layer.
interactive_nba_df <- nba_df |>
select(player_name,team,GP,FG_PCT,MPG,PPG,RPG,APG, salary)
#install.packages('DT')
library(DT)Warning: package 'DT' was built under R version 4.5.3
datatable(interactive_nba_df)# ============================================================
# Create a linear regression model
# ============================================================
#multiple linear regression using PPG,Apg,rpg, position as predictors
#salary as the dependent variable
lm.salary <- lm(salary~PPG+RPG+APG+position,data=nba_df)
summary(lm.salary)
Call:
lm(formula = salary ~ PPG + RPG + APG + position, data = nba_df)
Residuals:
Min 1Q Median 3Q Max
-24393940 -5028883 -204617 3709304 26237717
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4990852 1717599 -2.906 0.00389 **
PPG 1366110 111835 12.215 < 0.0000000000000002 ***
RPG 91703 283746 0.323 0.74674
APG 1694643 371597 4.560 0.00000696 ***
positionF 515249 1379620 0.373 0.70901
positionG -3414409 1686377 -2.025 0.04362 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7956000 on 369 degrees of freedom
Multiple R-squared: 0.6771, Adjusted R-squared: 0.6727
F-statistic: 154.7 on 5 and 369 DF, p-value: < 0.00000000000000022
#Adjusted R Squared Value is 0.6711
lm.salary
Call:
lm(formula = salary ~ PPG + RPG + APG + position, data = nba_df)
Coefficients:
(Intercept) PPG RPG APG positionF positionG
-4990852 1366110 91703 1694643 515249 -3414409
Coefficient estimates (salary ~ PPG + RPG + APG + position):
| Term | Estimate | Std. Error | t value | p value |
|---|---|---|---|---|
| (Intercept) | -4,990,852 | 1,717,599 | -2.91 | 0.0039 |
| PPG | 1,366,110 | 111,835 | 12.22 | < 2e-16 |
| RPG | 91,703 | 283,746 | 0.32 | 0.747 |
| APG | 1,694,643 | 371,597 | 4.56 | 7.0e-06 |
| positionF | 515,249 | 1,379,620 | 0.37 | 0.709 |
| positionG | -3,414,409 | 1,686,377 | -2.03 | 0.044 |
Coefficient interpretation. Assists are valued more per unit than points, with about $1.69M per assist compared to $1.37M per point. Points and assists are on very different scales though, since players average roughly four times as many points as assists, so in total scoring still explains more of the overall salary. But on the margin, an extra assist is paid more than an extra point, which lines up with how teams clearly value primary playmakers. Rebounding, however, is not a meaningful salary driver once you control for everything else: the RPG coefficient is very small and not statistically significant, suggesting teams are not directly paying for rebounds themselves. Instead, they are paying for the overall production bundle: scoring and play making, that often comes with strong rebounders. This also explains why rebounds look important in a simple scatter plot but lose their effect once you control for other variables, a good example of confounding. Position effects also highlight differences in how the market values players. After controlling for the same level of production, guards are paid about $3.4M less than centers, while forwards are paid roughly the same as centers (the difference is not significant). This likely reflects the fact that guards are more common in the league, which pushes their price down, while productive big men are more scarce and harder to replace, leading to higher salaries even when their basic stats are similar.
Addressing MLR assumptions. This analysis is cross-sectional and based on a single season, so it does not capture the fact that salaries reflect multi-year expectations, injuries, contract structures, and extensions. As a result, some players identified as “overpaid” may simply be on legacy max deals or aging contracts, while “underpaid” players are often still on rookie scale contracts or veteran minimums. The model also ignores the hard salary cap structure, where rookie scale and supermax deals create step functions that a linear model cannot capture, which helps explain some of the large residuals. On top of that, the model only includes PPG, APG, RPG, and position, so it omits important factors like efficiency, defense, usage, turnovers, and availability. This means certain player types, especially defensive anchors, may appear overpaid because their contributions are not included in the regression, think Draymond Green. The model also imposes a linearity assumption, where the marginal effect of an additional point is constant across all scoring levels, even though the true relationship is likely nonlinear at the top. Lastly, there is some multicollinearity between PPG and APG, since high-usage players tend to do both, which can inflate standard errors, but the coefficients can still be interpreted as partial effects.
# ============================================================
# Create a clustering model
# ============================================================
install.packages("cluster")Warning: package 'cluster' is in use and will not be installed
library(cluster)
install.packages("NbClust")Warning: package 'NbClust' is in use and will not be installed
library(NbClust)
#############Guards Cluster ###############################################
#will leave salary out since that is the dependent variable
Guards_data <- nba_df |>
filter(position=='G')|>
select(PPG,RPG,APG)
guard_scale <- scale(Guards_data)
# Estimate the best number of clusters from 2 to 10 using k-means criteria
guard_number_cluster_estimate <- NbClust(
guard_scale,
distance = "euclidean",
min.nc = 2,
max.nc = 10,
method = "kmeans"
)*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 11 proposed 2 as the best number of clusters
* 4 proposed 3 as the best number of clusters
* 1 proposed 6 as the best number of clusters
* 1 proposed 8 as the best number of clusters
* 5 proposed 9 as the best number of clusters
* 2 proposed 10 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************
# Show the voting results for the best number of clusters
guard_number_cluster_estimate$Best.nc KL CH Hartigan CCC Scott Marriot TrCovW
Number_clusters 9.0000 2.0000 8.000 2.0000 3.0 9.0 9.0000
Value_Index 84.9226 220.1545 36.859 -1.0129 122.3 516162.6 619.5345
TraceW Friedman Rubin Cindex DB Silhouette Duda
Number_clusters 3.0000 9.0000 9.000 6.0000 2.0000 2.0000 2.0000
Value_Index 31.0073 3.0137 -0.924 0.1539 0.8626 0.4988 1.2086
PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
Number_clusters 2.0000 2.000 2.0000 3.0000 2.0000 3.0000 2.0000
Value_Index -17.9469 -0.291 0.5339 54.5991 0.6494 1.0971 0.3918
Dunn Hubert SDindex Dindex SDbw
Number_clusters 10.0000 0 2.00 0 10.0000
Value_Index 0.0516 0 2.65 0 0.3272
# Set seed for reproducibility
set.seed(123)
# Run PAM clustering with 5 clusters
# Note: the slides call this k-means, but this function is PAM
kmeans_guard_scalecluster <- pam(guard_scale, k = 5)
# Show medoids for the clusters
kmeans_guard_scalecluster$medoids PPG RPG APG
[1,] -0.5799024 -0.4510787 -0.7969434
[2,] -1.0457889 -1.3011721 -0.8440058
[3,] -0.2887233 0.1156503 -0.2321937
[4,] 0.4829012 0.4698559 0.4266808
[5,] 1.7640891 1.2491082 1.4620551
# Show the cluster assignment for each row
kmeans_guard_scalecluster$clustering [1] 1 1 2 3 1 3 1 1 4 4 3 5 5 4 1 1 2 2 4 4 1 2 3 4 3 4 5 2 4 2 3 3 4 4 4 1 3
[38] 1 4 2 4 1 2 5 5 3 5 5 4 5 5 1 4 5 4 4 2 4 2 2 1 2 1 1 3 3 4 1 4 1 5 1 1 4
[75] 2 5 4 2 4 5 5 2 5 3 1 2 3 2 2 1 2 4 3 1 2 4 3 5 1 5 1 4 1 2 1 1 3 3 3 4 3
[112] 2 5 3 5 2 1 3 3 2 4 2 3 3 3 3 2 1 1 3 4 3 1 2 4 2 4 2 2 2 4 1 1 4 4 2 4 5
[149] 5 4 3 1 2 3 1 5 3 3 5 2 5 3 5 4 3 5
# Plot the clustering result in two reduced dimensions
plot(kmeans_guard_scalecluster)# Add the assigned cluster to the Guard data
guard_cluster <- Guards_data %>%
mutate(cluster = kmeans_guard_scalecluster$clustering)
# Show the dataset with assigned clusters
guard_cluster# A tibble: 166 × 4
PPG RPG APG cluster
<dbl> <dbl> <dbl> <int>
1 7.4 2.4 1.5 1
2 7.6 2 2.6 1
3 5.5 1.3 1.3 2
4 12 3.9 1.8 3
5 6.5 1.9 1.8 1
6 7.1 2.9 2.5 3
7 9.7 2.2 1.1 1
8 3.4 2.7 1.2 1
9 10 3.3 5 4
10 19.3 2.7 4.8 4
# ℹ 156 more rows
# Compute the average of each variable by cluster
guard_cluster_summary <- guard_cluster %>%
group_by(cluster) %>%
summarise(across(everything(), ~ mean(.x, na.rm = TRUE)))
# Show cluster summaries
guard_cluster_summary# A tibble: 5 × 4
cluster PPG RPG APG
<int> <dbl> <dbl> <dbl>
1 1 7.25 2.41 1.51
2 2 4.37 1.30 1.44
3 3 9.57 3.17 2.76
4 4 15.3 3.78 4.37
5 5 22.9 4.93 6.65
#############Forward Cluster ###############################################
#will leave salary out since that is the dependent variable
Forward_data <- nba_df |>
filter(position=='F')|>
select(PPG,RPG,APG)
Forward_scale <- scale(Forward_data)
# Estimate the best number of clusters from 2 to 10 using k-means criteria
forward_number_cluster_estimate <- NbClust(
Forward_scale,
distance = "euclidean",
min.nc = 2,
max.nc = 10,
method = "kmeans"
)*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 13 proposed 2 as the best number of clusters
* 4 proposed 3 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 3 proposed 6 as the best number of clusters
* 1 proposed 7 as the best number of clusters
* 2 proposed 10 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************
# Show the voting results for the best number of clusters
forward_number_cluster_estimate$Best.nc KL CH Hartigan CCC Scott Marriot TrCovW
Number_clusters 2.0000 2.0000 10.0000 2.0000 3.0000 6.0 6.000
Value_Index 3.9759 177.3621 27.2849 -2.7604 105.2346 126929.8 221.814
TraceW Friedman Rubin Cindex DB Silhouette Duda
Number_clusters 3.0000 6.0000 7.0000 5.000 2.0000 2.0000 2.0000
Value_Index 31.4724 3.8352 -0.5637 0.193 0.8834 0.4913 1.0321
PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
Number_clusters 2.0000 2.0000 2.000 3.000 2.0000 2.0000 2.0000
Value_Index -3.5406 -0.0523 0.529 50.447 0.6169 1.8034 0.3751
Dunn Hubert SDindex Dindex SDbw
Number_clusters 2.0000 0 3.0000 0 10.0000
Value_Index 0.0763 0 3.0402 0 0.1747
# Set seed for reproducibility
set.seed(123)
# Run PAM clustering with 5 clusters
# Note: the slides call this k-means, but this function is PAM
kmeans_forward_scalecluster <- pam(Forward_scale, k = 5)
# Show medoids for the clusters
kmeans_forward_scalecluster$medoids PPG RPG APG
[1,] 0.6145837 0.6821109 0.1679893
[2,] -0.1209854 -0.1022123 -0.2951544
[3,] 1.2249495 1.3683938 1.7559107
[4,] -1.2008634 -1.3277175 -1.0891151
[5,] -0.5748472 -0.5433942 -0.6921348
# Show the cluster assignment for each row
kmeans_forward_scalecluster$clustering [1] 1 2 3 1 4 2 1 1 1 2 3 2 4 2 5 1 2 2 5 4 1 3 5 3 5 2 4 4 2 4 3 2 3 3 5 2 3
[38] 5 2 5 2 4 4 1 5 1 4 2 2 3 2 5 5 5 3 2 3 4 2 2 1 4 3 1 5 1 2 4 4 4 2 4 3 2
[75] 2 1 1 2 1 5 3 5 1 2 4 5 2 1 2 1 3 2 4 5 2 5 1 1 3 5 1 5 5 1 2 4 5 1 3 3 4
[112] 2 3 5 5 3 5 2 2 5 5 5 1 3 4 5 4 5 1 2 1 1 2 1 5 4 2 2 4 2 3
# Plot the clustering result in two reduced dimensions
plot(kmeans_forward_scalecluster)# Add the assigned cluster to the Guard data
forward_cluster <- Forward_data %>%
mutate(cluster = kmeans_forward_scalecluster$clustering)
# Show the dataset with assigned clusters
forward_cluster# A tibble: 141 × 4
PPG RPG APG cluster
<dbl> <dbl> <dbl> <int>
1 14.7 4.8 3.2 1
2 12 4 1.2 2
3 14.1 8.2 3.8 3
4 18 4.5 2.6 1
5 2.5 1.3 0.3 4
6 10.1 5.1 2.3 2
7 16.1 5.3 1.9 1
8 12.3 5 3.4 1
9 13.9 8.4 2.1 1
10 8.3 5.1 1 2
# ℹ 131 more rows
# Compute the average of each variable by cluster
forward_cluster_summary <- forward_cluster %>%
group_by(cluster) %>%
summarise(across(everything(), ~ mean(.x, na.rm = TRUE)))
# Show cluster summaries
forward_cluster_summary# A tibble: 5 × 4
cluster PPG RPG APG
<int> <dbl> <dbl> <dbl>
1 1 15.3 5.70 2.41
2 2 9.82 4.07 1.83
3 3 20.9 7.11 4.85
4 4 3.31 1.55 0.517
5 5 6.55 3.37 1.09
#############Center Cluster ###############################################
#will leave salary out since that is the dependent variable
Center_data <- nba_df |>
filter(position=='C')|>
select(PPG,RPG,APG)
Center_scale <- scale(Center_data)
# Estimate the best number of clusters from 2 to 10 using k-means criteria
Center_number_cluster_estimate <- NbClust(
Center_scale,
distance = "euclidean",
min.nc = 2,
max.nc = 10,
method = "kmeans"
)*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 9 proposed 2 as the best number of clusters
* 1 proposed 3 as the best number of clusters
* 6 proposed 4 as the best number of clusters
* 4 proposed 7 as the best number of clusters
* 1 proposed 8 as the best number of clusters
* 2 proposed 9 as the best number of clusters
* 1 proposed 10 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************
# Show the voting results for the best number of clusters
Center_number_cluster_estimate$Best.nc KL CH Hartigan CCC Scott Marriot TrCovW
Number_clusters 8.0000 4.0000 4.0000 4.0000 7.0000 7.00 7.0000
Value_Index 6.8335 75.8917 20.1033 -1.7679 63.7695 26208.74 74.3384
TraceW Friedman Rubin Cindex DB Silhouette Duda
Number_clusters 4.0000 7.0000 4.0000 9.0000 2.0000 2.0000 2.0000
Value_Index 13.9789 7.1011 -0.8626 0.2131 0.9248 0.4565 0.8828
PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
Number_clusters 2.0000 2.0000 2.0000 3.0000 2.0000 2.0000 2.0000
Value_Index 4.7794 0.2153 0.5153 26.3169 0.5825 1.2071 0.3817
Dunn Hubert SDindex Dindex SDbw
Number_clusters 9.0000 0 4.0000 0 10.0000
Value_Index 0.0861 0 3.8028 0 0.1398
# Set seed for reproducibility
set.seed(123)
# Run PAM clustering with 5 clusters
# Note: the slides call this k-means, but this function is PAM
kmeans_center_scalecluster <- pam(Center_scale, k = 5)
# Show medoids for the clusters
kmeans_center_scalecluster$medoids PPG RPG APG
[1,] -0.77630545 -0.8919938 -0.9010956
[2,] 0.04074820 -0.1309036 0.2406653
[3,] -0.03029994 0.5970957 -0.6156554
[4,] 2.65176747 1.4905494 1.9533067
[5,] 1.05318424 1.2258224 0.8115457
# Show the cluster assignment for each row
kmeans_center_scalecluster$clustering [1] 1 2 2 3 4 1 4 1 2 1 5 1 3 2 2 1 5 2 4 3 1 1 1 2 2 5 2 5 5 5 1 5 5 1 1 1 1 4
[39] 3 2 2 2 1 2 5 2 1 1 2 2 2 1 2 3 5 1 1 1 5 1 1 2 4 5 2 1 3 3
# Plot the clustering result in two reduced dimensions
plot(kmeans_center_scalecluster)# Add the assigned cluster to the Guard data
center_cluster <- Center_data %>%
mutate(cluster = kmeans_center_scalecluster$clustering)
# Show the dataset with assigned clusters
center_cluster# A tibble: 68 × 4
PPG RPG APG cluster
<dbl> <dbl> <dbl> <int>
1 5.8 4.2 0.5 1
2 9 6.2 2.1 2
3 13 6.5 2.4 2
4 7.3 7.8 0.9 3
5 24.7 11.6 3.5 4
6 1.9 2 0.4 1
7 18.1 9.6 4.3 4
8 5.1 5.6 1.1 1
9 13 5 1.8 2
10 4.4 4.2 0.5 1
# ℹ 58 more rows
# Compute the average of each variable by cluster
center_cluster_summary <- center_cluster %>%
group_by(cluster) %>%
summarise(across(everything(), ~ mean(.x, na.rm = TRUE)))
# Show cluster summaries
center_cluster_summary# A tibble: 5 × 4
cluster PPG RPG APG
<int> <dbl> <dbl> <dbl>
1 1 4.40 3.58 0.738
2 2 9.51 6.08 1.93
3 3 8.51 8.04 1.04
4 4 22.1 11.8 4.12
5 5 14.3 9.91 2.32
| Cluster | PPG | RPG | APG | Tier / Archetype | Example |
|---|---|---|---|---|---|
| 2 | 4.4 | 1.3 | 1.4 | End-of-bench / 2 way guards (g league) | Ron Harper Jr. |
| 1 | 7.3 | 2.4 | 1.5 | Low-usage rotation guard | Keon Ellis, Kris Dunn |
| 3 | 9.6 | 3.2 | 2.8 | Solid rotation guard/ 6th man | Dennis Schroder, Bones Hyland |
| 4 | 15.3 | 3.8 | 4.4 | Secondary creators / high end starters | Derrick White, Ajay Mitchell |
| 5 | 22.9 | 4.9 | 6.7 | All star ball handlers | Jalen Brunson, SGA, Donovan Mitchell |
PPG, RPG, and APG move together from cluster 2 (all = −1 SD) to cluster 5 (all = +1.5 SD or more). The one exception is cluster 3, which has a slightly elevated RPG compared to its scoring: those are bigger, slashing combo guards who get to the paint and rebound a bit more than their points suggest. Guards is the largest bucket (166 players) and the most evenly spread across clusters, which is consistent with the modern NBA being focused on many ball-handling roles.
| Cluster | PPG | RPG | APG | Tier / Archetype | Example |
|---|---|---|---|---|---|
| 4 | 3.3 | 1.6 | 0.5 | End-of-bench / deep reserve forward | Carter Bryant, Nicolas Batum |
| 5 | 6.6 | 3.4 | 1.1 | Low-usage 3&D forward | Jordan Walsh, Nikola Jovic |
| 2 | 9.8 | 4.1 | 1.8 | 3-and-D / connector wing (rotation forwards) | Tari Eason, Royce O’Neal |
| 1 | 15.3 | 5.7 | 2.4 | Scoring starters | Aaron Gordon, Jaime Jaquez Jr. |
| 3 | 20.9 | 7.1 | 4.9 | All star forward / play making wings | Jayson Tatum, Lebron James, Deni Avdija |
It is mostly a tier ladder, but the top cluster (3) stands out because its APG jumps disproportionately. That captures point-forwards / floor generals: the Luka Doncic / Giannis / Kevin Durant-adjacent wing-creator types. Cluster 1 is more like the traditional scoring small forward or power forward who averages ~15/6 without being the primary initiator (think Tobias Harris, Jaden McDaniels). Forwards also have the cleanest separation between clusters 4 and 5 because RPG differentiates them even when PPG is similar.
| Cluster | PPG | RPG | APG | Tier / Archetype | Example |
|---|---|---|---|---|---|
| 1 | 4.4 | 3.6 | 0.7 | End-of-bench big/ 3rd center | Clint Capela |
| 2 | 9.5 | 6.1 | 1.9 | Backup / Rotation bigs | Daniel Gafford |
| 3 | 8.5 | 8.0 | 1.0 | Screen-setter / rim-protector / glass-cleaner (rebound first low usage bigs) | Mitchell Robinson |
| 5 | 14.3 | 9.9 | 2.3 | All-Star caliber traditional bigs | Jalen Duren |
| 4 | 22.1 | 11.8 | 4.1 | Franchise centers | Jokic / Wembanyama |
Centers is where the k = 5 choice is most interesting, because you can see two different mid-tier archetypes:
That is a genuine difference in player roles, not just a tier difference. Clusters 4 and 5 are both star bigs; the gap between them is the gap between an All-Star and an MVP-tier big man, and only a handful of players land in cluster 4. Centers is also the smallest group (68 players), so its clusters are more sensitive to individual outliers than the guard / forward splits.
Tying the clusters back to the regression residuals: a superstar in guard cluster 5 or center cluster 4 who still has a large positive residual is overpaid even relative to elite production - i.e., on a max contract but past their prime, (think Paul George, Joel Embiid). Conversely, a player in the “low-usage rotation” clusters (guard C1, forward C5, center C2 / C3) with a large negative residual is exactly the “cheap useful role player” the team values more than the market does; typically a rookie-scale or vet-minimum guy.
Overall, the results show that NBA salaries are strongly influenced by offensive production, particularly scoring and assists. Players responsible for scoring more and creating opportunities for others tend to earn significantly higher salaries. We found that not all performance metrics are equally valued, as rebounding was not a significant predictor in the model.
Visualizations also highlight inefficiencies in the market. Some players provide high production relative to their salary and can be considered underpaid, while others are on large contracts without matching performance, making them overpaid. The negative relationship between salary and value score further supports this idea.
The clustering analysis reinforces that players fall into distinct performance tiers, ranging from low-impact role players to superstars. It suggests that teams may be paying based on perceived player roles rather than strictly on statistical efficiency.
In conclusion, while NBA salaries are partially aligned with performance, there are clear inconsistencies. Teams heavily reward scoring and playmaking, but overall value is not always reflected in the salary, leading to both overpaid and underpaid players league wide.
#install.packages("leaflet")
library(leaflet)Warning: package 'leaflet' was built under R version 4.5.3
leaflet() |>
addTiles()|>
addMarkers(
lng = -84.5184,
lat = 33.9384,
popup = 'KSU Marietta'
)#install.packages('qrcode')
library(qrcode)
#will need to update URL once final draft is complete
qr <- qr_code('https://rpubs.com/wmostofa/1424378')
plot(qr)