TylerWasif_FinalProject

#install.packages('qrcode')
library(qrcode)

Warning: package 'qrcode' was built under R version 4.5.3

#will need to update URL once final draft is complete
qr <- qr_code('https://rpubs.com/wmostofa/1428118')
plot(qr)

Introduction

Our project analyzes whether NBA players are being paid relative to their on-court performance during the 2024–2025 season. The main goal is to understand how player salaries relate to key performance metrics such as points, rebounds, and assists, and to identify players who may be overpaid or underpaid based on their production.

To do this, we used two main data sources. Salary data was collected from Basketball Reference and cleaned to ensure consistency, while performance data was obtained using the nbastatR package (also from Basketball Reference). These datasets were merged so that each player had both salary and performance statistics in a single dataset. Players with very limited playing time were removed to ensure more reliable findings.

After cleaning and merging the data, we conducted exploratory data analysis using visualizations to identify trends and relationships between salary and performance. Then we fit a multiple linear regression model that was used to quantify how performance metrics impact salary. Lastly, we applied clustering techniques to group players with similar performance profiles, providing additional insight into player roles and value.

Data Processioning

# 1. Setup

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'ggplot2' was built under R version 4.5.3

Warning: package 'tidyr' was built under R version 4.5.2

Warning: package 'readr' was built under R version 4.5.2

Warning: package 'purrr' was built under R version 4.5.2

Warning: package 'dplyr' was built under R version 4.5.2

Warning: package 'stringr' was built under R version 4.5.2

Warning: package 'forcats' was built under R version 4.5.2

Warning: package 'lubridate' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(stringi)

Warning: package 'stringi' was built under R version 4.5.2

library(ggrepel)

Warning: package 'ggrepel' was built under R version 4.5.2

library(patchwork)

Warning: package 'patchwork' was built under R version 4.5.2

library(gridExtra)

Warning: package 'gridExtra' was built under R version 4.5.2


Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine

library(plotly)

Warning: package 'plotly' was built under R version 4.5.2


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(cluster)

Warning: package 'cluster' was built under R version 4.5.3

library(NbClust)

Warning: package 'NbClust' was built under R version 4.5.2

# Step 1: Install remotes (simpler and more lightweight than devtools)
#install.packages("remotes")

# Step 2: Restart R Session
# In RStudio: Session -> Restart R

# Step 3: Install nbastatR using remotes instead
#remotes::install_github("abresler/nbastatR")

# Step 4: Restart R Session again
# In RStudio: Session -> Restart R

# Step 5: Load the package
library(nbastatR)

Warning: replacing previous import 'curl::handle_reset' by 'httr::handle_reset'
when loading 'nbastatR'

Warning: replacing previous import 'httr::timeout' by 'memoise::timeout' when
loading 'nbastatR'

Warning: replacing previous import 'magrittr::set_names' by 'purrr::set_names'
when loading 'nbastatR'

Warning: replacing previous import 'jsonlite::flatten' by 'purrr::flatten' when
loading 'nbastatR'

Warning: replacing previous import 'curl::parse_date' by 'readr::parse_date'
when loading 'nbastatR'

Warning: replacing previous import 'purrr::flatten_lgl' by 'rlang::flatten_lgl'
when loading 'nbastatR'

Warning: replacing previous import 'purrr::splice' by 'rlang::splice' when
loading 'nbastatR'

Warning: replacing previous import 'purrr::flatten_chr' by 'rlang::flatten_chr'
when loading 'nbastatR'

Warning: replacing previous import 'purrr::flatten_raw' by 'rlang::flatten_raw'
when loading 'nbastatR'

Warning: replacing previous import 'purrr::flatten' by 'rlang::flatten' when
loading 'nbastatR'

Warning: replacing previous import 'jsonlite::unbox' by 'rlang::unbox' when
loading 'nbastatR'

Warning: replacing previous import 'purrr::flatten_dbl' by 'rlang::flatten_dbl'
when loading 'nbastatR'

Warning: replacing previous import 'purrr::invoke' by 'rlang::invoke' when
loading 'nbastatR'

Warning: replacing previous import 'purrr::flatten_int' by 'rlang::flatten_int'
when loading 'nbastatR'

Warning: replacing previous import 'readr::guess_encoding' by
'rvest::guess_encoding' when loading 'nbastatR'

Warning: replacing previous import 'magrittr::extract' by 'tidyr::extract' when
loading 'nbastatR'

Warning: replacing previous import 'rlang::as_list' by 'xml2::as_list' when
loading 'nbastatR'

library(ggplot2)


# eliminate scientific notation in salaries
options(scipen = 999)
library(tidyverse)
library(readr)

# ============================================================
# STEP 1 — Load and clean salary CSV
# ============================================================

salaries_raw <- read_csv("sportsref_download.csv", skip = 1)

Rows: 530 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Player, Tm, 2025-26, 2026-27, 2027-28, 2028-29, 2029-30, 2030-31, G...
dbl (1): Rk

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Preview
glimpse(salaries_raw)

Rows: 530
Columns: 10
$ Rk         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ Player     <chr> "Stephen Curry", "Joel Embiid", "Nikola Jokić", "Kevin Dura…
$ Tm         <chr> "GSW", "PHI", "DEN", "HOU", "BOS", "WAS", "MIL", "GSW", "BO…
$ `2025-26`  <chr> "$59,606,817", "$55,224,526", "$55,224,526", "$54,708,609",…
$ `2026-27`  <chr> "$62,587,158", "$58,100,000", "$59,033,114", "$43,902,439",…
$ `2027-28`  <chr> NA, "$62,748,000", "$62,841,702", "$46,097,561", "$62,786,6…
$ `2028-29`  <chr> NA, "$67,396,000", NA, NA, "$67,116,798", NA, NA, NA, "$64,…
$ `2029-30`  <chr> NA, NA, NA, NA, "$71,446,914", NA, NA, NA, NA, "$69,191,228…
$ `2030-31`  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Guaranteed <chr> "$122,193,975", "$176,072,526", "$114,257,640", "$98,611,04…

# Function to clean a salary string and return a numeric value
# Handles inputs like "$59,606,817" or "59606817" or "$59606817"

clean_salary <- function(salary) {
  
  cleaned <- c()  # empty vector to store results
  
  for (i in salary) {
    if (is.character(i)) {
      # if salary is a string, strip $ signs, commas, and whitespace
      i <- i |>
        str_remove_all("\\$") |>
        str_remove_all(",") |>
        str_trim() |>
        as.numeric()
    }
    cleaned <- c(cleaned, i)
  }
  return(cleaned)
}

# Apply the function inside mutate()

salary_df <- salaries_raw |>
  select(Player, Tm, `2025-26`) |>
  rename(salary = `2025-26`) |>
  filter(!is.na(Player), !is.na(salary)) |>
  mutate(salary = clean_salary(salary))

#will only have one player per row. keeps highest salary row
salary_df_clean <- salary_df |>
  group_by(Player) |>
  summarize(
    team   = Tm[which.max(salary)],   # keep team associated with highest salary
    salary = max(salary),             # keep highest salary value
    .groups = "drop"
  )

# Confirm duplicates are gone
salary_df_clean |>
  count(Player) |>
  filter(n > 1)

# A tibble: 0 × 2
# ℹ 2 variables: Player <chr>, n <int>

# ============================================================
# STEP 2 Performance data (nbastatR, 2024-25 regular season)
# ============================================================

# nbastatR labels seasons by the END year, so 2024-25 = 2025.
# bref_players_stats() pulls per season averages directly from Basketball Reference
Sys.setenv(VROOM_CONNECTION_SIZE = 500072)

bref_pg <- nbastatR::bref_players_stats(
  seasons = 2025,
  tables  = "per_game"
)

parsed http://www.basketball-reference.com/leagues/NBA_2025_per_game.html
PerGame

Warning: package 'future' was built under R version 4.5.3

colnames(bref_pg)

 [1] "slugSeason"                    "namePlayer"                   
 [3] "groupPosition"                 "yearSeason"                   
 [5] "slugPosition"                  "isSeasonCurrent"              
 [7] "slugPlayerSeason"              "slugPlayerBREF"               
 [9] "agePlayer"                     "countGames"                   
[11] "countGamesStarted"             "pctFG"                        
[13] "pctFG3"                        "pctFG2"                       
[15] "pctEFG"                        "pctFT"                        
[17] "isHOFPlayer"                   "slugTeamsBREF"                
[19] "idPlayerNBA"                   "urlPlayerThumbnail"           
[21] "urlPlayerHeadshot"             "urlPlayerPhoto"               
[23] "urlPlayerStats"                "urlPlayerActionPhoto"         
[25] "nameTeamPerGame"               "minutesPerGame"               
[27] "fgmPerGame"                    "fgaPerGame"                   
[29] "fg3mPerGame"                   "fg3aPerGame"                  
[31] "fg2mPerGame"                   "fg2aPerGame"                  
[33] "ftmPerGame"                    "ftaPerGame"                   
[35] "orbPerGame"                    "drbPerGame"                   
[37] "trbPerGame"                    "astPerGame"                   
[39] "stlPerGame"                    "blkPerGame"                   
[41] "tovPerGame"                    "pfPerGame"                    
[43] "ptsPerGame"                    "AwardsPerGame"                
[45] "countTeamsPlayerSeasonPerGame" "urlPlayerBREF"

glimpse(bref_pg)

Rows: 572
Columns: 46
$ slugSeason                    <chr> "2024-25", "2024-25", "2024-25", "2024-2…
$ namePlayer                    <chr> "Shai Gilgeous-Alexander", "Giannis Ante…
$ groupPosition                 <chr> "G", "F", "C", "G", "G", "F", "F", "G", …
$ yearSeason                    <dbl> 2025, 2025, 2025, 2025, 2025, 2025, 2025…
$ slugPosition                  <chr> "PG", "PF", "C", "PG", "SG", "PF", "PF",…
$ isSeasonCurrent               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ slugPlayerSeason              <chr> "gilgesh01_2025", "antetgi01_2025", "jok…
$ slugPlayerBREF                <chr> "gilgesh01", "antetgi01", "jokicni01", "…
$ agePlayer                     <dbl> 26, 30, 29, 25, 23, 26, 36, 24, 23, 28, …
$ countGames                    <dbl> 76, 67, 70, 50, 79, 72, 62, 52, 70, 65, …
$ countGamesStarted             <dbl> 76, 67, 70, 50, 79, 72, 62, 52, 70, 65, …
$ pctFG                         <dbl> 0.519, 0.601, 0.576, 0.450, 0.447, 0.452…
$ pctFG3                        <dbl> 0.375, 0.222, 0.417, 0.368, 0.395, 0.343…
$ pctFG2                        <dbl> 0.571, 0.620, 0.627, 0.522, 0.501, 0.559…
$ pctEFG                        <dbl> 0.569, 0.607, 0.627, 0.536, 0.547, 0.537…
$ pctFT                         <dbl> 0.898, 0.617, 0.800, 0.782, 0.837, 0.814…
$ isHOFPlayer                   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ slugTeamsBREF                 <chr> "0", "0", "0", "0 | 0 | 0", "0", "0", "0…
$ idPlayerNBA                   <dbl> 1628983, 203507, 0, 0, 1630162, 1628369,…
$ urlPlayerThumbnail            <chr> "https://ak-static.cms.nba.com/wp-conten…
$ urlPlayerHeadshot             <chr> "https://ak-static.cms.nba.com/wp-conten…
$ urlPlayerPhoto                <chr> "https://ak-static.cms.nba.com/wp-conten…
$ urlPlayerStats                <chr> "https://stats.nba.com/player/1628983", …
$ urlPlayerActionPhoto          <chr> "https://stats.nba.com/media/players/700…
$ nameTeamPerGame               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ minutesPerGame                <dbl> 34.2, 34.2, 36.7, 35.4, 36.3, 36.4, 36.5…
$ fgmPerGame                    <dbl> 11.3, 11.8, 11.2, 9.2, 9.1, 9.2, 9.5, 9.…
$ fgaPerGame                    <dbl> 21.8, 19.7, 19.5, 20.5, 20.4, 20.3, 18.1…
$ fg3mPerGame                   <dbl> 2.1, 0.2, 2.0, 3.5, 4.1, 3.5, 2.6, 3.1, …
$ fg3aPerGame                   <dbl> 5.7, 0.9, 4.7, 9.6, 10.3, 10.1, 6.0, 9.2…
$ fg2mPerGame                   <dbl> 9.2, 11.6, 9.3, 5.7, 5.1, 5.7, 7.0, 6.1,…
$ fg2aPerGame                   <dbl> 16.1, 18.7, 14.8, 10.9, 10.1, 10.2, 12.1…
$ ftmPerGame                    <dbl> 7.9, 6.5, 5.2, 6.2, 5.3, 5.0, 4.9, 4.9, …
$ ftaPerGame                    <dbl> 8.8, 10.6, 6.4, 7.9, 6.3, 6.1, 5.8, 5.6,…
$ orbPerGame                    <dbl> 0.9, 2.2, 2.9, 0.8, 0.8, 0.7, 0.4, 0.3, …
$ drbPerGame                    <dbl> 4.1, 9.7, 9.9, 7.4, 4.9, 8.0, 5.7, 3.1, …
$ trbPerGame                    <dbl> 5.0, 11.9, 12.7, 8.2, 5.7, 8.7, 6.0, 3.3…
$ astPerGame                    <dbl> 6.4, 6.5, 10.2, 7.7, 4.5, 6.0, 4.2, 6.1,…
$ stlPerGame                    <dbl> 1.7, 0.9, 1.8, 1.8, 1.2, 1.1, 0.8, 1.8, …
$ blkPerGame                    <dbl> 1.0, 1.2, 0.6, 0.4, 0.6, 0.5, 1.2, 0.4, …
$ tovPerGame                    <dbl> 2.4, 3.1, 3.3, 3.6, 3.2, 2.9, 3.1, 2.4, …
$ pfPerGame                     <dbl> 2.2, 2.3, 2.3, 2.5, 1.9, 2.2, 1.7, 2.2, …
$ ptsPerGame                    <dbl> 32.7, 30.4, 29.6, 28.2, 27.6, 26.8, 26.6…
$ AwardsPerGame                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ countTeamsPlayerSeasonPerGame <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ urlPlayerBREF                 <chr> "http://www.basketball-reference.com/pla…

# rename to match what our downstream plots / model expect.
# note: bref_players_stats returns groupPosition already bucketed
# as G / F / C, so we use that directly.
stats_df <- bref_pg |>
  select(
    player_name = namePlayer,
    position    = groupPosition,
    GP          = countGames,
    MPG         = minutesPerGame,
    PPG         = ptsPerGame,
    RPG         = trbPerGame,
    APG         = astPerGame,
    SPG         = stlPerGame,
    BPG         = blkPerGame,
    TOPG        = tovPerGame,
    PFPG        = pfPerGame,
    FG_PCT      = pctFG,
    FG3_PCT     = pctFG3,
    FT_PCT      = pctFT
  ) |>
  # bref lists traded players once per stint plus a "TOT" row.
  # keep the row with the most games played per player.
  group_by(player_name) |>
  slice_max(order_by = GP, n = 1, with_ties = FALSE) |>
  ungroup() |>
  filter(GP >= 20, !is.na(position))   # drop low GP and missing positions

# Verify
stats_df |>
  count(position, sort = TRUE)

# A tibble: 3 × 2
  position     n
  <chr>    <int>
1 G          208
2 F          162
3 C           86

# ============================================================
# STEP 3 — Use a join command to combine Salary and Player Stats
# ============================================================

# Join stats_df + salary_df
# inner_join — only keep players with both stats AND a salary
# names differ between tables so we specify both sides

nba_df <- stats_df |>
  inner_join(salary_df_clean, by = c("player_name" = "Player"))


# Check for any duplicate player rows
nba_df |>
  count(player_name) |>
  filter(n > 1)

# A tibble: 0 × 2
# ℹ 2 variables: player_name <chr>, n <int>

# Spot check to confirm if the join method worked
nba_df |>
  select(player_name, position, team, salary, GP, PPG, RPG, APG) |>
  arrange(desc(salary)) |>
  head(10)

# A tibble: 10 × 8
   player_name           position team    salary    GP   PPG   RPG   APG
   <chr>                 <chr>    <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Stephen Curry         G        GSW   59606817    70  24.5   4.4   6  
 2 Kevin Durant          F        HOU   54708609    62  26.6   6     4.2
 3 Anthony Davis         C        WAS   54126450    51  24.7  11.6   3.5
 4 Giannis Antetokounmpo F        MIL   54126450    67  30.4  11.9   6.5
 5 Jayson Tatum          F        BOS   54126450    72  26.8   8.7   6  
 6 Jimmy Butler          F        GSW   54126450    55  17.5   5.4   5.4
 7 Devin Booker          G        PHO   53142264    75  25.6   4.1   7.1
 8 Jaylen Brown          F        BOS   53142264    63  22.2   5.8   4.5
 9 Karl-Anthony Towns    C        NYK   53142264    72  24.4  12.8   3.1
10 LeBron James          F        LAL   52627153    70  24.4   7.8   8.2

Data Visualization

# ============================================================
# STEP 4 — Visualize Data
# ============================================================

# Histogram
p1_hist_salarydistribution <- nba_df |>
  ggplot(aes(x = salary/1e6)) + #dividing by 1e6 to make the Xaxis more readable
  geom_histogram(bins = 30, fill = "blue", color = "white") +
  labs(
    title    = "NBA Player Salary Distribution",
    caption  = "Data: Basketball Reference + nbastatR",
    x        = "Salary (USD Millions)",
    y        = "Number of Players"
  ) +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))
#Graph1
p1_hist_salarydistribution #show data

p_top10_salary <- nba_df |>
  arrange(desc(salary)) |>
  slice_head(n = 10) |>
  ggplot(aes(x = salary / 1e6, y = reorder(player_name, salary), fill = position)) +
  geom_bar(stat = "identity") +
  labs(
    title   = "Top 10 Highest Paid NBA Players",
    x       = "Salary (Millions USD)",
    y       = "Player",
    fill    = "Position",
    caption = "Data: Basketball Reference + nbastatR"
  ) +
  theme_light() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
#Graph2
p_top10_salary #show graph

p_top10_ppg <- nba_df |>
  arrange(desc(PPG)) |>
  slice_head(n = 10) |>
  ggplot(aes(x = PPG, y = reorder(player_name, PPG), fill = position)) +
  geom_bar(stat = "identity") +
  labs(
    title   = "Top 10 Scorers in the NBA",
    x       = "Points Per Game (PPG)",
    y       = "Player",
    fill    = "Position",
    caption = "Data: Basketball Reference + nbastatR"
  ) +
  theme_light() +
  theme(
    plot.title = element_text(hjust = 0.5)
  )
#Graph3
p_top10_ppg #show graph

#Boxplot
p_salary_position <- nba_df |>
  ggplot(aes(x = position, y = salary / 1e6, color = position)) +
  geom_boxplot() +
  labs(
    title    = "NBA Salary Distribution by Position",
    subtitle = "How does pay vary across Guards, Forwards and Centers?",
    x        = "Position",
    y        = "Salary (Millions USD)",
    caption  = "Data: Basketball Reference + nbastatR"
  ) +
  theme_light() +
  theme(
    plot.title      = element_text(hjust = 0.5),
    plot.subtitle   = element_text(hjust = 0.5),
    legend.position = "none"
  )
#Graph4
p_salary_position

#scatterplot
p_salary_ppg <- nba_df |>
  ggplot(aes(x = PPG, y = salary / 1e6)) +
  geom_point(aes(color = position), alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(
    title   = "Salary vs. Points Per Game",
    x       = "Points Per Game (PPG)",
    y       = "Salary (Millions USD)",
    color   = "Position",
    caption = "Data: Basketball Reference + nbastatR"
  ) +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5))
#Graph5
p_salary_ppg #scatter salary vs PPG

`geom_smooth()` using formula = 'y ~ x'

p_salary_rpg <- nba_df |>
  ggplot(aes(x = RPG, y = salary / 1e6)) +
  geom_point(aes(color = position), alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(
    title   = "Salary vs. Rebounds Per Game",
    x       = "Rebounds Per Game (RPG)",
    y       = "Salary (Millions USD)",
    color   = "Position",
    caption = "Data: Basketball Reference + nbastatR"
  ) +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5))

#Graph6
p_salary_rpg #scatter salary vs RPG

`geom_smooth()` using formula = 'y ~ x'

p_salary_apg <- nba_df |>
  ggplot(aes(x = APG, y = salary / 1e6)) +
  geom_point(aes(color = position), alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(
    title   = "Salary vs. Assists Per Game",
    x       = "Assists Per Game (APG)",
    y       = "Salary (Millions USD)",
    color   = "Position",
    caption = "Data: Basketball Reference + nbastatR"
  ) +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5))

#Graph7
p_salary_apg #scatter salary vs APG

`geom_smooth()` using formula = 'y ~ x'

#adding a new variable (value score = PPG + RPG + APG)
nba_df <- nba_df |>
  mutate(
    salary_M    = salary / 1e6,
    value_score = (PPG + RPG + APG) / salary_M
  )

#Graph 8: top 10 most underpaid nba players
p_underpaid <- nba_df |>
  arrange(desc(value_score)) |>
  slice_head(n = 10) |>
  ggplot(aes(x = value_score, y = reorder(player_name, value_score), fill = position)) +
  geom_bar(stat = "identity") +
  labs(
    title    = "Top 10 Most Underpaid NBA Players",
    subtitle = "Highest production per million dollars earned",
    x        = "Value Score (PPG + RPG + APG) / Salary (Millions)",
    y        = "Player",
    fill     = "Position",
    caption  = "Data: Basketball Reference + nbastatR"
  ) +
  theme_light() +
  theme(
    plot.title    = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

p_underpaid

#Graph 9: top 10 most overderpaid nba players
p_overpaid <- nba_df |>
  arrange(value_score) |>
  slice_head(n = 10) |>
  ggplot(aes(x = value_score, y = reorder(player_name, value_score), fill = position)) +
  geom_bar(stat = "identity") +
  labs(
    title    = "Top 10 Most Overpaid NBA Players",
    subtitle = "Lowest production per million dollars earned",
    x        = "Value Score (PPG + RPG + APG) / Salary (Millions)",
    y        = "Player",
    fill     = "Position",
    caption  = "Data: Basketball Reference + nbastatR"
  ) +
  theme_light() +
  theme(
    plot.title    = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

p_overpaid

#Graph 10: Scatter plot of Salary vs Value score
p_salary_value <- nba_df |>
  ggplot(aes(x = salary_M, y = value_score)) +
  geom_point(aes(color = position), alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(
    title    = "NBA Salary vs. Value Score",
    x        = "Salary (Millions USD)",
    y        = "Value Score (PPG + RPG + APG) / Salary (Millions)",
    color    = "Position",
    caption  = "Data: Basketball Reference + nbastatR"
  ) +
  theme_light() +
  theme(
    plot.title    = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

p_salary_value

`geom_smooth()` using formula = 'y ~ x'

Graph Explanations

Graph 1 - Distribution of 2025-26 salaries. A histogram of salary / 1e6 (salary in millions of USD) across all 375 players. A classic right-skewed distribution: the bulk of the NBA is concentrated in the $1–5M range (rookie scale, minimums, small role-player deals), and the tail stretches out past $50M for supermax contracts. You see a spike near the league minimum (~$2M), a secondary hump in the mid-level exception range ($10–15M), and a sparse tail of $30M+ max contracts.

Graph 2 - Top 10 highest-paid NBA players. A horizontal bar chart of the 10 players with the highest 2025-26 salary, sorted by salary, colored by position.

Graph 3 - Top 10 scorers. A horizontal bar chart of the 10 players with the highest 2024-25 PPG, again colored by position.

Graph 4 - Salary boxplot by position. A box-and-whisker plot of salary / 1e6 across the three position buckets (C, F, G). Centers have the highest median salary and the longest upper tail; forwards are close behind; guards have the lowest median with a heavy upper tail (because a handful of superstar guards like Steph Curry sit at the very top). The box heights are similar, but the medians shift in the order C ≥ F > G, which is exactly what the regression is capturing.

Graph 5 - Salary vs. Points Per Game. A scatter plot with PPG on the x-axis and salary / 1e6 on the y-axis, colored by position. A clear upward-sloping linear trend: more points per game, more salary. The slope of the fitted line is roughly $1.3–1.4M per additional PPG, matching the regression’s PPG coefficient of $1,366,110. Wide vertical scatter at every PPG level - lots of $5M players and $25M players all putting up ~15 PPG which is the residual variation the regression cannot explain.

Graph 6 - Salary vs. Rebounds Per Game. This is the visualization for a subtle regression finding: in a simple model salary ~ RPG, rebounds look meaningful, but in the multiple regression controlling for PPG, APG, and position, RPG collapses to a non-significant coefficient. A textbook example of confounding.

Graph 7 - Salary vs. Assists Per Game. A scatter of salary vs APG, same structure as Graphs 5 and 6. Guards dominate the right side of the x-axis (high APG), but the fitted line still sits lower overall than Graph 5’s line, because the guard position dummy drags the intercept down. Interpretation: assists are more valuable per unit, and teams pay for playmaking even more than for scoring.

Graph 8 - Top 10 most underpaid players. A bar chart of the 10 players with the highest value_score = (PPG + RPG + APG / salary_M, i.e., players who produce the most counting stats per $1M of salary. This is a different definition of “underpaid” than the regression residuals: it is a simple ratio and does not control for position or anything else. As a productivity-per-dollar view, it reliably shows rookie contracts and hard-working role players, but it is not the same as being underpaid relative to expectations.

Graph 9 - Top 10 most overpaid players. Same as Graph 8 but using the lowest value scores (smallest production / salary ratio). It shows large-salary, small-production players, which are typically: injured stars on supermax deals who missed most of 2024-25 (Kawhi Leonard, Zion Williamson, Joel Embiid seasons with low GP); aging veterans on legacy max contracts whose per-game production has fallen off (Paul George); and players who were traded mid-season and ended up in a reduced role (Cam Thomas, Buddy Hield after leaving the Pacers - neither of which have bad contracts, but they fit the mold nonetheless).

Graph 10 - Scatter of Salary vs. Value Score. Every player plotted with salary on x and value score on y, colored by position, with a linear regression line. Points far above the curve for a given salary are local bargains; points far below are local overpays. This pairs naturally with the residual-based over/underpaid lists from the regression.

Interactive Plots

# ============================================================
# STEP 7 — create interactive Plots
# ============================================================

# Salary vs PPG animated by postion (G,F,C)
library(plotly)
library(ggplot2)
plot_ly(
  data=nba_df,
  x=~salary,
  y=~PPG,
  frame = ~position,
  type='histogram2d'
) |>
  layout(
    title='NBA Salary vs Points Per Game By Position',
    xaxis=list(title='Salary'),
    yaxis=list(title='Points Per Game')
  )

# Interactive scatter salary vs value score 

p_interactive_value <- nba_df |>
  ggplot(aes(
    x     = salary_M,
    y     = value_score,
    color = position,
    text  = paste0(
      player_name,
      "<br>Team: ",     team,
      "<br>Position: ", position,
      "<br>Salary: $",  round(salary_M, 2), "M",
      "<br>PPG: ",      round(PPG, 1),
      "<br>RPG: ",      round(RPG, 1),
      "<br>APG: ",      round(APG, 1),
      "<br>Value: ",    round(value_score, 2)
    )
  )) +
  geom_point(alpha = 0.7, size = 2) +
  labs(
    title   = "NBA Salary vs. Value Score",
    x       = "Salary (Millions USD)",
    y       = "Value Score (PPG + RPG + APG) / Salary (Millions)",
    color   = "Position",
    caption = "Hover to identify players; data: Basketball Reference + nbastatR"
  ) +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(p_interactive_value, tooltip = "text")

Interactive Figure Explanations

Interactive Figure 1 - Salary × PPG, faceted by position. This produces an animated 2D density / heatmap where hitting Play (or clicking the slider) cycles through the three position groups (C, F, G). For each frame, you see the joint density of salary and PPG as a colored grid: where players cluster most heavily.

Centers: density concentrates around low-to-moderate PPG (5–15) and moderate salary, with a sparse tail of $30M+ / 20+ PPG franchise bigs.
Forwards: a more even spread across the PPG axis, with density reaching further into the mid-salary band.
Guards: the density at the upper-right corner (high PPG + max salary) is lighter than for centers, and guards have a dense blob at low-to-mid PPG with low salary. This is the visual for the guard discount that the positionG coefficient quantified.

Interactive Figure 2 - Scatter of Salary vs. Value Score with hover. The same content as Graph 10, but rendered via ggplotly() so every point exposes a hover tooltip containing player name, team, position, salary (in M), PPG / RPG / APG, and value score. The static Graph 10 gives the shape; this interactive version lets you identify individual players by hovering on the outlier dots to see who the bargains and luxuries are without needing a separate label layer.

Interactive Tables

interactive_nba_df <- nba_df |>
  select(player_name,team,GP,FG_PCT,MPG,PPG,RPG,APG, salary)

#install.packages('DT')
library(DT)

Warning: package 'DT' was built under R version 4.5.3

datatable(interactive_nba_df)

Data Analysis (Linear Regression)

# ============================================================
# Create a linear regression model
# ============================================================

#multiple linear regression using PPG,Apg,rpg, position as predictors
#salary as the dependent variable
lm.salary <- lm(salary~PPG+RPG+APG+position,data=nba_df)


summary(lm.salary)


Call:
lm(formula = salary ~ PPG + RPG + APG + position, data = nba_df)

Residuals:
      Min        1Q    Median        3Q       Max 
-24393940  -5028883   -204617   3709304  26237717 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -4990852    1717599  -2.906              0.00389 ** 
PPG          1366110     111835  12.215 < 0.0000000000000002 ***
RPG            91703     283746   0.323              0.74674    
APG          1694643     371597   4.560           0.00000696 ***
positionF     515249    1379620   0.373              0.70901    
positionG   -3414409    1686377  -2.025              0.04362 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7956000 on 369 degrees of freedom
Multiple R-squared:  0.6771,    Adjusted R-squared:  0.6727 
F-statistic: 154.7 on 5 and 369 DF,  p-value: < 0.00000000000000022

#Adjusted R Squared Value is 0.6711
lm.salary


Call:
lm(formula = salary ~ PPG + RPG + APG + position, data = nba_df)

Coefficients:
(Intercept)          PPG          RPG          APG    positionF    positionG  
   -4990852      1366110        91703      1694643       515249     -3414409

Coefficient estimates (salary ~ PPG + RPG + APG + position):

Term	Estimate	Std. Error	t value	p value
(Intercept)	-4,990,852	1,717,599	-2.91	0.0039
PPG	1,366,110	111,835	12.22	< 2e-16
RPG	91,703	283,746	0.32	0.747
APG	1,694,643	371,597	4.56	7.0e-06
positionF	515,249	1,379,620	0.37	0.709
positionG	-3,414,409	1,686,377	-2.03	0.044

Coefficient interpretation. Assists are valued more per unit than points, with about $1.69M per assist compared to $1.37M per point. Points and assists are on very different scales though, since players average roughly four times as many points as assists, so in total scoring still explains more of the overall salary. But on the margin, an extra assist is paid more than an extra point, which lines up with how teams clearly value primary playmakers. Rebounding, however, is not a meaningful salary driver once you control for everything else: the RPG coefficient is very small and not statistically significant, suggesting teams are not directly paying for rebounds themselves. Instead, they are paying for the overall production bundle: scoring and play making, that often comes with strong rebounders. This also explains why rebounds look important in a simple scatter plot but lose their effect once you control for other variables, a good example of confounding. Position effects also highlight differences in how the market values players. After controlling for the same level of production, guards are paid about $3.4M less than centers, while forwards are paid roughly the same as centers (the difference is not significant). This likely reflects the fact that guards are more common in the league, which pushes their price down, while productive big men are more scarce and harder to replace, leading to higher salaries even when their basic stats are similar.

Addressing MLR assumptions. This analysis is cross-sectional and based on a single season, so it does not capture the fact that salaries reflect multi-year expectations, injuries, contract structures, and extensions. As a result, some players identified as “overpaid” may simply be on legacy max deals or aging contracts, while “underpaid” players are often still on rookie scale contracts or veteran minimums. The model also ignores the hard salary cap structure, where rookie scale and supermax deals create step functions that a linear model cannot capture, which helps explain some of the large residuals. On top of that, the model only includes PPG, APG, RPG, and position, so it omits important factors like efficiency, defense, usage, turnovers, and availability. This means certain player types, especially defensive anchors, may appear overpaid because their contributions are not included in the regression, think Draymond Green. The model also imposes a linearity assumption, where the marginal effect of an additional point is constant across all scoring levels, even though the true relationship is likely nonlinear at the top. Lastly, there is some multicollinearity between PPG and APG, since high-usage players tend to do both, which can inflate standard errors, but the coefficients can still be interpreted as partial effects.

Data Analysis (K means Clustering)

# ============================================================
# Create a clustering model
# ============================================================
install.packages("cluster")

Warning: package 'cluster' is in use and will not be installed

library(cluster)
install.packages("NbClust")

Warning: package 'NbClust' is in use and will not be installed

library(NbClust)

#############Guards Cluster ###############################################
#will leave salary out since that is the dependent variable
Guards_data <- nba_df |>
  filter(position=='G')|>
  select(PPG,RPG,APG)
guard_scale <- scale(Guards_data)

# Estimate the best number of clusters from 2 to 10 using k-means criteria
guard_number_cluster_estimate <- NbClust(
  guard_scale,
  distance = "euclidean",
  min.nc = 2,
  max.nc = 10,
  method = "kmeans"
)

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 11 proposed 2 as the best number of clusters 
* 4 proposed 3 as the best number of clusters 
* 1 proposed 6 as the best number of clusters 
* 1 proposed 8 as the best number of clusters 
* 5 proposed 9 as the best number of clusters 
* 2 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
 
 
*******************************************************************

# Show the voting results for the best number of clusters
guard_number_cluster_estimate$Best.nc

                     KL       CH Hartigan     CCC Scott  Marriot   TrCovW
Number_clusters  9.0000   2.0000    8.000  2.0000   3.0      9.0   9.0000
Value_Index     84.9226 220.1545   36.859 -1.0129 122.3 516162.6 619.5345
                 TraceW Friedman  Rubin Cindex     DB Silhouette   Duda
Number_clusters  3.0000   9.0000  9.000 6.0000 2.0000     2.0000 2.0000
Value_Index     31.0073   3.0137 -0.924 0.1539 0.8626     0.4988 1.2086
                PseudoT2  Beale Ratkowsky    Ball PtBiserial   Frey McClain
Number_clusters   2.0000  2.000    2.0000  3.0000     2.0000 3.0000  2.0000
Value_Index     -17.9469 -0.291    0.5339 54.5991     0.6494 1.0971  0.3918
                   Dunn Hubert SDindex Dindex    SDbw
Number_clusters 10.0000      0    2.00      0 10.0000
Value_Index      0.0516      0    2.65      0  0.3272

# Set seed for reproducibility
set.seed(123)

# Run PAM clustering with 5 clusters
# Note: the slides call this k-means, but this function is PAM
kmeans_guard_scalecluster <- pam(guard_scale, k = 5)

# Show medoids for the clusters
kmeans_guard_scalecluster$medoids

            PPG        RPG        APG
[1,] -0.5799024 -0.4510787 -0.7969434
[2,] -1.0457889 -1.3011721 -0.8440058
[3,] -0.2887233  0.1156503 -0.2321937
[4,]  0.4829012  0.4698559  0.4266808
[5,]  1.7640891  1.2491082  1.4620551

# Show the cluster assignment for each row
kmeans_guard_scalecluster$clustering

  [1] 1 1 2 3 1 3 1 1 4 4 3 5 5 4 1 1 2 2 4 4 1 2 3 4 3 4 5 2 4 2 3 3 4 4 4 1 3
 [38] 1 4 2 4 1 2 5 5 3 5 5 4 5 5 1 4 5 4 4 2 4 2 2 1 2 1 1 3 3 4 1 4 1 5 1 1 4
 [75] 2 5 4 2 4 5 5 2 5 3 1 2 3 2 2 1 2 4 3 1 2 4 3 5 1 5 1 4 1 2 1 1 3 3 3 4 3
[112] 2 5 3 5 2 1 3 3 2 4 2 3 3 3 3 2 1 1 3 4 3 1 2 4 2 4 2 2 2 4 1 1 4 4 2 4 5
[149] 5 4 3 1 2 3 1 5 3 3 5 2 5 3 5 4 3 5

# Plot the clustering result in two reduced dimensions
plot(kmeans_guard_scalecluster)

# Add the assigned cluster to the Guard data
guard_cluster <- Guards_data %>%
  mutate(cluster = kmeans_guard_scalecluster$clustering)

# Show the dataset with assigned clusters
guard_cluster

# A tibble: 166 × 4
     PPG   RPG   APG cluster
   <dbl> <dbl> <dbl>   <int>
 1   7.4   2.4   1.5       1
 2   7.6   2     2.6       1
 3   5.5   1.3   1.3       2
 4  12     3.9   1.8       3
 5   6.5   1.9   1.8       1
 6   7.1   2.9   2.5       3
 7   9.7   2.2   1.1       1
 8   3.4   2.7   1.2       1
 9  10     3.3   5         4
10  19.3   2.7   4.8       4
# ℹ 156 more rows

# Compute the average of each variable by cluster
guard_cluster_summary <- guard_cluster %>%
  group_by(cluster) %>%
  summarise(across(everything(), ~ mean(.x, na.rm = TRUE)))

# Show cluster summaries
guard_cluster_summary

# A tibble: 5 × 4
  cluster   PPG   RPG   APG
    <int> <dbl> <dbl> <dbl>
1       1  7.25  2.41  1.51
2       2  4.37  1.30  1.44
3       3  9.57  3.17  2.76
4       4 15.3   3.78  4.37
5       5 22.9   4.93  6.65

#############Forward Cluster ###############################################
#will leave salary out since that is the dependent variable
Forward_data <- nba_df |>
  filter(position=='F')|>
  select(PPG,RPG,APG)
Forward_scale <- scale(Forward_data)

# Estimate the best number of clusters from 2 to 10 using k-means criteria
forward_number_cluster_estimate <- NbClust(
  Forward_scale,
  distance = "euclidean",
  min.nc = 2,
  max.nc = 10,
  method = "kmeans"
)

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 13 proposed 2 as the best number of clusters 
* 4 proposed 3 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 3 proposed 6 as the best number of clusters 
* 1 proposed 7 as the best number of clusters 
* 2 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
 
 
*******************************************************************

# Show the voting results for the best number of clusters
forward_number_cluster_estimate$Best.nc

                    KL       CH Hartigan     CCC    Scott  Marriot  TrCovW
Number_clusters 2.0000   2.0000  10.0000  2.0000   3.0000      6.0   6.000
Value_Index     3.9759 177.3621  27.2849 -2.7604 105.2346 126929.8 221.814
                 TraceW Friedman   Rubin Cindex     DB Silhouette   Duda
Number_clusters  3.0000   6.0000  7.0000  5.000 2.0000     2.0000 2.0000
Value_Index     31.4724   3.8352 -0.5637  0.193 0.8834     0.4913 1.0321
                PseudoT2   Beale Ratkowsky   Ball PtBiserial   Frey McClain
Number_clusters   2.0000  2.0000     2.000  3.000     2.0000 2.0000  2.0000
Value_Index      -3.5406 -0.0523     0.529 50.447     0.6169 1.8034  0.3751
                  Dunn Hubert SDindex Dindex    SDbw
Number_clusters 2.0000      0  3.0000      0 10.0000
Value_Index     0.0763      0  3.0402      0  0.1747

# Set seed for reproducibility
set.seed(123)

# Run PAM clustering with 5 clusters
# Note: the slides call this k-means, but this function is PAM
kmeans_forward_scalecluster <- pam(Forward_scale, k = 5)

# Show medoids for the clusters
kmeans_forward_scalecluster$medoids

            PPG        RPG        APG
[1,]  0.6145837  0.6821109  0.1679893
[2,] -0.1209854 -0.1022123 -0.2951544
[3,]  1.2249495  1.3683938  1.7559107
[4,] -1.2008634 -1.3277175 -1.0891151
[5,] -0.5748472 -0.5433942 -0.6921348

# Show the cluster assignment for each row
kmeans_forward_scalecluster$clustering

  [1] 1 2 3 1 4 2 1 1 1 2 3 2 4 2 5 1 2 2 5 4 1 3 5 3 5 2 4 4 2 4 3 2 3 3 5 2 3
 [38] 5 2 5 2 4 4 1 5 1 4 2 2 3 2 5 5 5 3 2 3 4 2 2 1 4 3 1 5 1 2 4 4 4 2 4 3 2
 [75] 2 1 1 2 1 5 3 5 1 2 4 5 2 1 2 1 3 2 4 5 2 5 1 1 3 5 1 5 5 1 2 4 5 1 3 3 4
[112] 2 3 5 5 3 5 2 2 5 5 5 1 3 4 5 4 5 1 2 1 1 2 1 5 4 2 2 4 2 3

# Plot the clustering result in two reduced dimensions
plot(kmeans_forward_scalecluster)

# Add the assigned cluster to the Guard data
forward_cluster <- Forward_data %>%
  mutate(cluster = kmeans_forward_scalecluster$clustering)

# Show the dataset with assigned clusters
forward_cluster

# A tibble: 141 × 4
     PPG   RPG   APG cluster
   <dbl> <dbl> <dbl>   <int>
 1  14.7   4.8   3.2       1
 2  12     4     1.2       2
 3  14.1   8.2   3.8       3
 4  18     4.5   2.6       1
 5   2.5   1.3   0.3       4
 6  10.1   5.1   2.3       2
 7  16.1   5.3   1.9       1
 8  12.3   5     3.4       1
 9  13.9   8.4   2.1       1
10   8.3   5.1   1         2
# ℹ 131 more rows

# Compute the average of each variable by cluster
forward_cluster_summary <- forward_cluster %>%
  group_by(cluster) %>%
  summarise(across(everything(), ~ mean(.x, na.rm = TRUE)))

# Show cluster summaries
forward_cluster_summary

# A tibble: 5 × 4
  cluster   PPG   RPG   APG
    <int> <dbl> <dbl> <dbl>
1       1 15.3   5.70 2.41 
2       2  9.82  4.07 1.83 
3       3 20.9   7.11 4.85 
4       4  3.31  1.55 0.517
5       5  6.55  3.37 1.09

#############Center Cluster ###############################################
#will leave salary out since that is the dependent variable
Center_data <- nba_df |>
  filter(position=='C')|>
  select(PPG,RPG,APG)
Center_scale <- scale(Center_data)

# Estimate the best number of clusters from 2 to 10 using k-means criteria
Center_number_cluster_estimate <- NbClust(
  Center_scale,
  distance = "euclidean",
  min.nc = 2,
  max.nc = 10,
  method = "kmeans"
)

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 9 proposed 2 as the best number of clusters 
* 1 proposed 3 as the best number of clusters 
* 6 proposed 4 as the best number of clusters 
* 4 proposed 7 as the best number of clusters 
* 1 proposed 8 as the best number of clusters 
* 2 proposed 9 as the best number of clusters 
* 1 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
 
 
*******************************************************************

# Show the voting results for the best number of clusters
Center_number_cluster_estimate$Best.nc

                    KL      CH Hartigan     CCC   Scott  Marriot  TrCovW
Number_clusters 8.0000  4.0000   4.0000  4.0000  7.0000     7.00  7.0000
Value_Index     6.8335 75.8917  20.1033 -1.7679 63.7695 26208.74 74.3384
                 TraceW Friedman   Rubin Cindex     DB Silhouette   Duda
Number_clusters  4.0000   7.0000  4.0000 9.0000 2.0000     2.0000 2.0000
Value_Index     13.9789   7.1011 -0.8626 0.2131 0.9248     0.4565 0.8828
                PseudoT2  Beale Ratkowsky    Ball PtBiserial   Frey McClain
Number_clusters   2.0000 2.0000    2.0000  3.0000     2.0000 2.0000  2.0000
Value_Index       4.7794 0.2153    0.5153 26.3169     0.5825 1.2071  0.3817
                  Dunn Hubert SDindex Dindex    SDbw
Number_clusters 9.0000      0  4.0000      0 10.0000
Value_Index     0.0861      0  3.8028      0  0.1398

# Set seed for reproducibility
set.seed(123)

# Run PAM clustering with 5 clusters
# Note: the slides call this k-means, but this function is PAM
kmeans_center_scalecluster <- pam(Center_scale, k = 5)

# Show medoids for the clusters
kmeans_center_scalecluster$medoids

             PPG        RPG        APG
[1,] -0.77630545 -0.8919938 -0.9010956
[2,]  0.04074820 -0.1309036  0.2406653
[3,] -0.03029994  0.5970957 -0.6156554
[4,]  2.65176747  1.4905494  1.9533067
[5,]  1.05318424  1.2258224  0.8115457

# Show the cluster assignment for each row
kmeans_center_scalecluster$clustering

 [1] 1 2 2 3 4 1 4 1 2 1 5 1 3 2 2 1 5 2 4 3 1 1 1 2 2 5 2 5 5 5 1 5 5 1 1 1 1 4
[39] 3 2 2 2 1 2 5 2 1 1 2 2 2 1 2 3 5 1 1 1 5 1 1 2 4 5 2 1 3 3

# Plot the clustering result in two reduced dimensions
plot(kmeans_center_scalecluster)

# Add the assigned cluster to the Guard data
center_cluster <- Center_data %>%
  mutate(cluster = kmeans_center_scalecluster$clustering)

# Show the dataset with assigned clusters
center_cluster

# A tibble: 68 × 4
     PPG   RPG   APG cluster
   <dbl> <dbl> <dbl>   <int>
 1   5.8   4.2   0.5       1
 2   9     6.2   2.1       2
 3  13     6.5   2.4       2
 4   7.3   7.8   0.9       3
 5  24.7  11.6   3.5       4
 6   1.9   2     0.4       1
 7  18.1   9.6   4.3       4
 8   5.1   5.6   1.1       1
 9  13     5     1.8       2
10   4.4   4.2   0.5       1
# ℹ 58 more rows

# Compute the average of each variable by cluster
center_cluster_summary <- center_cluster %>%
  group_by(cluster) %>%
  summarise(across(everything(), ~ mean(.x, na.rm = TRUE)))

# Show cluster summaries
center_cluster_summary

# A tibble: 5 × 4
  cluster   PPG   RPG   APG
    <int> <dbl> <dbl> <dbl>
1       1  4.40  3.58 0.738
2       2  9.51  6.08 1.93 
3       3  8.51  8.04 1.04 
4       4 22.1  11.8  4.12 
5       5 14.3   9.91 2.32

Clustering Results

Guards (n = 166)

Cluster	PPG	RPG	APG	Tier / Archetype	Example
2	4.4	1.3	1.4	End-of-bench / 2 way guards (g league)	Ron Harper Jr.
1	7.3	2.4	1.5	Low-usage rotation guard	Keon Ellis, Kris Dunn
3	9.6	3.2	2.8	Solid rotation guard/ 6th man	Dennis Schroder, Bones Hyland
4	15.3	3.8	4.4	Secondary creators / high end starters	Derrick White, Ajay Mitchell
5	22.9	4.9	6.7	All star ball handlers	Jalen Brunson, SGA, Donovan Mitchell

PPG, RPG, and APG move together from cluster 2 (all = −1 SD) to cluster 5 (all = +1.5 SD or more). The one exception is cluster 3, which has a slightly elevated RPG compared to its scoring: those are bigger, slashing combo guards who get to the paint and rebound a bit more than their points suggest. Guards is the largest bucket (166 players) and the most evenly spread across clusters, which is consistent with the modern NBA being focused on many ball-handling roles.

Forwards (n = 141)

Cluster	PPG	RPG	APG	Tier / Archetype	Example
4	3.3	1.6	0.5	End-of-bench / deep reserve forward	Carter Bryant, Nicolas Batum
5	6.6	3.4	1.1	Low-usage 3&D forward	Jordan Walsh, Nikola Jovic
2	9.8	4.1	1.8	3-and-D / connector wing (rotation forwards)	Tari Eason, Royce O’Neal
1	15.3	5.7	2.4	Scoring starters	Aaron Gordon, Jaime Jaquez Jr.
3	20.9	7.1	4.9	All star forward / play making wings	Jayson Tatum, Lebron James, Deni Avdija

It is mostly a tier ladder, but the top cluster (3) stands out because its APG jumps disproportionately. That captures point-forwards / floor generals: the Luka Doncic / Giannis / Kevin Durant-adjacent wing-creator types. Cluster 1 is more like the traditional scoring small forward or power forward who averages ~15/6 without being the primary initiator (think Tobias Harris, Jaden McDaniels). Forwards also have the cleanest separation between clusters 4 and 5 because RPG differentiates them even when PPG is similar.

Centers (n = 68)

Cluster	PPG	RPG	APG	Tier / Archetype	Example
1	4.4	3.6	0.7	End-of-bench big/ 3rd center	Clint Capela
2	9.5	6.1	1.9	Backup / Rotation bigs	Daniel Gafford
3	8.5	8.0	1.0	Screen-setter / rim-protector / glass-cleaner (rebound first low usage bigs)	Mitchell Robinson
5	14.3	9.9	2.3	All-Star caliber traditional bigs	Jalen Duren
4	22.1	11.8	4.1	Franchise centers	Jokic / Wembanyama

Centers is where the k = 5 choice is most interesting, because you can see two different mid-tier archetypes:

Cluster 2 (9.5 / 6.1 / 1.9): balanced backup big - scores a little, rebounds a little (think Jusuf Nurkic, Jakob Poeltl).
Cluster 3 (8.5 / 8.0 / 1.0): scores less than cluster 2 but rebounds more and passes less. This is the classic screen-setter / rim-protector / glass-cleaner archetype (think Steven Adams).

That is a genuine difference in player roles, not just a tier difference. Clusters 4 and 5 are both star bigs; the gap between them is the gap between an All-Star and an MVP-tier big man, and only a handful of players land in cluster 4. Centers is also the smallest group (68 players), so its clusters are more sensitive to individual outliers than the guard / forward splits.

Cross-position interpretation

Tying the clusters back to the regression residuals: a superstar in guard cluster 5 or center cluster 4 who still has a large positive residual is overpaid even relative to elite production - i.e., on a max contract but past their prime, (think Paul George, Joel Embiid). Conversely, a player in the “low-usage rotation” clusters (guard C1, forward C5, center C2 / C3) with a large negative residual is exactly the “cheap useful role player” the team values more than the market does; typically a rookie-scale or vet-minimum guy.

Conclusion

Overall, the results show that NBA salaries are strongly influenced by offensive production, particularly scoring and assists. Players responsible for scoring more and creating opportunities for others tend to earn significantly higher salaries. We found that not all performance metrics are equally valued, as rebounding was not a significant predictor in the model.

Visualizations also highlight inefficiencies in the market. Some players provide high production relative to their salary and can be considered underpaid, while others are on large contracts without matching performance, making them overpaid. The negative relationship between salary and value score further supports this idea.

The clustering analysis reinforces that players fall into distinct performance tiers, ranging from low-impact role players to superstars. It suggests that teams may be paying based on perceived player roles rather than strictly on statistical efficiency.

In conclusion, while NBA salaries are partially aligned with performance, there are clear inconsistencies. Teams heavily reward scoring and playmaking, but overall value is not always reflected in the salary, leading to both overpaid and underpaid players league wide.

GeoTag

#install.packages("leaflet")
library(leaflet)

Warning: package 'leaflet' was built under R version 4.5.3

leaflet() |>
  addTiles()|>
  addMarkers(
    lng = -84.5184,
    lat = 33.9384,
    popup = 'KSU Marietta'
  )

QR Code

#install.packages('qrcode')
library(qrcode)
#will need to update URL once final draft is complete
qr <- qr_code('https://rpubs.com/wmostofa/1424378')
plot(qr)