Week 3 Data Dive Overview

This is my week 3 data dive analyzing NBA shooting statistics grouping by position, team’s three-point reliance, and age separately. I analyze outliers (smallest group) for each grouping and developed hypotheses to further test based on the results. I also reviewed roster construction by counting the combination of positions by teams to see if a particular position was most prevalent on teams.

# loading the appropriate packages, reading the CSV file, and cleaning column
# names to follow R format.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(scales)
## Warning: package 'scales' was built under R version 4.5.2
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(ggplot2)

df <-
  read_csv("C:/Users/guyon/OneDrive/Desktop/NBA_Shooting_Stats.csv") |>
  clean_names() |>
  filter(!team %in% c("2TM", "3TM", "4TM"))  # remove combined-team rows
## Rows: 3669 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): Player, Team, Pos, Season
## dbl (25): Rk, Age, G, GS, MP, FG%, Dist., FGA_2P, FGA_0-3, FGA_3-10, FGA_10-...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Shooting Profiles Grouped by Position

pos_summary <-
  df |>
  group_by(pos) |>
  summarise(
    n_players     = n(),
    avg_fg_pct    = mean(fg_percent, na.rm = TRUE),
    avg_fga_3p    = mean(fga_3p, na.rm = TRUE),
    avg_dist      = mean(dist, na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(
    prob_if_random_player = n_players / sum(n_players)
  ) |>
  arrange(n_players)

# Tag lowest-probability group
min_pos_n <- pos_summary |> summarise(m = min(n_players)) |> pull(m)

pos_summary_tagged <-
  pos_summary |>
  mutate(
    lowest_prob_group = n_players == min_pos_n
  )

pos_summary_tagged
## # A tibble: 5 × 7
##   pos   n_players avg_fg_pct avg_fga_3p avg_dist prob_if_random_player
##   <chr>     <int>      <dbl>      <dbl>    <dbl>                 <dbl>
## 1 C           592      0.539      0.174     8.07                 0.182
## 2 PG          612      0.412      0.429    15.5                  0.188
## 3 PF          624      0.460      0.395    13.3                  0.192
## 4 SF          626      0.421      0.485    15.3                  0.192
## 5 SG          803      0.417      0.493    16.1                  0.247
## # ℹ 1 more variable: lowest_prob_group <lgl>
# Visualization
pos_summary |>
  ggplot(aes(x = reorder(pos, avg_fg_pct), y = avg_fg_pct)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Average FG% by Position",
    x = "Position",
    y = "Average FG%"
  ) +
  theme_minimal()

This grouping identifies how shooting efficiency and shot selection differ by player position. Out of all the positions in the NBA, the probability of a random player being a center is 18.2%. This reflects roster construction, fewer centers than guards. While centers have a higher FG% compared to guards, they have lower three-point attempt rates, which is a priority in today’s NBA. Along with the low three-point attempts, teams will have fewer centers than guards, who comprise a smaller share of total roster slots.

Significance: The differences in FG% across all positions show how player role shape shooting behavior and efficiency. Further Question: Do positional differences in field goal percentage persist after adjusting for minutes played or shot volume?

Team’s 3pt% Reliance and Efficiency Grouped by Team

team_summary <-
  df |>
  group_by(team) |>
  summarise(
    n_players   = n(),
    avg_fga_3p  = mean(fga_3p, na.rm = TRUE),
    avg_3p_pct  = mean(x3p_percent_cor_3, na.rm = TRUE),
    avg_fg_pct  = mean(fg_percent, na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(
    prob_if_random_player = n_players / sum(n_players)
  ) |>
  arrange(n_players)

# Tag lowest-probability team(s)
min_team_n <- team_summary |> summarise(m = min(n_players)) |> pull(m)

team_summary_tagged <-
  team_summary |>
  mutate(
    lowest_prob_group = n_players == min_team_n
  )

team_summary_tagged |>
  filter(lowest_prob_group)
## # A tibble: 1 × 7
##   team  n_players avg_fga_3p avg_3p_pct avg_fg_pct prob_if_random_player
##   <chr>     <int>      <dbl>      <dbl>      <dbl>                 <dbl>
## 1 DEN          92      0.404      0.344      0.464                0.0282
## # ℹ 1 more variable: lowest_prob_group <lgl>
# Scatterplot: reliance vs efficiency
team_summary |>
  ggplot(aes(x = avg_fga_3p, y = avg_3p_pct)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Team 3PA Reliance vs Team 3P Efficiency",
    x = "Average 3-Point Attempt Rate",
    y = "Average 3P%"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

This grouping identifies which teams rely most on the three-point shot and how efficient they are at shooting those shots. While teams like Boston and OKC shoot the most three-point shots on average, other teams such as Denver rely less on three-point shooting while maintaining a high FG% (46.4%), suggesting a more balanced offense. Denver is the lowest-probability group, meaning the probability of a random player playing for Denver is 2.82%. This means that offensive efficiency is not driven by player count or shooting volume alone, however, it is possible to hypothesize that teams with fewer players will have higher average FG% and lower variance in minutes played (MP).

Significance: Teams can be efficient without maximizing three-point attempts, so strategy depends on personnel fit rather than volume alone. Further Question: Do high volume 3PA also have higher overall efficiency when weighed by attempts?

Shooting Efficiency Grouped by Age

age_summary <-
  df |>
  group_by(age) |>
  summarise(
    n_players  = n(),
    avg_fg_pct = mean(fg_percent, na.rm = TRUE),
    avg_fga_3p = mean(fga_3p, na.rm = TRUE),
    avg_dist   = mean(dist, na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(
    prob_if_random_player = n_players / sum(n_players)
  ) |>
  arrange(n_players)

min_age_n <-
  age_summary |>
  summarise(min_n = min(n_players)) |>
  pull(min_n)

age_summary_tagged <-
  age_summary |>
  mutate(
    lowest_prob_group = n_players == min_age_n
  )

age_summary_tagged |>
  filter(lowest_prob_group)
## # A tibble: 1 × 7
##     age n_players avg_fg_pct avg_fga_3p avg_dist prob_if_random_player
##   <dbl>     <int>      <dbl>      <dbl>    <dbl>                 <dbl>
## 1    41         1      0.452      0.129      9.4              0.000307
## # ℹ 1 more variable: lowest_prob_group <lgl>
age_summary |>
  ggplot(aes(x = age, y = avg_fg_pct)) +
  geom_line() +
  geom_point() +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Average FG% by Age",
    x = "Age",
    y = "Average FG%"
  ) +
  theme_minimal()

The smallest age groups represent the youngest and oldest players, meaning most players cluster in their mid to late 20s. Shooting efficiency and shot selection vary modestly across ages, with older age groups showing slightly lower average FG% with higher variability. If a player is selected at random, the probability of selecting a player who is 41 years old is approximately 0.03%. Players in extreme age groups will have lower average minutes played (MP) and/or appear in fewer games than “prime”-aged players.

Significance: The rarity of extreme age groups shows that there is potential for a spot in the league for them if they shoot the ball effectively and minimize the negative impact of their physical decline for older players and inexperience for younger players. When analyzing the shooting of the league as a whole, these extreme age groups should be considered when analyzing league-wide trends. Further Question: Do players in extreme age groups differ significantly in minutes played/games played compared to “prime”-aged players?

Identifying All Combinations of Position and Team

df_combo <-
  df |>
  filter(!is.na(team), !is.na(pos))

# 1) Unique combinations that exist in the data
team_pos_combos <-
  df_combo |>
  distinct(team, pos)

# 2) All possible combinations (full grid)
team_pos_grid <-
  expand_grid(
    team = sort(unique(df_combo$team)),
    pos  = sort(unique(df_combo$pos))
  )

# 3) Combinations that do NOT exist
missing_team_pos <-
  team_pos_grid |>
  anti_join(team_pos_combos, by = c("team", "pos"))

# If missing combos exist, show a few; otherwise explicitly say none are missing
if (nrow(missing_team_pos) == 0) {
  cat("No missing Team x Position combinations were found in this dataset.\n")
} else {
  cat("Missing Team x Position combinations (first 10 shown):\n")
  missing_team_pos |>
    slice_head(n = 10) |>
    print()
}
## No missing Team x Position combinations were found in this dataset.
# 4) Count how common each existing combination is
team_pos_counts <-
  df_combo |>
  count(team, pos, name = "n") |>
  arrange(desc(n))

# 5) Most common and least common combinations
most_common_team_pos <-
  team_pos_counts |>
  slice_max(n, n = 1, with_ties = FALSE)

least_common_team_pos <-
  team_pos_counts |>
  slice_min(n, n = 1, with_ties = FALSE)

most_common_team_pos
## # A tibble: 1 × 3
##   team  pos       n
##   <chr> <chr> <int>
## 1 DET   SG       36
least_common_team_pos
## # A tibble: 1 × 3
##   team  pos       n
##   <chr> <chr> <int>
## 1 GSW   C        13
# 6) Heatmap visual of combinations
team_pos_counts |>
  ggplot(aes(x = pos, y = team, fill = n)) +
  geom_tile() +
  labs(
    title = "Counts of Players by Team and Position",
    x = "Position",
    y = "Team",
    fill = "Count"
  ) +
  theme_minimal()

Based on the code created, there are no team and position combinations missing. The most common position x team combination was 36 shooting guards (SG) on the Detroit Pistons. This is probably because the Pistons are trying to stay up to date with modern NBA roster construction prioritizing shooting guards, who are typically the best position at shooting three-pointers. The least common position x team combination was 13 centers on the Golden State Warriors. This is probably because the Warriors, who were the team that revolutionized three point shooting efficiency, prioritize their roster construction around players who can shoot three-point shots efficiently, who are mostly guards rather than centers.

Significance: Teams must balance roster construction and position combinations to play to their strengths offensively while also allowing flexibility when playing different teams or a need to adjust their play style. Further Question: Are teams with a more balanced position distribution more efficient offensively than teams whose rosters are heavily invested in one or two positions?