week2datadive

```{r setup, message=FALSE, warning=FALSE}

{r setu

library(tidyverse)

library(janitor)

library(scales)

library(ggplot2)

Week 2 Data Dive Overview

This is my week 2 data dive analyzing NBA shooting statistics by distance for players in the NBA. I analyze how the different types of shot attempts by players and teams impacts efficiency and if there a difference in efficiency by distance among positions.

# loading the appropriate packages, reading the CSV file, and cleaning column
# names to follow R format.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'dplyr' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(scales)

## Warning: package 'scales' was built under R version 4.5.2

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(ggplot2)

df <-
  read_csv("C:/Users/guyon/OneDrive/Desktop/NBA_Shooting_Stats.csv") |>
  clean_names()

## Rows: 3669 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): Player, Team, Pos, Season
## dbl (25): Rk, Age, G, GS, MP, FG%, Dist., FGA_2P, FGA_0-3, FGA_3-10, FGA_10-...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Numeric Summary of Shooting Efficiency and 3-Point Reliance

A summary table comparing players’ field goal percentage and how many of those shot attempts are three-point shots:

numeric_summary <-
  df |>
  summarise(
    across(
      c(fg_percent, fga_3p),
      list(
        n      = ~ sum(!is.na(.)),
        mean   = ~ mean(., na.rm = TRUE),
        median = ~ median(., na.rm = TRUE),
        sd     = ~ sd(., na.rm = TRUE),
        min    = ~ min(., na.rm = TRUE),
        q1     = ~ quantile(., 0.25, na.rm = TRUE),
        q3     = ~ quantile(., 0.75, na.rm = TRUE),
        max    = ~ max(., na.rm = TRUE)
      ),
      .names = "{.col}_{.fn}"
    )
  )

numeric_summary |>
  print(width = Inf)

## # A tibble: 1 × 16
##   fg_percent_n fg_percent_mean fg_percent_median fg_percent_sd fg_percent_min
##          <int>           <dbl>             <dbl>         <dbl>          <dbl>
## 1         3633           0.447             0.446         0.119              0
##   fg_percent_q1 fg_percent_q3 fg_percent_max fga_3p_n fga_3p_mean fga_3p_median
##           <dbl>         <dbl>          <dbl>    <int>       <dbl>         <dbl>
## 1           0.4           0.5              1     3633       0.404         0.417
##   fga_3p_sd fga_3p_min fga_3p_q1 fga_3p_q3 fga_3p_max
##       <dbl>      <dbl>     <dbl>     <dbl>      <dbl>
## 1     0.225          0     0.269     0.552          1

The average FG% across all players is 44.7% with a median of 44.6%, indicating a very symmetric distribution around the center. The standard deviation (11.9 percentage points) suggests moderate variability in shooting efficiency across players. The min and max (0% and 100%) are extreme observations that likely reflect low volume attempts rather than performance. The interquartile range (Q1 = 40% and Q3 = 50%) shows that the middle 50% of players shoot between 40% and 50%.

On average, 40.4% of player’s FGA come from three-point range, indicating the important of perimeter shooting. The median (41.7%) is slightly higher than the mean, suggesting a small left skew where some players take significantly less threes. Variability is higher for three-point attempt rate (SD = 22.5 percentage points) than FG% indicating a wide difference in shot selection among players. The min and max (0% and 100%) confirms some players rarely attempt threes while others rely on them entirely.The interquartile range (Q1 = 26.9% and Q3 = 55.2%) shows that the middle 50% of players take about 27% and 55% of their shots from three-point range.

This suggests while shooting efficiency is constant, reliance on three-point shooting varies dramatically across players meaning differences in offensive roles and strategies influence how players score more than overall ability to convert shots. A further question to investigate is do players with extreme three-point reliance exhibit higher or lower efficiency.

Categorical Summary (unique values and counts)

Here is a summary of the total number of position each player plays and how many players each team had from the 2018-19 season to the 2024-25 season, excluding pandemic affected seasons: 2019-20 and 2020-21.

categorical_summary <-
  df |>
  filter(!team %in% c("2TM", "3TM", "4TM")) |>
  select(team, pos) |>
  pivot_longer(
    cols = everything(),
    names_to = "column",
    values_to = "value"
  ) |>
  count(column, value, sort = TRUE)

categorical_summary |>
  print(n = 50)

## # A tibble: 35 × 3
##    column value     n
##    <chr>  <chr> <int>
##  1 pos    SG      803
##  2 pos    SF      626
##  3 pos    PF      624
##  4 pos    PG      612
##  5 pos    C       592
##  6 team   PHI     128
##  7 team   WAS     125
##  8 team   MEM     124
##  9 team   DET     121
## 10 team   MIL     118
## 11 team   TOR     118
## 12 team   DAL     117
## 13 team   LAL     116
## 14 team   BRK     113
## 15 team   SAC     111
## 16 team   CLE     110
## 17 team   LAC     110
## 18 team   NYK     110
## 19 team   POR     110
## 20 team   IND     109
## 21 team   PHO     109
## 22 team   CHO     108
## 23 team   ATL     106
## 24 team   UTA     106
## 25 team   NOP     104
## 26 team   OKC     104
## 27 team   SAS     104
## 28 team   CHI     102
## 29 team   MIA     102
## 30 team   BOS     100
## 31 team   MIN      98
## 32 team   HOU      95
## 33 team   ORL      94
## 34 team   GSW      93
## 35 team   DEN      92

Note: 2TM, 3TM, and 4TM were filtered out of the dataset because those refer to players who have been traded in the middle of the season. They are counted as separate teams in the data when there are only 30 teams in the NBA.

This summary shows that the dataset includes players from all standard NBA positions and a full set of individual teams. Team count varies indicating different roster sizes, player movement, or minutes threshold used in the dataset. Guards (PG and SG) appear more frequently than the other positions, reflecting a higher representation in player-level data.

The positional distribution confirms the validity in analyzing shooting breakdown by position and provides context for later efficiency and shot-selection analysis, since positions with more observations may guide overall trends. One question to further investigate is whether weighting players by minutes played change the observed positional balance.

Questions to Investigate

Here are three questions to potentially investigate with the given dataset:

Do different positions have different shot distances and three-point attempt rates?
Is there a relationship between a player’s age and their shot selection/distance and FG%?
Which teams rely most on 3-point attempts, and how efficient are they? (question investigated below)
- fga_3p = percentage of field goal attempts that are three-point attempts
- x3p_pct = field goal percentage of three-pointers
- fg_pct = overall field goal percentage
- The data was from the 2018-19 season to the 2024-25 season, excluding pandemic affected seasons: 2019-20 and 2020-21.

team_summary <-
  df |>
  filter(!team %in% c("2TM","3TM","4TM")) |>
  group_by(team) |>
  summarise(
    players    = n(),
    avg_fga_3p = mean(fga_3p, na.rm = TRUE),
    avg_3p_pct = mean(x3p_percent_cor_3, na.rm = TRUE),
    avg_fg_pct = mean(fg_percent, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(avg_fga_3p))

team_summary |>
  print(n = 30)

## # A tibble: 30 × 5
##    team  players avg_fga_3p avg_3p_pct avg_fg_pct
##    <chr>   <int>      <dbl>      <dbl>      <dbl>
##  1 MIL       118      0.467      0.371      0.431
##  2 BOS       100      0.458      0.317      0.456
##  3 OKC       104      0.451      0.358      0.432
##  4 HOU        95      0.436      0.311      0.450
##  5 DET       121      0.435      0.324      0.421
##  6 UTA       106      0.427      0.358      0.458
##  7 MIA       102      0.426      0.360      0.435
##  8 ATL       106      0.424      0.340      0.455
##  9 ORL        94      0.420      0.363      0.441
## 10 MIN        98      0.419      0.361      0.464
## 11 NYK       110      0.413      0.35       0.422
## 12 CLE       110      0.411      0.355      0.453
## 13 DAL       117      0.409      0.356      0.458
## 14 PHI       128      0.408      0.349      0.450
## 15 DEN        92      0.404      0.344      0.464
## 16 BRK       113      0.403      0.343      0.444
## 17 MEM       124      0.399      0.311      0.428
## 18 GSW        93      0.396      0.347      0.473
## 19 LAL       116      0.396      0.310      0.442
## 20 TOR       118      0.388      0.370      0.442
## 21 SAC       111      0.383      0.320      0.441
## 22 NOP       104      0.382      0.349      0.431
## 23 PHO       109      0.380      0.358      0.441
## 24 POR       110      0.378      0.331      0.444
## 25 CHI       102      0.378      0.370      0.475
## 26 LAC       110      0.376      0.359      0.467
## 27 SAS       104      0.364      0.361      0.463
## 28 WAS       125      0.363      0.319      0.448
## 29 CHO       108      0.361      0.363      0.439
## 30 IND       109      0.348      0.362      0.459

Teams at the top of the table (MIL, BOS, OKC) show the highest average three-point attempt rate indicating a strong reliance on the perimeter shot while teams at the bottom (WAS, CHO, IND) rely less on the three-point shot compared to other teams, indicating a more balanced, interior focused offense. Overall, the results show that high three-point shooting volume does not guarantee efficiency (see MIL). This analysis is aligned with how modern basketball is played: teams that balance three-point shooting with efficiency may gain a competitive advantage, but there are diminishing returns where shooting too many threes can weaken an offense (all player personnel driven usually). A follow up question to ask is what is the optimal three-point attempts to take to maximize a team’s overall FG%.

Distribution of FG% by Position

This boxplot shows the distribution of FG% by position to see whether certain positions shoot the ball efficiency.

df |>
  ggplot(aes(x = pos, y = fg_percent)) +
  geom_boxplot() +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Field Goal Percentage by Position",
    x = "Position",
    y = "FG%"
  ) +
  theme_minimal()

## Warning: Removed 36 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The boxplot shows a positional difference in FG% with centers having the highest median FG% meaning they are the most efficient group compared to guards (PG and SG) who have a lower median FG% and greater variability, reflecting worse efficiency. This visual effectively shows how field goal efficiency varies by position, reflecting the appropriate offensive roles for each position. Frontcourt players (SF, PF, C) tend to convert a higher percentage of their shots compared to guards, who show lower efficiency and greater variability, thus reinforcing the concept that FG% should be analyzed within a positional context. A further question to investigate is whether positional FG% consistently appears across multiple seasons.

Comparing the FG% and Shot Distance between Point Guards and Shooting Guards

This scatterplot creates a comparison between average shot distance and FG% for point guards and shooting guards, who are primarily known for shooting three point shots better than the other positions, particularly shooting guards, who specialize in shooting the three-point shot most efficiently.

df |>
  filter(pos %in% c("PG", "SG")) |>
  ggplot(aes(x = dist, y = fg_percent, color = pos)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "FG% vs Shot Distance: Point Guards vs Shooting Guards",
    x = "Average Shot Distance (ft)",
    y = "FG%"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

The scatterplot shows a negative relationship between average shot distance and FG% for both point guards and shooting guards. As shot distance increases, FG% usually decreases. This reinforces the fundamental basketball concept that shot distance does strongly influence shooting efficiency. The similarity in trend suggests position labels alone do not influence shooting efficiency and that shot selection and distance are more impactful compared to being a PG or SG. This is significant because it suggests that evaluating guard efficiency without accounting for shot distance can lead to misleading conclusions about shooting efficiency. A further question to investigate is whether this relationship changes when weighting players by minutes played or shot attempts.