```{r setup, message=FALSE, warning=FALSE}
{r setu
library(tidyverse)
library(janitor)
library(scales)
library(ggplot2)
This is my week 2 data dive analyzing NBA shooting statistics by distance for players in the NBA. I analyze how the different types of shot attempts by players and teams impacts efficiency and if there a difference in efficiency by distance among positions.
# loading the appropriate packages, reading the CSV file, and cleaning column
# names to follow R format.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(scales)
## Warning: package 'scales' was built under R version 4.5.2
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(ggplot2)
df <-
read_csv("C:/Users/guyon/OneDrive/Desktop/NBA_Shooting_Stats.csv") |>
clean_names()
## Rows: 3669 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Player, Team, Pos, Season
## dbl (25): Rk, Age, G, GS, MP, FG%, Dist., FGA_2P, FGA_0-3, FGA_3-10, FGA_10-...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A summary table comparing players’ field goal percentage and how many of those shot attempts are three-point shots:
numeric_summary <-
df |>
summarise(
across(
c(fg_percent, fga_3p),
list(
n = ~ sum(!is.na(.)),
mean = ~ mean(., na.rm = TRUE),
median = ~ median(., na.rm = TRUE),
sd = ~ sd(., na.rm = TRUE),
min = ~ min(., na.rm = TRUE),
q1 = ~ quantile(., 0.25, na.rm = TRUE),
q3 = ~ quantile(., 0.75, na.rm = TRUE),
max = ~ max(., na.rm = TRUE)
),
.names = "{.col}_{.fn}"
)
)
numeric_summary |>
print(width = Inf)
## # A tibble: 1 × 16
## fg_percent_n fg_percent_mean fg_percent_median fg_percent_sd fg_percent_min
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 3633 0.447 0.446 0.119 0
## fg_percent_q1 fg_percent_q3 fg_percent_max fga_3p_n fga_3p_mean fga_3p_median
## <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 0.4 0.5 1 3633 0.404 0.417
## fga_3p_sd fga_3p_min fga_3p_q1 fga_3p_q3 fga_3p_max
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.225 0 0.269 0.552 1
The average FG% across all players is 44.7% with a median of 44.6%, indicating a very symmetric distribution around the center. The standard deviation (11.9 percentage points) suggests moderate variability in shooting efficiency across players. The min and max (0% and 100%) are extreme observations that likely reflect low volume attempts rather than performance. The interquartile range (Q1 = 40% and Q3 = 50%) shows that the middle 50% of players shoot between 40% and 50%.
On average, 40.4% of player’s FGA come from three-point range, indicating the important of perimeter shooting. The median (41.7%) is slightly higher than the mean, suggesting a small left skew where some players take significantly less threes. Variability is higher for three-point attempt rate (SD = 22.5 percentage points) than FG% indicating a wide difference in shot selection among players. The min and max (0% and 100%) confirms some players rarely attempt threes while others rely on them entirely.The interquartile range (Q1 = 26.9% and Q3 = 55.2%) shows that the middle 50% of players take about 27% and 55% of their shots from three-point range.
This suggests while shooting efficiency is constant, reliance on three-point shooting varies dramatically across players meaning differences in offensive roles and strategies influence how players score more than overall ability to convert shots. A further question to investigate is do players with extreme three-point reliance exhibit higher or lower efficiency.
Here is a summary of the total number of position each player plays and how many players each team had from the 2018-19 season to the 2024-25 season, excluding pandemic affected seasons: 2019-20 and 2020-21.
categorical_summary <-
df |>
filter(!team %in% c("2TM", "3TM", "4TM")) |>
select(team, pos) |>
pivot_longer(
cols = everything(),
names_to = "column",
values_to = "value"
) |>
count(column, value, sort = TRUE)
categorical_summary |>
print(n = 50)
## # A tibble: 35 × 3
## column value n
## <chr> <chr> <int>
## 1 pos SG 803
## 2 pos SF 626
## 3 pos PF 624
## 4 pos PG 612
## 5 pos C 592
## 6 team PHI 128
## 7 team WAS 125
## 8 team MEM 124
## 9 team DET 121
## 10 team MIL 118
## 11 team TOR 118
## 12 team DAL 117
## 13 team LAL 116
## 14 team BRK 113
## 15 team SAC 111
## 16 team CLE 110
## 17 team LAC 110
## 18 team NYK 110
## 19 team POR 110
## 20 team IND 109
## 21 team PHO 109
## 22 team CHO 108
## 23 team ATL 106
## 24 team UTA 106
## 25 team NOP 104
## 26 team OKC 104
## 27 team SAS 104
## 28 team CHI 102
## 29 team MIA 102
## 30 team BOS 100
## 31 team MIN 98
## 32 team HOU 95
## 33 team ORL 94
## 34 team GSW 93
## 35 team DEN 92
Note: 2TM, 3TM, and 4TM were filtered out of the dataset because those refer to players who have been traded in the middle of the season. They are counted as separate teams in the data when there are only 30 teams in the NBA.
This summary shows that the dataset includes players from all standard NBA positions and a full set of individual teams. Team count varies indicating different roster sizes, player movement, or minutes threshold used in the dataset. Guards (PG and SG) appear more frequently than the other positions, reflecting a higher representation in player-level data.
The positional distribution confirms the validity in analyzing shooting breakdown by position and provides context for later efficiency and shot-selection analysis, since positions with more observations may guide overall trends. One question to further investigate is whether weighting players by minutes played change the observed positional balance.
Here are three questions to potentially investigate with the given dataset:
Do different positions have different shot distances and three-point attempt rates?
Is there a relationship between a player’s age and their shot selection/distance and FG%?
Which teams rely most on 3-point attempts, and how efficient are they? (question investigated below)
fga_3p = percentage of field goal attempts that are three-point attempts
x3p_pct = field goal percentage of three-pointers
fg_pct = overall field goal percentage
The data was from the 2018-19 season to the 2024-25 season, excluding pandemic affected seasons: 2019-20 and 2020-21.
team_summary <-
df |>
filter(!team %in% c("2TM","3TM","4TM")) |>
group_by(team) |>
summarise(
players = n(),
avg_fga_3p = mean(fga_3p, na.rm = TRUE),
avg_3p_pct = mean(x3p_percent_cor_3, na.rm = TRUE),
avg_fg_pct = mean(fg_percent, na.rm = TRUE),
.groups = "drop"
) |>
arrange(desc(avg_fga_3p))
team_summary |>
print(n = 30)
## # A tibble: 30 × 5
## team players avg_fga_3p avg_3p_pct avg_fg_pct
## <chr> <int> <dbl> <dbl> <dbl>
## 1 MIL 118 0.467 0.371 0.431
## 2 BOS 100 0.458 0.317 0.456
## 3 OKC 104 0.451 0.358 0.432
## 4 HOU 95 0.436 0.311 0.450
## 5 DET 121 0.435 0.324 0.421
## 6 UTA 106 0.427 0.358 0.458
## 7 MIA 102 0.426 0.360 0.435
## 8 ATL 106 0.424 0.340 0.455
## 9 ORL 94 0.420 0.363 0.441
## 10 MIN 98 0.419 0.361 0.464
## 11 NYK 110 0.413 0.35 0.422
## 12 CLE 110 0.411 0.355 0.453
## 13 DAL 117 0.409 0.356 0.458
## 14 PHI 128 0.408 0.349 0.450
## 15 DEN 92 0.404 0.344 0.464
## 16 BRK 113 0.403 0.343 0.444
## 17 MEM 124 0.399 0.311 0.428
## 18 GSW 93 0.396 0.347 0.473
## 19 LAL 116 0.396 0.310 0.442
## 20 TOR 118 0.388 0.370 0.442
## 21 SAC 111 0.383 0.320 0.441
## 22 NOP 104 0.382 0.349 0.431
## 23 PHO 109 0.380 0.358 0.441
## 24 POR 110 0.378 0.331 0.444
## 25 CHI 102 0.378 0.370 0.475
## 26 LAC 110 0.376 0.359 0.467
## 27 SAS 104 0.364 0.361 0.463
## 28 WAS 125 0.363 0.319 0.448
## 29 CHO 108 0.361 0.363 0.439
## 30 IND 109 0.348 0.362 0.459
Note: 2TM, 3TM, and 4TM were filtered out of the dataset because those refer to players who have been traded in the middle of the season. They are counted as separate teams in the data when there are only 30 teams in the NBA.
Teams at the top of the table (MIL, BOS, OKC) show the highest average three-point attempt rate indicating a strong reliance on the perimeter shot while teams at the bottom (WAS, CHO, IND) rely less on the three-point shot compared to other teams, indicating a more balanced, interior focused offense. Overall, the results show that high three-point shooting volume does not guarantee efficiency (see MIL). This analysis is aligned with how modern basketball is played: teams that balance three-point shooting with efficiency may gain a competitive advantage, but there are diminishing returns where shooting too many threes can weaken an offense (all player personnel driven usually). A follow up question to ask is what is the optimal three-point attempts to take to maximize a team’s overall FG%.
This boxplot shows the distribution of FG% by position to see whether certain positions shoot the ball efficiency.
df |>
ggplot(aes(x = pos, y = fg_percent)) +
geom_boxplot() +
scale_y_continuous(labels = percent_format()) +
labs(
title = "Field Goal Percentage by Position",
x = "Position",
y = "FG%"
) +
theme_minimal()
## Warning: Removed 36 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The boxplot shows a positional difference in FG% with centers having the highest median FG% meaning they are the most efficient group compared to guards (PG and SG) who have a lower median FG% and greater variability, reflecting worse efficiency. This visual effectively shows how field goal efficiency varies by position, reflecting the appropriate offensive roles for each position. Frontcourt players (SF, PF, C) tend to convert a higher percentage of their shots compared to guards, who show lower efficiency and greater variability, thus reinforcing the concept that FG% should be analyzed within a positional context. A further question to investigate is whether positional FG% consistently appears across multiple seasons.
This scatterplot creates a comparison between average shot distance and FG% for point guards and shooting guards, who are primarily known for shooting three point shots better than the other positions, particularly shooting guards, who specialize in shooting the three-point shot most efficiently.
df |>
filter(pos %in% c("PG", "SG")) |>
ggplot(aes(x = dist, y = fg_percent, color = pos)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
scale_y_continuous(labels = percent_format()) +
labs(
title = "FG% vs Shot Distance: Point Guards vs Shooting Guards",
x = "Average Shot Distance (ft)",
y = "FG%"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
The scatterplot shows a negative relationship between average shot distance and FG% for both point guards and shooting guards. As shot distance increases, FG% usually decreases. This reinforces the fundamental basketball concept that shot distance does strongly influence shooting efficiency. The similarity in trend suggests position labels alone do not influence shooting efficiency and that shot selection and distance are more impactful compared to being a PG or SG. This is significant because it suggests that evaluating guard efficiency without accounting for shot distance can lead to misleading conclusions about shooting efficiency. A further question to investigate is whether this relationship changes when weighting players by minutes played or shot attempts.