This is my week 5 data dive where I clarify certain column names that might be difficult to understand if a user has not read the data dictionary that appears with the basketball-reference.com shooting statistics dataset. I also clarify and check for any implicit, explicit, and empty data groups. Fortunately, not much data is missing since the data didn’t have any null values and the only missing data is pandemic affected ones that were left out on purposed due to strange impact it has on statistics.
# loading the appropriate packages, reading the CSV file, and cleaning column
# names to follow R format.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(scales)
## Warning: package 'scales' was built under R version 4.5.2
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(ggplot2)
df <-
read_csv("C:/Users/guyon/OneDrive/Desktop/NBA_Shooting_Stats.csv") |>
clean_names() |>
filter(!team %in% c("2TM", "3TM", "4TM")) # remove combined-team rows
## Rows: 3669 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Player, Team, Pos, Season
## dbl (25): Rk, Age, G, GS, MP, FG%, Dist., FGA_2P, FGA_0-3, FGA_3-10, FGA_10-...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Two columns that are a bit unclear until reading the data documentation provided by basketball-reference.com is FG_2P_Ast and FG_3P_Ast, which refers to the percentage of 2-pt and 3-pt field goals that were assisted, meaning point scored when a teammate passed that player the ball and they scored. This data was included in the shooting breakdown for players so anyone reading and using the data can understand what percentage of players’ shots are from receiving passes from their teammates, indicating a high level of ball movement, or the players scoring themselves, how well can they create their own shot. These columns can be easily misunderstood if not reading the data documentation, otherwise leading to misinterpretations of those shooting breakdowns, which are important for presenting to coaches and scouts.
One column that still remains unclear is “3P%_cor_3”, which refers to a player’s shooting percentage when they shoot corner three-point shots. While the data documentation does describe the location of where the three-point shot is taken, it is still unclear what is considered “the corner” since the three-point line is an arc.
# Corner 3s vs Overall 3PT%
df |>
mutate(
low_corner_volume = percent_3pa_cor_3 < 0.05 # 5% threshold
) |>
ggplot(aes(x = fg_3p,
y = x3p_percent_cor_3,
color = low_corner_volume)) +
geom_point(alpha = 0.7, size = 2) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
scale_color_manual(
values = c("FALSE" = "gray50", "TRUE" = "red"),
labels = c("FALSE" = "Normal Volume",
"TRUE" = "Low Corner Volume (<5%)"),
name = "Corner 3PA Volume"
) +
labs(
title = "Corner 3P% vs Overall 3P%",
x = "Overall 3P%",
y = "Corner 3P%"
) +
theme_minimal()
## Warning: Removed 447 rows containing missing values or values outside the scale range
## (`geom_point()`).
There is a strong positive relationship between overall 3pt% and corner
3pt% with most players hovering around the line of best fit. However,
the highlights in red do indicate low-volume shooters, meaning there is
a disproportion between extreme shooting efficiency with low volume
attempts. This again reinforces the careful analysis needed to remove
these low volume attempt players from the dataset to conduct any type of
league wide shooting conclusions. One question to further look into is
how corner three volume correlates with offensive roles (spot-up
shooters vs ball handlers vs bigs)
df |>
ggplot(aes(x = percent_3pa_cor_3, y = x3p_percent_cor_3)) +
geom_point(alpha = 0.6) +
labs(
title = "Corner 3P% vs % of 3PA from Corner",
x = "Proportion of 3PA from Corner",
y = "Corner 3P%"
) +
theme_minimal()
## Warning: Removed 447 rows containing missing values or values outside the scale range
## (`geom_point()`).
As the proportion of corner 3PA increases, corner 3PT% values appear
concentrated around 30-45%, or league average shooting efficiency
showing higher-volume shooters more stable and predictable performances.
This visual also reinforces the carefulness needed to remove low volumne
attempts to make league-wide conclusions, since they can throw off the
analysis. One question to further analyze is how well certain
high-volume players are able to maintain their efficiency throughout
many seasons.
I don’t see any significant risks with the misunderstanding of this variable. While it is always interesting to know exactly what the different areas of the three point line are, the data is from basketball-reference.com, which is a datasource used by the NBA, who standarizes the shot locations and I trust them to make clear distinctions for corner three pointers.
df |>
summarise(
missing_team = sum(is.na(team)),
missing_pos = sum(is.na(pos))
)
## # A tibble: 1 × 2
## missing_team missing_pos
## <int> <int>
## 1 0 0
Explicit: For the categorical variables “team” and “pos”, there are no null values meaning the data is structurally complete at the player level.
df_combo <-
df |>
filter(!is.na(team), !is.na(pos))
# All possible combinations (full grid)
team_pos_grid <-
expand_grid(
team = sort(unique(df_combo$team)),
pos = sort(unique(df_combo$pos))
)
team_pos_grid
## # A tibble: 150 × 2
## team pos
## <chr> <chr>
## 1 ATL C
## 2 ATL PF
## 3 ATL PG
## 4 ATL SF
## 5 ATL SG
## 6 BOS C
## 7 BOS PF
## 8 BOS PG
## 9 BOS SF
## 10 BOS SG
## # ℹ 140 more rows
Implicitly: There is no missing team x pos combination that does not exist, meaning every team has at least one player listed at each position.
Empty Groups: All five positions and all 30 NBA teams appear in the dataset, meaning there are no empty groups based on previous checks in other assignments.
FG% is a continuous column in the shooting dataset. I would define outliers as players that shoot on the extreme ends of field goal %, which is between 0 and 1 and on low number of shot attempts. Usually these cases appear for players that get minimal playing time and maybe take one shot in the amount of action they play, so if they make their one shot, they shoot 100% from the field, which is too good to be true since the sample size is one.