shot_logs <- read.csv("C:/Users/13177/OneDrive/Stats for Data Science/filtered_shot_logs.csv")
head(shot_logs)
## GAME_ID MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400899 MAR 04, 2015 - CHA @ BKN A W 24 1
## 2 21400899 MAR 04, 2015 - CHA @ BKN A W 24 2
## 3 21400899 MAR 04, 2015 - CHA @ BKN A W 24 3
## 4 21400899 MAR 04, 2015 - CHA @ BKN A W 24 4
## 5 21400899 MAR 04, 2015 - CHA @ BKN A W 24 5
## 6 21400899 MAR 04, 2015 - CHA @ BKN A W 24 6
## PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1 1 1:09 10.8 2 1.9 7.7 2
## 2 1 0:14 3.4 0 0.8 28.2 3
## 3 1 0:00 NA 3 2.7 10.1 2
## 4 2 11:47 10.3 2 1.9 17.2 2
## 5 2 10:34 10.9 2 2.7 3.7 2
## 6 2 8:15 9.1 2 4.4 18.4 2
## SHOT_RESULT CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1 made Anderson, Alan 101187 1.3 1
## 2 missed Bogdanovic, Bojan 202711 6.1 0
## 3 missed Bogdanovic, Bojan 202711 0.9 0
## 4 missed Brown, Markel 203900 3.4 0
## 5 missed Young, Thaddeus 201152 1.1 0
## 6 missed Williams, Deron 101114 2.6 0
## PTS player_name player_id
## 1 2 brian roberts 203148
## 2 0 brian roberts 203148
## 3 0 brian roberts 203148
## 4 0 brian roberts 203148
## 5 0 brian roberts 203148
## 6 0 brian roberts 203148
#1# There really aren’t any columns that I don’t understand, however there are some columns that were not explained in the documentation. The original documentation for this dataset was poor, but I did my own to help me identify and keep the data organized and clear.
GAME_CLOCK: This column lists the time on the game clock when each shot was taken. Upon first glance I noticed that there are 0:00 values for that shot and it confused me, however I was able to figure out and document that this was because the shot was a buzzer beater shot at the end of a period.
SHOT_CLOCK: This column lists the time on the shot clock when the shot was taken. There are “NA” values in the column that I did not understand at first, but I was able to conclude that these values are because the game clock was less than 24 seconds when the offense had the ball resulting in the shot clock being disabled until the end of the period.
FGM: This column contains binary values and I was not sure what it meant because it was not listed in the documentation. Using my basketball knowledge I know FGM stands for “field goal made”. I was able to figure out that the FGM column is just the binary version of the SHOT_RESULT column. 1 = Made 0 = Missed.
#2# There are not any columns that I do not understand even after reading the documentation and creating my own so for the purpose of this data dive I will work with one of the three columns I listed above. The documentaation did not explain the “NA” values, FGM column, and 0:00 game clock values, but I undertsand them all now.
#3#
#Count instances of GAME_CLOCK being "0:00"#
buzzer_beater_count <- shot_logs |> filter(GAME_CLOCK == "0:00") |> nrow()
cat("Number of shots attempted at 0:00 on the game clock:", buzzer_beater_count, "\n")
## Number of shots attempted at 0:00 on the game clock: 480
#Convert GAME_CLOCK to seconds for visualization#
shot_logs <- shot_logs |>
mutate(
game_clock_seconds = as.numeric(str_extract(GAME_CLOCK, "\\d+")) * 60 +
as.numeric(str_extract(GAME_CLOCK, "(?<=:)[0-9]+"))
)
#Histogram of shots by game clock time#
ggplot(shot_logs, aes(x = game_clock_seconds)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
geom_vline(xintercept = 0, color = "red", linetype = "dashed", linewidth = 1) +
labs(
title = "Distribution of Shots by Game Clock Time",
x = "Game Clock (Seconds Remaining)",
y = "Shot Attempts"
) +
annotate("text", x = 5, y = 300, label = "0:00", color = "red") +
theme_minimal()
Insight:
Spike at 0:00 = Confirms buzzer-beater shots at the end of a period. Most shots occur earlier in the period. Fewer shots taken near mid-period timeouts.
#Count NA values in SHOT_CLOCK#
na_shot_clock_count <- sum(is.na(shot_logs$SHOT_CLOCK))
#Check relationship between SHOT_CLOCK NA and GAME_CLOCK#
na_shot_clock_df <- shot_logs |> filter(is.na(SHOT_CLOCK)) |> select(GAME_CLOCK)
cat("Number of NA values in SHOT_CLOCK:", na_shot_clock_count, "\n")
## Number of NA values in SHOT_CLOCK: 5553
#Create GAME_CLOCK bins (grouping in 30-second intervals)#
shot_logs <- shot_logs |>
mutate(game_clock_bin = cut(game_clock_seconds,
breaks = seq(0, max(game_clock_seconds, na.rm = TRUE), by = 30),
include.lowest = TRUE))
#Count missing SHOT_CLOCK values per GAME_CLOCK bin#
missing_shot_clock <- shot_logs |>
mutate(shot_clock_missing = ifelse(is.na(SHOT_CLOCK), "Missing", "Available")) |>
count(game_clock_bin, shot_clock_missing)
#Bar chart to show where SHOT_CLOCK is missing#
ggplot(missing_shot_clock, aes(x = game_clock_bin, y = n, fill = shot_clock_missing)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("Missing" = "red", "Available" = "blue")) +
labs(
title = "Frequency of Missing SHOT_CLOCK by GAME_CLOCK Interval",
x = "Game Clock (Seconds Remaining, Grouped in 30s)",
y = "Number of Shots",
fill = "SHOT_CLOCK Status"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Insight:
Red bars represent missing SHOT_CLOCK values. They increase dramatically as GAME_CLOCK decreases, showing that SHOT_CLOCK is disabled when GAME_CLOCK < 24s. Blue bars show where SHOT_CLOCK is available. It works normally when GAME_CLOCK is above 24s. Super clear trend. Instead of thousands of data points, this bar chart condenses the issue into a simple, easy-to-read visualization.
#Check if FGM matches SHOT_RESULT#
fgm_check <- shot_logs |>
mutate(shot_result_binary = ifelse(SHOT_RESULT == "made", 1, 0)) |>
summarise(match = all(FGM == shot_result_binary))
cat("Does FGM correctly encode SHOT_RESULT?:", fgm_check$match, "\n")
## Does FGM correctly encode SHOT_RESULT?: TRUE
#Bar plot to compare FGM and SHOT_RESULT#
ggplot(shot_logs, aes(x = factor(FGM), fill = SHOT_RESULT)) +
geom_bar(position = "dodge") +
labs(
title = "Comparison of FGM and SHOT_RESULT",
x = "Field Goal Made (FGM: 0 = Miss, 1 = Made)",
y = "Shot Count",
fill = "Shot Result"
) +
theme_minimal()
Insight:
Perfect one-to-one match between FGM and SHOT_RESULT → Confirms that FGM is redundant. Using only SHOT_RESULT simplifies analysis.
#4# Checking the SHOT_DIST column for the missing values, outliers, etc.
#Check for explicit missing values in SHOT_DIST#
missing_shot_dist <- sum(is.na(shot_logs$SHOT_DIST))
print(missing_shot_dist)
## [1] 0
No explicitly missing values.
Lets check from implicitly missing rows. Approach: We check if SHOT_DIST contains gaps that suggest missing data. For example, if every integer from 0 to 30 ft appears except a specific range (e.g., 15-16 ft), that would suggest implicit missing values.
#Check for missing shot distance values in a reasonable range (e.g., 0-35 ft)
all_shot_dist <- seq(0, 35, by = 1)
observed_shot_dist <- unique(na.omit(shot_logs$SHOT_DIST))
#Find missing expected distances#
implicit_missing <- setdiff(all_shot_dist, observed_shot_dist)
print(implicit_missing)
## numeric(0)
We can see there are no implicitly missing values.
#Create bins for SHOT_DIST#
shot_logs <- shot_logs |>
mutate(shot_dist_bin = cut(SHOT_DIST, breaks = seq(0, 35, by = 5), include.lowest = TRUE))
#Count shots per bin#
shot_dist_counts <- shot_logs |> count(shot_dist_bin)
#Check for empty bins#
print(shot_dist_counts)
## shot_dist_bin n
## 1 [0,5] 34337
## 2 (5,10] 21954
## 3 (10,15] 10662
## 4 (15,20] 18434
## 5 (20,25] 32444
## 6 (25,30] 9357
## 7 (30,35] 318
## 8 <NA> 247
Insight: If some bins have zero counts, it means no shots were recorded in those ranges, which could indicate player preferences or data gaps. Example: If no shots were taken between 20-25 ft, it might suggest players take 3-point shots (beyond 23.75 ft) rather than deep 2-pointers. We can see that there are 247 instances of shots >35ft being labled as NA instead of being placed in a bin. Lets confirm this.
#Check shots that were labeled as NA in the shot distance bin#
shots_beyond_35 <- shot_logs |> filter(is.na(shot_dist_bin))
#Display summary statistics for these shots#
summary(shots_beyond_35$SHOT_DIST)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.10 37.05 39.50 39.63 41.80 47.20
#Count how many of them are truly beyond 35 feet#
num_shots_over_35 <- sum(shots_beyond_35$SHOT_DIST > 35, na.rm = TRUE)
print(num_shots_over_35)
## [1] 247
So we confirmed here that all 247 instances are over 35ft.
Lets define outliers for the SHOT_DIST column.
#Compute IQR for SHOT_DIST#
iqr_shot_dist <- IQR(shot_logs$SHOT_DIST, na.rm = TRUE)
q1 <- quantile(shot_logs$SHOT_DIST, 0.25, na.rm = TRUE)
q3 <- quantile(shot_logs$SHOT_DIST, 0.75, na.rm = TRUE)
#Define outlier boundaries#
lower_bound <- q1 - (1.5 * iqr_shot_dist)
upper_bound <- q3 + (1.5 * iqr_shot_dist)
#Identify outliers#
outliers <- shot_logs |> filter(SHOT_DIST < lower_bound | SHOT_DIST > upper_bound)
print(nrow(outliers))
## [1] 0
It appears that there are no outliers in the shot distance column which is surprising to me honestly. I figured there would’ve been a few due to full court shots. I would consider any shot further than the halfcourt line as an outlier, which is 47 ft. Lets check if there are any above 47 ft.
sum(shot_logs$SHOT_DIST > 47.0)
## [1] 1
It looks like there is only 1 shot above 47 ft and we can see that it is a distance of 47.2 ft from earlier code, barely making it an outlier.
The reason I defined an outlier as > 47.0 ft is because NBA players are capable of shooting very far 3-pointers at a high success rate. It is more of a personal choice for an outlier instead of one backed by the numbers.