I will load the dataset and preview the first few rows using head().
shot_logs <- read.csv("C:/Users/13177/OneDrive/Stats for Data Science/filtered_shot_logs.csv")
head(shot_logs)
## GAME_ID MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400899 MAR 04, 2015 - CHA @ BKN A W 24 1
## 2 21400899 MAR 04, 2015 - CHA @ BKN A W 24 2
## 3 21400899 MAR 04, 2015 - CHA @ BKN A W 24 3
## 4 21400899 MAR 04, 2015 - CHA @ BKN A W 24 4
## 5 21400899 MAR 04, 2015 - CHA @ BKN A W 24 5
## 6 21400899 MAR 04, 2015 - CHA @ BKN A W 24 6
## PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1 1 1:09 10.8 2 1.9 7.7 2
## 2 1 0:14 3.4 0 0.8 28.2 3
## 3 1 0:00 NA 3 2.7 10.1 2
## 4 2 11:47 10.3 2 1.9 17.2 2
## 5 2 10:34 10.9 2 2.7 3.7 2
## 6 2 8:15 9.1 2 4.4 18.4 2
## SHOT_RESULT CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1 made Anderson, Alan 101187 1.3 1
## 2 missed Bogdanovic, Bojan 202711 6.1 0
## 3 missed Bogdanovic, Bojan 202711 0.9 0
## 4 missed Brown, Markel 203900 3.4 0
## 5 missed Young, Thaddeus 201152 1.1 0
## 6 missed Williams, Deron 101114 2.6 0
## PTS player_name player_id
## 1 2 brian roberts 203148
## 2 0 brian roberts 203148
## 3 0 brian roberts 203148
## 4 0 brian roberts 203148
## 5 0 brian roberts 203148
## 6 0 brian roberts 203148
#Numeric Summaries Now I will complete a numeric summary for atleast 2 columns. The values are measured in ft.
#Numeric summaries for SHOT_DIST and CLOSE_DEF_DIST#
summary(shot_logs$SHOT_DIST)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 4.70 13.70 13.58 22.50 47.20
summary(shot_logs$CLOSE_DEF_DIST)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.300 3.700 4.124 5.300 53.200
#Quantiles for SHOT_DIST and CLOSE_DEF_DIST#
quantile(shot_logs$SHOT_DIST, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75%
## 4.7 13.7 22.5
quantile(shot_logs$CLOSE_DEF_DIST, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75%
## 2.3 3.7 5.3
Some interesting analytics into the quartiles and averages of shot distances and closest defender distances. Almost every value in the 3rd quartile is a 3-point shot. On average a defender is 4.1 feet away from the shooter. That seems like a lot, but the average height in the NBA is around 6ft 7in, taking into account the wingspan of players reaching to defend the shooter then its actually pretty good defense.
#Categorical Summaries Now I’ll do the categorical summaries.
#Frequency counts for LOCATION and SHOT_RESULT#
table(shot_logs$LOCATION)
##
## A H
## 63968 63785
table(shot_logs$SHOT_RESULT)
##
## made missed
## 57804 69949
This is a very basic summary. We can see there were more shot attempts by away teams than home teams. Thats kind of interesting. And we can also see that there were more missed shots than made. Very surface level but still interesting.
#Exploritory Questions Exploritory Questions: 1. Does shot distance affect the shot result? (Most likely yes, but I want to look into it.) 2. Is there a relationship between the closest defender and shot distance? 3. How does the number of dribbles affect the shot result?
#Aggregation Looking into shot distance and shot result using an aggregation function.
#Aggregating average SHOT_DIST by SHOT_RESULT#
aggregated_data <- shot_logs |>
group_by(SHOT_RESULT) |>
summarize(mean_shot_dist = mean(SHOT_DIST, na.rm = TRUE))
print(aggregated_data)
## # A tibble: 2 × 2
## SHOT_RESULT mean_shot_dist
## <chr> <dbl>
## 1 made 11.7
## 2 missed 15.1
We can see the average distance of made shots is about 3.5 ft closer to the basket than missed shots. We can assume that the distance has an affect on the result of the shot. The 3-point line in the NBA is 23 ft and 9 in, but 22ft from the corner.
#Visual Summaries Distribution of Shot Distance:
#Histogram of SHOT_DIST#
ggplot(shot_logs, aes(x = SHOT_DIST)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
labs(title = "Distribution of Shot Distance in Feet", x = "Shot Distance", y = "Frequency")
We have a parabola shape in our histogram here. We see a high frequency
of close ranged shots, and a drop off of shots at 7-13 ish feet, then
the frequency starts to increase again from 15 feet and on. Keeping in
mind that the 3-point line is 22 feet from the corners we see a steep
increase from that 20-21 foot range to the 22 foot range corner 3
position. And then we see a massive increase in frequency at the 24-25
foot range.
Lets take that histogram a step further.
#Histogram of SHOT_DIST with color indicating made or missed shots#
ggplot(shot_logs, aes(x = SHOT_DIST, fill = SHOT_RESULT)) +
geom_histogram(position = "identity", bins = 30, alpha = 0.7, color = "black") +
scale_fill_manual(values = c("missed" = "red", "made" = "blue")) +
labs(
title = "Distribution of Shot Distance with Made and Missed Shots",
x = "Shot Distance (ft)",
y = "Frequency",
fill = "Shot Result"
) +
theme_minimal()
We can see that when the blue is present there are more made than missed
and when the blue is not present then there are more missed than made.
23 feet and on look to have the worst make %.
Relationship Between Shot Distance and Closest Defender Distance: Because there are too many instances to look at this whole, I am going to filter it to only look at shots by the Pacers.
#Filter data for the Pacers#
pacers_shots <- shot_logs |>
filter(str_detect(MATCHUP, "IND"))
#Scatterplot of SHOT_DIST vs CLOSE_DEF_DIST for Pacers#
ggplot(pacers_shots, aes(x = SHOT_DIST, y = CLOSE_DEF_DIST, color = SHOT_RESULT)) +
geom_point(alpha = 0.5) +
labs(
title = "Shot Distance vs Close Defender Distance (Pacers)",
x = "Shot Distance (ft)",
y = "Close Defender Distance (ft)"
) +
theme_minimal()
Although the visual is honestly still too full, we can make out a denser
presence of the “made” values in the lower left corner of the graph. We
can see as the shot distance in creases there is a slight increase in
the distance in the closest defender, as well as an increase in the
“missed” shots values. I think if we filtered it further to look at a
specific game then we can get better insights.
Lets filter it for the Pacers for one game: Feb 08, 2015 - CHA vs. IND
#Filter data for the specific Pacers game#
specific_pacers_game <- shot_logs |>
filter(MATCHUP == "FEB 08, 2015 - CHA vs. IND")
#Scatterplot of SHOT_DIST vs CLOSE_DEF_DIST for the specific Pacers game#
ggplot(specific_pacers_game, aes(x = SHOT_DIST, y = CLOSE_DEF_DIST, color = SHOT_RESULT)) +
geom_point(alpha = 0.7) +
labs(
title = "Shot Distance vs Close Defender Distance (CHA vs. IND - Feb 8, 2015)",
x = "Shot Distance (ft)",
y = "Close Defender Distance (ft)"
) +
theme_minimal()
Now that is much better. We can see a high density of missed shots in
the 7-9 ft range where the defender is about 2-3 ft away. An insight
from this would be the Pacers should practice mid-range shots with a
defender close by. Another insight is that the Pacers did well this game
in shots where the defender is about 5ft and more away. Another strength
of the Pacers this game is the did well with contested close ranged
shots, indicated by the higher frequency of “made” values in the bottom
left of the graph.
#Dribbles and Shot Results Summarize Dribbles by Shot Result:
#Summarize average dribbles by shot result#
dribbles_summary <- shot_logs |>
group_by(SHOT_RESULT) |>
summarize(
mean_dribbles = mean(DRIBBLES, na.rm = TRUE),
median_dribbles = median(DRIBBLES, na.rm = TRUE),
shot_count = n()
)
print(dribbles_summary)
## # A tibble: 2 × 4
## SHOT_RESULT mean_dribbles median_dribbles shot_count
## <chr> <dbl> <dbl> <int>
## 1 made 1.89 0 57804
## 2 missed 2.14 1 69949
There are about 12,000 more missed shots than made, and we see a slight increase in dribbles for missed shots but not much. Its possible that the 12,000 more values in the missed shots could be bringing the avg dribbles down. Just out of curiosity lets check the density of distribution for dribbles.
First lets take a quick look at the box plot for these dribbles.
#Visualization: Boxplot of DRIBBLES by SHOT_RESULT#
ggplot(shot_logs, aes(x = SHOT_RESULT, y = DRIBBLES, fill = SHOT_RESULT)) +
geom_boxplot(outlier.color = "red", outlier.shape = 16) +
labs(
title = "Dribbles Before Shot by Shot Result",
x = "Shot Result",
y = "Number of Dribbles"
) +
theme_minimal()
Density:
#Density plot of dribbles for made vs. missed shots#
ggplot(shot_logs, aes(x = DRIBBLES, fill = SHOT_RESULT)) +
geom_density(alpha = 0.5) +
labs(
title = "Distribution of Dribbles for Made vs. Missed Shots",
x = "Number of Dribbles",
y = "Density"
) +
theme_minimal()
This helps put things into perspective a little bit. We can see a much
lager section of made shots over the 0-2 dribble spikes, and a decrease
in the made values, as well as attempts, as the dribbles increase. We
can conclude that 1. the longer a player dribbles, the less likely they
are to make a shot and 2. the longer the player dribbles the less likely
they are to attempt a shot. There are obviously some outliers, but
generally speaking those 2 conclusions will hold true.
The graph might be kind of hard to see se lets look at it numerically:
#Create dribble ranges for all shots#
shot_logs <- shot_logs |>
mutate(dribble_range = cut(DRIBBLES, breaks = c(0, 1, 3, 5, 10, Inf)))
#Summary statistics for missed shots and total shots#
dribble_summary <- shot_logs |>
group_by(dribble_range) |>
summarize(
total_shots = n(),
missed_shots = sum(SHOT_RESULT == "missed", na.rm = TRUE),
made_shots = sum(SHOT_RESULT == "made", na.rm = TRUE),
percentage_missed = (missed_shots / total_shots) * 100
)
print(dribble_summary)
## # A tibble: 6 × 5
## dribble_range total_shots missed_shots made_shots percentage_missed
## <fct> <int> <int> <int> <dbl>
## 1 (0,1] 19409 10596 8813 54.6
## 2 (1,3] 21649 12449 9200 57.5
## 3 (3,5] 8883 5059 3824 57.0
## 4 (5,10] 9851 5696 4155 57.8
## 5 (10,Inf] 5077 2982 2095 58.7
## 6 <NA> 62884 33167 29717 52.7
So we can see that players missed less when only taking 0-1 dribbles as opposed to more than 2.
There were some insights into my data both visually and numerically that helped answer some of my questions.