I will load the dataset and preview the first few rows using head().

shot_logs <- read.csv("C:/Users/13177/OneDrive/Stats for Data Science/filtered_shot_logs.csv")

head(shot_logs)
##    GAME_ID                  MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           1
## 2 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           2
## 3 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           3
## 4 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           4
## 5 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           5
## 6 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           6
##   PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1      1       1:09       10.8        2        1.9       7.7        2
## 2      1       0:14        3.4        0        0.8      28.2        3
## 3      1       0:00         NA        3        2.7      10.1        2
## 4      2      11:47       10.3        2        1.9      17.2        2
## 5      2      10:34       10.9        2        2.7       3.7        2
## 6      2       8:15        9.1        2        4.4      18.4        2
##   SHOT_RESULT  CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1        made    Anderson, Alan                     101187            1.3   1
## 2      missed Bogdanovic, Bojan                     202711            6.1   0
## 3      missed Bogdanovic, Bojan                     202711            0.9   0
## 4      missed     Brown, Markel                     203900            3.4   0
## 5      missed   Young, Thaddeus                     201152            1.1   0
## 6      missed   Williams, Deron                     101114            2.6   0
##   PTS   player_name player_id
## 1   2 brian roberts    203148
## 2   0 brian roberts    203148
## 3   0 brian roberts    203148
## 4   0 brian roberts    203148
## 5   0 brian roberts    203148
## 6   0 brian roberts    203148

#Numeric Summaries Now I will complete a numeric summary for atleast 2 columns. The values are measured in ft.

#Numeric summaries for SHOT_DIST and CLOSE_DEF_DIST#
summary(shot_logs$SHOT_DIST)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    4.70   13.70   13.58   22.50   47.20
summary(shot_logs$CLOSE_DEF_DIST)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.300   3.700   4.124   5.300  53.200
#Quantiles for SHOT_DIST and CLOSE_DEF_DIST#
quantile(shot_logs$SHOT_DIST, probs = c(0.25, 0.5, 0.75))
##  25%  50%  75% 
##  4.7 13.7 22.5
quantile(shot_logs$CLOSE_DEF_DIST, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75% 
## 2.3 3.7 5.3

Some interesting analytics into the quartiles and averages of shot distances and closest defender distances. Almost every value in the 3rd quartile is a 3-point shot. On average a defender is 4.1 feet away from the shooter. That seems like a lot, but the average height in the NBA is around 6ft 7in, taking into account the wingspan of players reaching to defend the shooter then its actually pretty good defense.

#Categorical Summaries Now I’ll do the categorical summaries.

#Frequency counts for LOCATION and SHOT_RESULT#
table(shot_logs$LOCATION)
## 
##     A     H 
## 63968 63785
table(shot_logs$SHOT_RESULT)
## 
##   made missed 
##  57804  69949

This is a very basic summary. We can see there were more shot attempts by away teams than home teams. Thats kind of interesting. And we can also see that there were more missed shots than made. Very surface level but still interesting.

#Exploritory Questions Exploritory Questions: 1. Does shot distance affect the shot result? (Most likely yes, but I want to look into it.) 2. Is there a relationship between the closest defender and shot distance? 3. How does the number of dribbles affect the shot result?

#Aggregation Looking into shot distance and shot result using an aggregation function.

#Aggregating average SHOT_DIST by SHOT_RESULT#
aggregated_data <- shot_logs |>
  group_by(SHOT_RESULT) |>
  summarize(mean_shot_dist = mean(SHOT_DIST, na.rm = TRUE))

print(aggregated_data)
## # A tibble: 2 × 2
##   SHOT_RESULT mean_shot_dist
##   <chr>                <dbl>
## 1 made                  11.7
## 2 missed                15.1

We can see the average distance of made shots is about 3.5 ft closer to the basket than missed shots. We can assume that the distance has an affect on the result of the shot. The 3-point line in the NBA is 23 ft and 9 in, but 22ft from the corner.

#Visual Summaries Distribution of Shot Distance:

#Histogram of SHOT_DIST#
ggplot(shot_logs, aes(x = SHOT_DIST)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  labs(title = "Distribution of Shot Distance in Feet", x = "Shot Distance", y = "Frequency")

We have a parabola shape in our histogram here. We see a high frequency of close ranged shots, and a drop off of shots at 7-13 ish feet, then the frequency starts to increase again from 15 feet and on. Keeping in mind that the 3-point line is 22 feet from the corners we see a steep increase from that 20-21 foot range to the 22 foot range corner 3 position. And then we see a massive increase in frequency at the 24-25 foot range.

Lets take that histogram a step further.

#Histogram of SHOT_DIST with color indicating made or missed shots#
ggplot(shot_logs, aes(x = SHOT_DIST, fill = SHOT_RESULT)) +
  geom_histogram(position = "identity", bins = 30, alpha = 0.7, color = "black") +
  scale_fill_manual(values = c("missed" = "red", "made" = "blue")) +
  labs(
    title = "Distribution of Shot Distance with Made and Missed Shots",
    x = "Shot Distance (ft)",
    y = "Frequency",
    fill = "Shot Result"
  ) +
  theme_minimal()

We can see that when the blue is present there are more made than missed and when the blue is not present then there are more missed than made. 23 feet and on look to have the worst make %.

Relationship Between Shot Distance and Closest Defender Distance: Because there are too many instances to look at this whole, I am going to filter it to only look at shots by the Pacers.

#Filter data for the Pacers#
pacers_shots <- shot_logs |>
  filter(str_detect(MATCHUP, "IND"))

#Scatterplot of SHOT_DIST vs CLOSE_DEF_DIST for Pacers#
ggplot(pacers_shots, aes(x = SHOT_DIST, y = CLOSE_DEF_DIST, color = SHOT_RESULT)) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Shot Distance vs Close Defender Distance (Pacers)",
    x = "Shot Distance (ft)",
    y = "Close Defender Distance (ft)"
  ) +
  theme_minimal()

Although the visual is honestly still too full, we can make out a denser presence of the “made” values in the lower left corner of the graph. We can see as the shot distance in creases there is a slight increase in the distance in the closest defender, as well as an increase in the “missed” shots values. I think if we filtered it further to look at a specific game then we can get better insights.

Lets filter it for the Pacers for one game: Feb 08, 2015 - CHA vs. IND

#Filter data for the specific Pacers game#
specific_pacers_game <- shot_logs |>
  filter(MATCHUP == "FEB 08, 2015 - CHA vs. IND")

#Scatterplot of SHOT_DIST vs CLOSE_DEF_DIST for the specific Pacers game#
ggplot(specific_pacers_game, aes(x = SHOT_DIST, y = CLOSE_DEF_DIST, color = SHOT_RESULT)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Shot Distance vs Close Defender Distance (CHA vs. IND - Feb 8, 2015)",
    x = "Shot Distance (ft)",
    y = "Close Defender Distance (ft)"
  ) +
  theme_minimal()

Now that is much better. We can see a high density of missed shots in the 7-9 ft range where the defender is about 2-3 ft away. An insight from this would be the Pacers should practice mid-range shots with a defender close by. Another insight is that the Pacers did well this game in shots where the defender is about 5ft and more away. Another strength of the Pacers this game is the did well with contested close ranged shots, indicated by the higher frequency of “made” values in the bottom left of the graph.

#Dribbles and Shot Results Summarize Dribbles by Shot Result:

#Summarize average dribbles by shot result#
dribbles_summary <- shot_logs |>
  group_by(SHOT_RESULT) |>
  summarize(
    mean_dribbles = mean(DRIBBLES, na.rm = TRUE),
    median_dribbles = median(DRIBBLES, na.rm = TRUE),
    shot_count = n()
  )

print(dribbles_summary)
## # A tibble: 2 × 4
##   SHOT_RESULT mean_dribbles median_dribbles shot_count
##   <chr>               <dbl>           <dbl>      <int>
## 1 made                 1.89               0      57804
## 2 missed               2.14               1      69949

There are about 12,000 more missed shots than made, and we see a slight increase in dribbles for missed shots but not much. Its possible that the 12,000 more values in the missed shots could be bringing the avg dribbles down. Just out of curiosity lets check the density of distribution for dribbles.

First lets take a quick look at the box plot for these dribbles.

#Visualization: Boxplot of DRIBBLES by SHOT_RESULT#
ggplot(shot_logs, aes(x = SHOT_RESULT, y = DRIBBLES, fill = SHOT_RESULT)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 16) +
  labs(
    title = "Dribbles Before Shot by Shot Result",
    x = "Shot Result",
    y = "Number of Dribbles"
  ) +
  theme_minimal()

Density:

#Density plot of dribbles for made vs. missed shots#
ggplot(shot_logs, aes(x = DRIBBLES, fill = SHOT_RESULT)) +
  geom_density(alpha = 0.5) +
  labs(
    title = "Distribution of Dribbles for Made vs. Missed Shots",
    x = "Number of Dribbles",
    y = "Density"
  ) +
  theme_minimal()

This helps put things into perspective a little bit. We can see a much lager section of made shots over the 0-2 dribble spikes, and a decrease in the made values, as well as attempts, as the dribbles increase. We can conclude that 1. the longer a player dribbles, the less likely they are to make a shot and 2. the longer the player dribbles the less likely they are to attempt a shot. There are obviously some outliers, but generally speaking those 2 conclusions will hold true.

The graph might be kind of hard to see se lets look at it numerically:

#Create dribble ranges for all shots#
shot_logs <- shot_logs |>
  mutate(dribble_range = cut(DRIBBLES, breaks = c(0, 1, 3, 5, 10, Inf)))

#Summary statistics for missed shots and total shots#
dribble_summary <- shot_logs |>
  group_by(dribble_range) |>
  summarize(
    total_shots = n(),
    missed_shots = sum(SHOT_RESULT == "missed", na.rm = TRUE),
    made_shots = sum(SHOT_RESULT == "made", na.rm = TRUE),
    percentage_missed = (missed_shots / total_shots) * 100
  )

print(dribble_summary)
## # A tibble: 6 × 5
##   dribble_range total_shots missed_shots made_shots percentage_missed
##   <fct>               <int>        <int>      <int>             <dbl>
## 1 (0,1]               19409        10596       8813              54.6
## 2 (1,3]               21649        12449       9200              57.5
## 3 (3,5]                8883         5059       3824              57.0
## 4 (5,10]               9851         5696       4155              57.8
## 5 (10,Inf]             5077         2982       2095              58.7
## 6 <NA>                62884        33167      29717              52.7

So we can see that players missed less when only taking 0-1 dribbles as opposed to more than 2.

There were some insights into my data both visually and numerically that helped answer some of my questions.