shot_logs <- read.csv("C:/Users/13177/OneDrive/Stats for Data Science/filtered_shot_logs.csv")

head(shot_logs)
##    GAME_ID                  MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           1
## 2 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           2
## 3 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           3
## 4 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           4
## 5 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           5
## 6 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           6
##   PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1      1       1:09       10.8        2        1.9       7.7        2
## 2      1       0:14        3.4        0        0.8      28.2        3
## 3      1       0:00         NA        3        2.7      10.1        2
## 4      2      11:47       10.3        2        1.9      17.2        2
## 5      2      10:34       10.9        2        2.7       3.7        2
## 6      2       8:15        9.1        2        4.4      18.4        2
##   SHOT_RESULT  CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1        made    Anderson, Alan                     101187            1.3   1
## 2      missed Bogdanovic, Bojan                     202711            6.1   0
## 3      missed Bogdanovic, Bojan                     202711            0.9   0
## 4      missed     Brown, Markel                     203900            3.4   0
## 5      missed   Young, Thaddeus                     201152            1.1   0
## 6      missed   Williams, Deron                     101114            2.6   0
##   PTS   player_name player_id
## 1   2 brian roberts    203148
## 2   0 brian roberts    203148
## 3   0 brian roberts    203148
## 4   0 brian roberts    203148
## 5   0 brian roberts    203148
## 6   0 brian roberts    203148

#Group By Investigations#

#Group by player_name and summarize average shot distance#
group1 <- shot_logs |>
  group_by(player_name) |>
  summarize(
    avg_shot_dist = mean(SHOT_DIST, na.rm = TRUE),
    shot_count = n()
  )

#Add probability column#
group1 <- group1 |>
  mutate(probability = shot_count / sum(shot_count))

#View the top and bottom players by probability#
group1 <- arrange(group1, probability)
print(group1)
## # A tibble: 281 × 4
##    player_name      avg_shot_dist shot_count probability
##    <chr>                    <dbl>      <int>       <dbl>
##  1 greg smith                2.67         47    0.000368
##  2 jerome jordan             4.79         88    0.000689
##  3 joey dorsey               3.67         92    0.000720
##  4 alan crabbe              19.1          93    0.000728
##  5 mike miller              23.5          94    0.000736
##  6 joe harris               19.6         100    0.000783
##  7 tyler hansbrough          5.24        100    0.000783
##  8 hedo turkoglu            21.4         102    0.000798
##  9 aaron gordon             10.1         104    0.000814
## 10 bismack biyombo           4.03        114    0.000892
## # ℹ 271 more rows

Here we are looking at player, avg shot distance, shot count, and the probability of selecting a shot by them from the whole dataset. We can see that Greg Smith has the lowest probability of being selecte and James Harden has the highest probability of being selected. Some other big names that have a high probability are Lebron James, Kyrie Irving, and Steph Curry. There are some obvious explanations for why players have teken more or less shots. Some of these explanations are: skill of player, player position, playing time(relates to skill), player specialty(is this player a shooter, defensive player, etc.), and even team(some teams play a more slow offensive style than others).

#Filter for top 20 players by probability#
top_players <- group1 |> 
  slice_max(probability, n = 20)

#Visualization for top players#
ggplot(top_players, aes(x = reorder(player_name, -probability), y = probability)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(
    title = "Top 20 Players by Shot Probability",
    x = "Player Name",
    y = "Probability"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Here are the top 20 contributors to the shot count/probability.

#Group by SHOT_RESULT and summarize CLOSE_DEF_DIST#
group2 <- shot_logs |>
  group_by(SHOT_RESULT) |>
  summarize(
    avg_close_def_dist = mean(CLOSE_DEF_DIST, na.rm = TRUE),
    shot_count = n()
  ) |>
  mutate(probability = shot_count / sum(shot_count))

print(group2)
## # A tibble: 2 × 4
##   SHOT_RESULT avg_close_def_dist shot_count probability
##   <chr>                    <dbl>      <int>       <dbl>
## 1 made                      4.12      57804       0.452
## 2 missed                    4.13      69949       0.548

Surprisingly there is little to no difference in the avg closest defender between a made and missed shot. Lets look at a violin plot to show some density of these numbers.

#Violin plot for CLOSE_DEF_DIST by SHOT_RESULT#
ggplot(shot_logs, aes(x = SHOT_RESULT, y = CLOSE_DEF_DIST, fill = SHOT_RESULT)) +
  geom_violin(alpha = 0.7, trim = TRUE) +
  labs(
    title = "Density of Defender Distance by Shot Result",
    x = "Shot Result",
    y = "Close Defender Distance (ft)",
    fill = "Shot Result"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

These violin plots are very similar, however we can see there is slightly more density in the missed shots with a defender being further away. And there is slightly more density for the made shots with a closer defender. This seems a little backwards, however I think that the made shots with a close defender are shots right near the basket. Despite there being a defender nearby these are very high make probability shots. I think the more dense missed shot instances with a defender further away are longer shots, possible 3-pointers, which have a much lower probability to be made. Also, players are ale to create more space between them and the defender at/around the 3 point line instead of in the paint where it tends to be more crowded.

#Group by PERIOD and summarize shot-related metrics#
group3 <- shot_logs |>
  group_by(PERIOD) |>
  summarize(
    total_shots = n(),
    avg_shot_dist = mean(SHOT_DIST, na.rm = TRUE),
    avg_close_def_dist = mean(CLOSE_DEF_DIST, na.rm = TRUE)
  )

print(group3)
## # A tibble: 7 × 4
##   PERIOD total_shots avg_shot_dist avg_close_def_dist
##    <int>       <int>         <dbl>              <dbl>
## 1      1       33873          13.3               4.16
## 2      2       31567          13.3               4.11
## 3      3       32137          13.6               4.15
## 4      4       29056          14.0               4.08
## 5      5         910          14.4               3.87
## 6      6         167          15.1               3.80
## 7      7          43          16.0               4.09

For this grouping we can see the shots attempted for each period. Periods 5, 6, 7 indicate that the games went to overtime with 5 being 1st overtime, 6 is 2nd, and 7 is 3rd overtime. We can see that the least amount of shots (excluding overtimes) is the 4th period and the most are taken in period 1. This could be because if a team has a lead in the 4th quarter, they are going to take more time to shoot making their opponents have less time to comeback. This is a typical game strategy for basketball. I find it interesting that as the periods go on (including overtimes) there is a slight increase in avg shot distance. Also, there is a slight trend of the avg closest defender decreasing, meaning that defense is being played slightly better.

#Stacked bar plot: Total shots by period and shot result#
ggplot(shot_logs, aes(x = factor(PERIOD), fill = SHOT_RESULT)) +
  geom_bar(position = "stack") +
  labs(
    title = "Shot Results by Period",
    x = "Game Period",
    y = "Total Shots",
    fill = "Shot Result"
  ) +
  theme_minimal()

Here is a stacked bar plot for shots made and missed per period.

#Line plot for average shot distance and defender distance by period#
ggplot(group3, aes(x = PERIOD)) +
  geom_line(aes(y = avg_shot_dist, color = "Avg Shot Distance"), size = 1) +
  geom_line(aes(y = avg_close_def_dist, color = "Avg Defender Distance"), size = 1, linetype = "dashed") +
  labs(
    title = "Average Shot Distance and Defender Distance by Period",
    x = "Game Period",
    y = "Distance (ft)",
    color = "Metric"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Here is a line plot to help visualize.

#Categorical Combinations#

#Unique combinations of SHOT_RESULT and LOCATION#
combinations <- shot_logs |>
  count(SHOT_RESULT, LOCATION)

print(combinations)
##   SHOT_RESULT LOCATION     n
## 1        made        A 28685
## 2        made        H 29119
## 3      missed        A 35283
## 4      missed        H 34666
#Identify missing combinations#
all_combinations <- expand.grid(
  SHOT_RESULT = unique(shot_logs$SHOT_RESULT),
  LOCATION = unique(shot_logs$LOCATION)
)

missing_combinations <- anti_join(all_combinations, combinations, by = c("SHOT_RESULT", "LOCATION"))
print(missing_combinations)
## [1] SHOT_RESULT LOCATION   
## <0 rows> (or 0-length row.names)

There are no missing combinations because a shot HAS to either be made or missed, and a team HAS to play either away or home. Even if teams play at neither teams home arena(neutral site), there will be a team designated to be the home team and a team designated to be an away team.

#Most/least common combinations#
print(arrange(combinations, n))
##   SHOT_RESULT LOCATION     n
## 1        made        A 28685
## 2        made        H 29119
## 3      missed        H 34666
## 4      missed        A 35283

The most common combination is an missed shot away. And the least common combination is a made away shot.A missed away shot could be the most common because players are not used to that environment/crowd noise/ basket and it could cause them to miss more shots. The same can be said as to why a made away shot is the least common.

#Bar plot for combinations#
ggplot(combinations, aes(x = LOCATION, y = n, fill = SHOT_RESULT)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Shot Result by Location",
    x = "Location",
    y = "Count",
    fill = "Shot Result"
  ) +
  theme_minimal()

Here is a bar chart to help visualize this.