Data Dive 4

For this data dive I am going to work with only the shots that are labeled as Pacers shots. This will help reduce the massive amount of data that would typically be in each sample, and will help us reach cleaner conclusions.

shot_logs <- read.csv("C:/Users/13177/OneDrive/Stats for Data Science/filtered_shot_logs.csv")

head(shot_logs)

##    GAME_ID                  MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           1
## 2 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           2
## 3 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           3
## 4 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           4
## 5 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           5
## 6 21400899 MAR 04, 2015 - CHA @ BKN        A   W           24           6
##   PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1      1       1:09       10.8        2        1.9       7.7        2
## 2      1       0:14        3.4        0        0.8      28.2        3
## 3      1       0:00         NA        3        2.7      10.1        2
## 4      2      11:47       10.3        2        1.9      17.2        2
## 5      2      10:34       10.9        2        2.7       3.7        2
## 6      2       8:15        9.1        2        4.4      18.4        2
##   SHOT_RESULT  CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1        made    Anderson, Alan                     101187            1.3   1
## 2      missed Bogdanovic, Bojan                     202711            6.1   0
## 3      missed Bogdanovic, Bojan                     202711            0.9   0
## 4      missed     Brown, Markel                     203900            3.4   0
## 5      missed   Young, Thaddeus                     201152            1.1   0
## 6      missed   Williams, Deron                     101114            2.6   0
##   PTS   player_name player_id
## 1   2 brian roberts    203148
## 2   0 brian roberts    203148
## 3   0 brian roberts    203148
## 4   0 brian roberts    203148
## 5   0 brian roberts    203148
## 6   0 brian roberts    203148

Filter for IND matchups

#Filter data for Pacers games#
pacers_shots <- shot_logs |> filter(str_detect(MATCHUP, "IND"))

head(pacers_shots)

##    GAME_ID                    MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400768 FEB 08, 2015 - CHA vs. IND        H   L           -1           1
## 2 21400768 FEB 08, 2015 - CHA vs. IND        H   L           -1           2
## 3 21400768 FEB 08, 2015 - CHA vs. IND        H   L           -1           3
## 4 21400768 FEB 08, 2015 - CHA vs. IND        H   L           -1           4
## 5 21400768 FEB 08, 2015 - CHA vs. IND        H   L           -1           5
## 6 21400768 FEB 08, 2015 - CHA vs. IND        H   L           -1           6
##   PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1      1       5:58       19.7        0        1.0      24.7        3
## 2      1       5:21       12.8        0        0.9      22.5        3
## 3      1       4:22       16.5        7        6.2       8.8        2
## 4      1       0:31       11.3        0        0.8      20.0        2
## 5      2      11:47        9.7        1        0.8      24.7        3
## 6      2       9:18        0.7        3        3.7       5.6        2
##   SHOT_RESULT    CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1        made           Miles, CJ                     101139           11.3   1
## 2        made        Hill, George                     201588            7.2   1
## 3      missed          Watson, CJ                     201228            2.5   0
## 4        made          Watson, CJ                     201228            1.9   1
## 5      missed          Watson, CJ                     201228            4.3   0
## 6        made Whittington, Shayne                     203963            2.4   1
##   PTS   player_name player_id
## 1   3 brian roberts    203148
## 2   3 brian roberts    203148
## 3   0 brian roberts    203148
## 4   2 brian roberts    203148
## 5   0 brian roberts    203148
## 6   2 brian roberts    203148

#Generate 5 Random Samples (50% Each, With Replacement)#

set.seed(42)

#Define sample size (50% of Pacers dataset)#
sample_size <- nrow(pacers_shots) * 0.5

#Generate five random samples with replacement#
df_1 <- pacers_shots |> slice_sample(n = sample_size, replace = TRUE)
df_2 <- pacers_shots |> slice_sample(n = sample_size, replace = TRUE)
df_3 <- pacers_shots |> slice_sample(n = sample_size, replace = TRUE)
df_4 <- pacers_shots |> slice_sample(n = sample_size, replace = TRUE)
df_5 <- pacers_shots |> slice_sample(n = sample_size, replace = TRUE)

#Check first few rows of one sample#
head(df_1)

##    GAME_ID                    MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400675 JAN 27, 2015 - IND vs. TOR        H   L          -13           1
## 2 21400768   FEB 08, 2015 - IND @ CHA        A   W            1          13
## 3 21400675   JAN 27, 2015 - TOR @ IND        A   W           13           5
## 4 21400516 JAN 05, 2015 - UTA vs. IND        H   L           -4          11
## 5 21400644 JAN 23, 2015 - MIA vs. IND        H   W            2           2
## 6 21400356 DEC 15, 2014 - IND vs. LAL        H   W           19           8
##   PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1      2       9:50         NA        7        6.3       3.5        2
## 2      4       3:54       10.0        5        4.8       8.4        2
## 3      3       1:40         NA        0        0.5       3.5        2
## 4      4       3:19        8.4        0        1.2      23.3        3
## 5      1       8:44       12.9        0        0.7      20.2        3
## 6      2       7:17        6.3        3        3.2       4.9        2
##   SHOT_RESULT   CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1      missed Patterson, Patrick                     202335            0.5   0
## 2      missed       Zeller, Cody                     203469            3.0   0
## 3        made        West, David                       2561            2.6   1
## 4        made        West, David                       2561           18.4   1
## 5        made      Hill, Solomon                     203524            2.4   1
## 6      missed        Young, Nick                     201156            1.6   0
##   PTS  player_name player_id
## 1   0   luis scola      2449
## 2   0   david west      2561
## 3   2 amir johnson    101161
## 4   3   dante exum    203957
## 5   3    luol deng      2736
## 6   0     cj miles    101139

#Comparing the Subsamples#

#Function to summarize key statistics for a sample#
summarize_sample <- function(df) {
  df |> summarize(
    avg_shot_dist = mean(SHOT_DIST, na.rm = TRUE),
    avg_close_def_dist = mean(CLOSE_DEF_DIST, na.rm = TRUE),
    made_percentage = mean(SHOT_RESULT == "made", na.rm = TRUE),
    total_shots = n()
  )
}

#Compute summaries for each sample#
summary_df <- bind_rows(
  summarize_sample(df_1) |> mutate(sample = "df_1"),
  summarize_sample(df_2) |> mutate(sample = "df_2"),
  summarize_sample(df_3) |> mutate(sample = "df_3"),
  summarize_sample(df_4) |> mutate(sample = "df_4"),
  summarize_sample(df_5) |> mutate(sample = "df_5")
)

#Print the summary statistics#
print(summary_df)

##   avg_shot_dist avg_close_def_dist made_percentage total_shots sample
## 1      14.02326           4.075608       0.4329395        4317   df_1
## 2      13.79646           3.979893       0.4310864        4317   df_2
## 3      13.82196           4.043966       0.4415103        4317   df_3
## 4      14.03312           4.090549       0.4334028        4317   df_4
## 5      13.83878           4.023535       0.4313180        4317   df_5

When we generated five random subsamples (each 50% of the dataset), we observed that some statistics (e.g., shot distance, defender distance, and made shot percentage) fluctuated across samples. This highlights the random variation that naturally occurs when drawing samples from a dataset. If we relied on just one sample, we might mistakenly overestimate or underestimate certain player behaviors.

If different samples produce different results, we must be cautious about making strong conclusions based on limited data. We can see that the values are all very close to eachother suggesting that strong conclusions can be based off this data.

#Anomaly Detection in the Subsamples#

#Checking extreme values in each sample (e.g., long-range shots > 30ft)#
df_1 |> filter(SHOT_DIST > 30) |> count()

##    n
## 1 24

df_2 |> filter(SHOT_DIST > 30) |> count()

##    n
## 1 25

df_3 |> filter(SHOT_DIST > 30) |> count()

##    n
## 1 15

df_4 |> filter(SHOT_DIST > 30) |> count()

##    n
## 1 16

df_5 |> filter(SHOT_DIST > 30) |> count()

##    n
## 1 15

What Seems Like an Outlier in One Sample Might Not Be in Another

By checking long-range shots (>30 ft) across the five samples, we noticed that some samples had more long-range shots than others. This tells us that apparent anomalies may just be a result of random selection rather than a meaningful pattern. If we only had one dataset, we might incorrectly conclude that Pacers rarely take long shots or Pacers take way too many long shots, depending on the sample.

Context matters when identifying anomalies—something that looks unusual in one sample might be completely normal when looking at the full dataset. This reinforces the need to validate outliers across multiple samples before assuming they are meaningful.

#Visualizing the Differences#

# Combine all samples for visualization
df_combined <- bind_rows(
  df_1 |> mutate(sample = "df_1"),
  df_2 |> mutate(sample = "df_2"),
  df_3 |> mutate(sample = "df_3"),
  df_4 |> mutate(sample = "df_4"),
  df_5 |> mutate(sample = "df_5")
)

# Boxplot to compare shot distance across samples
ggplot(df_combined, aes(x = sample, y = SHOT_DIST, fill = sample)) +
  geom_boxplot(alpha = 0.7) +
  labs(
    title = "Shot Distance Variability Across Pacers Subsamples",
    x = "Sample",
    y = "Shot Distance (ft)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

The boxplot visualization showed that shot distance distributions differed between the five samples. Some samples had a wider spread (more variability), while others were more concentrated (less variability). This means that shot distance can look very different depending on the sample chosen. These 5 samples happen to be very close in shot distance variability, but with more samples taken, we could see a larger variability.

A single dataset snapshot might not be representative of the overall distribution of shots. When interpreting player or team tendencies, it’s important to look at the overall trend rather than relying on one sample’s summary statistics.

#Monte Carlo Simulation#

#Monte Carlo simulation: Running 100 random samples#
set.seed(42)
monte_carlo_results <- replicate(100, {
  sample_df <- pacers_shots |> slice_sample(n = sample_size, replace = TRUE)
  mean(sample_df$SHOT_DIST, na.rm = TRUE)
})

#Visualizing Monte Carlo results#
ggplot(data.frame(avg_shot_dist = monte_carlo_results), aes(x = avg_shot_dist)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  labs(
    title = "Monte Carlo Simulation: Shot Distance Variation for Pacers",
    x = "Average Shot Distance (ft)",
    y = "Frequency"
  ) +
  theme_minimal()

#Calculate the average shot distance for Pacers#
avg_shot_dist_pacers <- mean(pacers_shots$SHOT_DIST, na.rm = TRUE)

print(avg_shot_dist_pacers)

## [1] 13.88843

Because the avg shot distance for the Pacers is 13.888 and it falls within the range of the Monte Carlo Sim we can consider this to be a success.

Why is This a Success?

The real value is inside the expected range: This suggests that random sampling variability alone can explain the observed shot distance. There’s no strong evidence that the Pacers’ shot distance is significantly different from what would naturally occur.

No anomaly detected: If the real shot distance were outside this range (ex, below 13.5 ft or above 14.3 ft), we might suspect a real pattern or data collection bias. Since it’s within the range, we can say our sample is statistically normal.

Monte Carlo Simulation validated the result: The simulation shows that even with different samples, shot distances naturally fluctuate within 13.6 - 14.2 ft. Since 13.888 ft is within this range, it reinforces that our dataset and sample selection process are sound.

Data Dive 4

jace vayhinger

2025-02-04