For this data dive I am going to work with only the shots that are labeled as Pacers shots. This will help reduce the massive amount of data that would typically be in each sample, and will help us reach cleaner conclusions.
shot_logs <- read.csv("C:/Users/13177/OneDrive/Stats for Data Science/filtered_shot_logs.csv")
head(shot_logs)
## GAME_ID MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400899 MAR 04, 2015 - CHA @ BKN A W 24 1
## 2 21400899 MAR 04, 2015 - CHA @ BKN A W 24 2
## 3 21400899 MAR 04, 2015 - CHA @ BKN A W 24 3
## 4 21400899 MAR 04, 2015 - CHA @ BKN A W 24 4
## 5 21400899 MAR 04, 2015 - CHA @ BKN A W 24 5
## 6 21400899 MAR 04, 2015 - CHA @ BKN A W 24 6
## PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1 1 1:09 10.8 2 1.9 7.7 2
## 2 1 0:14 3.4 0 0.8 28.2 3
## 3 1 0:00 NA 3 2.7 10.1 2
## 4 2 11:47 10.3 2 1.9 17.2 2
## 5 2 10:34 10.9 2 2.7 3.7 2
## 6 2 8:15 9.1 2 4.4 18.4 2
## SHOT_RESULT CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1 made Anderson, Alan 101187 1.3 1
## 2 missed Bogdanovic, Bojan 202711 6.1 0
## 3 missed Bogdanovic, Bojan 202711 0.9 0
## 4 missed Brown, Markel 203900 3.4 0
## 5 missed Young, Thaddeus 201152 1.1 0
## 6 missed Williams, Deron 101114 2.6 0
## PTS player_name player_id
## 1 2 brian roberts 203148
## 2 0 brian roberts 203148
## 3 0 brian roberts 203148
## 4 0 brian roberts 203148
## 5 0 brian roberts 203148
## 6 0 brian roberts 203148
Filter for IND matchups
#Filter data for Pacers games#
pacers_shots <- shot_logs |> filter(str_detect(MATCHUP, "IND"))
head(pacers_shots)
## GAME_ID MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400768 FEB 08, 2015 - CHA vs. IND H L -1 1
## 2 21400768 FEB 08, 2015 - CHA vs. IND H L -1 2
## 3 21400768 FEB 08, 2015 - CHA vs. IND H L -1 3
## 4 21400768 FEB 08, 2015 - CHA vs. IND H L -1 4
## 5 21400768 FEB 08, 2015 - CHA vs. IND H L -1 5
## 6 21400768 FEB 08, 2015 - CHA vs. IND H L -1 6
## PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1 1 5:58 19.7 0 1.0 24.7 3
## 2 1 5:21 12.8 0 0.9 22.5 3
## 3 1 4:22 16.5 7 6.2 8.8 2
## 4 1 0:31 11.3 0 0.8 20.0 2
## 5 2 11:47 9.7 1 0.8 24.7 3
## 6 2 9:18 0.7 3 3.7 5.6 2
## SHOT_RESULT CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1 made Miles, CJ 101139 11.3 1
## 2 made Hill, George 201588 7.2 1
## 3 missed Watson, CJ 201228 2.5 0
## 4 made Watson, CJ 201228 1.9 1
## 5 missed Watson, CJ 201228 4.3 0
## 6 made Whittington, Shayne 203963 2.4 1
## PTS player_name player_id
## 1 3 brian roberts 203148
## 2 3 brian roberts 203148
## 3 0 brian roberts 203148
## 4 2 brian roberts 203148
## 5 0 brian roberts 203148
## 6 2 brian roberts 203148
#Generate 5 Random Samples (50% Each, With Replacement)#
set.seed(42)
#Define sample size (50% of Pacers dataset)#
sample_size <- nrow(pacers_shots) * 0.5
#Generate five random samples with replacement#
df_1 <- pacers_shots |> slice_sample(n = sample_size, replace = TRUE)
df_2 <- pacers_shots |> slice_sample(n = sample_size, replace = TRUE)
df_3 <- pacers_shots |> slice_sample(n = sample_size, replace = TRUE)
df_4 <- pacers_shots |> slice_sample(n = sample_size, replace = TRUE)
df_5 <- pacers_shots |> slice_sample(n = sample_size, replace = TRUE)
#Check first few rows of one sample#
head(df_1)
## GAME_ID MATCHUP LOCATION W.L FINAL_MARGIN SHOT_NUMBER
## 1 21400675 JAN 27, 2015 - IND vs. TOR H L -13 1
## 2 21400768 FEB 08, 2015 - IND @ CHA A W 1 13
## 3 21400675 JAN 27, 2015 - TOR @ IND A W 13 5
## 4 21400516 JAN 05, 2015 - UTA vs. IND H L -4 11
## 5 21400644 JAN 23, 2015 - MIA vs. IND H W 2 2
## 6 21400356 DEC 15, 2014 - IND vs. LAL H W 19 8
## PERIOD GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE
## 1 2 9:50 NA 7 6.3 3.5 2
## 2 4 3:54 10.0 5 4.8 8.4 2
## 3 3 1:40 NA 0 0.5 3.5 2
## 4 4 3:19 8.4 0 1.2 23.3 3
## 5 1 8:44 12.9 0 0.7 20.2 3
## 6 2 7:17 6.3 3 3.2 4.9 2
## SHOT_RESULT CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM
## 1 missed Patterson, Patrick 202335 0.5 0
## 2 missed Zeller, Cody 203469 3.0 0
## 3 made West, David 2561 2.6 1
## 4 made West, David 2561 18.4 1
## 5 made Hill, Solomon 203524 2.4 1
## 6 missed Young, Nick 201156 1.6 0
## PTS player_name player_id
## 1 0 luis scola 2449
## 2 0 david west 2561
## 3 2 amir johnson 101161
## 4 3 dante exum 203957
## 5 3 luol deng 2736
## 6 0 cj miles 101139
#Comparing the Subsamples#
#Function to summarize key statistics for a sample#
summarize_sample <- function(df) {
df |> summarize(
avg_shot_dist = mean(SHOT_DIST, na.rm = TRUE),
avg_close_def_dist = mean(CLOSE_DEF_DIST, na.rm = TRUE),
made_percentage = mean(SHOT_RESULT == "made", na.rm = TRUE),
total_shots = n()
)
}
#Compute summaries for each sample#
summary_df <- bind_rows(
summarize_sample(df_1) |> mutate(sample = "df_1"),
summarize_sample(df_2) |> mutate(sample = "df_2"),
summarize_sample(df_3) |> mutate(sample = "df_3"),
summarize_sample(df_4) |> mutate(sample = "df_4"),
summarize_sample(df_5) |> mutate(sample = "df_5")
)
#Print the summary statistics#
print(summary_df)
## avg_shot_dist avg_close_def_dist made_percentage total_shots sample
## 1 14.02326 4.075608 0.4329395 4317 df_1
## 2 13.79646 3.979893 0.4310864 4317 df_2
## 3 13.82196 4.043966 0.4415103 4317 df_3
## 4 14.03312 4.090549 0.4334028 4317 df_4
## 5 13.83878 4.023535 0.4313180 4317 df_5
When we generated five random subsamples (each 50% of the dataset), we observed that some statistics (e.g., shot distance, defender distance, and made shot percentage) fluctuated across samples. This highlights the random variation that naturally occurs when drawing samples from a dataset. If we relied on just one sample, we might mistakenly overestimate or underestimate certain player behaviors.
If different samples produce different results, we must be cautious about making strong conclusions based on limited data. We can see that the values are all very close to eachother suggesting that strong conclusions can be based off this data.
#Anomaly Detection in the Subsamples#
#Checking extreme values in each sample (e.g., long-range shots > 30ft)#
df_1 |> filter(SHOT_DIST > 30) |> count()
## n
## 1 24
df_2 |> filter(SHOT_DIST > 30) |> count()
## n
## 1 25
df_3 |> filter(SHOT_DIST > 30) |> count()
## n
## 1 15
df_4 |> filter(SHOT_DIST > 30) |> count()
## n
## 1 16
df_5 |> filter(SHOT_DIST > 30) |> count()
## n
## 1 15
What Seems Like an Outlier in One Sample Might Not Be in Another
By checking long-range shots (>30 ft) across the five samples, we noticed that some samples had more long-range shots than others. This tells us that apparent anomalies may just be a result of random selection rather than a meaningful pattern. If we only had one dataset, we might incorrectly conclude that Pacers rarely take long shots or Pacers take way too many long shots, depending on the sample.
Context matters when identifying anomalies—something that looks unusual in one sample might be completely normal when looking at the full dataset. This reinforces the need to validate outliers across multiple samples before assuming they are meaningful.
#Visualizing the Differences#
# Combine all samples for visualization
df_combined <- bind_rows(
df_1 |> mutate(sample = "df_1"),
df_2 |> mutate(sample = "df_2"),
df_3 |> mutate(sample = "df_3"),
df_4 |> mutate(sample = "df_4"),
df_5 |> mutate(sample = "df_5")
)
# Boxplot to compare shot distance across samples
ggplot(df_combined, aes(x = sample, y = SHOT_DIST, fill = sample)) +
geom_boxplot(alpha = 0.7) +
labs(
title = "Shot Distance Variability Across Pacers Subsamples",
x = "Sample",
y = "Shot Distance (ft)"
) +
theme_minimal() +
theme(legend.position = "none")
The boxplot visualization showed that shot distance distributions
differed between the five samples. Some samples had a wider spread (more
variability), while others were more concentrated (less variability).
This means that shot distance can look very different depending on the
sample chosen. These 5 samples happen to be very close in shot distance
variability, but with more samples taken, we could see a larger
variability.
A single dataset snapshot might not be representative of the overall distribution of shots. When interpreting player or team tendencies, it’s important to look at the overall trend rather than relying on one sample’s summary statistics.
#Monte Carlo Simulation#
#Monte Carlo simulation: Running 100 random samples#
set.seed(42)
monte_carlo_results <- replicate(100, {
sample_df <- pacers_shots |> slice_sample(n = sample_size, replace = TRUE)
mean(sample_df$SHOT_DIST, na.rm = TRUE)
})
#Visualizing Monte Carlo results#
ggplot(data.frame(avg_shot_dist = monte_carlo_results), aes(x = avg_shot_dist)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
labs(
title = "Monte Carlo Simulation: Shot Distance Variation for Pacers",
x = "Average Shot Distance (ft)",
y = "Frequency"
) +
theme_minimal()
#Calculate the average shot distance for Pacers#
avg_shot_dist_pacers <- mean(pacers_shots$SHOT_DIST, na.rm = TRUE)
print(avg_shot_dist_pacers)
## [1] 13.88843
Because the avg shot distance for the Pacers is 13.888 and it falls within the range of the Monte Carlo Sim we can consider this to be a success.
Why is This a Success?
The real value is inside the expected range: This suggests that random sampling variability alone can explain the observed shot distance. There’s no strong evidence that the Pacers’ shot distance is significantly different from what would naturally occur.
No anomaly detected: If the real shot distance were outside this range (ex, below 13.5 ft or above 14.3 ft), we might suspect a real pattern or data collection bias. Since it’s within the range, we can say our sample is statistically normal.
Monte Carlo Simulation validated the result: The simulation shows that even with different samples, shot distances naturally fluctuate within 13.6 - 14.2 ft. Since 13.888 ft is within this range, it reinforces that our dataset and sample selection process are sound.