LJ Data Dive - Week 4

#Package Loading
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::src()       masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

#Loading the MoneyPuck Shot Dataset
mpd = read.csv('C:\\Users\\Logan\\Downloads\\shots_2024.csv')

#adding descriptors to dataframe

# Load the data dictionary (update with your file path)
data_dict <- read.csv("C:\\Users\\Logan\\Downloads\\MoneyPuck_Shot_Data_Dictionary (1).csv")

# Iterate through the data dictionary and assign labels (from ChatGPT -- QOL Step)
for (i in 1:nrow(data_dict)) {
  column_name <- data_dict$Variable[i]
  description <- data_dict$Definition[i]
  
  if (column_name %in% colnames(mpd)) {
    label(mpd[[column_name]]) <- description
  }
}

#For this week, filter down to only games where NYR played DET (home or away).

mpd_NYR_DET = mpd |> filter((homeTeamCode == "NYR" & awayTeamCode == "DET") | (homeTeamCode == "DET" & awayTeamCode == "NYR"))

###Filtering

For this analysis, and for anomaly detection, I will be narrowing the columns down to five to avoid having to analyze all 100 columns. I have selected the following columns which are closely related to most of the others and will capture a lot of the qualitative nuance of the set.

Outcome & Spatial Metrics: Together, goal, shotDistance, and shotAngleAdjusted allow you to check if the offensive quality or conditions of shots in your sample differ from the overall game conditions.

Contextual Metrics: The shooterTimeOnIce and timeSinceFaceoff provide insight into the situational aspects of the shot (like fatigue or play setup). A sample with unusually high or low values in these fields might be capturing a non-representative game context.

#Selecting Five Columns for anomaly detection

mpd_NYR_DET <- mpd_NYR_DET %>%
  select(game_id, goal, shotDistance, shotAngleAdjusted, shooterTimeOnIce, timeSinceFaceoff)

table(mpd_NYR_DET$game_id)

## 
## 20045 20063 20228 
##    86    93    96

Random Sampling

set.seed(8765309)
sample_size = floor(0.5 * nrow(mpd_NYR_DET))

# Generate 5 random samples with replacement
aa_samp_df_1 <- mpd_NYR_DET[sample(nrow(mpd_NYR_DET), sample_size, replace = TRUE), ]
aa_samp_df_2 <- mpd_NYR_DET[sample(nrow(mpd_NYR_DET), sample_size, replace = TRUE), ]
aa_samp_df_3 <- mpd_NYR_DET[sample(nrow(mpd_NYR_DET), sample_size, replace = TRUE), ]
aa_samp_df_4 <- mpd_NYR_DET[sample(nrow(mpd_NYR_DET), sample_size, replace = TRUE), ]
aa_samp_df_5 <- mpd_NYR_DET[sample(nrow(mpd_NYR_DET), sample_size, replace = TRUE), ]

##Looking at Goals in Each

# Function to count goals in each sample
count_goals <- function(df, sample_name) {
  df %>%
    summarize(goal_count = sum(goal)) %>%
    mutate(sample = sample_name)
}

# Count goals in each sample and combine into a table
goal_counts <- bind_rows(
  count_goals(aa_samp_df_1, "Sample 1"),
  count_goals(aa_samp_df_2, "Sample 2"),
  count_goals(aa_samp_df_3, "Sample 3"),
  count_goals(aa_samp_df_4, "Sample 4"),
  count_goals(aa_samp_df_5, "Sample 5")
)

# Display as a table
print(goal_counts)

##   goal_count   sample
## 1          8 Sample 1
## 2         11 Sample 2
## 3         12 Sample 3
## 4          8 Sample 4
## 5          7 Sample 5

Each has a similar number of goals, with samples 2 and 3 having a higher amount.

Analyzing Anomalies by Distribution

# Determine the x-axis limits for shotDistance
x_limits <- range(mpd_NYR_DET$shotDistance, na.rm = TRUE)

# Create histograms for shotDistance
p1 = ggplot(aa_samp_df_1, aes(x = shotDistance)) +
  geom_histogram(fill = "blue", alpha = 0.5, bins = 18) +
  labs(title = "Sample 1", x = "Shot Distance", y = "Frequency") + xlim(x_limits)

p2 = ggplot(aa_samp_df_2, aes(x = shotDistance)) +
  geom_histogram(fill = "green", alpha = 0.5, bins = 18) +
  labs(title = "Sample 2", x = "Shot Distance", y = "Frequency") + xlim(x_limits)

p3 = ggplot(aa_samp_df_3, aes(x = shotDistance)) +
  geom_histogram(fill = "purple", alpha = 0.5, bins = 18) +
  labs(title = "Sample 3", x = "Shot Distance", y = "Frequency") + xlim(x_limits)

p4 = ggplot(aa_samp_df_4, aes(x = shotDistance)) +
  geom_histogram(fill = "orange", alpha = 0.5, bins = 18) +
  labs(title = "Sample 4", x = "Shot Distance", y = "Frequency") + xlim(x_limits)

p5 = ggplot(aa_samp_df_5, aes(x = shotDistance)) +
  geom_histogram(fill = "pink", alpha = 0.5, bins = 18) +
  labs(title = "Sample 5", x = "Shot Distance", y = "Frequency") + xlim(x_limits)

p6 = ggplot(mpd_NYR_DET, aes(x = shotDistance)) +
  geom_histogram(fill = "red", alpha = 0.5, bins = 18) +
  labs(title = "Set", x = "Shot Distance", y = "Frequency") + xlim(x_limits)

# Arrange histograms side by side
grid.arrange(p6, p1, p2, p3, p4, p5, ncol = 3)

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Looking at the distributions of the main set, is characterized at two distinct peaks and right skew. One at the 0 to 10 range, a second peak at the 30 to 40 range, and a slope downward through the rest of the set. Notably, sample three appears to be an anomaly, in that its second peak never really manifests. It has a high early peak, but flattens the rest of the set. Interestingly, in all but sample 3, the first peak is shorter than the second peak, and a third peak emerges around the 50 foot range.

# Determine the x-axis limits for shotAngleAdjusted
x_limits_angle <- range(mpd_NYR_DET$shotAngleAdjusted, na.rm = TRUE)

# Create histograms for shotAngleAdjusted
p1 = ggplot(aa_samp_df_1, aes(x = shotAngleAdjusted)) +
  geom_histogram(fill = "blue", alpha = 0.5, bins = 18) +
  labs(title = "Sample 1", x = "Shot Angle Adjusted", y = "Frequency") + xlim(x_limits_angle)

p2 = ggplot(aa_samp_df_2, aes(x = shotAngleAdjusted)) +
  geom_histogram(fill = "green", alpha = 0.5, bins = 18) +
  labs(title = "Sample 2", x = "Shot Angle Adjusted", y = "Frequency") + xlim(x_limits_angle)

p3 = ggplot(aa_samp_df_3, aes(x = shotAngleAdjusted)) +
  geom_histogram(fill = "purple", alpha = 0.5, bins = 18) +
  labs(title = "Sample 3", x = "Shot Angle Adjusted", y = "Frequency") + xlim(x_limits_angle)

p4 = ggplot(aa_samp_df_4, aes(x = shotAngleAdjusted)) +
  geom_histogram(fill = "orange", alpha = 0.5, bins = 18) +
  labs(title = "Sample 4", x = "Shot Angle Adjusted", y = "Frequency") + xlim(x_limits_angle)

p5 = ggplot(aa_samp_df_5, aes(x = shotAngleAdjusted)) +
  geom_histogram(fill = "pink", alpha = 0.5, bins = 18) +
  labs(title = "Sample 5", x = "Shot Angle Adjusted", y = "Frequency") + xlim(x_limits_angle)

p6 = ggplot(mpd_NYR_DET, aes(x = shotAngleAdjusted)) +
  geom_histogram(fill = "red", alpha = 0.5, bins = 18) +
  labs(title = "Set", x = "Shot Angle Adjusted", y = "Frequency") + xlim(x_limits_angle)

# Arrange histograms side by side
grid.arrange(p6, p1, p2, p3, p4, p5, ncol = 3)

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Looking at the distributions of the main set, is characterized at one peak around 25-40, and a slight right skew. Samples 3, 4, and 5 match this distribution closely, with sample five having a little more variation at the tails. Sample 1 is a little more uniform, but still has a slight peak. Sample 2 is bi-modal with a second peak at 70, a severe anomaly given the low presence of 40+ in the set.

# Determine the x-axis limits for shooterTimeOnIce
x_limits_time <- range(mpd_NYR_DET$shooterTimeOnIce, na.rm = TRUE)

# Create histograms for shooterTimeOnIce
p1 = ggplot(aa_samp_df_1, aes(x = shooterTimeOnIce)) +
  geom_histogram(fill = "blue", alpha = 0.5, bins = 18) +
  labs(title = "Sample 1", x = "Shooter Time On Ice", y = "Frequency") + xlim(x_limits_time)

p2 = ggplot(aa_samp_df_2, aes(x = shooterTimeOnIce)) +
  geom_histogram(fill = "green", alpha = 0.5, bins = 18) +
  labs(title = "Sample 2", x = "Shooter Time On Ice", y = "Frequency") + xlim(x_limits_time)

p3 = ggplot(aa_samp_df_3, aes(x = shooterTimeOnIce)) +
  geom_histogram(fill = "purple", alpha = 0.5, bins = 18) +
  labs(title = "Sample 3", x = "Shooter Time On Ice", y = "Frequency") + xlim(x_limits_time)

p4 = ggplot(aa_samp_df_4, aes(x = shooterTimeOnIce)) +
  geom_histogram(fill = "orange", alpha = 0.5, bins = 18) +
  labs(title = "Sample 4", x = "Shooter Time On Ice", y = "Frequency") + xlim(x_limits_time)

p5 = ggplot(aa_samp_df_5, aes(x = shooterTimeOnIce)) +
  geom_histogram(fill = "pink", alpha = 0.5, bins = 18) +
  labs(title = "Sample 5", x = "Shooter Time On Ice", y = "Frequency") + xlim(x_limits_time)

p6 = ggplot(mpd_NYR_DET, aes(x = shooterTimeOnIce)) +
  geom_histogram(fill = "red", alpha = 0.5, bins = 18) +
  labs(title = "Set", x = "Shooter Time On Ice", y = "Frequency") + xlim(x_limits_time)

# Arrange histograms side by side
grid.arrange(p6, p1, p2, p3, p4, p5, ncol = 3)

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Looking at the distributions of the main set, it is heavily right skewed with a majority of the data falling around the 0-50 second range. Samples four and five match this distribution closely, with samples 2 and three a little more spread out but still similar. Sample 1 has a much higher concentration around the 50 mark, and is missing the lower values of the set.

# Determine the x-axis limits for timeSinceFaceoff
x_limits_faceoff <- range(mpd_NYR_DET$timeSinceFaceoff, na.rm = TRUE)

# Create histograms for timeSinceFaceoff
p1 = ggplot(aa_samp_df_1, aes(x = timeSinceFaceoff)) +
  geom_histogram(fill = "blue", alpha = 0.5, bins = 18) +
  labs(title = "Sample 1", x = "Time Since Faceoff", y = "Frequency") + xlim(x_limits_faceoff)

p2 = ggplot(aa_samp_df_2, aes(x = timeSinceFaceoff)) +
  geom_histogram(fill = "green", alpha = 0.5, bins = 18) +
  labs(title = "Sample 2", x = "Time Since Faceoff", y = "Frequency") + xlim(x_limits_faceoff)

p3 = ggplot(aa_samp_df_3, aes(x = timeSinceFaceoff)) +
  geom_histogram(fill = "purple", alpha = 0.5, bins = 18) +
  labs(title = "Sample 3", x = "Time Since Faceoff", y = "Frequency") + xlim(x_limits_faceoff)

p4 = ggplot(aa_samp_df_4, aes(x = timeSinceFaceoff)) +
  geom_histogram(fill = "orange", alpha = 0.5, bins = 18) +
  labs(title = "Sample 4", x = "Time Since Faceoff", y = "Frequency") + xlim(x_limits_faceoff)

p5 = ggplot(aa_samp_df_5, aes(x = timeSinceFaceoff)) +
  geom_histogram(fill = "pink", alpha = 0.5, bins = 18) +
  labs(title = "Sample 5", x = "Time Since Faceoff", y = "Frequency") + xlim(x_limits_faceoff)

p6 = ggplot(mpd_NYR_DET, aes(x = timeSinceFaceoff)) +
  geom_histogram(fill = "red", alpha = 0.5, bins = 18) +
  labs(title = "Set", x = "Time Since Faceoff", y = "Frequency") + xlim(x_limits_faceoff)

# Arrange histograms side by side
grid.arrange(p6, p1, p2, p3, p4, p5, ncol = 3)

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Looking at the distributions of the main set, it is right skewed and heavily concentrated around zero. The other distributions all match this to varying degrees. Side note: Sample three is the prettiest sample I’ve ever seen.

Conclusion

Each of the samples could represent a given attribute in this set well. No sample is perfect, but for various different reasons each could be considered good.

It is interesting to note the relationship between original set and the sample distributions. The more narrow the distribution of the original set, the more likely the sample distribution is to follow it (which is why the Time Since FaceOff samples match so closely to the original).

Consider how this investigation affects how you might draw conclusions about the data in the future.

This investigation shows how the distribution of the original dataset can greatly impact the representatives of sample distributions. For example, the narrow distribution of timeSinceFaceoff resulted in sample distributions that closely matched the original. Moving forward, I’ll make sure to consider these distribution characteristics before drawing any conclusions, ensuring my insights are both accurate and reliable.

LJ Data Dive - Week 4

Logan Johnson

2025-02-10

Random Sampling

Analyzing Anomalies by Distribution

Conclusion

Consider how this investigation affects how you might draw conclusions about the data in the future.