#Package Loading
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::src()       masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(haven)
library(plotly)

## Warning: package 'plotly' was built under R version 4.4.3

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:Hmisc':
## 
##     subplot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)

#Loading the MoneyPuck Shot Dataset
mpd = read.csv('./shots_2024.csv')

#adding descriptors to dataframe

#Load the data dictionary (update with your file path)
data_dict <- read.csv("C:/Users/Logan/Downloads/MoneyPuck_Shot_Data_Dictionary (1) (1).csv")

#Iterate through the data dictionary and assign labels (from ChatGPT -- QOL Step)

for (i in 1:nrow(data_dict)) {
  column_name <- data_dict$Variable[i]
  description <- data_dict$Definition[i]
  
  if (column_name %in% colnames(mpd)) {
    label(mpd[[column_name]]) <- description
  }
}

#mpd <- mpd %>% 
  #filter(playerPositionThatDidEvent != "" & 
         #!is.na(playerPositionThatDidEvent) & 
         #playerPositionThatDidEvent != "G")

Regression: Shot Distance and xGoal

mpd <- mpd %>% filter(mpd$teamCode == "NYR")

# Run linear regression model
lm_model <- lm(xGoal ~ mpd$shotDistance, data = mpd)
summary(lm_model)

## 
## Call:
## lm(formula = xGoal ~ mpd$shotDistance, data = mpd)
## 
## Residuals:
## The probability the shot will be a goal. Also known as "Expected Goals" 
##      Min       1Q   Median       3Q      Max 
## -0.13978 -0.04358 -0.01749  0.01493  0.80881 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.1605025  0.0042153   38.08   <2e-16 ***
## mpd$shotDistance -0.0026017  0.0001092  -23.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09053 on 1876 degrees of freedom
## Multiple R-squared:  0.2323, Adjusted R-squared:  0.2319 
## F-statistic: 567.7 on 1 and 1876 DF,  p-value: < 2.2e-16

# Scatter plot with regression line
ggplot(mpd, aes(x = shotDistance, y = xGoal)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "XGoal vs. Shot Distance", 
       x = "Shot Distance", 
       y = "Expected Goals") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The analysis of shot distance as a predictor for expected goals (xGoal) reveals a statistically significant relationship, with a p-value less than 2e-16 for both the intercept and shot distance coefficient. The coefficient for shot distance is -0.002353, suggesting that as the shot distance increases, the expected probability of a goal slightly decreases. The R-squared value of 0.2009 indicates that shot distance accounts for only about 20% of the variability in xGoal, pointing to a weak linear relationship. This suggests that while shot distance has some influence on expected goals, it is not a strong enough predictor to explain much of the variation in the dataset. Furthermore, the lack of a strong linear relationship across other continuous variables in the dataset implies that expected goals are likely driven by a combination of factors beyond just shot distance, indicating that the dataset may require more complex modeling techniques, such as incorporating non-linear relationships or exploring other influential factors, to better understand the predictors of xGoal.

Adding a Binary Term – Off Handedness

Next, we will add the binary term offhandedness. The binary term tests whether the relationship between shot distance and xGoal differs by shooter handedness. For example, it might be that off-hand shooters are less penalized by longer distances because they can get a better angle.

mpd$offWing <- as.character(as_factor(mpd$offWing))

# Model for offWing = 0
model_offWing0 <- lm(goal ~ shotDistance, data = subset(mpd, offWing == 0))
summary(model_offWing0)

## 
## Call:
## lm(formula = goal ~ shotDistance, data = subset(mpd, offWing == 
##     0))
## 
## Residuals:
## Set to 1 if shot was a goal. Otherwise 0 
##      Min       1Q   Median       3Q      Max 
## -0.13019 -0.08899 -0.05564 -0.01709  1.04370 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.134446   0.014231   9.447  < 2e-16 ***
## shotDistance -0.002129   0.000352  -6.049 1.98e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2325 on 1130 degrees of freedom
## Multiple R-squared:  0.03136,    Adjusted R-squared:  0.03051 
## F-statistic: 36.59 on 1 and 1130 DF,  p-value: 1.981e-09

# Model for offWing = 1
model_offWing1 <- lm(goal ~ shotDistance, data = subset(mpd, offWing == 1))
summary(model_offWing1)

## 
## Call:
## lm(formula = goal ~ shotDistance, data = subset(mpd, offWing == 
##     1))
## 
## Residuals:
## Set to 1 if shot was a goal. Otherwise 0 
##      Min       1Q   Median       3Q      Max 
## -0.11974 -0.09613 -0.07290 -0.04507  0.95783 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1268635  0.0188367   6.735 3.28e-11 ***
## shotDistance -0.0017269  0.0005284  -3.268  0.00113 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2598 on 744 degrees of freedom
## Multiple R-squared:  0.01415,    Adjusted R-squared:  0.01283 
## F-statistic: 10.68 on 1 and 744 DF,  p-value: 0.001132

Graph for Model 1:

ggplot(mpd, aes(x = shotDistance, y = xGoal, color = as.factor(offWing))) +
  geom_point(alpha = 0.5) +  # Scatter plot
  geom_smooth(method = "lm", se = FALSE) +  # Separate regression lines
  labs(title = "Regression Lines for Shot Distance vs Goal Probability by Off-Wing Status",
       x = "Shot Distance",
       y = "Goal Probability",
       color = "Off-Wing Status") +
  scale_color_manual(values = c("blue", "orange"), labels = c("Not Off-Wing", "Off-Wing")) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

# Broken Out Into Seperate Plots

# Plot for offWing = 0
plot_offWing0 <- ggplot(subset(mpd, offWing == 0), aes(x = shotDistance, y = xGoal)) +
  geom_point(alpha = 0.5, color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "darkblue") +
  labs(title = "Regression for Not Off-Wing (0)",
       x = "Shot Distance",
       y = "Goal Probability") +
  theme_minimal()

# Plot for offWing = 1
plot_offWing1 <- ggplot(subset(mpd, offWing == 1), aes(x = shotDistance, y = xGoal)) +
  geom_point(alpha = 0.5, color = "red") +
  geom_smooth(method = "lm", se = FALSE, color = "darkred") +
  labs(title = "Regression for Off-Wing (1)",
       x = "Shot Distance",
       y = "Goal Probability") +
  theme_minimal()

# Display both plots
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

grid.arrange(plot_offWing0, plot_offWing1, ncol = 2)

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

Adding the binary term reveals that not much changes in our model between shooter handedness by distance. Future exploration would probably be more effective looking at this term in relation to shot angle.

Adding Shot Angle and generating 3D plot

# Fit a linear regression model
model <- lm(xGoal ~ shotDistance + shotAngle, data = mpd)

# Display the summary of the model
summary(model)

## 
## Call:
## lm(formula = xGoal ~ shotDistance + shotAngle, data = mpd)
## 
## Residuals:
## The probability the shot will be a goal. Also known as "Expected Goals" 
##      Min       1Q   Median       3Q      Max 
## -0.14136 -0.04311 -0.01764  0.01513  0.80916 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.605e-01  4.217e-03  38.070   <2e-16 ***
## shotDistance -2.601e-03  1.092e-04 -23.818   <2e-16 ***
## shotAngle     2.056e-05  4.986e-05   0.412     0.68    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09055 on 1875 degrees of freedom
## Multiple R-squared:  0.2324, Adjusted R-squared:  0.2316 
## F-statistic: 283.8 on 2 and 1875 DF,  p-value: < 2.2e-16

The nearly identical R² values between the regression models with and without shotAngle suggest that shotDistance is the primary predictor of xGoal, while adding shotAngle does not significantly improve the model’s explanatory power. This could indicate multicollinearity between the two predictors, where they provide overlapping information, or that the model has reached saturation, with no additional predictive value from including shotAngle. As a result, the model’s performance may be driven mainly by shotDistance, and the inclusion of shotAngle may not contribute significantly to explaining the variation in xGoal. Further checks for multicollinearity could clarify this relationship.

Resid vs Val Plots

gg_resfitted(model) +
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

This pattern suggests heteroscedasticity—early on, the model predicts well, but as fitted values increase, residuals spread out. The black dots fanning out at the end indicate that variance grows for higher predicted values, meaning the model struggles with certain types of shots. Since I’m forcing a linear fit on an exponential decay relationship, the residuals likely get larger where the curve should be flattening.

plots <- gg_resX(model, plot.all = FALSE)

# for each variable of interest ...
plots$shotDistance +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

plots$shotAngle +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

gg_reshist(model)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The residual graphs show that shotAngle is accurately represented in the model, and is not influenced by high or low values. On the other hand, high values of shot Distance have higher residuals, largely due to the fact the model goes below the x-axis. The histogram has a skewed distrubution, reflecting the skew of the original plot.

The lack of a linear relationship in the dataset is continued to be shown, as even the most linear relationship fails the residual test.

QQ Plots

gg_qqplot(model)

My Q-Q plot falls off in Q2 because the true relationship between shot distance and xGoal follows exponential decay, but I’m modeling it as linear. This creates systematic bias—overestimating xGoal for long shots and underestimating it for short ones—leading to the curvature in the Q-Q plot. While an exponential model would fit better, treating it as linear is still a reasonable approximation over small shot distance ranges due to local linearity, similar to how the Taylor Series works. Plus, a linear model is easier to interpret, and the estimates are still useful despite some deviation from normality.

Cooks Plot

gg_cooksd(model, threshold = 'matlab')

I’m seeing a ton of influential points, which likely comes from high variability in the data. Since the relationship is actually exponential decay, forcing a linear model probably introduces large residuals, making more points seem influential. Plus, shot angles and off-wing conditions vary a lot, so some subsets might behave very differently.

LJ Data Dive - Regression Diagnostics

Logan Johnson

2025-03-25