#Package Loading
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::src() masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(haven)
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.3
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:Hmisc':
##
## subplot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
#Loading the MoneyPuck Shot Dataset
mpd = read.csv('./shots_2024.csv')
#adding descriptors to dataframe
#Load the data dictionary (update with your file path)
data_dict <- read.csv("C:/Users/Logan/Downloads/MoneyPuck_Shot_Data_Dictionary (1) (1).csv")
#Iterate through the data dictionary and assign labels (from ChatGPT -- QOL Step)
for (i in 1:nrow(data_dict)) {
column_name <- data_dict$Variable[i]
description <- data_dict$Definition[i]
if (column_name %in% colnames(mpd)) {
label(mpd[[column_name]]) <- description
}
}
#mpd <- mpd %>%
#filter(playerPositionThatDidEvent != "" &
#!is.na(playerPositionThatDidEvent) &
#playerPositionThatDidEvent != "G")
mpd <- mpd %>% filter(mpd$teamCode == "NYR")
# Run linear regression model
lm_model <- lm(xGoal ~ mpd$shotDistance, data = mpd)
summary(lm_model)
##
## Call:
## lm(formula = xGoal ~ mpd$shotDistance, data = mpd)
##
## Residuals:
## The probability the shot will be a goal. Also known as "Expected Goals"
## Min 1Q Median 3Q Max
## -0.13978 -0.04358 -0.01749 0.01493 0.80881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1605025 0.0042153 38.08 <2e-16 ***
## mpd$shotDistance -0.0026017 0.0001092 -23.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09053 on 1876 degrees of freedom
## Multiple R-squared: 0.2323, Adjusted R-squared: 0.2319
## F-statistic: 567.7 on 1 and 1876 DF, p-value: < 2.2e-16
# Scatter plot with regression line
ggplot(mpd, aes(x = shotDistance, y = xGoal)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "blue") +
labs(title = "XGoal vs. Shot Distance",
x = "Shot Distance",
y = "Expected Goals") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The analysis of shot distance as a predictor for expected goals
(xGoal) reveals a statistically significant relationship,
with a p-value less than 2e-16 for both the intercept and shot distance
coefficient. The coefficient for shot distance is -0.002353, suggesting
that as the shot distance increases, the expected probability of a goal
slightly decreases. The R-squared value of 0.2009 indicates that shot
distance accounts for only about 20% of the variability in
xGoal, pointing to a weak linear relationship. This
suggests that while shot distance has some influence on expected goals,
it is not a strong enough predictor to explain much of the variation in
the dataset. Furthermore, the lack of a strong linear relationship
across other continuous variables in the dataset implies that expected
goals are likely driven by a combination of factors beyond just shot
distance, indicating that the dataset may require more complex modeling
techniques, such as incorporating non-linear relationships or exploring
other influential factors, to better understand the predictors of
xGoal.
Next, we will add the binary term offhandedness. The binary term tests whether the relationship between shot distance and xGoal differs by shooter handedness. For example, it might be that off-hand shooters are less penalized by longer distances because they can get a better angle.
mpd$offWing <- as.character(as_factor(mpd$offWing))
# Model for offWing = 0
model_offWing0 <- lm(goal ~ shotDistance, data = subset(mpd, offWing == 0))
summary(model_offWing0)
##
## Call:
## lm(formula = goal ~ shotDistance, data = subset(mpd, offWing ==
## 0))
##
## Residuals:
## Set to 1 if shot was a goal. Otherwise 0
## Min 1Q Median 3Q Max
## -0.13019 -0.08899 -0.05564 -0.01709 1.04370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.134446 0.014231 9.447 < 2e-16 ***
## shotDistance -0.002129 0.000352 -6.049 1.98e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2325 on 1130 degrees of freedom
## Multiple R-squared: 0.03136, Adjusted R-squared: 0.03051
## F-statistic: 36.59 on 1 and 1130 DF, p-value: 1.981e-09
# Model for offWing = 1
model_offWing1 <- lm(goal ~ shotDistance, data = subset(mpd, offWing == 1))
summary(model_offWing1)
##
## Call:
## lm(formula = goal ~ shotDistance, data = subset(mpd, offWing ==
## 1))
##
## Residuals:
## Set to 1 if shot was a goal. Otherwise 0
## Min 1Q Median 3Q Max
## -0.11974 -0.09613 -0.07290 -0.04507 0.95783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1268635 0.0188367 6.735 3.28e-11 ***
## shotDistance -0.0017269 0.0005284 -3.268 0.00113 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2598 on 744 degrees of freedom
## Multiple R-squared: 0.01415, Adjusted R-squared: 0.01283
## F-statistic: 10.68 on 1 and 744 DF, p-value: 0.001132
ggplot(mpd, aes(x = shotDistance, y = xGoal, color = as.factor(offWing))) +
geom_point(alpha = 0.5) + # Scatter plot
geom_smooth(method = "lm", se = FALSE) + # Separate regression lines
labs(title = "Regression Lines for Shot Distance vs Goal Probability by Off-Wing Status",
x = "Shot Distance",
y = "Goal Probability",
color = "Off-Wing Status") +
scale_color_manual(values = c("blue", "orange"), labels = c("Not Off-Wing", "Off-Wing")) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Broken Out Into Seperate Plots
# Plot for offWing = 0
plot_offWing0 <- ggplot(subset(mpd, offWing == 0), aes(x = shotDistance, y = xGoal)) +
geom_point(alpha = 0.5, color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "darkblue") +
labs(title = "Regression for Not Off-Wing (0)",
x = "Shot Distance",
y = "Goal Probability") +
theme_minimal()
# Plot for offWing = 1
plot_offWing1 <- ggplot(subset(mpd, offWing == 1), aes(x = shotDistance, y = xGoal)) +
geom_point(alpha = 0.5, color = "red") +
geom_smooth(method = "lm", se = FALSE, color = "darkred") +
labs(title = "Regression for Off-Wing (1)",
x = "Shot Distance",
y = "Goal Probability") +
theme_minimal()
# Display both plots
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(plot_offWing0, plot_offWing1, ncol = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Adding the binary term reveals that not much changes in our model between shooter handedness by distance. Future exploration would probably be more effective looking at this term in relation to shot angle.
# Fit a linear regression model
model <- lm(xGoal ~ shotDistance + shotAngle, data = mpd)
# Display the summary of the model
summary(model)
##
## Call:
## lm(formula = xGoal ~ shotDistance + shotAngle, data = mpd)
##
## Residuals:
## The probability the shot will be a goal. Also known as "Expected Goals"
## Min 1Q Median 3Q Max
## -0.14136 -0.04311 -0.01764 0.01513 0.80916
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.605e-01 4.217e-03 38.070 <2e-16 ***
## shotDistance -2.601e-03 1.092e-04 -23.818 <2e-16 ***
## shotAngle 2.056e-05 4.986e-05 0.412 0.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09055 on 1875 degrees of freedom
## Multiple R-squared: 0.2324, Adjusted R-squared: 0.2316
## F-statistic: 283.8 on 2 and 1875 DF, p-value: < 2.2e-16
The nearly identical R² values between the regression models with and
without shotAngle suggest that shotDistance is
the primary predictor of xGoal, while adding
shotAngle does not significantly improve the model’s
explanatory power. This could indicate multicollinearity between the two
predictors, where they provide overlapping information, or that the
model has reached saturation, with no additional predictive value from
including shotAngle. As a result, the model’s performance
may be driven mainly by shotDistance, and the inclusion of
shotAngle may not contribute significantly to explaining
the variation in xGoal. Further checks for
multicollinearity could clarify this relationship.
gg_resfitted(model) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
This pattern suggests heteroscedasticity—early on, the model predicts well, but as fitted values increase, residuals spread out. The black dots fanning out at the end indicate that variance grows for higher predicted values, meaning the model struggles with certain types of shots. Since I’m forcing a linear fit on an exponential decay relationship, the residuals likely get larger where the curve should be flattening.
plots <- gg_resX(model, plot.all = FALSE)
# for each variable of interest ...
plots$shotDistance +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
plots$shotAngle +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The residual graphs show that shotAngle is accurately represented in the model, and is not influenced by high or low values. On the other hand, high values of shot Distance have higher residuals, largely due to the fact the model goes below the x-axis. The histogram has a skewed distrubution, reflecting the skew of the original plot.
The lack of a linear relationship in the dataset is continued to be shown, as even the most linear relationship fails the residual test.
gg_qqplot(model)
My Q-Q plot falls off in Q2 because the true relationship between shot distance and xGoal follows exponential decay, but I’m modeling it as linear. This creates systematic bias—overestimating xGoal for long shots and underestimating it for short ones—leading to the curvature in the Q-Q plot. While an exponential model would fit better, treating it as linear is still a reasonable approximation over small shot distance ranges due to local linearity, similar to how the Taylor Series works. Plus, a linear model is easier to interpret, and the estimates are still useful despite some deviation from normality.
gg_cooksd(model, threshold = 'matlab')
I’m seeing a ton of influential points, which likely comes from high variability in the data. Since the relationship is actually exponential decay, forcing a linear model probably introduces large residuals, making more points seem influential. Plus, shot angles and off-wing conditions vary a lot, so some subsets might behave very differently.