LJ Data Dive - Hypothesis Testing

#Package Loading
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::src()       masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(haven)

#Loading the MoneyPuck Shot Dataset
mpd = read.csv("C:/Users/Logan/Downloads/shots_2024_1/shots_2024.csv")

#adding descriptors to dataframe

#Load the data dictionary (update with your file path)
#data_dict <- read.csv("C:/Users/Logan/Downloads/MoneyPuck_Shot_Data_Dictionary (1) (1).csv")

#Iterate through the data dictionary and assign labels (from ChatGPT -- QOL Step)

#for (i in 1:nrow(data_dict)) {
  #column_name <- data_dict$Variable[i]
  #description <- data_dict$Definition[i]
  
#if (column_name %in% colnames(mpd)) {
    #label(mpd[[column_name]]) <- description
  #}
#}

Variable Selection

# Assuming your data frame is named 'mpd_data'
model <- glm(goal ~ mpd$arenaAdjustedXCord + mpd$arenaAdjustedYCord + mpd$speedFromLastEvent + mpd$timeSinceFaceoff, 
             data = mpd, family = binomial)

# View the summary of the model
summary(model)

## 
## Call:
## glm(formula = goal ~ mpd$arenaAdjustedXCord + mpd$arenaAdjustedYCord + 
##     mpd$speedFromLastEvent + mpd$timeSinceFaceoff, family = binomial, 
##     data = mpd)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.4638339  0.0369316 -66.713  < 2e-16 ***
## mpd$arenaAdjustedXCord -0.0003300  0.0003252  -1.015  0.31025    
## mpd$arenaAdjustedYCord  0.0009017  0.0009937   0.907  0.36423    
## mpd$speedFromLastEvent -0.0220580  0.0027839  -7.924 2.31e-15 ***
## mpd$timeSinceFaceoff    0.0009867  0.0003459   2.853  0.00433 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17971  on 34998  degrees of freedom
## Residual deviance: 17884  on 34994  degrees of freedom
## AIC: 17894
## 
## Number of Fisher Scoring iterations: 5

Based on the logistic regression model, we observe that speedFromLastEvent and timeSinceFaceoff significantly influence the likelihood of scoring a goal. The negative coefficient for speedFromLastEvent (-0.0221) indicates that higher speeds from the last event decrease the probability of scoring, potentially due to reduced control or accuracy at higher speeds. Conversely, the positive coefficient for timeSinceFaceoff (0.0010) suggests that as more time passes since the faceoff, the chances of scoring increase, possibly due to better positioning or strategic play. The coefficients for arenaAdjustedXCord and arenaAdjustedYCord are not statistically significant, implying that the x and y coordinates do not have a strong impact on scoring likelihood in this dataset.

# Extract coefficients and standard errors
coef <- summary(model)$coefficients[, "Estimate"]
se <- summary(model)$coefficients[, "Std. Error"]

# Calculate 95% CI
ci_lower <- coef - 1.96 * se
ci_upper <- coef + 1.96 * se

# Combine into a data frame for easy viewing
ci <- data.frame(Coefficient = coef, Lower_CI = ci_lower, Upper_CI = ci_upper)
print(ci)

##                          Coefficient      Lower_CI      Upper_CI
## (Intercept)            -2.4638338618 -2.5362197412 -2.3914479825
## mpd$arenaAdjustedXCord -0.0003300306 -0.0009675145  0.0003074533
## mpd$arenaAdjustedYCord  0.0009016557 -0.0010460807  0.0028493921
## mpd$speedFromLastEvent -0.0220579931 -0.0275143686 -0.0166016175
## mpd$timeSinceFaceoff    0.0009867410  0.0003088239  0.0016646582

Intercept: The coefficient for the intercept is -2.4638, with a 95% confidence interval ranging from -2.5362 to -2.3914. This represents the baseline log-odds of scoring a goal when all explanatory variables are zero. The negative value indicates a low probability of scoring under these conditions.

mpd$arenaAdjustedXCord: The coefficient is -0.00033, with a confidence interval from -0.00097 to 0.00031. Since the interval includes zero and the p-value is not significant, this suggests that the x-coordinate does not have a meaningful impact on the likelihood of scoring a goal.

mpd$arenaAdjustedYCord: The coefficient is 0.0009, with a confidence interval from -0.00105 to 0.00285. Similar to the x-coordinate, the interval includes zero and the p-value is not significant, indicating that the y-coordinate does not significantly affect the probability of scoring.

mpd$speedFromLastEvent: The coefficient is -0.0221, with a confidence interval from -0.0275 to -0.0166. This negative coefficient is statistically significant, suggesting that higher speeds from the last event decrease the likelihood of scoring a goal. The confidence interval does not include zero, reinforcing the significance of this variable.

mpd$timeSinceFaceoff: The coefficient is 0.0010, with a confidence interval from 0.00031 to 0.00166. This positive coefficient is statistically significant, indicating that more time since the faceoff increases the probability of scoring. The confidence interval does not include zero, confirming the importance of this variable.

LJ Data Dive - Hypothesis Testing

Logan Johnson

2025-04-07

Variable Selection