LJ Data Dive - GLM and Custom Diagnostics

#Package Loading
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::src()       masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(haven)
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)

#Loading the MoneyPuck Shot Dataset
mpd = read.csv("C:/Users/Logan/Downloads/shots_2024_1/shots_2024.csv")

#adding descriptors to dataframe

#Load the data dictionary (update with your file path)
#data_dict <- read.csv("C:/Users/Logan/Downloads/MoneyPuck_Shot_Data_Dictionary (1) (1).csv")

#Iterate through the data dictionary and assign labels (from ChatGPT -- QOL Step)

#for (i in 1:nrow(data_dict)) {
  #column_name <- data_dict$Variable[i]
  #description <- data_dict$Definition[i]
  
#if (column_name %in% colnames(mpd)) {
    #label(mpd[[column_name]]) <- description
  #}
#}

Variable Selection

# Assuming your data frame is named 'mpd_data'
model <- glm(goal ~ mpd$arenaAdjustedXCord + mpd$arenaAdjustedYCord + mpd$speedFromLastEvent + mpd$timeSinceFaceoff, 
             data = mpd, family = binomial)

# View the summary of the model
summary(model)

## 
## Call:
## glm(formula = goal ~ mpd$arenaAdjustedXCord + mpd$arenaAdjustedYCord + 
##     mpd$speedFromLastEvent + mpd$timeSinceFaceoff, family = binomial, 
##     data = mpd)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.4638339  0.0369316 -66.713  < 2e-16 ***
## mpd$arenaAdjustedXCord -0.0003300  0.0003252  -1.015  0.31025    
## mpd$arenaAdjustedYCord  0.0009017  0.0009937   0.907  0.36423    
## mpd$speedFromLastEvent -0.0220580  0.0027839  -7.924 2.31e-15 ***
## mpd$timeSinceFaceoff    0.0009867  0.0003459   2.853  0.00433 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17971  on 34998  degrees of freedom
## Residual deviance: 17884  on 34994  degrees of freedom
## AIC: 17894
## 
## Number of Fisher Scoring iterations: 5

Based on the logistic regression model, we observe that speedFromLastEvent and timeSinceFaceoff significantly influence the likelihood of scoring a goal. The negative coefficient for speedFromLastEvent (-0.0221) indicates that higher speeds from the last event decrease the probability of scoring, potentially due to reduced control or accuracy at higher speeds. Conversely, the positive coefficient for timeSinceFaceoff (0.0010) suggests that as more time passes since the faceoff, the chances of scoring increase, possibly due to better positioning or strategic play. The coefficients for arenaAdjustedXCord and arenaAdjustedYCord are not statistically significant, implying that the x and y coordinates do not have a strong impact on scoring likelihood in this dataset.

# Extract coefficients and standard errors
coef <- summary(model)$coefficients[, "Estimate"]
se <- summary(model)$coefficients[, "Std. Error"]

# Calculate 95% CI
ci_lower <- coef - 1.96 * se
ci_upper <- coef + 1.96 * se

# Combine into a data frame for easy viewing
ci <- data.frame(Coefficient = coef, Lower_CI = ci_lower, Upper_CI = ci_upper)
print(ci)

##                          Coefficient      Lower_CI      Upper_CI
## (Intercept)            -2.4638338618 -2.5362197412 -2.3914479825
## mpd$arenaAdjustedXCord -0.0003300306 -0.0009675145  0.0003074533
## mpd$arenaAdjustedYCord  0.0009016557 -0.0010460807  0.0028493921
## mpd$speedFromLastEvent -0.0220579931 -0.0275143686 -0.0166016175
## mpd$timeSinceFaceoff    0.0009867410  0.0003088239  0.0016646582

Intercept: The coefficient for the intercept is -2.4638, with a 95% confidence interval ranging from -2.5362 to -2.3914. This represents the baseline log-odds of scoring a goal when all explanatory variables are zero. The negative value indicates a low probability of scoring under these conditions.

mpd$arenaAdjustedXCord: The coefficient is -0.00033, with a confidence interval from -0.00097 to 0.00031. Since the interval includes zero and the p-value is not significant, this suggests that the x-coordinate does not have a meaningful impact on the likelihood of scoring a goal.

mpd$arenaAdjustedYCord: The coefficient is 0.0009, with a confidence interval from -0.00105 to 0.00285. Similar to the x-coordinate, the interval includes zero and the p-value is not significant, indicating that the y-coordinate does not significantly affect the probability of scoring.

mpd$speedFromLastEvent: The coefficient is -0.0221, with a confidence interval from -0.0275 to -0.0166. This negative coefficient is statistically significant, suggesting that higher speeds from the last event decrease the likelihood of scoring a goal. The confidence interval does not include zero, reinforcing the significance of this variable.

mpd$timeSinceFaceoff: The coefficient is 0.0010, with a confidence interval from 0.00031 to 0.00166. This positive coefficient is statistically significant, indicating that more time since the faceoff increases the probability of scoring. The confidence interval does not include zero, confirming the importance of this variable.

Diagnostics

Pseudo R2

library(pscl)

## Warning: package 'pscl' was built under R version 4.4.3

## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.

pR2(model)

## fitting null model for pseudo-r2

##           llh       llhNull            G2      McFadden          r2ML 
## -8.942037e+03 -8.985545e+03  8.701731e+01  4.842072e-03  2.483192e-03 
##          r2CU 
##  6.183444e-03

The pseudo-R² values calculated for the logistic regression model indicate that, while some variables such as speedFromLastEvent and timeSinceFaceoff are statistically significant, the model as a whole explains only a very small portion of the variance in goal outcomes. Specifically, the McFadden R² value is approximately 0.0048, which is considered extremely low and suggests that the model’s predictive power is weak. This implies that although certain features may influence the likelihood of a goal, the current model does not capture enough information to reliably predict goal outcomes on its own. To improve model performance, additional variables, interaction terms, or alternative modeling approaches should be explored.

Confusion Matrix

library(caret)

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:boot':
## 
##     melanoma

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

# Predict probabilities
pred_probs <- predict(model, type = "response")

# Choose threshold (default: 0.5)
pred_class <- ifelse(pred_probs > 0.5, 1, 0)

# Confusion matrix
confusionMatrix(factor(pred_class), factor(mpd$goal))

## Warning in confusionMatrix.default(factor(pred_class), factor(mpd$goal)):
## Levels are not in the same order for reference and data. Refactoring data to
## match.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 32507  2492
##          1     0     0
##                                           
##                Accuracy : 0.9288          
##                  95% CI : (0.9261, 0.9315)
##     No Information Rate : 0.9288          
##     P-Value [Acc > NIR] : 0.5053          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9288          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9288          
##          Detection Rate : 0.9288          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

While the model achieves a high overall accuracy (92.88%), this is largely due to the overwhelming number of “no goal” outcomes in the dataset. The model fails to predict any actual goals, leading to a sensitivity of 0 and indicating that it is completely ineffective at identifying the minority class. This suggests the model is severely biased toward the majority class and lacks the ability to distinguish between goal and non-goal events. To address this, it may be necessary to rebalance the dataset (e.g., with oversampling, undersampling, or SMOTE), use class weighting, or explore more flexible modeling approaches.

###ROC

library(pROC)

## Warning: package 'pROC' was built under R version 4.4.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

roc_obj <- roc(mpd$goal, pred_probs)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_obj, col = "blue", print.auc = TRUE)

An AUC of 0.548 suggests that the model is only marginally better than flipping a coin when it comes to predicting goals. Combined with the confusion matrix (which shows no predicted goals at all), this reinforces the need to:

Address class imbalance (e.g., through oversampling/undersampling or weighted models)
Consider using a different modeling approach like Random Forest or Gradient Boosting
Revisit feature selection or engineering to see if more predictive variables can be included

Quick Random Forest Predictor

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(caret)

# 1. Prepare the data
mpd_rf <- mpd[, c("goal", "speedFromLastEvent", "timeSinceFaceoff")]
mpd_rf$goal <- as.factor(mpd_rf$goal)

# 2. Split data into training and testing sets
set.seed(42)
train_index <- createDataPartition(mpd_rf$goal, p = 0.7, list = FALSE)
train_data <- mpd_rf[train_index, ]
test_data <- mpd_rf[-train_index, ]

# 3. Calculate class weights
class_levels <- levels(train_data$goal)  # Get the actual class levels
class_counts <- table(train_data$goal)
weights <- as.numeric(c(class_counts[class_levels[2]] / sum(class_counts),
             class_counts[class_levels[1]] / sum(class_counts)))
names(weights) <- class_levels # set the names of the weights vector to the class levels

# 4. Train the Random Forest model with class weights
rf_model_weighted <- randomForest(goal ~ .,
                                  data = train_data,
                                  ntree = 100,
                                  classwt = weights)

# 5. Make predictions on the test set
predictions_weighted <- predict(rf_model_weighted, newdata = test_data)

# 6. Evaluate performance with a confusion matrix
conf_matrix_weighted <- confusionMatrix(predictions_weighted, test_data$goal)
print(conf_matrix_weighted)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7607  596
##          1 2145  151
##                                           
##                Accuracy : 0.7389          
##                  95% CI : (0.7304, 0.7473)
##     No Information Rate : 0.9289          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0091         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.78005         
##             Specificity : 0.20214         
##          Pos Pred Value : 0.92734         
##          Neg Pred Value : 0.06577         
##              Prevalence : 0.92885         
##          Detection Rate : 0.72455         
##    Detection Prevalence : 0.78131         
##       Balanced Accuracy : 0.49109         
##                                           
##        'Positive' Class : 0               
##

While the class-weighted Random Forest performs better than the unweighted version in terms of correctly identifying some goals, it still struggles significantly to predict goals accurately. The model is heavily biased towards predicting non-goals.