Main Report

This project was done over a semester in the Linear Model Theory class at Willamette. The following includes the final report and the code (in the appendix). This code was developed by first finding some basic information about our data and learning how to work Rstudio at the beginning of the semester (with our early “Milestone” assignments) and then working our way to more and more complicated tasks.

Introduction

The National Hockey League (NHL) is one of the top 4 major North American sports. Using the NHL Game Data, Kaggle, we had access to a plethora of NHL data. This paper examines how certain elements of a shot towards the goal affect the likelihood the shooter will score or miss. The first element we looked at was the hockey puck’s distance from the goal when the shot was taken. We then expanded our analysis to total distance from the goal, distance in the X direction, distance in the Y direction, and the time remaining in the period and how these affected the likelihood of making a goal. We found that the total distance from the goal had a relatively large impact, the distance in the X direction had a medium impact, and the time had a small impact on the likelihood a goal was made. We found that the distance from the goal in the Y direction had little to no effect on if the shot resulted in a goal. We also found that the likelihood of scoring from a distance more than a couple meters from the goal is incredibly small so most shots taken from a far distance (and the bulk of the data) was close to zero. We built this model in hopes of better understanding what elements were statistically at play when hockey players shoot the puck. The data was fitted to a linear model but in retrospect could be more accurate if fit to an exponential model.

Data & Application

The data used is pulled from data of all NHL games in the history of the NHL. The data was wrangled into a simple CSV that accounted for Euclidean distance, X and Y distance, and time remaining in the period, with whether or not a goal was scored serving as the response variable. The X and Y coordinates were taken originally with the origin at the center of the rink (0,0), but were adjusted to be in reference to the position of the goal. Furthermore, the X coordinate was standardized so that all X values would be positive, as if the entire game were played on one half of the rink.

The plot below shows the relationship between goal likelihood and a player’s distance to the goal. It shows a very strong evidence that distance is a driving force behind goal scoring. There is an exponential decrease in the likeness of scoring a goal the further away a player is from the crease. There are two important notes on this scatter plot. The first is that it is difficult to get into this zone next to the crease that is quite literally a kill zone sometimes. Therefore, the number of shots occurring at this distance is far fewer. Even more so, the area that is the one meter range is far smaller than the area for some of the further distances. With these two factors, far fewer shots on goal occurred at the one meter distance. This does not, however, mitigate the fact that when a player is able to get a shot off in this range, they are more likely than not to make it.

Methods

We first determined the fitted model for the data by calculating the slope ( \(\beta_1\) ) and intercept ( \(\beta_0\) ) of the line of best fit based on the likelihood of a goal being made in response to the distance from where the shot was made to the goal. Once there was a fitted model, we used the summary function in R to find the degrees of freedom, the test statistic, and the p-value of this fitted model.

Fitted line: \(y= 0.044882- 0.00064 x\)

T-stat: \(0.0002148\)

P-value: \(0.003171\)

Model & degrees of freedom: Normal reference distribution, 98 df

Some important diagnostics ran for this data was a residual plot and a QQ plot. The residual plot mapped out how the residuals compared to the line of best fit. This is a crucial property when deciding if the linear model was correctly chosen for the data. The QQ plot compared the quantiles in our graph to those of a theoretic standard distribution, and can display characteristics that denote if a graph has some major issues, i.e. fat tails and right/left skews.

Lastly, we expanded our model to include multiple variables and performed variable selection (forward selection) to determine what variables did not appear to have any effect on the model so they could be excluded from the final model. From here, we set up a new line of be and created a new ANOVA table.

Results & Analysis

Upon testing our single variable model (where only the total distance was evaluated in respect to the likelihood a goal would be made) a test statistic of \(0.0002148\), and a p-value of \(0.003171.\) This let us reject the null hypothesis with a p-value of \(0.003171\) at the \(0.01\) significance level. There is convincing evidence to suggest that there is a significant relationship between distance from the goal and the likelihood of making the goal. That being said, the slope was also incredibly small and showed there was a very small chance of scoring a goal unless very close to the net. So although this data appears to be very accurate from the hypothesis test, it isn’t incredibly helpful in seeing the bigger picture.

Although the raw data appeared very linear at a glance the diagnostic tests showed otherwise. The hypothesis test warranted the conclusion that the data was modeled well. The QQ plot also presented as if the data was normally distributed and linear. The residual plot, however, exposed some issues with these models.

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Notice how the residual plot displays characteristics of strong outliers and non-linear properties. One issue here is that the “outliers” are the shots that were made near the goal and are crucial to the data set so they can’t be disregarded as user/information error. This lets us conclude that perhaps an exponential model would have been better for modeling the data.

Upon performing forward selection to determine what variables were most important to the model, the data showed that the total distance, the distance in the X direction, and the time remaining in the period had an effect on the likelihood of a shot being made. The distance from the goal in the Y direction appeared to have little to no effect on the likelihood a goal would be scored.

New Line: \(y= 9.313e-02 - (1.539e-03) x_1 + (5.923e-04) x_2 -(1.133e-06) x_3- (9.642e-04) x_4\) (See appendix for anova table).

Conclusions

This paper sought to find the most influential variables on scoring for any given player. These findings lead us to believe that Euclidean distance has the greatest impact on scoring, followed by distance from the goal in the X direction. There is some correlation between time remaining in the period, but this could be because the number of shots taken in the final seconds of the period is higher than other times in the period. The most useful information to gleam from this is that if a player can make it into this one foot range, they are more likely to score than not.

The most interesting find is the lack of impact that the horizontal distance from the crease had on likelihood of scoring. Intuitively, the horizontal distance augments the angle of the shot and therefore decreases the relative area to shoot on the goal and increases the goalies the area of the goalie in relation to the goal. This, however, is not the case. Whether this is due to the skill of the players or some other driving factor remains unclear. This could be an interesting follow up investigation into what drives this lack of correlation.

Finally, and most importantly, the diagnostic tests (specifically the residual plot) on the single variable linear model exposed that a linear model was not the best fit for this data. Although there is strong evidence to show the correlation between the likelihood of a shot being made and the total distance, the distance in the X direction, and the time remaining in the period, this information could be much stronger with an exponential model and would probably depict the relationships between the explored variables more accurately.

Bibliography

“National Hockey League.” Wikipedia. Wikimedia Foundation, April 28, 2021. https://en.wikipedia.org/wiki/National_Hockey_League#Popularity.

Ellis, Martin. “NHL Game Data.” Kaggle, December 11, 2020. https://www.kaggle.com/martinellis/nhl-game-data.

Appendix

 #MILESTONE 2

#Import our single variable data (here the goals have already been turned into a ratio)

library(readxl)
newplaysfiltered_1_ <- read_excel("~/Downloads/4) Second Semester Senior/Linear Modeling/newplaysfiltered (1).xlsx", 
    sheet = "distance vs goalRatio")
#View(newplaysfiltered_1_)

#Importing the data
filteredplays<-newplaysfiltered_1_
head(filteredplays)
## # A tibble: 6 x 4
##   Distance `Shot Taken` `Shots Made`  Ratio
##      <dbl>        <dbl>        <dbl>  <dbl>
## 1        1           13            8 0.615 
## 2        2           54            9 0.167 
## 3        3          321           34 0.106 
## 4        4         2020          140 0.0693
## 5        5         5130          203 0.0396
## 6        6        14352          410 0.0286
distance<-filteredplays$Distance
goalratio<-filteredplays$Ratio

b1<-(sum((distance-mean(distance))*goalratio))/(sum((distance-mean(distance))^2))
b0<-mean(goalratio)-(b1*mean(distance))

results<-lm(goalratio~distance) #Y variable first, then x(s)
results
## 
## Call:
## lm(formula = goalratio ~ distance)
## 
## Coefficients:
## (Intercept)     distance  
##   0.0448820   -0.0006499
b0
## [1] 0.044882
b1
## [1] -0.0006498731
summary(results)
## 
## Call:
## lm(formula = goalratio ~ distance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03078 -0.02192 -0.00850  0.00700  0.57115 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0448820  0.0124948   3.592 0.000515 ***
## distance    -0.0006499  0.0002148  -3.025 0.003171 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06201 on 98 degrees of freedom
## Multiple R-squared:  0.08542,    Adjusted R-squared:  0.07609 
## F-statistic: 9.153 on 1 and 98 DF,  p-value: 0.003171
plot(goalratio~distance)

#MILESTONE 3

library(tidyverse)

anova(results)
## Analysis of Variance Table
## 
## Response: goalratio
##           Df  Sum Sq  Mean Sq F value   Pr(>F)   
## distance   1 0.03519 0.035191  9.1531 0.003171 **
## Residuals 98 0.37678 0.003845                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(results, aes(distance, y=results$residuals)) + geom_point()

qqnorm(results$residuals)
qqline(results$residuals)

 #MILESTONE 4


#Import our multi-variable data (here the goals have already been turned into a ratio)


playsfiltered_M4_new_3_ <- read_excel("~/Downloads/4) Second Semester Senior/Linear Modeling/playsfilteredupdatedM4.xlsx")
#View(playsfiltered_M4_new_3_)


#Importing the data
samp<-sample(1:nrow(playsfiltered_M4_new_3_), size=10000, replace=FALSE)
hockey<-playsfiltered_M4_new_3_[samp,]
head(hockey)
## # A tibble: 6 x 11
##   play_id event secondaryType     x     y  st_x  st_y `Distance to Go…
##   <chr>   <chr> <chr>         <dbl> <dbl> <dbl> <dbl>            <dbl>
## 1 201702… Miss… NA              -82   -31    82    31             31.8
## 2 201702… Shot  Wrist Shot      -77    12    77   -12             17.0
## 3 201702… Bloc… NA              -47   -28   -47   -28             50.5
## 4 201702… Bloc… NA              -65   -13   -65   -13             27.3
## 5 201502… Bloc… NA               63   -15   -63    15             30.0
## 6 201702… Shot  Wrist Shot      -84   -17    84    17             17.7
## # … with 3 more variables: `Rounded Distance` <dbl>, Goal <dbl>, `Time
## #   Remaining in period` <dbl>
dim(hockey)
## [1] 10000    11
#Looking at the data
#library(tidyverse)

hockey%>%
  select(Goal, `Rounded Distance`, st_x, st_y, `Time Remaining in period`)%>%
  pairs()

hockey2<-hockey%>%
  select(Goal, `Rounded Distance`, st_x, st_y, `Time Remaining in period`)%>%
  na.omit()

hockey2$timemsec<- hockey2$`Time Remaining in period`/60

#dim(hockey)
#dim(hockey2)
  
#First we must create our response vector
y<-hockey2$Goal

attach(hockey2)

#Design Matrix
xMat<-matrix(c(rep(1, dim(hockey2)[1]), `Rounded Distance`, st_x, st_y, timemsec), nrow=dim(hockey2)[1])
head(xMat)
##      [,1] [,2] [,3] [,4]      [,5]
## [1,]    1   32   82   31  9.133333
## [2,]    1   17   77  -12  2.766667
## [3,]    1   50  -47  -28 14.766667
## [4,]    1   27  -65  -13 13.333333
## [5,]    1   30  -63   15  7.216667
## [6,]    1   18   84   17 17.966667
#Now we must find our fitted model from scratch

#first we find the transpose
#t(xMat)

#then we preform matrix multiplication
#t(xMat)%*%xMat

#we find the inverse
#solve(t(xMat)%*%xMat)

#finally bringing it all together to find our beta hat vector
betahatvec<-solve(t(xMat)%*%xMat)%*%t(xMat)%*%y
betahatvec
##               [,1]
## [1,]  9.028869e-02
## [2,] -1.456667e-03
## [3,]  6.065789e-04
## [4,] -5.948032e-05
## [5,] -9.039099e-04
#Double check using our snazzy lm() function
mod<-lm(y~`Rounded Distance`+ st_x + st_y+ timemsec)
mod
## 
## Call:
## lm(formula = y ~ `Rounded Distance` + st_x + st_y + timemsec)
## 
## Coefficients:
##        (Intercept)  `Rounded Distance`                st_x                st_y  
##          9.029e-02          -1.457e-03           6.066e-04          -5.948e-05  
##           timemsec  
##         -9.039e-04
#MILESTONE 5


#Design Matrix
xMat<-matrix(c(rep(1, dim(hockey2)[1]), `Rounded Distance`, st_x, st_y, timemsec), nrow=dim(hockey2)[1])
#head(xMat)

#Hat Matrix
hMat<-xMat%*%solve(t(xMat)%*%xMat)%*%t(xMat)
#head(hMat)

#Identity Matrix 
I<-diag(dim(hockey2)[1])

#Define J
J<-matrix( rep( 1, len=length(y)*length(y)), nrow = length(y))

n<-length(y)
p<-4


#sum of squares
ssReg<-t(y) %*% (hMat-((1/n)*J))%*% y
ssReg    
##          [,1]
## [1,] 19.69827
ssRes<- t(y) %*% (I-hMat) %*% y
ssRes
##          [,1]
## [1,] 463.3936
ssTot<-t(y) %*% (I-(1/n)*J)%*% y
ssTot
##          [,1]
## [1,] 483.0919
#mean squares                   
ms_res<-sum(mod$residuals^2)/(n-p-1)
ms_res
## [1] 0.04636254
ms_reg <- ssReg/p
ms_reg
##          [,1]
## [1,] 4.924568
#F-value
F <- ms_reg/ms_res
F
##          [,1]
## [1,] 106.2187
#P-Value
pf(F, p, n-p-1, lower.tail = FALSE)
##              [,1]
## [1,] 9.077977e-89
#R squared
R<-ssReg/ssTot
R
##            [,1]
## [1,] 0.04077541
#Double check using our snazzy lm() function
mod<-lm(y~`Rounded Distance`+ st_x + st_y+ timemsec)
mod
## 
## Call:
## lm(formula = y ~ `Rounded Distance` + st_x + st_y + timemsec)
## 
## Coefficients:
##        (Intercept)  `Rounded Distance`                st_x                st_y  
##          9.029e-02          -1.457e-03           6.066e-04          -5.948e-05  
##           timemsec  
##         -9.039e-04
#double check
anova(mod)
## Analysis of Variance Table
## 
## Response: y
##                      Df Sum Sq Mean Sq  F value  Pr(>F)    
## `Rounded Distance`    1   7.25  7.2484 156.3407 < 2e-16 ***
## st_x                  1  12.16 12.1645 262.3774 < 2e-16 ***
## st_y                  1   0.01  0.0103   0.2220 0.63755    
## timemsec              1   0.28  0.2751   5.9347 0.01486 *  
## Residuals          9995 463.39  0.0464                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#VARIABLE SELECTION

#STEP 1
n=dim(hockey2)[1]
alpha=0.15

# Look at models with one variable 
p=1
this.df=n-p-1

# Find the F_in cut-off
qf(alpha, df1=1, df2=this.df, lower.tail = FALSE)
## [1] 2.072569
## A: Rounded distance
mod1a<-lm(y~`Rounded Distance`, data=hockey2)
anova(mod1a)#F-value: 151.41
## Analysis of Variance Table
## 
## Response: y
##                      Df Sum Sq Mean Sq F value    Pr(>F)    
## `Rounded Distance`    1   7.25  7.2484   152.3 < 2.2e-16 ***
## Residuals          9998 475.84  0.0476                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## B: Distance in the x direction
mod1b<-lm(y~st_x, data=hockey2)
anova(mod1b)#F-value: 245.24 (BIGGEST)
## Analysis of Variance Table
## 
## Response: y
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## st_x         1  12.91  12.906  274.44 < 2.2e-16 ***
## Residuals 9998 470.19   0.047                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## C: Distance in the y direction
mod1c<-lm(y~st_y, data=hockey2)
anova(mod1c)#F-value: 0.0201
## Analysis of Variance Table
## 
## Response: y
##             Df Sum Sq  Mean Sq F value Pr(>F)
## st_y         1   0.01 0.010990  0.2274 0.6334
## Residuals 9998 483.08 0.048318
## D: Time in minutes
mod1d<-lm(y~timemsec, data=hockey2)
anova(mod1d)#F-value: 17.747
## Analysis of Variance Table
## 
## Response: y
##             Df Sum Sq Mean Sq F value   Pr(>F)   
## timemsec     1   0.38 0.38012   7.873 0.005028 **
## Residuals 9998 482.71 0.04828                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#STEP 2 models with two variables
p=2
this.df=n-p-1

# Find the F_in cut-off
qf(alpha, df1=1, df2=this.df, lower.tail = FALSE)
## [1] 2.072569
## A: Rounded distance
mod2a<-lm(y~st_x+`Rounded Distance`, data=hockey2)
anova(mod2a)#F-value: 138.58 (BIGGEST)
## Analysis of Variance Table
## 
## Response: y
##                      Df Sum Sq Mean Sq F value    Pr(>F)    
## st_x                  1  12.91 12.9064  278.26 < 2.2e-16 ***
## `Rounded Distance`    1   6.51  6.5065  140.28 < 2.2e-16 ***
## Residuals          9997 463.68  0.0464                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## B: Distance in the y direction
mod2b<-lm(y~st_x+st_y, data=hockey2)
anova(mod2b)#F-value: 0.2246
## Analysis of Variance Table
## 
## Response: y
##             Df Sum Sq Mean Sq  F value Pr(>F)    
## st_x         1  12.91 12.9064 274.4166 <2e-16 ***
## st_y         1   0.01  0.0066   0.1402 0.7081    
## Residuals 9997 470.18  0.0470                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## C: Time remaining in min
mod2c<-lm(y~st_x+timemsec, data=hockey2)
anova(mod2c)#F-value: 18.30
## Analysis of Variance Table
## 
## Response: y
##             Df Sum Sq Mean Sq  F value    Pr(>F)    
## st_x         1  12.91 12.9064 274.6017 < 2.2e-16 ***
## timemsec     1   0.32  0.3234   6.8817  0.008721 ** 
## Residuals 9997 469.86  0.0470                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#STEP 3 models with three variables
p=3
this.df=n-p-1

# Find the F_in cut-off
qf(alpha, df1=1, df2=this.df, lower.tail = FALSE)
## [1] 2.072569
## A: Distance in the y direction
mod3a<-lm(y~st_x+`Rounded Distance`+st_y, data=hockey2)
anova(mod3a)#F-value: 0.2786
## Analysis of Variance Table
## 
## Response: y
##                      Df Sum Sq Mean Sq  F value Pr(>F)    
## st_x                  1  12.91 12.9064 278.2418 <2e-16 ***
## `Rounded Distance`    1   6.51  6.5065 140.2697 <2e-16 ***
## st_y                  1   0.01  0.0103   0.2219 0.6376    
## Residuals          9996 463.67  0.0464                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## B: time in minutes
mod3b<-lm(y~st_x+`Rounded Distance`+timemsec, data=hockey2)
anova(mod3b)#F-value: 16.91 (BIGGEST)
## Analysis of Variance Table
## 
## Response: y
##                      Df Sum Sq Mean Sq  F value  Pr(>F)    
## st_x                  1  12.91 12.9064 278.4003 < 2e-16 ***
## `Rounded Distance`    1   6.51  6.5065 140.3496 < 2e-16 ***
## timemsec              1   0.27  0.2743   5.9161 0.01502 *  
## Residuals          9996 463.40  0.0464                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#STEP 4 models with four variables
p=4
this.df=n-p-1

# Find the F_in cut-off
qf(alpha, df1=1, df2=this.df, lower.tail = FALSE)
## [1] 2.072569
## A: Distance in the y direction
mod4a<-lm(y~st_x+`Rounded Distance`+timemsec+st_y, data=hockey2)
anova(mod4a)#F-value: 0.2567
## Analysis of Variance Table
## 
## Response: y
##                      Df Sum Sq Mean Sq  F value  Pr(>F)    
## st_x                  1  12.91 12.9064 278.3791 < 2e-16 ***
## `Rounded Distance`    1   6.51  6.5065 140.3389 < 2e-16 ***
## timemsec              1   0.27  0.2743   5.9156 0.01502 *  
## st_y                  1   0.01  0.0112   0.2410 0.62351    
## Residuals          9995 463.39  0.0464                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Our final model is: y~st_x+`Rounded Distance`+timemsec