library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#import the data
library(readxl)
hockey2022data_player <- read_excel("hockey2022data.xlsx", 
    sheet = "Player Stats (2022-23)")


hockey2022data_goalie <- read_excel("hockey2022data.xlsx", 
    sheet = "Goalie Stats (2022-23)")
hockey_stats <- merge(hockey2022data_player, hockey2022data_goalie, by = "TEAM")  #Merge the data sets by team name
#include a new stat called standings_pts 
hockey_stats <- hockey_stats %>%
  mutate(standings_pts = (W * 2) + OTL)
#create linear model
model <- lm(data = hockey_stats, formula = standings_pts ~ GF_pg + PP_p + PK_p + GA_pg + SV_p)
#summary Stats
stargazer(hockey_stats, type = 'text')
## 
## =================================================
## Statistic     N    Mean    St. Dev.  Min    Max  
## -------------------------------------------------
## GP.x          32  82.000    0.000     82     82  
## GF_pg         32   3.139    0.364   2.460  3.960 
## A             32  435.594   51.819   331    545  
## PTS           32  693.062   81.181   533    870  
## PPG           32  53.656    12.853    35     89  
## PP_p          32  21.113    3.691   15.600 32.400
## SHG           32   7.812    3.383     1      18  
## S.x           32 2,561.438 215.153  2,119  3,019 
## S_p           32  10.050    0.843   8.700  11.800
## PIM           32  752.531  133.516   589   1,027 
## PK_p          32  78.703    4.148   71.600 87.300
## SOA           32  16.969    8.372     3      42  
## SOG           32   5.594    3.618     0      18  
## SO_p          32  33.469    13.789  0.000  75.000
## GP.y          32  82.000    0.000     82     82  
## GA_pg         32   3.140    0.482   2.120  4.090 
## W             32  41.000    9.899     22     65  
## L             32  31.562    9.249     12     49  
## OTL           32   9.438    3.151     3      17  
## SA            32 2,561.438 213.636  2,132  3,207 
## GA            32  257.469   39.475   174    335  
## S.y           32 2,303.844 188.598  1,923  2,872 
## SV_p          32   0.904    0.011   0.887  0.931 
## SO            32   3.594    2.077     0      9   
## SOSA          32  16.969    8.368     2      39  
## SOS           32  11.375    6.025     2      24  
## SOS_p         32   0.664    0.130   0.375  1.000 
## standings_pts 32  91.438    18.899    58    135  
## -------------------------------------------------

\(\hat{standings points} = \beta_{0} + \beta_{1}GF_pg + \beta_{2}PP_p + \beta_{3}PK_p + \beta_{4}GA_pg + \beta_{5}SVp\)

These are some summary stats of the data set I used. I am only using a few variables for the regression model but these stats provide some cool insight.

Some cool stats:

Average Wins per team : 41

Average Losses: 31.5

Average Goals For Per Game / Goals Against Per Game = 3.140. I thought it was interesting that these stats were both the same. But upon further thinking it makes sense because the total goals scored in the league is equal to the total number of goals against in the league. However these will be different for specific teams.

Power Play %: League average is 21.113. Standard Deviation of 3.691 was larger than I expected.

Penalty Kill %: League average is 78.703. Standard Deviation of 4.148 was a lot more than I expected.

#summary of model
stargazer(model, type = 'text')
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                            standings_pts       
## -----------------------------------------------
## GF_pg                        19.167***         
##                               (3.890)          
##                                                
## PP_p                           0.188           
##                               (0.324)          
##                                                
## PK_p                         -0.721**          
##                               (0.282)          
##                                                
## GA_pg                       -32.430***         
##                               (3.832)          
##                                                
## SV_p                           1.958           
##                              (112.992)         
##                                                
## Constant                      184.073          
##                              (120.511)         
##                                                
## -----------------------------------------------
## Observations                    32             
## R2                             0.967           
## Adjusted R2                    0.960           
## Residual Std. Error       3.764 (df = 26)      
## F Statistic           151.126*** (df = 5; 26)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

I ran a regression to find the relationship between standings points in an NHL season (calculated by 2 points for a win, 1 point for an overtime loss, and 0 points for a regulation loss). The predictor variables picked were goals for per game, power play success percentage, penatly kill percentage, goals against per game, and save percentage.

Goals For Per Game: The regression shows a coefficient of 19.167 This means that for every additional goals for per game, there is an expected increase in 19.167 points percentage. This does not surprise me as the standard deviation of GF is about .3, so an additional 1 goal per game should shoot you up the standings.

Goals Against Per Game: The regression model shows a GA_pg coefficient of -32.430. This means that for every additional goals against averaged per game, a team’s point percentage is predicted to decrease by 32.430. This negative relationship makes sense because the more goals a team gives up, generally the more games they will lose. Goals against is also weighted heavier than goals for, suggesting defense may be slightly more important than offense in predicting best point percentage.

Penalty Kill Percentage: This variable was surprising. It did not have much influence on the model, having a coefficient of only -.721. This implies for each percentage point increase on the penalty kill, you are expected to lose .721 points in the standings. Usually in real life observations this is not believed to be true.

Save Percentage: Coefficient of 1.958. This implies that each percentage point increase in save percentage, you would expect to increase by about 2 points in the standings.

Constant: 184.073. This means that if all the stats above were 0, the team would expect to have 184.073 points. We know this is not true because this is more than the possible points. But we can assume it is high due to the negative coefficients we discovered will bring the points down as stats are added (such as goals against).

#residual plots
plot(model)

Residuals Vs. Fitted:

The residuals seem to be spread around a relatively horizontal line. This would indicate there are not non-linear relationships at work here. It is important to note there are some high residual points on both extremes of the graph and could indicate a slightly parabolic pattern, but this relationship does not seem extremely strong with just a few points influencing the model.

Normal Q-Q:

This plot shows that the residuals follow a normal distribution. The residuals follow a straight linear line quite well. This gives me confident the residuals are normal.

Scale-Location:

This plot is used to check the assumption of homoscedasticity. This is not a great graph. The line is not super horizontal, it shows a decreasing line and changes to an upwards sloping line in the second half. It is important to note point 3 in the top right corner that is an outlier and is pulling the line up. Without this point, our model may fit better.

Residuals Vs. Leverage:

This helps us find influential cases as not all outliers are influential. There does not seem to be any influential as all the points are within the Dashed Cook’s Line. However, it is important to note that point 3 is about on the line.

Assumptions

  1. Exogeneity: The independent variables may be influencing each other. Goals against per game and save percentage could (and probably are) related. PP_p may also be affecting the goals for per game parameter. This condition is probably not met.

  2. Random: This model uses all the data from the 2022-2023 population. So this condition is satisfied. However, a more complete model would be to randomly sample teams across different years.

  3. Homoscedasticity: The variability of the residuals is nearly constant. As seen in the Residuals Vs. Fitted Plot the variability in the residuals is not perfect. There seems to be some more extreme variation on each ends of the plot.

  4. Residuals are independent.

    acf(residuals(model))

    pacf(residuals(model))

    There is no easily identifiable patterns here.

  5. Linearity: Each variable is linearly related to the outcome.

    Since there are a lot of variables in y model I will use the gvlma package. Also, looking at the p-values for the coefficients we can see that every predictor except the SV_p had very high significance showing there is a likely a linear relationship there.

  6. Multicolinearity: In my model, there is a good chance that certain parameters have a linear relationship with each other. Goals for per game and PP_p are expected to have a positive linear relationship. PK_p and goals against are expected to have a negative linear relationship. Finally, goals against per game and save percentage are expected to have a negative linear relationship.

    library(gvlma)
    gvlma(x = model)
    ## 
    ## Call:
    ## lm(formula = standings_pts ~ GF_pg + PP_p + PK_p + GA_pg + SV_p, 
    ##     data = hockey_stats)
    ## 
    ## Coefficients:
    ## (Intercept)        GF_pg         PP_p         PK_p        GA_pg         SV_p  
    ##    184.0732      19.1671       0.1878      -0.7206     -32.4297       1.9580  
    ## 
    ## 
    ## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
    ## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
    ## Level of Significance =  0.05 
    ## 
    ## Call:
    ##  gvlma(x = model) 
    ## 
    ##                        Value  p-value                   Decision
    ## Global Stat        8.7028489 0.068972    Assumptions acceptable.
    ## Skewness           0.0003839 0.984367    Assumptions acceptable.
    ## Kurtosis           0.2394021 0.624638    Assumptions acceptable.
    ## Link Function      8.0587208 0.004529 Assumptions NOT satisfied!
    ## Heteroscedasticity 0.4043420 0.524855    Assumptions acceptable.

    As we can see, Exogeneity and Multicolinearity are probably not satisfied, so this model is not great.

BLUE

OLS is BLUE means Best Linear Unbiased Estimator. This means the OLS coefficient estimates follow the tightest possible sampling distribution of unbiased estimates compared to other methods.

II.

One of the main reasons taking the log of a variable can help in the linear regression model is when you are dealing with skewed data. Taking the log of very skewed data can help it fit a more normal distribution (which is very helpful for linear models).

Another reason is to improve linearity. Sometimes taking the log of a variable can help model a non-linear relationship as linear. Some useful use cases for this would be exponential data.

One last reason (although there are many others) is to make the interpretation of the results easier. If your two variables are on very different scales or if you want to interpret proportions vs absolute changes.