Brian Weinfeld
May 7th, 2018
Introduction
In the National Hockey League (NHL), is there a relationship between the number of goals scored and the number of shots taken when considering the period?
\[goals\sim shots + period\]
I am a big fan of the NHL and one aspect I had noticed is that there seems to be more goals scored in the 2nd period and 3rd period when compared to the 1st period. I decided to explore this relationship in order to determine whether this perceived difference is statistically significant.
I also believe that this investigation may reveal insight on possible future rule changes the NHL could make to increase scoring with the goal of increasing scoring while still keeping the “spirit” of the game.
I will be performing a multiple linear regression analysis considering goals scored, shots taken and period of play (1st, 2nd or 3rd).
Data Collection
Data Collection
The raw data was collected and tidied in R. Below is a sample of the collected data.
| id | period | goals | shots |
|---|---|---|---|
| 29019 | 1 | 1 | 18 |
| 29019 | 2 | 1 | 31 |
| 29019 | 3 | 2 | 17 |
| 29020 | 1 | 4 | 23 |
| 29020 | 2 | 1 | 17 |
| 29020 | 3 | 0 | 21 |
| 29021 | 1 | 2 | 24 |
| 29021 | 2 | 2 | 16 |
| 29021 | 3 | 2 | 34 |
| 29022 | 1 | 3 | 22 |
Exploratory Analysis
| period | n | mean | sd | median | min | max | range |
|---|---|---|---|---|---|---|---|
| 1 | 1230 | 19.06829 | 4.436173 | 19 | 7 | 37 | 30 |
| 2 | 1230 | 20.49512 | 4.540521 | 20 | 6 | 35 | 29 |
| 3 | 1230 | 18.90406 | 4.438228 | 19 | 4 | 40 | 36 |
The distributions of shots per period are very similar. This appears to indicate that the period of play does not affect the number of shots taken.
Exploratory Analysis
| period | n | mean | sd | median | min | max | range |
|---|---|---|---|---|---|---|---|
| 1 | 1230 | 1.452846 | 1.193543 | 1 | 0 | 7 | 7 |
| 2 | 1230 | 1.833333 | 1.291152 | 2 | 0 | 7 | 7 |
| 3 | 1230 | 1.914634 | 1.252511 | 2 | 0 | 6 | 6 |
The number of goals scored when considering period shows some variation. The difference is not much but considering the breadth of the sample, it could be meainingful to have nearly a \(\frac{1}{2}\) goal difference between periods.
Exploratory Analysis
## Df Sum Sq Mean Sq F value Pr(>F)
## period 2 149 74.75 48.12 <2e-16 ***
## Residuals 3687 5728 1.55
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The exploratory analysis is promising. There appears to be a significant difference in the number of goals scored per period.
Precondition Verification
Before regression analysis, I needed to ensure that the precondition for analysis were met.
Precondition Verification
Before regression analysis, I needed to ensure that the precondition for analysis were met.
Precondition Verification
Before regression analysis, I needed to ensure that the precondition for analysis were met.
Precondition Verification
Before regression analysis, I needed to ensure that the precondition for analysis were met.
Regression Analysis
\[\widehat{goals}=0.066448\times shots+0.285678\times period2+0.472701\times period3 + 0.185791\]
##
## Call:
## lm(formula = goals ~ shots + period, data = box.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7848 -0.9134 -0.0662 0.8164 5.2859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.185791 0.091754 2.025 0.043 *
## shots 0.066448 0.004458 14.904 < 2e-16 ***
## period2 0.285678 0.049229 5.803 7.06e-09 ***
## period3 0.472701 0.048822 9.682 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.211 on 3686 degrees of freedom
## Multiple R-squared: 0.08083, Adjusted R-squared: 0.08008
## F-statistic: 108 on 3 and 3686 DF, p-value: < 2.2e-16
While the \(R^2_{adj}\) is low, indicating that the variables of shots and period only explain a small portion of the variability of the data, the p-value for shots and period are both well below 0.01 and are statistically significant.
The regression indicates that, everything else being equal, the 3rd period expected nearly an extra \(\frac{1}{2}\) of a goal when compared to the 1st period. The 2nd period is less pronounced but still has a difference of about \(\frac{2}{7}\) of a goal.
Regression Analysis
\[\widehat{goals}=0.066448\times shots+0.285678\times period2+0.472701\times period3 + 0.185791\]
Conclusion
\[\widehat{goals}=0.066448\times shots+0.285678\times period2+0.472701\times period3 + 0.185791\]
With p-values close to 0, there is incredibly strong evidence to support the conclusion that there are more goals scored in the 2nd and 3rd period of NHL games when compared to the first period.
For the 2nd period, this is most likely caused due to the teams switching sides of the ice resulting in less consistent line changes.
For the 3rd period, the fact that the end of the game is approaching likely motivates teams to take more risks which leads to more goals scored and more goals allowed (this includes empty net goals)
In order to increase scoring, I would recommend swapping the starting sides of the ice. That is, the 1st and 3rd period plays with the long change while the 2nd plays with the short.
Theoretically, this could increase scoring in the 3rd period upwards of \(0.285678+0.472701=0.758379\) above the baseline. This is roughly equivelant to an extra \(1230\times 0.285678\approx 351\) goals per season.
Additional research would need to be conducted to determine whether such an increase would actually be seen or whether there are lurking variables not being accounted for.