Is there a relationship in the NHL between shots taken and goals scored?

I set out to discover the relationship between number of shots taken and number of goals scored in the NHL’s 2017 regular season.

head(raw.data) %>%
  kable() %>%
  kable_styling(bootstrap_options=c('striped'))
X1 id period goals shots
1 29019 1 1 18
2 29019 2 1 31
3 29019 3 2 17
4 29020 1 4 23
5 29020 2 1 17
6 29020 3 0 21

I used an API for a sports statistics website to query the number of shots taken and number of associated goals in each period of every game in the 2017-2018 season. Visual inspection of the data appears to show some sort of positive relationship between the two variables, however that relationship is not particularly clear.

ggplot(raw.data, aes(shots, goals)) +
  geom_jitter(aes(alpha=0.2), show.legend = FALSE) +
  geom_smooth(method='lm')

I created a single linear regression of the data and the summary is displayed below. The summary indicates a strong positive relationship between shots and goals. Although the relationship is statistically significant, the \(R^2\) value of just \(\approx 0.06\) indicates that shots only account for a very small portion of variation seen in the number of goals per period.

hockey.lm <- lm(goals ~ shots, data=raw.data)
summary(hockey.lm)
## 
## Call:
## lm(formula = goals ~ shots, data = raw.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5671 -0.9009 -0.1008  0.7660  5.0991 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.435159   0.089190   4.879 1.11e-06 ***
## shots       0.066624   0.004458  14.946  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.226 on 3688 degrees of freedom
## Multiple R-squared:  0.05711,    Adjusted R-squared:  0.05685 
## F-statistic: 223.4 on 1 and 3688 DF,  p-value: < 2.2e-16

\[\widehat{goals}=shots \times 0.06 + 0.43\]

Continuing, I examined the residuals.

ggplot() +
  geom_jitter(aes(fitted(hockey.lm), resid(hockey.lm), alpha=0.2), show.legend = FALSE)

ggplot() +
  geom_histogram(aes(resid(hockey.lm)), bins=20)

The residuals are very roughly normally distributed with a center around 0. The amount of variation in the residuals appears to not be uniform. Furthermore, there is a violation independence requirement. Each period is not independent of the previous one as the score going into that period can affect future play. A team may pull a goalie if they are down (increasing the likelyhood the other team scores) or a team may sit it’s best players if the game is out of hand.

Conclusion:

Although it can be tempting to predict the number of goals scored based on the number of shots, there is no strong evidence to suggest that this is feasible. There are concerns based on the residual analysis that this regression is not valid and even if it were, the small \(R^2\) value indicates that shots are not that predictive.

Analysis:

Why are shots not predictive of goals? After all, if you want to score, you need to amass shots. Goals are, in effect, successful shots. There are a number of reasons for this result. Without getting too deep into the weeds, the most significant reason is that shots are a binary indicator. There is no indication as to the quality of the shots. A team could take 40 low quality shots and be unlikely to score any goals while another team could take 10 high quality shots and expect to score several more. In theory, a metric that attempts to count only high quality shots (or weights shots based on quality) may have more predictive power.