Previous research in the National Hockey League has suggested that teams’ decisions to gain the offensive zone with puck possession (“carry-ins”) is preferred over dumping the puck in and chasing after it (“dump-ins”). However, standard comparisons of zone entry strategy are confounded by factors such as offensive and defensive talent, location on the ice, and shift time, each of which impact player choice. Indeed, contrasting carry-ins to dump-ins isn’t exactly an apples-to-apples comparison; instead, it is more like studying grapes versus prunes. Using two matching methods – propensity score matching and Bayesian additive regression trees – we leverage player-tracking data to estimate the causal benefits due to zone-entry decisions. Both approaches better account for the variables that affect entry choice. We also highlight the wide-ranging potential of the causal inference framework with player tracking data in sports while emphasizing the challenges of using standard statistical methods to inform decision-making in the presence of substantial confounding.

In hockey, players can enter the offensive zone by either carrying the puck in or dumping it. This decision is typically made by the player or their head coach. This decision is informed by, among other factors, offensive and defensive talent, location on the ice, shift time, time left in the game and score differential.

Previous research done by Tulsky (2013) showed that:

“A team that moves the puck through the neutral zone most effectively will be able to enter the offensive zone with possession, carrying the puck across the blue line. This might be expected to result in more shots and goals than a play where the team dumps the puck into the corner hoping to retrieve it. […] Every data set collected showed entries with possession being more than twice as effective as dump-ins at 5-on- 5. The neutral zone play that sets a team up to enter the offensive zone with possession is a critical driver of success.”

The primary goal of this research is to estimate the potential benefit that teams can gain by carrying it in instead of dumping it using a causal inference framework.

If the player knew the outcome of both zone entry strategies, deciding whether to carry the puck in or dump it would be a no-brainer. This is not possible, as we only know the outcome of *one* of the strategies, i.e. the one that the player chose. This is the problem that causal inference attempts to solve. Causal inference deals with prediciting the outcomes for the other strategy that was not undertaken. It also mimics some of the advantages of a randomized experiment when randomization is difficult or impossible to implement, like it is in hockey.

Given that carrying it in is considered the most efficient zone strategy, it’s important to estimate its causal benefit among the group who carried it in. To do so, we selected the *Average Treatment Effect on Treated* (ATT) as the estimand. ATT reflects the expected causal effect of carry-ins among players who did carry it in. In other words, it approximates the value of carry it in among those who had the chance to do so. The “treated” in our experiment is the group of players who carried the puck in and the control are those that dumped it in. To estimate the ATT, we used two approaches: propensity score matching and Bayesian Additive Regression Trees (BART):

We use matched methods with finite population propensity scores (Imbens and Rubin, 2015) to help account for all the variables that could impact a player’s decision to carry the puck in or dump it. Propensity score, first introduced in 1983, corresponds to the probability of receiving a treatment given the observed variables (Rosenbaum and Rubin, 1983). The propensity score is used for flexible matching of treatment and control groups with similar distribution of variables. Grouping individuals with similar propensity scores works to replicate a randomized experiment (Stuart, 2010).

Another method used to estimate causal effects are Bayesian Additive Regression Trees (BART). BART is a Bayesian nonparametric modeling procedure that is used to fit the response surface model (Hill, 2012). Compared to propensity score matching, BART generates posterior intervals. Additionally, treatment effect estimates using BART have been shown to be significiantly more accurate than propensity score matching in certain settings. It is also believed to be the more superior approach when there are many confounding covariates as is the case in many causal inference problems. Finally, BART also has the advantage of readily identifying heterogeneous treatment effects which is of special interest to us.

We obtained zone-entry tracking data from Stathletes for the 2017-18 and 2018-19 NHL seasons. We matched only within even-strength situations, leaving `n = 277,661`

entries, of which `n = 158,808`

were carry-ins and `n = 118,853`

were dump-ins. We obtained Goals Plus-Minus per 60 minutes (`GPM_60`

) from Evolving-Hockey in order to account for player skill. From Evolving-Hockey, we also obtained expected goals (`xG`

) and calculated xG as a function of time using NHL play-by-play data.

Our propensity score model estimates the probability of a player carrying the puck in given game characteristics of each carry-in. This probability is called \(e(X)\). We used a multiple logistic regression model with spline terms to estimate the covariates defined in Table 1 below. The ultimate goal of estimating \(e(X)\) is to balance the variables in X between the players that carried it in and the players that did not. The full list of interaction terms and spline knots used in the propensity score model can also be found in Table 1. Note: the list of interaction terms were chosen using our understanding of what influences zone entry decision-making. The spline knots were optimized to obtain the lowest Akaike Information Criterion (AIC).

**Table 1: Descriptions of variables and interaction terms used in the propensity score model**

Variable | Description | Spline knots |
---|---|---|

Position | position of entry player, forward or defenseman | NA |

score_diff | score differential | NA |

length_of_shift | shift length for entry player | 10 |

game_seconds | seconds left in the game | 5 |

x_entry | x-coordinate of entry player | 5 |

y_entry | y-coordinate of entry player | 5 |

GPM_60 | Goals plus-minus per 60 minutes | 5 |

GPM_60*score_diff | interaction term | 5 & NA |

x_entry*y_entry | interaction term | 5 & 5 |

Matching is the most popular technique to estimate the ATT. Matching involves taking each subject in the treatment group and matching them with a subject in the comparator group based on their covariates and the propensity score. Using the `Matching`

package, we performed one-to-one matching with replacement using the estimated propensity score. Our matching cohort consisted of `n = 153,498`

matched pairs. Each pair includes one play where a team did *not* carry it in, as well as a corresponding match where a different player *did* carry it in. Matching success across all plays in the matched cohort is best represented by comparing the distributions of X between players that carried it in and players that did not, for example:

With our propensity score model, the estimated ATT was `0.0209`

. **Table 2** shows the estimated ATT’s for the three outcomes considered among the matched cohort:

**Table 2: Estimated ATT’s in matching cohort for the outcomes considered**

Avg. xG for per 30s | Avg. xG against per 30s | Avg. Shots on Goal per 30s | |
---|---|---|---|

Estimated ATT | 0.02095 | 0.000 | 0.2565 |

The estimated ATT’s in the table above represent the difference in the outcomes for an entry that was a carry-in (not a dump-in). Approximately, there is 1 extra goal produced every 50 carry-ins given that there are about 30 carry-ins per game per team. Additionally, xG against are not any worse after a carry-in. Of note, given the large number plays ( `n = 153,498`

), the standard errors were very small.

Compared to propensity score matching, BART yielded a similar ATT estimate of `0.0205`

. BART is able to identify heterogenous treatment effects across different groups. In other words, BART allows us to estimate the benefit of carrying it in for each variable. Comparing defensemen to forwards, we see that in the table below, it is approximately 23% more beneficial for a forward to carry it in:

Avg. ATT | Std Error | |
---|---|---|

Forward | 0.0211 | 0.0000134 |

Defenseman | 0.0171 | 0.0000296 |

There appears to be no association between Goals plus-minus per 60 minutes and position of entry player as shown in the figure below. The difference between carrying it in and dumping it appears to be the same for players of different skill level:

In the figure below, there was also no observable difference between causal estimates by score differential, game seconds left or shift length. We see that the marginal 95% posterior intervals – represented by the dotted lines – cover the true CATT’s well. The exception is for `game_seconds`

above 3000 seconds where no overlap exists: