Introduction

The inspiration for this post came after watching a random video on YouTube titled Is Kyle Dubas a Fraud or a Hockey Genius?.

I know, I know you should let bygones be bygones, but as a devout Maple Leafs Fan, I was curious.

Now you might be thinking… why? Mr. Dubus isn’t even the GM of Toronto anymore… he’s with the Pens. Again… I know … but he oversaw so much of my heartbreak over the last 5+ years that I had to click.

Now, I don’t expect you to watch the video, and why would I when I already mentioned in a prior post that I have a chat-gpt bot that summarizes things like that for us.

See below:

The video discusses various scenarios in hockey management, focusing on issues like player redundancy, salary cap management, and roster construction. It delves into the careers of several NHL players and teams, examining instances where having two star players in the same position (e.g., Brent Burns and Erik Karlsson of the San Jose Sharks) or overcommitting financially to multiple players (like the Toronto Maple Leafs with Tavares, Matthews, Marner, and Nylander) can lead to problems. It emphasizes the pitfalls of over-focusing on star players, especially when it neglects team depth and financial flexibility, using various team scenarios and management strategies as examples, and contrasting different team management philosophies and their impacts on team performance and financial health. It subtly critiques some managerial decisions and underscores the importance of having a well-rounded team and effective salary cap management in the NHL.

If you’re a Toronto Maple Leafs Fan or a hockey fan at all, it’s easy to see how the Toronto Maple Leafs’ example relates to Dubas. If you’re not then basically Dubas signed a small number of players to very large contracts and then really couldn’t afford anyone else… then the leafs lost in the first round of the playoffs every year for six straight years until last year when they lost in the second round.

It hurts to even write that … but also not what I want to focus on. I want to focus on the first point, about having two star players who are similar, then paying them both a lot only to have them regress. This is effectively the Burns-Karlsson case study… both amazing defensemen when they played on different teams, but both… let’s say not so great when they play together. The author in the video’s concluding point was that Dubas had just signed Karlsson (who was coming off an amazing season after Burns had been traded to Carolina) to the Pens making him not only one of the highest paid defenseman in the league but also possibly redundant again, seeing as the Pens already have a very similar player in Kris Letang.

I thought this was interesting… and wanted to see if I could actually put some numbers to some of these ideas.

I’ve been interested for a long time about the concept of team construction because in essence it is a portfolio. Players are literally assets, different positions (Forward, Defense, Goalie) are like asset classes, sub positions (Left wing , Right wing , Center) are kind-of like sectors, and then the players themselves are obviously the underlying instrument.

When you construct a portfolio of financial assets, diversification is important and so are concepts like correlation. Well it’s the same when constructing a team portfolio. For instance Forwards and Goalies have very little in common (low correlation , good for diversification) while Defenseman have some elements in common with Forwards and Goaltenders. This can get even more granular on the player type level (‘stay at home’ Defenseman, play maker, goal scorer etc). But like any portfolio, understanding where you already have exposure and where you might need to look for undervalued exposure is imperative… So did Kyle Dubas just buy some more overpriced defensive exposure that his portfolio doesn’t need?

Let’s see if we can find out.

Data

The least fun part… the hardest part about putting numbers to anything is finding numbers to use. Luckily for me a couple years ago I built a web scrapping function to pull in historical NHL data going back to 2010… (It was for a fantasy hockey league that I’m notoriously bad at but that’s another topic).

The script effectively pulls in data for every NHL player whose played a game in the league since 2010 and allows us to look at player and team stats.

That’s the good news…

The bad news is there are not a ton of variables… there are enough variables… but I’m not making any new advanced analytics friends with this.

This is what the data set looks like:

Player Stats Table

Now for those concerned individuals who are confused why McDavid only had 1.085 Points in 2022 let me start by saying that all of this data has been standardized to be per game so that we can actually compared players both cross sectionally and on a time series basis.

Here are the variables we have to work with:

Player: Player’s Name
Season: Season year
Position: Simplified to Forward, Defense or Goalie
G: Goals per Game
Ast: Assits per Game
Pts: Points per Game
SOG: Shots on Goal per Game
Hit: Hits per Game
Blk: Blocked shots per Game
Shpct: Shooting percentage
TOI: Time on Ice as a percentage of games played (eg 0.1 implies player plays 10% of games they are in)

(If you are confused why some goalies have TOI greater than 1 this would be because in every game they played in they not only finished… but sometimes went to overtime. If TOI is less than 1 … this goalie got pulled or injured and didn’t finish some games)

It should also be noted that the data set has been filtered to only include players who played in at least half of of the season.

Exploratory Analysis

We have a lot of data here and so it’s good to get a sense of what it looks like and what we can realistically do with it.

One of my favorite ways to do this is with violin plots which are basically great ways to visualize distribution of factors against some categorical variables. Using our earlier example of positions as asset classes we can see if there are any interesting differences in factors across positions.

Goal Distribution

Assist Distribution

Points Distribution

SOG Distribution

Hit Distribution

Blk Distribution

Shpct Distribution

SVP Distribution

TOI Distribution

We can also view this all in a nice average table like this:

Summary Table

What can we take away from this? Well Forwards definitely out score Defenseman on average by a large margin… which makes sense, however, assists are much more evenly distributed. Points which is obviously the sum of goals and assists is then still skewed towards Forwards. Shots on Goal and Hits are another two that are actually pretty even with Forwards edging out Defense in SOG and Defense edging out Forwards in Hits… but neither in any significant way. Blocked shots is another one that is heavily skewed towards Defensemen. And our final variable shooting percentage is heavily skewed towards Forwards. This last point is actually fairly interesting to me. In general the average Forward and Defenseman get around the same number of shots on net per game, however a Defenseman’s shot is likely coming from further out vs a Forward who would positionaly be closer to the opposing team’s net. This likely accounts for the higher shooting percentage and also makes sense given the discrepancy in Goals.

Goalies are the only ones who record save percentages and they also spend the most time on the ice, followed by Defensemen and then Forwards.

Roster Optimization

Before we deal with Dubas I wanted to play around with the concept we discussed earlier around team construction as it relates to portfolio construction. Basically can we create some kind of model that we can use to find an optimal roster of Forwards, Defensemen and Goalies now that we have some common attributes of the average asset in each asset class.

Here is my approach: For my model I want to use only variables that are discriminative… that basically do well at dividing the attributes of my three assets.

In this case it would be Goals, Blocked Shots and Save Percentage.

Next I need a way of expressing these variables in an equation that I can then optimize to find the optimal number of holdings in each asset class.

Now for simplicity we will assume an average team has 12 Forwards, 8 Defensemen and 1 Goalie for each game. In total it can fill the roster with 21 positions.

Here’s my approach:

Our equation is the net score for an average game:

\[ \text{Net Score} = \text{Team Goals} - \left(\text{Total Shots Towards Team's Net} - \text{Blocked Shots} \right) \times \left(1 - \text{Save Percentage} \right) \]

Where, \[ \text{Team Goals} = \left(\text{# of Forwards} \times \text{Avg Forward Goals/Game}\right) + \left(\text{# of Defensemen} \times \text{Avg Defensemen Goals/Game}\right) \] and, \[ \text{Team Blocked Shots} = \left(\text{# of Forwards} \times \text{Avg Forward Blocked Shots/Game}\right) + \left(\text{# of Defensemen} \times \text{Avg Defensemen Blocked Shots/Game}\right) \]

and,

\[ \text{Team Save Percentage} = \sum{\text{Average Goalie SVP}} / \text{# of Goalies} \]

Note on that last point with goalies… it basically means that if you have 1 goalie or 21 goalies your save percentage doesn’t change … if you don’t like that … how would 6 goalies playing on the ice look to you?

So basically I believe the model will figure out quickly that having 1 Goalie is beneficial but adding more doesn’t help its cause. The second trade off then comes between offense and defense… add more offense and score more goals but more shots come towards your net… add more defense score less goals but less shots come towards your net.

We already have most of these inputs, but what we don’t have is Total Shots Towards the Net. But we can find it. To do that, we take our data set and aggregate all of our players by their respective teams for each season. This gives us team stats, then we can calculate the average shots on goal each team took and also the number of shots each team blocked. If you are wondering what that looks like, here is the table (Note that this table is also adjusted on a “per game” basis):

One of my favorite observations (albeit obvious) from this data set is this beautiful graphical representation of my childhood hockey career:

Avg Team Shots on Net Per Game vs Avg Team Goals Scored Per Game

The $R^2$ is even better if we use our original player data set with more observations

Avg Player Shots on Net Per Game vs Avg Player Goals Scored Per Game

Pucks on net kids… Pucks on net.

Anyways, we can take the average team shots on goal from 2010 to 2022 which comes out to $30.85$ and average team blocked shots which comes out to $14.26$. So in total our hypothetical team needs to face $45.11$ shots towards their net. We can code the net score formula above into a function and then basically brute force optimize what the best combination is.

Its actually very efficient, given the total possible combinations is only 253 ( n = 3 choose r = 21 with replacement)…

The results are shown in the table below:

As someone who spent most of their minor hockey career playing defense, I have to say I was very disheartened by the fact that the optimal roster based on the model I created indicated that having no Defensemen was the best strategy…

That being said, this outcome was fairly obvious in hindsight. Why? Well we are really only evaluating one trade off here. Forwards for Defensemen. For each Forward we take off the team, we lose ~ 0.14 goals (-0.21 goals from the Forward we lost + 0.07 from the Defenseman we gained). But in terms of goal saves it’s only ~0.097 ([-0.46 Blocked shots from the Forward we lost + 1.46 Blocked shots from the Defenseman we gained] x (1-SVP)). Basically we lose more goals then we save by adding Defensemen.

Our net score objective function is just the difference between these two trade-off curves but these curves are linear so all we can change are the slopes of these curves. Therefore, the optimal solution will always be at one of the extremes … unless the trade off is equal, in which case it doesn’t matter the combination.

Goals Scored vs Number of Forwards (Black) + Goals Against vs Number of Forwards (Blue)

You can see this in the visual above, basically the slope of goals gained by adding a forward is steeper than the slope of the incremental goals given up by adding a forward.

We will leave this alone for now … but maybe we pick it up in a bit …

Foreshadowing

Cluster Modeling

Alright, so my original idea was to see if we could build a cluster algorithm to classify players into specific groups. Then we could make assessments about the types of players in each group and show, for instance, that two players are very similar to each other and therefore, have similar factor exposure. By doing this I was hoping to be able to support some of the claims in the video. The reality though, is that given the amount of factors we have, and the simplicity of those factors, we won’t be able to build as detailed of a classification engine as I had hoped. That being said we can still give it a try and see what it spits out.

We are going to use one of my favorite classification models: KNN model. If you don’t know what that is, KNN stands for K-nearest neighbors (where K is a parameter). Basically what this algorithm does is try to classify an observation using the other observations that are closest to it in its n-dimensional space. The K is the parameter that can be tuned which equals the number of neighbors to use for the classification.

This obviously works well for supervised learning where if two categories existed on a two dimensional space, a new observation could be categorized based on its location in those two dimensions, using the average of the K known observations closest (nearest) to this unknown observation.

But what’s great about this algorithm is that it can also be used in unsupervised learning, where observations are not categorized and the algorithm finds the best way to discriminate the data into categories.

If you’re still curious you can read more here

First off, to run this model we are only going to use the following variables:

Next, to get an idea of the optimal clusters we should use to sort our data we are going to use the “Elbow method”. Basically what we do is:

Compute the K-means clustering algorithm for different values of K (in this case 1-10)
For each K calculate the total within-cluster sum of squares
Plot the within-cluster sum of squares at each k value
Look for the bend in the plot as a general indicator of appropriate number of clusters

Elbow Method

You can see in our case 4 or 5 seems to be a decent cluster to choose but Im going to choose 6 because its the number of players on ice at any one time.

Alright we can now run our KNN model. If you’re curious about the output this is what it looks like in all it’s glory:

## K-means clustering with 6 clusters of sizes 441, 2310, 601, 1261, 1345, 1168
## 
## Cluster means:
##            G       Ast      Shpct       toi       Blk       SVP
## 1 0.00000000 0.0000000 0.00000000 0.9384371 0.0000000 0.9021966
## 2 0.14958966 0.1919695 0.09530486 0.2244695 0.3702757 0.0000000
## 3 0.07451735 0.2603281 0.04654514 0.3558122 2.0383136 0.0000000
## 4 0.13468266 0.2346008 0.08174900 0.2705096 0.8250786 0.0000000
## 5 0.33479793 0.4861787 0.13045687 0.3071818 0.4350668 0.0000000
## 6 0.07712241 0.2634189 0.04849439 0.3347700 1.4066031 0.0000000
## 
## Clustering vector:
##    [1] 2 4 6 5 6 6 2 2 2 2 2 4 2 3 5 2 4 2 6 2 6 3 2 5 2 5 2 5 2 2 4 2 6 2 2 2 6 2 5 2 5 6 2 5 6 2 6 5 4 3 2 2 5 4 6 5 4 2 5 2 2 4 6 5 6 4 6 2 5 6 4 6 2 6 3 3 5
##   [78] 2 6 2 2 3 2 2 5 2 5 2 6 2 6 5 2 2 4 2 6 4 2 2 2 2 6 2 5 6 4 2 5 2 6 2 2 2 5 5 2 4 2 2 4 2 2 5 6 6 5 2 6 6 4 2 3 5 2 5 6 4 2 6 4 2 4 6 5 2 6 4 4 5 4 2 2 5
##  [155] 5 5 3 2 2 6 5 3 3 4 2 3 3 5 2 6 4 2 6 2 6 4 5 5 3 3 6 4 6 3 2 3 4 4 6 3 3 2 6 2 2 6 5 5 5 2 4 3 5 2 5 2 4 3 3 2 5 2 5 5 5 2 4 6 5 3 2 2 2 2 6 6 5 5 2 6 4
##  [232] 3 3 3 4 5 5 4 3 4 2 4 5 4 3 3 2 2 5 5 6 4 2 5 4 2 5 2 5 2 5 6 6 6 5 4 5 2 5 4 2 2 5 6 5 5 3 6 6 2 2 6 3 3 5 6 5 6 3 5 3 2 4 2 4 5 5 2 4 5 4 6 3 6 2 2 6 6
##  [309] 4 4 5 2 2 2 3 6 6 4 3 2 4 2 6 2 3 3 5 6 5 2 5 3 6 5 5 2 2 4 6 2 2 4 3 4 2 3 3 2 6 2 4 2 5 2 2 2 2 5 2 4 3 2 5 5 2 4 3 3 6 6 2 3 5 6 5 2 2 2 2 4 3 2 5 4 2
##  [386] 4 2 4 5 2 3 2 5 5 4 2 6 2 4 6 2 3 2 5 5 2 6 4 2 5 2 6 6 6 3 6 2 3 3 6 3 5 3 6 5 5 2 2 2 5 2 2 2 2 3 6 2 4 2 2 3 6 2 5 5 5 5 6 5 2 2 5 2 5 2 4 5 2 2 5 2 2
##  [463] 6 4 6 6 6 2 6 2 4 2 4 2 5 5 4 4 5 2 3 2 5 2 2 2 2 6 4 2 6 5 5 4 6 3 3 2 2 6 4 4 6 6 6 5 2 2 6 5 6 5 2 6 2 2 4 6 2 6 6 2 2 3 5 6 2 2 2 2 2 2 5 4 6 2 2 2 4
##  [540] 5 5 2 5 2 6 2 4 4 6 2 2 4 5 3 6 2 4 2 2 4 4 5 2 6 2 3 2 2 5 2 4 2 2 2 6 2 2 6 2 2 2 4 2 2 6 3 2 4 6 5 2 2 2 3 5 4 6 4 5 2 2 2 6 5 6 4 6 2 6 4 6 3 3 5 2 6
##  [617] 2 2 2 3 2 2 2 4 5 2 6 2 6 5 2 4 4 2 4 4 4 2 2 2 6 4 5 2 6 4 5 5 2 4 2 2 2 4 5 4 4 2 3 5 3 6 5 4 2 6 2 4 5 6 5 4 5 6 4 6 4 5 4 3 5 2 4 4 5 2 2 5 5 2 6 4 2
##  [694] 6 4 6 3 2 6 3 3 5 4 6 4 2 6 6 6 3 2 5 4 6 6 3 3 2 3 2 2 4 6 3 3 4 6 2 5 3 5 5 3 6 2 2 2 4 5 6 3 2 2 2 2 2 5 5 2 5 3 2 2 6 6 6 2 5 5 2 4 4 4 3 4 4 5 5 5 6
##  [771] 4 2 4 5 2 3 3 4 2 5 2 3 2 2 5 2 2 2 2 5 5 3 3 3 4 6 5 2 2 6 2 4 2 6 5 5 2 6 6 2 2 6 6 6 2 2 5 5 3 2 3 2 2 5 5 5 5 2 6 4 2 2 6 4 2 2 2 6 6 6 5 3 2 4 2 2 3
##  [848] 4 4 4 6 2 5 3 6 5 5 2 2 4 4 2 2 2 6 3 4 5 6 2 3 5 2 5 5 2 2 4 5 5 2 2 4 6 2 2 2 5 5 2 5 6 6 6 4 3 5 2 2 6 2 4 5 4 5 2 4 6 2 2 2 6 5 5 5 2 6 2 6 4 6 2 6 2
##  [925] 5 5 6 6 2 2 4 6 6 2 6 3 4 3 6 3 5 3 6 5 5 2 2 5 2 2 4 3 6 2 4 2 2 6 6 2 5 5 5 5 6 5 3 2 2 5 2 5 2 4 4 2 2 2 2 6 6 6 6 3 2 6 2 6 4 3 2 4 5 5 4 4 5 2 6 4
##  [ reached getOption("max.print") -- omitted 6126 entries ]
## 
## Within cluster sum of squares by cluster:
## [1]  0.9717061 74.9887130 63.3613031 65.4595535 78.9510761 72.4424307
##  (between_SS / total_SS =  89.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"         "iter"         "ifault"

Here is a closer look at our cluster centers:

So the first cluster looks like we have the Goalies… makes sense. Now a little out of order here but the 5th cluster seems to be the top Forwards, fairly high in all categories, including blocked shots. Cluster 3 looks to be our best Defensemen… They play a lot of minutes, block a lot of shots, make a lot of assists but don’t score often. Next cluster 6 seems to be more second tier Defensemen… still block a lot of shots, play a lot and also score slightly more but not significantly. Clusters 2 and 4 seem to be pretty similar with group 4 scoring a bit less on average, but contributing more in assists, minutes and blocked shots. Let’s call these our versatility players, making cluster 2 our bottom six forward group.

Now let’s check the frequency of positions in our clusters:

Okay so cluster 1, all Goalies… seems it was pretty easy for the clustering algo to figure that out. Cluster 5 our top Forwards, looks like it is mainly Forwards and 6 Defensemen… Who you’re probably asking?

In Cluster 3 and Cluster 6, as expected we have Defensemen almost exclusively. Cluster 4 our versatility players, seem to be our most balanced group and finally Cluster 2 holds all other forwards almost exclusively .

So for simplicity let’s call each group the following:

Cluster 1 = Goalies
Cluster 2 = Average Forward
Cluster 3 = Top Defensemen
Cluster 4 = Versitility Players
Cluster 5 = Top Forwards
Cluster 6 = Avereage Defensemen

Here is the entire data set if you want to search for players:

We can also visualize our clusters on a plot. Now you might be asking… how can we do that since a plot only has 2 dimensions and our model has 6. Well we can use a technique called Principal Component Analysis. PCA is great because it allows us to reduce the number of dimensions down. This helps in modeling with problems related to multicollinearity and factor reduction but in our case it also helps us visualize. What PCA analysis basically does is try and combine our variables into new “Principle Components” that explain most of the variation in our model. Now while it does reduce dimensions what we lose is interpretability. Basically I no longer know what variables are driving my model, only that they are some combination of my original variables. So in our case we can reduce our dimensions down to 2, then plot our clusters within these new principle components. Here is what it looks like:

Cluster Visual

This visual helps confirm a lot of what we already determined above. First of all , Dim1 and Dim2 are our new Principal Components and within our new observation space we can see Cluster 1 (our Goalies) completely removed from the rest of the observations. Also interesting is that our Top Forwards (Cluster 5) and Top Defensemen (Cluster 3) are at the furthest points from each other. Cluster 2 our Average Forwards are close in proximity to our Top forward group (Cluster 5), and Cluster 6, our Average Defensemen group is closest to the Top Defensemen (Cluster 3) and in the middle (Cluster 4) are our Versatility players.

Player Evaluation

Okay, now obviously we don’t have very specific classifications for our players, but using what we have let’s explore some of the output.

Let’s start by looking at some well know defenseman and see how their classification has changed over time

I really don’t think it’s right to make any grandiose statements off of these models. Trying to draw complex conclusions from what is really a very simplistic model seems counter productive. But it is interesting how generally the Clusters actually do a decent job at describing some of the seasons these players have had over their careers and also the general type of player they are. I’m not going to go into detail on every player here but let’s take a look at the Burns/Karlsson case study. (You can isolate them by deselecting all other players). Before playing on the same team you can see there was much more volatility in both of their clustering. Both had stints in the “top Cluster 5” categories. Karlsson was classified more as a “Versitility Player” in two of those years and the two seasons before Karlsson got traded to the Sharks they were both clustered as “Top Defensemen”. Then for the four years following they both jumped into Cluster 6 (Average Defenseman) and stayed. Now technically Burns didn’t get traded until 2022, but the trend breaks in 2021, both showing as having better seasons in that year. 2022 Burns still is shown as a Versatility player in Cluster 4 , but Karlsson gets grouped back in with the Average Defensemen… tough look for 101 points.

But that’s the reality with these things… This is just one (very imperfect) model’s opinion.

That being said, now contrast both of these players to Kris Letang and you’ll see Kris has been consistent, spending almost his entire career in Cluster 6. Now you might think, Letang’s not just an average Defenseman and I very much agree, but also appreciate our model is simple and only two classifications for defensemen isn’t enough to tell any stories. But one thing that I think is important is consistency. Every year with Letang (when he was healthy) you knew exactly what you were getting and every year he consistently delivered on that. Now if you want to talk about studs (at least according to this model) check out Mark Giordano … STUD.

I know we probably can’t draw any real detailed information from this clustering exercise, let alone judge Kyle Dubas for his off-season moves, but I do hope it helps demonstrate what kind of things are possible with this analysis. You can imagine actually developing a classification algorithm with more factors and more meaningful advanced stats. This could theoretically show which players are similar to each other. You lose a player to injury? Who is an undervalued player I can trade for, who will give me the same factor exposure. You’re a player in contract negotiation? The model says I’m most similar to these other players and they get paid X (could also work against you).

Six clusters is fun for a blog post but the applications, if done right, could actually be impactful.

Speaking of salaries…

While I was going through this exercise I thought it would be cool to see if we could merge salary data to this cluster data and see if we could show that our interpretation of each cluster had more merit based on the average salary of the individuals in that cluster. For instance we’ve been calling Cluster 3 and 5 our Top Defensemen and Forwards but if this is the case then on average shouldn’t they get paid more? We’ve classified Cluster 4 as our Versatility Players and Cluster 2 as our Average Forwards… was this fair?

So here is what we are going to do. I was able to scrape salary data from capfriendly, but only for the latest season. So our dataset is going to only look at players from 2022 who are still active in the league.

Another point is that not all the names line up perfectly so merging tables is pretty close but not 100%. With all that being said I still think that its fairly indicative analysis.

What’s really cool to see is how well average salaries lined up with our intuition of each cluster. Our highest paid cluster is our Top Forwards (Cluster 5), followed by our Top Defensemen (Cluster 3), next we have our Average Defensemen (Cluster 6) , followed by our Goalies (Cluster 1), then our Versatility Players (Cluster 4) and finally our Average Forwards (Cluster 2).

If you want to play around with some of the data here is the data set:

According to our model here are some current undervalued players, we filter for Top Forwards and sort salary for lowest to highest

According to our model here are some current overvalued player, we filter for Average Forwards and sort salary for highest to lowest:

Roster Optimization (Part 2)

Alright, before we wrap up this post I wanted to try the Roster Optimization one more time. Why? Well we now have more than just two categories (Forwards vs Defensemen) and also salary information.

The set up is going to be similar. We will use the same net score function but now instead of having three player categories (Forward, Defense, Goalie) to choose from we will have our six clusters.

Now n=6 choose r = 21 with replacement means we technically have 65,780 possible combinations … which is a lot more than our previous 253. So to simplify things a bit we are going to say our average team will start with 1 goalie and remove that from our equation. Now our algorithm only needs to select a 20 man roster from our 5 player clusters. This simplifies our possible combinations down to 10,626.

BUT I also want to include salaries as a constraint. Basically the NHL cap this year is $83.5 Million, since we are going to start our average team with a goalie and the average goalie costs about $3.5 Million, that means our hypothetical team needs to select 20 players using about $80 Million.

Now, we know that this model without a cap constraint would probably just select 20 of the top forwards (I mean if I could, I probably would too). But that solution would put us far over budget so we can throw that combination out. In fact we can throw out any combination where the total sum of the team salary is greater than the salary cap. When we do this we are only left with 3,692 potential cap complaint combinations (a lot easier than 65k).

Let’s run through these 3,692 possible solutions and see what we get. The results are displayed below:

You can see that our optimal team is comprised of 7 Top Forwards, 9 Average Forwards, 4 Versatility Players and a Goalie. Still no one from our core defensive categories made it. The reality though is that the trade off for ‘goals for’ vs ‘goals against’ is still too low in this model to include Defensemen. But you will notice that the optimal solution this time was not simply the solution that produced the most ‘goals for’. That solution would have been 7 Forwards and 13 Average Forwards. This means that there was some benefit to adding Versatility Players who may not score as much as Average Forwards but whose defensive capabilities compensate for this.

I think an obvious issue with the model as it stands is that it assumes that if you have 5 Forwards playing at a time they would all still produce the average goals of a Forward… but this assumption might not be fair. There is only so much space on the ice and Forwards are just generally closer to the opposing teams net. We’ve already established that Forwards and Defensemen get approximately the same amount of shots per game but these shots are coming from different locations on the ice. So it might make sense in this model to put a lambda factor that reduces the Forwards goal contribution after a certain point. This would in a way force more Defensemen on the team as the goals from incrementally adding more Forwards decreases. Maybe something for another time. But we can actually empirically see the difference between where Forwards and where Defensemen shoot from, by using heat maps from icydata.hockey. Let’s compare Auston Matthews with Morgan Rielly from 2016 until 2022.

Auston’s Shots

Rielly’s Shots

The red zones indicate more shots from that location. You can see in general how much further back a Defensemen needs to shoot vs a Forward… so for now I won’t feel too badly about my model excluding Defensemen.

Conclusion

We did a lot in this post and explored a lot of different topics. We saw how we could create a simple model for team construction using only three factors and then expanded on that idea in a more complex team model. Both scenarios however, showed the trade off teams need to make between ‘goals for’ and ‘goals against’ albeit in a simplistic manner. We were also able to build a classification algorithm to group like players, although our model was fairly simple, we saw how a more complex version might actually be applicable.

Although we weren’t really able to determine if Mr.Dubas made a mistake drafting Erik Karlsson… From one (fantasy) manager to another… I hope it works out for him.

Dear Dubas

Jonathon Barbaro, CFA