Ok with Corsica being down for now there has been a lack of publicly available xG(Expected Goals) data. Obviously with it coming back up soon that will change, but it got me thinking about building my own model, which in turn made me think of how when I first heard about xG I wanted to know how it works. While Emmanuel Perry and DTMAboutHeart do an excellent job of explaining their models here and here, to the uninitiated the explanations can be quite daunting.
At this point in my hockey analytics education I understood Corsi,GF%, and why +/- is bad. But this was completely new and cool and I wanted to understand it more. I understooed what xG was and what it said, but I wanted to understand how to actually come up with. I wanted to peak behind the curtain and see the wizard at work and unfortunately there was little in the way of that at the time. This article serves to fill that need.
First a couple things to get out of the way. First is that you have some familiarity with hockey stats like Corsi and Fenwick and the such. I won’t be explaining what these mean in the article but this link has the basic rundown. This will be written in R so if you have some experience working with that language that is a bonus. However, this article will be written in a way that if you just follow the steps you can get the model up and running as long as you have some familiarity with R Studio. R Studio can be downloaded here and if you’d like to learn R more here is a good link to learn the basics and there is also the Swirl Package that can be run in R Studio. This link will show you the directions to get that up and running in R Studio.
Normally I work in Python but I’m trying to improve my R knowledge which is another reason this is written in R. Another version of this will be written in Python as soon as I get the R version up and running. So if you don’t want to learn R, or don’t have time, I’ll have a version for you up soon.
Also this model won’t be perfect by any means. If you’re expecting a perfect model that is on the level of what is publicly out there this won’t be it. In fact I’m not even sure how good this will be because as I’m typing this I haven’t ran the numbers yet. I will learn things just as you do because I’m running the program as I type. The purpose of this article is just to familiarize you with the process behind making the model and look at some of the decisions behind what variables we include. There will be a part two where we work on fine tuning the model and go deeper into the data.
The Data for this project comes from the 2015, 2016, and 2017 NHL seasons. I’d like to think EvolvingWild for helping me setup the scraper for the NHL data. I keep telling myself I’m going to write one and never get around to it so his help was crucial to getting this underway. The Play by Play (PbP) data can be found here in csv form along with the file for the code and the actual model itself. They are rather large around .4 GB as they have every single event from each season so they may take a while to download. You will also find the files for the R markdown that this is written in and the code for the model as well if you want to play around with it
For this xG model we will be building what is called a logistical regression in order to predict our xG values. There are a few reasons why I am choosing this one as to other modeling options:
Logistic regressions work well when the output, or dependent variable, is categorical. In everyday speak this means that there are a set number of outcomes that the result of the action can result in. In mathematical terms the output would be referred to as a discrete outcome as opposed to a continous outcome where the output can be any value between a range of values. If you’d like to read more about this here is a handy link. Since the value of each shot attempt either results in a goal or doesn’t it fits the output of this model very well.
Another reason is that logistic regressions are relatively easier to implement than other models. The code behind them is not all that complex. As I’m no R expert this is a non trivial reason and for my readers I expect it to be as well.
And last they are computationally efficient to train and predict. Meaning you won’t be sitting around for days waiting on the model to get done predicting so you can work with the data it produces.
However, there are flaws to a logistic model as well:
You need lots of data to fit a logistic regression especially when the outcomes are rare. This isn’t as much as a problem with this model since there are large amounts of data for each season, but if one was wanting to use it in the future keep that in mind.
It doesn’t handle large numbers of features very well. This means the more variables you feed the regression the worse it may become. In this simple model we won’t be doing that but still should be something we think about in the future when we fine tune the model.
Logistic regressions can be prone to high bias in the bias-variance distribution. We’ll get more into that in part two when we fine tune the model. But for now just remember that a high bias model could result in underfitting which means the model could be missing the signal from the data.
Logistic Regressions produce a probability of something happening given the independent variables. So when the model says that a shot attempt produces .60 xG it means that given an average player that six out of ten times a shot attempt with the exact same variables will produce a goal. This link talks about it from a soccer perspective but a lot of the same themes apply to hockey as well.
We will then sum up these probabilities to get a team’s xG total per game. That’s why the totals of xG for games are often in decimal and not whole values. Ok enough back story and info and now it’s time to move on to working with the data.
setwd('~/DataAnalysis/Programming Stuff/R/NHLScraper/FullNHLSeasonPbP/')
PbP2016Data <- read_csv('2015NHLPbP.csv')
PbP2015Data <- read_csv('2016NHLPbP.csv')
Test_PbP_Data <- read_csv('2017NHLPbP.csv')
You want to put the directory in the setwd()
function where ever you have downloaded the NHL PbP data. From there we use the read_csv()
function from the readr
package we imported earlier. We have three seasons worth of data but right now I’m only going to read in two seasons. The reason behind this is the validation step where we test our model against the data. You want to do that against data that is out of sample i.e. data the model didn’t use in its training set.
Usually this is around 30% of your dataset but I broke the data into separate seasons instead of a percentage for a couple reasons. The main reason is I want to keep the season broken apart by year so I don’t have any data from 2017 in my training data so I can do a comparison at the end and see if my model has any predictive power or is a more descriptive model. If you are of unsure of the difference here is a good link that describes it well. Secondly the data split just feels right by the way people around the NHL segment their analysis as well.
So now that the all the data is imported from the 2015 and 2016 NHL seasons the first thing we will do is combine them into one giant data frame with the bind_rows
function and then take a look at the structure of the resulting dataframe
Train_PbP_Data <- bind_rows(PbP2015Data, PbP2016Data)
dim(Train_PbP_Data)
## [1] 2353074 42
str(Train_PbP_Data, max.level = 1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2353074 obs. of 42 variables:
## $ season : int 20152016 20152016 20152016 20152016 20152016 20152016 20152016 20152016 20152016 20152016 ...
## $ game_id : int 2015020001 2015020001 2015020001 2015020001 2015020001 2015020001 2015020001 2015020001 2015020001 2015020001 ...
## $ game_date : Date, format: "2015-10-07" "2015-10-07" ...
## $ session : chr "R" "R" "R" "R" ...
## $ event_index : int 1 2 3 4 5 6 7 8 9 10 ...
## $ game_period : int 1 1 1 1 1 1 1 1 1 1 ...
## $ game_seconds : int 0 0 0 0 14 14 36 36 37 37 ...
## $ event_type : chr "PSTR" "ON" "ON" "FAC" ...
## $ event_description : chr "Period Start- Local time: 7:20 EDT" NA NA "MTL won Neu. Zone - MTL #14 PLEKANEC vs TOR #42 BOZAK" ...
## $ event_detail : chr NA NA NA NA ...
## $ event_team : chr NA "MTL" "TOR" "MTL" ...
## $ event_player_1 : chr NA NA NA "TOMAS.PLEKANEC" ...
## $ event_player_2 : chr NA NA NA "TYLER.BOZAK" ...
## $ event_player_3 : chr NA NA NA NA ...
## $ event_length : int 0 0 0 14 0 22 0 1 0 3 ...
## $ coords_x : int 0 NA NA -1 48 19 NA NA NA NA ...
## $ coords_y : int 0 NA NA 0 25 22 NA NA NA NA ...
## $ players_substituted: chr NA "MTL11, MTL14, MTL31, MTL67, MTL76, MTL79" "TOR2, TOR3, TOR19, TOR23, TOR42, TOR45" NA ...
## $ home_on_1 : chr NA NA "MATT.HUNWICK" "MATT.HUNWICK" ...
## $ home_on_2 : chr NA NA "DION.PHANEUF" "DION.PHANEUF" ...
## $ home_on_3 : chr NA NA "JOFFREY.LUPUL" "JOFFREY.LUPUL" ...
## $ home_on_4 : chr NA NA "SHAWN.MATTHIAS" "SHAWN.MATTHIAS" ...
## $ home_on_5 : chr NA NA "TYLER.BOZAK" "TYLER.BOZAK" ...
## $ home_on_6 : chr NA NA "JONATHAN.BERNIER" "JONATHAN.BERNIER" ...
## $ away_on_1 : chr NA "BRENDAN.GALLAGHER" "BRENDAN.GALLAGHER" "BRENDAN.GALLAGHER" ...
## $ away_on_2 : chr NA "TOMAS.PLEKANEC" "TOMAS.PLEKANEC" "TOMAS.PLEKANEC" ...
## $ away_on_3 : chr NA "CAREY.PRICE" "CAREY.PRICE" "CAREY.PRICE" ...
## $ away_on_4 : chr NA "MAX.PACIORETTY" "MAX.PACIORETTY" "MAX.PACIORETTY" ...
## $ away_on_5 : chr NA "P.K..SUBBAN" "P.K..SUBBAN" "P.K..SUBBAN" ...
## $ away_on_6 : chr NA "ANDREI.MARKOV" "ANDREI.MARKOV" "ANDREI.MARKOV" ...
## $ home_goalie : chr NA NA "JONATHAN.BERNIER" "JONATHAN.BERNIER" ...
## $ away_goalie : chr NA "CAREY.PRICE" "CAREY.PRICE" "CAREY.PRICE" ...
## $ home_team : chr "TOR" "TOR" "TOR" "TOR" ...
## $ away_team : chr "MTL" "MTL" "MTL" "MTL" ...
## $ home_skaters : int 0 0 5 5 5 5 5 5 3 5 ...
## $ away_skaters : int 0 5 5 5 5 5 2 5 5 5 ...
## $ home_score : int 0 0 0 0 0 0 0 0 0 0 ...
## $ away_score : int 0 0 0 0 0 0 0 0 0 0 ...
## $ game_score_state : chr "0v0" "0v0" "0v0" "0v0" ...
## $ game_strength_state: chr "EvE" "Ev5" "5v5" "5v5" ...
## $ highlight_code : chr NA NA NA NA ...
## $ X1 : int NA NA NA NA NA NA NA NA NA NA ...
dim()
gives us the dimension of the datframe of the NHL PbP data. It shows that the dataframe has 2356958 observations of 41 different variables. Will we need all 41 of those variables? Yes if we were trying to create a full stat database, but we’ll be mainly working with the rows, or observations, that deal with Fenwick events. Why Fenwick and not Corsi? Well we are limited to those events because the NHL records the location of blocks at the location of the block and not the location of the shot and as you’ll see later distance plays a key role in our model so if we feed it false information we could get a false output hence the Fenwick only events. Those events are labeled in the event_type
variable of the dataframe so lets look what’s there so we can pick out what we need.
unique(Train_PbP_Data$event_type)
## [1] "PSTR" "ON" "FAC" "STOP" "OFF" "HIT" "SHOT" "GIVE"
## [9] "MISS" "GOAL" "BLOCK" "PENL" "TAKE" "PEND" "GEND" "GOFF"
## [17] "SOC" "EIEND" "EISTR" "CHL" "EGT" "EGPID"
So the only events the model needs are SHOT
, MISS
, GOAL
. We’re going to create what’s called a vector in R so we don’t have to type those over and over again. We are also going to create a few functions that will create some variables we need in the datasets as well so we don’t have to cut and paste. This functions can be saved and used on other NHL dataframes as well so feel free to improve or used them as needed.
The first thing I’m going to do is filter out all the Shootout atttempts. The data lists the shoot out as period five so we’ll just keep all rows that have the value less than 5 in the game_period
column.
fenwick_events <- c('SHOT', 'MISS', 'GOAL')
Train_PbP_Data <- Train_PbP_Data[Train_PbP_Data$game_period < 5,]
dim(Train_PbP_Data)
## [1] 2349958 42
The next couple functions are pretty simple as well I want to create new columns that tell me whether each event was a goal and if it was an event by the home team. Basically while this data is available in other columns I want to make my checks simpler in the future so my code is simpler. Also creating a column that’s sole function is to say whether the event was a goal or not makes it easier to pass to logistic regression later without having to doing any extra filtering as the only two options will be 1 or 0. This is done to setup a result that the logistic regression can model. The technical term for this is creating a dummy variable. “The dummy variables act like ‘switches’ that turn various parameters on and off in an equation.”
is_home <- function(dataframe){
dataframe$is_home <- ifelse(dataframe$event_team ==
dataframe$home_team, 1 , 0)
return(dataframe)
}
is_goal <- function(dataframe){
dataframe$is_goal <- ifelse(dataframe$event_type == "GOAL", 1, 0)
return(dataframe)
}
Train_PbP_Data <- is_home(Train_PbP_Data)
Train_PbP_Data <- is_goal(Train_PbP_Data)
Ok now we have those two columns setup relatively easily. Reading a lot of different research on the idea of expected goals leads these main variables as the most important when it comes to modeling xG: distance from goal, angle to the goal, shot type, whether the shot was from a rebound, a shot on the rush, shooter team strength i.e. PP, EV, SH. This model will include all of these variables but we need to set them up first.
The first thing I’m going to go is create a time difference between each event in the data set which we will then use later in our model to measure whether the shots are on the rush or a rebound. We’ll also set any NAs that result from the calculations to zero in case there are some NAs in the PBP data variables we are using to calculate the other variables.
Train_PbP_Data <- Train_PbP_Data %>% group_by(game_id) %>%
arrange(event_index, .by_group = TRUE) %>%
mutate(time_diff = game_seconds - lag(game_seconds))
Train_PbP_Data$time_diff[is.na(Train_PbP_Data$time_diff)] <- 0
Train_PbP_Data$is_home[is.na(Train_PbP_Data$is_home)] <- 0
Train_PbP_Data$is_rebound <- ifelse(Train_PbP_Data$time_diff < 3 &
Train_PbP_Data$event_type %in% fenwick_events &
Train_PbP_Data$event_team ==
lag(Train_PbP_Data$event_team),
1, 0)
Train_PbP_Data$is_rebound[is.na(Train_PbP_Data$is_rebound)] <- 0
Train_PbP_Data$is_rush <- ifelse(Train_PbP_Data$time_diff < 4 &
lag(abs(Train_PbP_Data$coords_x)) < 25 &
Train_PbP_Data$event_type %in% fenwick_events,
1, 0)
To make this easier we are going to filter out the Fenwick events first then apply our data manipulations and our regression.
Train_PbP_Data$is_rebound[is.na(Train_PbP_Data$time_diff)] <- 0
Train_PbP_Data$is_rush[is.na(Train_PbP_Data$is_rush)] <- 0
Train_Fenwick_Data<- filter(Train_PbP_Data, event_type %in% c("SHOT", "MISS", "GOAL"))
The resulting dataframe is much smaller and should be easier to work with. Now on to creating the variables to include in the logistic regression. Let’s check our data we’re going to be working with to make sure there are not NA or NaN values that we need to deal with before doing our calculations.
unique(Train_Fenwick_Data$event_detail)
## [1] "Wrist" "Backhand" "Snap" "Tip-In" "Slap"
## [6] "Wrap-around" "Deflected" NA
unique(Train_Fenwick_Data$coords_x)
## [1] 61 -54 -40 -72 37 -66 -57 -60 68 82 63 51 -58 -65 76 -47 -42
## [18] 75 53 40 77 -55 -83 -73 35 -71 39 66 -17 59 55 -39 49 -79
## [35] -77 -53 46 -64 -85 -89 -75 -14 -33 45 81 -80 84 62 36 60 57
## [52] 67 -52 -63 52 72 -48 6 -38 56 70 -81 -34 31 -43 42 -35 44
## [69] 34 -62 33 -56 65 38 64 41 80 -32 -70 -37 32 -84 85 73 18
## [86] -45 69 78 48 54 24 50 -61 43 -67 79 -59 86 -10 87 71 -82
## [103] -49 58 -50 83 -68 -78 47 -36 -69 -28 -87 -41 74 -76 -51 30 -74
## [120] 19 3 -31 88 -30 -46 -88 -22 4 11 -29 5 -6 91 -4 -16 -15
## [137] -12 -13 -23 27 -86 89 -7 -26 -3 -44 17 29 12 -91 28 -11 16
## [154] -8 -9 25 8 15 -1 -19 22 90 92 -20 9 -25 26 7 -21 -2
## [171] -24 10 1 -93 -97 21 -5 93 -90 -95 -94 13 2 0 -18 97 -92
## [188] 23 14 20 95 94 -27 NA 96 -96 -99 98
unique(Train_Fenwick_Data$coords_y)
## [1] 29 32 -8 30 1 -34 -13 -5 -15 5 -16 10 -18 22 31 37 25
## [18] -2 3 -20 33 21 8 12 27 0 2 4 17 -21 9 -25 -24 6
## [35] -31 34 -17 7 -32 16 -9 36 -29 -11 -27 -23 -36 -28 15 18 23
## [52] -33 -26 40 38 14 11 -35 -22 -30 -7 24 -4 -38 13 28 -1 -19
## [69] -14 19 -6 -3 26 20 -12 39 -39 -40 -10 35 -37 -41 NA 41 42
## [86] -42
unique(Train_Fenwick_Data$game_strength_state)
## [1] "5v5" "5vE" "4v5" "5v4" "Ev5" "4v4" "5v3" "3v4" "4vE" "4v3" "3v5"
## [12] "EvE" "Ev4" "5v2" "2v5" "3vE" "3v3" "Ev3" "Ev0"
It looks like there’s some NA data in there meaning there isn’t a value for the event_detail
so let’s take a look at what’s there.
head(Train_Fenwick_Data[Train_Fenwick_Data$event_detail == 'NA',], n = 15)
## # A tibble: 10 x 47
## # Groups: game_id [1]
## season game_id game_date session event_index game_period game_seconds
## <int> <int> <date> <chr> <int> <int> <int>
## 1 NA NA NA <NA> NA NA NA
## 2 NA NA NA <NA> NA NA NA
## 3 NA NA NA <NA> NA NA NA
## 4 NA NA NA <NA> NA NA NA
## 5 NA NA NA <NA> NA NA NA
## 6 NA NA NA <NA> NA NA NA
## 7 NA NA NA <NA> NA NA NA
## 8 NA NA NA <NA> NA NA NA
## 9 NA NA NA <NA> NA NA NA
## 10 NA NA NA <NA> NA NA NA
## # ... with 40 more variables: event_type <chr>, event_description <chr>,
## # event_detail <chr>, event_team <chr>, event_player_1 <chr>,
## # event_player_2 <chr>, event_player_3 <chr>, event_length <int>,
## # coords_x <int>, coords_y <int>, players_substituted <chr>,
## # home_on_1 <chr>, home_on_2 <chr>, home_on_3 <chr>, home_on_4 <chr>,
## # home_on_5 <chr>, home_on_6 <chr>, away_on_1 <chr>, away_on_2 <chr>,
## # away_on_3 <chr>, away_on_4 <chr>, away_on_5 <chr>, away_on_6 <chr>,
## # home_goalie <chr>, away_goalie <chr>, home_team <chr>,
## # away_team <chr>, home_skaters <int>, away_skaters <int>,
## # home_score <int>, away_score <int>, game_score_state <chr>,
## # game_strength_state <chr>, highlight_code <chr>, X1 <int>,
## # is_home <dbl>, is_goal <dbl>, time_diff <dbl>, is_rebound <dbl>,
## # is_rush <dbl>
So there are 10 rows of data out of our 34,165 that has nothing but NA values so the next move will be to go ahead and filter those out.
Train_Fenwick_Data <- filter(Train_Fenwick_Data, !is.na(event_detail))
head(Train_Fenwick_Data[Train_Fenwick_Data$event_detail == 'NA',])
## # A tibble: 0 x 47
## # Groups: game_id [0]
## # ... with 47 variables: season <int>, game_id <int>, game_date <date>,
## # session <chr>, event_index <int>, game_period <int>,
## # game_seconds <int>, event_type <chr>, event_description <chr>,
## # event_detail <chr>, event_team <chr>, event_player_1 <chr>,
## # event_player_2 <chr>, event_player_3 <chr>, event_length <int>,
## # coords_x <int>, coords_y <int>, players_substituted <chr>,
## # home_on_1 <chr>, home_on_2 <chr>, home_on_3 <chr>, home_on_4 <chr>,
## # home_on_5 <chr>, home_on_6 <chr>, away_on_1 <chr>, away_on_2 <chr>,
## # away_on_3 <chr>, away_on_4 <chr>, away_on_5 <chr>, away_on_6 <chr>,
## # home_goalie <chr>, away_goalie <chr>, home_team <chr>,
## # away_team <chr>, home_skaters <int>, away_skaters <int>,
## # home_score <int>, away_score <int>, game_score_state <chr>,
## # game_strength_state <chr>, highlight_code <chr>, X1 <int>,
## # is_home <dbl>, is_goal <dbl>, time_diff <dbl>, is_rebound <dbl>,
## # is_rush <dbl>
Next we’ll turn the categorical variables of into factors so that the model can use them in the training function later. And it also helps in graphing as well.
Train_Fenwick_Data$event_detail<- as.factor(Train_Fenwick_Data$event_detail)
The next variable to calculate is the angle of the shot relative to the center of the goal. Let’s look at the range of our coordinate system.
range(Train_Fenwick_Data$coords_x)
## [1] NA NA
range(Train_Fenwick_Data$coords_y)
## [1] NA NA
Well that didn’t work so we’ll need to filter out the NA
values:
Train_PbP_Data <- filter(Train_PbP_Data, coords_x != 'NA' & coords_y != 'NA')
range(Train_PbP_Data$coords_x)
## [1] -99 99
range(Train_PbP_Data$coords_y)
## [1] -42 42
So here we have the range of the NHL rink by the x and y coordinates. An NHL rink is 200 feet wide by 85 feet across so every vertical up or down is equal to one foot and and each x integer is equal to .995 feet. The goal line is 11 feet from the end boards so that is equal to 11.05 in units in our coordinate system. With this knowledge we now know the coordinate of the center of the goal is (87.95,0).
So given a shots location let’s say for this test case of a shot taken from (50, 12) and try to determine the angle using what we know about the coordinate system. The angle we are trying to find Theta is equal to Sin(Theta) = Opposite/Hypoteneuse. We know the opposite side is equal the y coordinate. The hypoteneuse we’ll get the length of using the trusty pythagorean theorem of the Sqrt((87.95-50)^2 + 12^2). So adding that altogether we have this:
asin(12/sqrt(1440.20+144))
## [1] 0.3062574
That answer is in radians in order to get it to degrees we will multiply it by 180 and divide by pi.
radian <- asin(12/sqrt(1440.20+144))
(radian * 180)/3.14
## [1] 17.55616
So now we have that it is a shot taken at a 17 degree angle from the center of the goal. So let’s add all this together and create a new column for each shots angle form goal. To simplify things I’m also just going to mirror the x values by taking the abs()
value of each as I don’t really believe which zone the shot is taken on has that much of an effect. I will also switch the y value from negative to positive in order to accurately display the correct wing that the shot was taken from.
Train_Fenwick_Data$coords_y <- ifelse(Train_Fenwick_Data$coords_x < 0,
-1 * Train_Fenwick_Data$coords_y, Train_Fenwick_Data$coords_y)
Train_Fenwick_Data$coords_x <- abs(Train_Fenwick_Data$coords_x)
Train_Fenwick_Data$shot_angle <- (asin(abs(Train_Fenwick_Data$coords_y)/sqrt((87.95
- abs(Train_Fenwick_Data$coords_x))^2
+ Train_Fenwick_Data$coords_y^2))*180)/ 3.14
With this metric shots that have a 0 angle are 90 Degrees to the net or in the slot. There’s another problem though because shots below the net should result in an angle larger than 90 degrees although our fomula won’t do that because the negative value of the difference between the goal line and shot will wash out when we square it. So to fix that we just need to add 90 to the value of the angle if our x coordinate is larger than the x coordinate of the goal line. Also now that we have the coordinates and angle straight the next thing to do is construct the distance from the goal which is just the simple pythagorean theorem
Train_Fenwick_Data$shot_angle <- ifelse(abs(Train_Fenwick_Data$coords_x) > 88, 90 +
(180-(90 + Train_Fenwick_Data$shot_angle)),
Train_Fenwick_Data$shot_angle)
Train_Fenwick_Data$distance <- sqrt((87.95 - abs(Train_Fenwick_Data$coords_x))^2 + Train_Fenwick_Data$coords_y^2)
The strength state is important but it is a little muddy and there are a lot of different factors. First is that while it tells the strength state of the game it will be easier on the model if we probably tell it the strength state of the person taking the shot because while it may be the PP if someone on the PK is taking a shot logic follows that it is less likely to be a dangerous chance than the shot by the team on the PP.
So we’ve got all our variables set and ready to go. We’ve got shot location, shot type, strength state, and whether it was a shot from the home team or not. Needless to say for this base model we are going to ignore the player actually taking the shot and the goalie in net. Obviously these are important variables as a shot by Sidney Crosby and reasonably expected to go in at a higher way with all other variables controled than the same shot by say Jay McClement. And again that same shot against Carey Price or Braden Holtby would resaonably be expected to drop the chances of that same shot as opposed to Cam Ward or Chad Johnson.
So yes I realize that not trying to factor in those variables makes a big difference, but for the purpose of this model and article we can safely ignore them because right now I’m not worried about showing the best model. I’m trying to show a model and instruct people on how to build one.
Now with all our variables set, we are finally read to build the actual model. The first step to doing so is training the model on our training data. I am going to use all the variables we created. We will be doing this with R’s glm()
function which stands for generalized linear model.
xGmodel <- glm(is_goal ~ poly(distance, 3, raw = TRUE) +
poly(shot_angle, 3, raw = TRUE) + event_detail +
shooter_strength +
is_rebound + is_rush,
data = Train_Fenwick_Data,
family = binomial(link = 'logit'))
save(xGmodel, file = "xGmodelver2.rda")
The the first argument of the glm
function is the formula we want to use to predict goals. Data is our training dataframe. The method we are using is a general linear model but when we use the binomial family that tells it to use the logistic regression. And that’s really all there is to it for building the model itself. In a lot of model building scenarios the time it takes to get the data ready for the model and the time afterwards to test the model will be the majority of time spent in building a model.
summary(xGmodel)
##
## Call:
## glm(formula = is_goal ~ poly(distance, 3, raw = TRUE) + poly(shot_angle,
## 3, raw = TRUE) + event_detail + shooter_strength + is_rebound +
## is_rush, family = binomial(link = "logit"), data = Train_Fenwick_Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4014 -0.3865 -0.2622 -0.1961 3.1432
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.190e+00 8.420e-02 26.007 < 2e-16
## poly(distance, 3, raw = TRUE)1 -9.059e-02 4.834e-03 -18.739 < 2e-16
## poly(distance, 3, raw = TRUE)2 4.495e-04 1.396e-04 3.221 0.00128
## poly(distance, 3, raw = TRUE)3 1.600e-06 1.137e-06 1.407 0.15933
## poly(shot_angle, 3, raw = TRUE)1 4.947e-03 2.356e-03 2.100 0.03574
## poly(shot_angle, 3, raw = TRUE)2 -4.741e-04 5.847e-05 -8.108 5.14e-16
## poly(shot_angle, 3, raw = TRUE)3 3.324e-06 4.026e-07 8.256 < 2e-16
## event_detailDeflected 3.155e-02 6.192e-02 0.510 0.61040
## event_detailSlap 7.702e-01 4.579e-02 16.822 < 2e-16
## event_detailSnap 7.289e-01 4.034e-02 18.067 < 2e-16
## event_detailTip-In 6.536e-02 4.382e-02 1.491 0.13585
## event_detailWrap-around -6.484e-01 1.148e-01 -5.646 1.64e-08
## event_detailWrist 4.390e-01 3.417e-02 12.847 < 2e-16
## shooter_strengthEV -3.347e+00 6.268e-02 -53.399 < 2e-16
## shooter_strengthPP -2.909e+00 6.525e-02 -44.579 < 2e-16
## shooter_strengthPS -1.051e+01 4.395e+01 -0.239 0.81100
## shooter_strengthSH -3.236e+00 8.676e-02 -37.294 < 2e-16
## is_rebound 4.300e-01 2.345e-02 18.338 < 2e-16
## is_rush 1.439e+00 6.734e-02 21.369 < 2e-16
##
## (Intercept) ***
## poly(distance, 3, raw = TRUE)1 ***
## poly(distance, 3, raw = TRUE)2 **
## poly(distance, 3, raw = TRUE)3
## poly(shot_angle, 3, raw = TRUE)1 *
## poly(shot_angle, 3, raw = TRUE)2 ***
## poly(shot_angle, 3, raw = TRUE)3 ***
## event_detailDeflected
## event_detailSlap ***
## event_detailSnap ***
## event_detailTip-In
## event_detailWrap-around ***
## event_detailWrist ***
## shooter_strengthEV ***
## shooter_strengthPP ***
## shooter_strengthPS
## shooter_strengthSH ***
## is_rebound ***
## is_rush ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 97215 on 203688 degrees of freedom
## Residual deviance: 85528 on 203670 degrees of freedom
## (131 observations deleted due to missingness)
## AIC: 85566
##
## Number of Fisher Scoring iterations: 7
coef(xGmodel)
## (Intercept) poly(distance, 3, raw = TRUE)1
## 2.189886e+00 -9.059070e-02
## poly(distance, 3, raw = TRUE)2 poly(distance, 3, raw = TRUE)3
## 4.495095e-04 1.599734e-06
## poly(shot_angle, 3, raw = TRUE)1 poly(shot_angle, 3, raw = TRUE)2
## 4.946806e-03 -4.741020e-04
## poly(shot_angle, 3, raw = TRUE)3 event_detailDeflected
## 3.323821e-06 3.155053e-02
## event_detailSlap event_detailSnap
## 7.702066e-01 7.288615e-01
## event_detailTip-In event_detailWrap-around
## 6.535929e-02 -6.484371e-01
## event_detailWrist shooter_strengthEV
## 4.389754e-01 -3.347231e+00
## shooter_strengthPP shooter_strengthPS
## -2.908783e+00 -1.051076e+01
## shooter_strengthSH is_rebound
## -3.235612e+00 4.299816e-01
## is_rush
## 1.438987e+00
In the output you can see that all but a few of the model’s predictor variables are statistically significant at a .001 level at predicting the variance in the data. The ones that aren’t like the different shot types would fit in with conventional wisdom as those shot types aren’t the majority of NHL goals. The other is Penalty Shot and while one would think it would be there probably just wasn’t enough data on them to make a decision one way or the other.
One way to test the fit of our data is to take the chi square distribution of the difference in the Null and Residual deviances and their respective degrees of freedom.
1 - pchisq(95532-85842, 203199-203183)
## [1] 0
What this test does is tell us if the difference in the deviance between the Null model which is just a model with only the intercept and the logistic model we created is due to chance or is statistically significant. By subtracting it from one it gives the area to the left of our model on the chi square distribution curve. Since it is zero it tells us that the model is satistically significant at describing the variance in our data than the null model. This doesn’t mean the model is “Good” by any means fortunately there are more tests for that. You can read more about Null and Residual deviances here and here
Ok now that we have the model ready to go and the fit of it is statistically significant over the null model it’s time to see how it fits our test data. I’ll be performing all the same steps to the test data to create the variables as I did the training data. I won’t go through all the steps again but if you want to see the code it will be in the google drive with the rest of the files.
So now that we have our test data ready to go we will predict the expected goal for each Fenwick attempt in our test data
Test_Fenwick_Data$xG <- predict(xGmodel, Test_Fenwick_Data, type = "response")
Let’s graph our predictions at each location by the average xg Value to get a feel for our data.
avg_xG_by_coord <- Test_Fenwick_Data %>% group_by(coords_x, coords_y) %>%
summarise(xg = mean(xG))
ggplot(avg_xG_by_coord, aes(coords_x, coords_y, fill = xg)) + geom_raster() +
scale_fill_gradient(low = 'blue', high = 'red')+
geom_vline(xintercept = 0, color = 'red') +
geom_vline(xintercept = 25, color = 'blue') +
geom_vline(xintercept = 88, color = 'red') +
xlab('X Coordinates') + ylab('Y Coordinates') +
labs(title = 'Average xG Value by Coordinate')
As you can see there are some weird outliers in the neutral zone. I’d assume these are caused by the inclusion of empty net goals in the prediction model. But other than that the plot seems to fit the intution that the closer to the goal and closer to the center of the ice a shooter is the greater chance of a shot turning into a goal. The next way to test the predictions of our model is to plot it’s ROC curve and calculate the area underneath the curve. This will be accomplished with the pROC
package in R.
g <- roc(is_goal ~ xG, data = Test_Fenwick_Data)
plot(g)
auc(g)
## Area under the curve: 0.7556
You can read more info about interpereting the ROC curve from this link, but the gist of it is that the model is a fair model by this metric. It’s not a perfect model, but it’s not completely useless as well. But for the first attempt at a model it shows us that I’m working in the right direction and we have good foundation on which to build improvments on.
Ok lets to get to the fun stuff and see how individual players performed in the 2017 season by our model. I’m also going to plot each players xG total versus their goal total as another way to test the predictiveness of the model as well.
#code to group xg and goals by player
xg_player <- Test_Fenwick_Data %>%
group_by(event_player_1, event_team) %>%
summarise( xG = sum(xG), Goals = sum(is_goal), Difference = sum(xG) - sum(is_goal))
head(xg_player)
## # A tibble: 6 x 5
## # Groups: event_player_1 [6]
## event_player_1 event_team xG Goals Difference
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AARON.EKBLAD FLA 11.02887921 10 1.02887921
## 2 AARON.NESS WSH 0.02557414 0 0.02557414
## 3 ADAM.CLENDENING NYR 1.35385413 2 -0.64614587
## 4 ADAM.CRACKNELL DAL 8.22744066 10 -1.77255934
## 5 ADAM.ERNE T.B 3.54879312 3 0.54879312
## 6 ADAM.HENRIQUE N.J 18.51550598 20 -1.48449402
arrange(xg_player, desc(xG))
## # A tibble: 941 x 5
## # Groups: event_player_1 [880]
## event_player_1 event_team xG Goals Difference
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 JEFF.SKINNER CAR 34.58222 37 -2.417784
## 2 AUSTON.MATTHEWS TOR 34.00383 40 -5.996170
## 3 JOHN.TAVARES NYI 33.65972 28 5.659719
## 4 CONNOR.MCDAVID EDM 33.59372 30 3.593719
## 5 WAYNE.SIMMONDS PHI 33.41914 31 2.419142
## 6 COREY.PERRY ANA 30.50376 19 11.503756
## 7 RYAN.KESLER ANA 30.48985 22 8.489851
## 8 JOE.PAVELSKI S.J 30.47975 28 2.479749
## 9 NATHAN.MACKINNON COL 29.61119 16 13.611189
## 10 PATRICE.BERGERON BOS 29.60397 21 8.603967
## # ... with 931 more rows
ggplot(aes(x = xG, y = Goals), data = xg_player) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title = 'Expected Goals vs Goals by Player')
play_xg <- lm(Goals ~ xG, data = xg_player)
summary(play_xg)
##
## Call:
## lm(formula = Goals ~ xG, data = xg_player)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7164 -1.2687 0.0663 0.9677 19.1669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.46500 0.13777 -3.375 0.000768 ***
## xG 1.05303 0.01334 78.930 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.044 on 939 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8689
## F-statistic: 6230 on 1 and 939 DF, p-value: < 2.2e-16
As seen with the graph and the regression line the xG model works very well in predicting individual goal totals based on the inputs from the data. But as the graph shows the greatest variance in our data of predicted versus actual is when the data approaches the extremes of the higher end. Now let’s take a look at a team level
xg_team <- Test_Fenwick_Data %>%
group_by(event_team) %>%
summarise( xG = sum(xG), Goals = sum(is_goal), Difference = sum(xG) - sum(is_goal))
arrange(xg_team, desc(xG))
## # A tibble: 30 x 4
## event_team xG Goals Difference
## <chr> <dbl> <dbl> <dbl>
## 1 TOR 261.5669 248 13.566932
## 2 PIT 260.5818 269 -8.418173
## 3 CAR 253.6819 212 41.681865
## 4 EDM 251.0783 241 10.078280
## 5 ANA 242.3017 218 24.301721
## 6 NYR 242.1009 253 -10.899078
## 7 DAL 238.9408 221 17.940751
## 8 CBJ 234.6076 247 -12.392359
## 9 BOS 233.8284 231 2.828355
## 10 NYI 231.9795 239 -7.020528
## # ... with 20 more rows
team_xg <- lm(Goals ~ xG, data = xg_team)
summary(team_xg)
##
## Call:
## lm(formula = Goals ~ xG, data = xg_team)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.583 -13.017 -1.734 14.759 39.242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.8002 48.1668 0.328 0.74533
## xG 0.9178 0.2137 4.295 0.00019 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.76 on 28 degrees of freedom
## Multiple R-squared: 0.3972, Adjusted R-squared: 0.3756
## F-statistic: 18.45 on 1 and 28 DF, p-value: 0.0001897
ggplot(aes(x = xG, y = Goals), data = xg_team) +
geom_point() +
geom_smooth(method = 'lm') +
labs(title = 'Expected Goals vs Goals by Team')
At the team level the xG model isn’t near as good at predicting goals as at the individual. I would expect this comes from the high variance from the higher end of the xG/Goal spectrum. The differences add up to drive the team xG amounts further away from the actual goal amounts leading to a worse correlation.
I believe with this model I have created a method of reasonably predicting the probability of whether a shot on average will be a goal or not give our inputs. Is it a perfect model? No, not by any means but it is a good start towards building something more complex and maybe more accurate. One way could be including some form of shooter and/or goalie talent evaluator that would help improve things. Despite that though the model created did have a decent level of success by our various tests.
The regression was statistically significant over the Null model with a chi square value of 0. Meaning the model is better suited in determining the variance in the data over the Null model. The model also performed well in the ROC test with a .7383 area under the curve. Another example of its accruacy was the ability to predict an individual player’s goal output as shown by the correlation between the two values with an .8716 adjusted R squared.
Again all of these performance metrics were run on the test data set which the model was not influenced by at all. I think one of the main things with this metirc is one can see whether teams and players are playing above or below their probabilites given the results versus their expected results.
Where there is a large difference one way or the other one could reasonably expect a team to regress to the mean at some point in the future, and if there is no sign of regression look deeper to see what other factors than what are included in the model could be affecting team and player goal outputs. Things such as injuries, playing time, and level of teammates all could affect the number of high level probability shots a player is attempting.
A couple things about what this model does not do as well. It does not predict what are the likely goal outputs of a player going forward. Tests could be run to see if there are intra and inter season repeatabilities at a player level which given the model’s predictive level’s at an individual level I would expect there would be. Then combined with an aging curve and some form of regression to the mean I think a reasonable model could be created. But as of right now I would not be confident in using this model in that capacity at all.
Again all the files of code, data, and model will be in the Google drive folder linked to earlier.
Obviously I didn’t come up with this all on my own. All the writers of these articles were highly influential in the decisions made in creating this model and without them it wouldn’t have turned out as well as it did. So all the credit to these writers for making the path a lot easier for me.
NHL Shot Quality by Ken Krzywicki
Evaluating NHL Goalies, Skaters, and Teams using Weighted Shots