Creating Your First xG Model

Ok with Corsica being down for now there has been a lack of publicly available xG(Expected Goals) data. Obviously with it coming back up soon that will change, but it got me thinking about building my own model, which in turn made me think of how when I first heard about xG I wanted to know how it works. While Emmanuel Perry and DTMAboutHeart do an excellent job of explaining their models here and here, to the uninitiated the explanations can be quite daunting.

At this point in my hockey analytics education I understood Corsi,GF%, and why +/- is bad. But this was completely new and cool and I wanted to understand it more. I understooed what xG was and what it said, but I wanted to understand how to actually come up with. I wanted to peak behind the curtain and see the wizard at work and unfortunately there was little in the way of that at the time. This article serves to fill that need.

Ground Rules

First a couple things to get out of the way. First is that you have some familiarity with hockey stats like Corsi and Fenwick and the such. I won’t be explaining what these mean in the article but this link has the basic rundown. This will be written in R so if you have some experience working with that language that is a bonus. However, this article will be written in a way that if you just follow the steps you can get the model up and running as long as you have some familiarity with R Studio. R Studio can be downloaded here and if you’d like to learn R more here is a good link to learn the basics and there is also the Swirl Package that can be run in R Studio. This link will show you the directions to get that up and running in R Studio.

Normally I work in Python but I’m trying to improve my R knowledge which is another reason this is written in R. Another version of this will be written in Python as soon as I get the R version up and running. So if you don’t want to learn R, or don’t have time, I’ll have a version for you up soon.

Also this model won’t be perfect by any means. If you’re expecting a perfect model that is on the level of what is publicly out there this won’t be it. In fact I’m not even sure how good this will be because as I’m typing this I haven’t ran the numbers yet. I will learn things just as you do because I’m running the program as I type. The purpose of this article is just to familiarize you with the process behind making the model and look at some of the decisions behind what variables we include. There will be a part two where we work on fine tuning the model and go deeper into the data.

The Data

The Data for this project comes from the 2015, 2016, and 2017 NHL seasons. I’d like to think EvolvingWild for helping me setup the scraper for the NHL data. I keep telling myself I’m going to write one and never get around to it so his help was crucial to getting this underway. The Play by Play (PbP) data can be found here in csv form along with the file for the code and the actual model itself. They are rather large around .4 GB as they have every single event from each season so they may take a while to download. You will also find the files for the R markdown that this is written in and the code for the model as well if you want to play around with it

The Model

For this xG model we will be building what is called a logistical regression in order to predict our xG values. There are a few reasons why I am choosing this one as to other modeling options:

However, there are flaws to a logistic model as well:

Output of the Model

Logistic Regressions produce a probability of something happening given the independent variables. So when the model says that a shot attempt produces .60 xG it means that given an average player that six out of ten times a shot attempt with the exact same variables will produce a goal. This link talks about it from a soccer perspective but a lot of the same themes apply to hockey as well.

We will then sum up these probabilities to get a team’s xG total per game. That’s why the totals of xG for games are often in decimal and not whole values. Ok enough back story and info and now it’s time to move on to working with the data.

Importing the Data

setwd('~/DataAnalysis/Programming Stuff/R/NHLScraper/FullNHLSeasonPbP/')
PbP2016Data <- read_csv('2015NHLPbP.csv')
PbP2015Data <- read_csv('2016NHLPbP.csv')
Test_PbP_Data <- read_csv('2017NHLPbP.csv')

You want to put the directory in the setwd() function where ever you have downloaded the NHL PbP data. From there we use the read_csv() function from the readr package we imported earlier. We have three seasons worth of data but right now I’m only going to read in two seasons. The reason behind this is the validation step where we test our model against the data. You want to do that against data that is out of sample i.e. data the model didn’t use in its training set.

Usually this is around 30% of your dataset but I broke the data into separate seasons instead of a percentage for a couple reasons. The main reason is I want to keep the season broken apart by year so I don’t have any data from 2017 in my training data so I can do a comparison at the end and see if my model has any predictive power or is a more descriptive model. If you are of unsure of the difference here is a good link that describes it well. Secondly the data split just feels right by the way people around the NHL segment their analysis as well.

So now that the all the data is imported from the 2015 and 2016 NHL seasons the first thing we will do is combine them into one giant data frame with the bind_rows function and then take a look at the structure of the resulting dataframe

Train_PbP_Data <- bind_rows(PbP2015Data, PbP2016Data)
dim(Train_PbP_Data)
## [1] 2353074      42
str(Train_PbP_Data, max.level = 1)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2353074 obs. of  42 variables:
##  $ season             : int  20152016 20152016 20152016 20152016 20152016 20152016 20152016 20152016 20152016 20152016 ...
##  $ game_id            : int  2015020001 2015020001 2015020001 2015020001 2015020001 2015020001 2015020001 2015020001 2015020001 2015020001 ...
##  $ game_date          : Date, format: "2015-10-07" "2015-10-07" ...
##  $ session            : chr  "R" "R" "R" "R" ...
##  $ event_index        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ game_period        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ game_seconds       : int  0 0 0 0 14 14 36 36 37 37 ...
##  $ event_type         : chr  "PSTR" "ON" "ON" "FAC" ...
##  $ event_description  : chr  "Period Start- Local time: 7:20 EDT" NA NA "MTL won Neu. Zone - MTL #14 PLEKANEC vs TOR #42 BOZAK" ...
##  $ event_detail       : chr  NA NA NA NA ...
##  $ event_team         : chr  NA "MTL" "TOR" "MTL" ...
##  $ event_player_1     : chr  NA NA NA "TOMAS.PLEKANEC" ...
##  $ event_player_2     : chr  NA NA NA "TYLER.BOZAK" ...
##  $ event_player_3     : chr  NA NA NA NA ...
##  $ event_length       : int  0 0 0 14 0 22 0 1 0 3 ...
##  $ coords_x           : int  0 NA NA -1 48 19 NA NA NA NA ...
##  $ coords_y           : int  0 NA NA 0 25 22 NA NA NA NA ...
##  $ players_substituted: chr  NA "MTL11, MTL14, MTL31, MTL67, MTL76, MTL79" "TOR2, TOR3, TOR19, TOR23, TOR42, TOR45" NA ...
##  $ home_on_1          : chr  NA NA "MATT.HUNWICK" "MATT.HUNWICK" ...
##  $ home_on_2          : chr  NA NA "DION.PHANEUF" "DION.PHANEUF" ...
##  $ home_on_3          : chr  NA NA "JOFFREY.LUPUL" "JOFFREY.LUPUL" ...
##  $ home_on_4          : chr  NA NA "SHAWN.MATTHIAS" "SHAWN.MATTHIAS" ...
##  $ home_on_5          : chr  NA NA "TYLER.BOZAK" "TYLER.BOZAK" ...
##  $ home_on_6          : chr  NA NA "JONATHAN.BERNIER" "JONATHAN.BERNIER" ...
##  $ away_on_1          : chr  NA "BRENDAN.GALLAGHER" "BRENDAN.GALLAGHER" "BRENDAN.GALLAGHER" ...
##  $ away_on_2          : chr  NA "TOMAS.PLEKANEC" "TOMAS.PLEKANEC" "TOMAS.PLEKANEC" ...
##  $ away_on_3          : chr  NA "CAREY.PRICE" "CAREY.PRICE" "CAREY.PRICE" ...
##  $ away_on_4          : chr  NA "MAX.PACIORETTY" "MAX.PACIORETTY" "MAX.PACIORETTY" ...
##  $ away_on_5          : chr  NA "P.K..SUBBAN" "P.K..SUBBAN" "P.K..SUBBAN" ...
##  $ away_on_6          : chr  NA "ANDREI.MARKOV" "ANDREI.MARKOV" "ANDREI.MARKOV" ...
##  $ home_goalie        : chr  NA NA "JONATHAN.BERNIER" "JONATHAN.BERNIER" ...
##  $ away_goalie        : chr  NA "CAREY.PRICE" "CAREY.PRICE" "CAREY.PRICE" ...
##  $ home_team          : chr  "TOR" "TOR" "TOR" "TOR" ...
##  $ away_team          : chr  "MTL" "MTL" "MTL" "MTL" ...
##  $ home_skaters       : int  0 0 5 5 5 5 5 5 3 5 ...
##  $ away_skaters       : int  0 5 5 5 5 5 2 5 5 5 ...
##  $ home_score         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ away_score         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ game_score_state   : chr  "0v0" "0v0" "0v0" "0v0" ...
##  $ game_strength_state: chr  "EvE" "Ev5" "5v5" "5v5" ...
##  $ highlight_code     : chr  NA NA NA NA ...
##  $ X1                 : int  NA NA NA NA NA NA NA NA NA NA ...

dim() gives us the dimension of the datframe of the NHL PbP data. It shows that the dataframe has 2356958 observations of 41 different variables. Will we need all 41 of those variables? Yes if we were trying to create a full stat database, but we’ll be mainly working with the rows, or observations, that deal with Fenwick events. Why Fenwick and not Corsi? Well we are limited to those events because the NHL records the location of blocks at the location of the block and not the location of the shot and as you’ll see later distance plays a key role in our model so if we feed it false information we could get a false output hence the Fenwick only events. Those events are labeled in the event_type variable of the dataframe so lets look what’s there so we can pick out what we need.

unique(Train_PbP_Data$event_type)
##  [1] "PSTR"  "ON"    "FAC"   "STOP"  "OFF"   "HIT"   "SHOT"  "GIVE" 
##  [9] "MISS"  "GOAL"  "BLOCK" "PENL"  "TAKE"  "PEND"  "GEND"  "GOFF" 
## [17] "SOC"   "EIEND" "EISTR" "CHL"   "EGT"   "EGPID"

So the only events the model needs are SHOT, MISS, GOAL. We’re going to create what’s called a vector in R so we don’t have to type those over and over again. We are also going to create a few functions that will create some variables we need in the datasets as well so we don’t have to cut and paste. This functions can be saved and used on other NHL dataframes as well so feel free to improve or used them as needed.

The first thing I’m going to do is filter out all the Shootout atttempts. The data lists the shoot out as period five so we’ll just keep all rows that have the value less than 5 in the game_period column.

fenwick_events <- c('SHOT', 'MISS', 'GOAL')

Train_PbP_Data <- Train_PbP_Data[Train_PbP_Data$game_period < 5,]
dim(Train_PbP_Data)
## [1] 2349958      42

The next couple functions are pretty simple as well I want to create new columns that tell me whether each event was a goal and if it was an event by the home team. Basically while this data is available in other columns I want to make my checks simpler in the future so my code is simpler. Also creating a column that’s sole function is to say whether the event was a goal or not makes it easier to pass to logistic regression later without having to doing any extra filtering as the only two options will be 1 or 0. This is done to setup a result that the logistic regression can model. The technical term for this is creating a dummy variable. “The dummy variables act like ‘switches’ that turn various parameters on and off in an equation.”

is_home <- function(dataframe){
    dataframe$is_home <- ifelse(dataframe$event_team == 
                                dataframe$home_team, 1 , 0)
    return(dataframe)
}

is_goal <- function(dataframe){
    dataframe$is_goal <- ifelse(dataframe$event_type == "GOAL", 1, 0)
    return(dataframe)
}

Train_PbP_Data <- is_home(Train_PbP_Data)
Train_PbP_Data <- is_goal(Train_PbP_Data)

Ok now we have those two columns setup relatively easily. Reading a lot of different research on the idea of expected goals leads these main variables as the most important when it comes to modeling xG: distance from goal, angle to the goal, shot type, whether the shot was from a rebound, a shot on the rush, shooter team strength i.e. PP, EV, SH. This model will include all of these variables but we need to set them up first.

The first thing I’m going to go is create a time difference between each event in the data set which we will then use later in our model to measure whether the shots are on the rush or a rebound. We’ll also set any NAs that result from the calculations to zero in case there are some NAs in the PBP data variables we are using to calculate the other variables.

Train_PbP_Data <- Train_PbP_Data %>% group_by(game_id) %>%
    arrange(event_index, .by_group = TRUE) %>%
    mutate(time_diff = game_seconds - lag(game_seconds))
    

Train_PbP_Data$time_diff[is.na(Train_PbP_Data$time_diff)] <- 0
Train_PbP_Data$is_home[is.na(Train_PbP_Data$is_home)] <- 0
Train_PbP_Data$is_rebound <- ifelse(Train_PbP_Data$time_diff < 3 & 
                                Train_PbP_Data$event_type %in% fenwick_events &
                                Train_PbP_Data$event_team == 
                                lag(Train_PbP_Data$event_team),
                                1, 0)

Train_PbP_Data$is_rebound[is.na(Train_PbP_Data$is_rebound)] <- 0
Train_PbP_Data$is_rush <- ifelse(Train_PbP_Data$time_diff < 4 &
                                 lag(abs(Train_PbP_Data$coords_x)) < 25 &
                                 Train_PbP_Data$event_type %in% fenwick_events,
                                 1, 0)

To make this easier we are going to filter out the Fenwick events first then apply our data manipulations and our regression.

Train_PbP_Data$is_rebound[is.na(Train_PbP_Data$time_diff)] <- 0
Train_PbP_Data$is_rush[is.na(Train_PbP_Data$is_rush)] <- 0
Train_Fenwick_Data<- filter(Train_PbP_Data, event_type %in% c("SHOT", "MISS", "GOAL"))

The resulting dataframe is much smaller and should be easier to work with. Now on to creating the variables to include in the logistic regression. Let’s check our data we’re going to be working with to make sure there are not NA or NaN values that we need to deal with before doing our calculations.

unique(Train_Fenwick_Data$event_detail)
## [1] "Wrist"       "Backhand"    "Snap"        "Tip-In"      "Slap"       
## [6] "Wrap-around" "Deflected"   NA
unique(Train_Fenwick_Data$coords_x)
##   [1]  61 -54 -40 -72  37 -66 -57 -60  68  82  63  51 -58 -65  76 -47 -42
##  [18]  75  53  40  77 -55 -83 -73  35 -71  39  66 -17  59  55 -39  49 -79
##  [35] -77 -53  46 -64 -85 -89 -75 -14 -33  45  81 -80  84  62  36  60  57
##  [52]  67 -52 -63  52  72 -48   6 -38  56  70 -81 -34  31 -43  42 -35  44
##  [69]  34 -62  33 -56  65  38  64  41  80 -32 -70 -37  32 -84  85  73  18
##  [86] -45  69  78  48  54  24  50 -61  43 -67  79 -59  86 -10  87  71 -82
## [103] -49  58 -50  83 -68 -78  47 -36 -69 -28 -87 -41  74 -76 -51  30 -74
## [120]  19   3 -31  88 -30 -46 -88 -22   4  11 -29   5  -6  91  -4 -16 -15
## [137] -12 -13 -23  27 -86  89  -7 -26  -3 -44  17  29  12 -91  28 -11  16
## [154]  -8  -9  25   8  15  -1 -19  22  90  92 -20   9 -25  26   7 -21  -2
## [171] -24  10   1 -93 -97  21  -5  93 -90 -95 -94  13   2   0 -18  97 -92
## [188]  23  14  20  95  94 -27  NA  96 -96 -99  98
unique(Train_Fenwick_Data$coords_y)
##  [1]  29  32  -8  30   1 -34 -13  -5 -15   5 -16  10 -18  22  31  37  25
## [18]  -2   3 -20  33  21   8  12  27   0   2   4  17 -21   9 -25 -24   6
## [35] -31  34 -17   7 -32  16  -9  36 -29 -11 -27 -23 -36 -28  15  18  23
## [52] -33 -26  40  38  14  11 -35 -22 -30  -7  24  -4 -38  13  28  -1 -19
## [69] -14  19  -6  -3  26  20 -12  39 -39 -40 -10  35 -37 -41  NA  41  42
## [86] -42
unique(Train_Fenwick_Data$game_strength_state)
##  [1] "5v5" "5vE" "4v5" "5v4" "Ev5" "4v4" "5v3" "3v4" "4vE" "4v3" "3v5"
## [12] "EvE" "Ev4" "5v2" "2v5" "3vE" "3v3" "Ev3" "Ev0"

It looks like there’s some NA data in there meaning there isn’t a value for the event_detail so let’s take a look at what’s there.

head(Train_Fenwick_Data[Train_Fenwick_Data$event_detail == 'NA',], n = 15)
## # A tibble: 10 x 47
## # Groups:   game_id [1]
##    season game_id game_date session event_index game_period game_seconds
##     <int>   <int>    <date>   <chr>       <int>       <int>        <int>
##  1     NA      NA        NA    <NA>          NA          NA           NA
##  2     NA      NA        NA    <NA>          NA          NA           NA
##  3     NA      NA        NA    <NA>          NA          NA           NA
##  4     NA      NA        NA    <NA>          NA          NA           NA
##  5     NA      NA        NA    <NA>          NA          NA           NA
##  6     NA      NA        NA    <NA>          NA          NA           NA
##  7     NA      NA        NA    <NA>          NA          NA           NA
##  8     NA      NA        NA    <NA>          NA          NA           NA
##  9     NA      NA        NA    <NA>          NA          NA           NA
## 10     NA      NA        NA    <NA>          NA          NA           NA
## # ... with 40 more variables: event_type <chr>, event_description <chr>,
## #   event_detail <chr>, event_team <chr>, event_player_1 <chr>,
## #   event_player_2 <chr>, event_player_3 <chr>, event_length <int>,
## #   coords_x <int>, coords_y <int>, players_substituted <chr>,
## #   home_on_1 <chr>, home_on_2 <chr>, home_on_3 <chr>, home_on_4 <chr>,
## #   home_on_5 <chr>, home_on_6 <chr>, away_on_1 <chr>, away_on_2 <chr>,
## #   away_on_3 <chr>, away_on_4 <chr>, away_on_5 <chr>, away_on_6 <chr>,
## #   home_goalie <chr>, away_goalie <chr>, home_team <chr>,
## #   away_team <chr>, home_skaters <int>, away_skaters <int>,
## #   home_score <int>, away_score <int>, game_score_state <chr>,
## #   game_strength_state <chr>, highlight_code <chr>, X1 <int>,
## #   is_home <dbl>, is_goal <dbl>, time_diff <dbl>, is_rebound <dbl>,
## #   is_rush <dbl>

So there are 10 rows of data out of our 34,165 that has nothing but NA values so the next move will be to go ahead and filter those out.

Train_Fenwick_Data <- filter(Train_Fenwick_Data, !is.na(event_detail))
head(Train_Fenwick_Data[Train_Fenwick_Data$event_detail == 'NA',])
## # A tibble: 0 x 47
## # Groups:   game_id [0]
## # ... with 47 variables: season <int>, game_id <int>, game_date <date>,
## #   session <chr>, event_index <int>, game_period <int>,
## #   game_seconds <int>, event_type <chr>, event_description <chr>,
## #   event_detail <chr>, event_team <chr>, event_player_1 <chr>,
## #   event_player_2 <chr>, event_player_3 <chr>, event_length <int>,
## #   coords_x <int>, coords_y <int>, players_substituted <chr>,
## #   home_on_1 <chr>, home_on_2 <chr>, home_on_3 <chr>, home_on_4 <chr>,
## #   home_on_5 <chr>, home_on_6 <chr>, away_on_1 <chr>, away_on_2 <chr>,
## #   away_on_3 <chr>, away_on_4 <chr>, away_on_5 <chr>, away_on_6 <chr>,
## #   home_goalie <chr>, away_goalie <chr>, home_team <chr>,
## #   away_team <chr>, home_skaters <int>, away_skaters <int>,
## #   home_score <int>, away_score <int>, game_score_state <chr>,
## #   game_strength_state <chr>, highlight_code <chr>, X1 <int>,
## #   is_home <dbl>, is_goal <dbl>, time_diff <dbl>, is_rebound <dbl>,
## #   is_rush <dbl>

Next we’ll turn the categorical variables of into factors so that the model can use them in the training function later. And it also helps in graphing as well.

Train_Fenwick_Data$event_detail<- as.factor(Train_Fenwick_Data$event_detail)

The next variable to calculate is the angle of the shot relative to the center of the goal. Let’s look at the range of our coordinate system.

range(Train_Fenwick_Data$coords_x)
## [1] NA NA
range(Train_Fenwick_Data$coords_y)
## [1] NA NA

Well that didn’t work so we’ll need to filter out the NA values:

Train_PbP_Data <- filter(Train_PbP_Data, coords_x != 'NA' & coords_y != 'NA') 

range(Train_PbP_Data$coords_x)
## [1] -99  99
range(Train_PbP_Data$coords_y)
## [1] -42  42

So here we have the range of the NHL rink by the x and y coordinates. An NHL rink is 200 feet wide by 85 feet across so every vertical up or down is equal to one foot and and each x integer is equal to .995 feet. The goal line is 11 feet from the end boards so that is equal to 11.05 in units in our coordinate system. With this knowledge we now know the coordinate of the center of the goal is (87.95,0).

So given a shots location let’s say for this test case of a shot taken from (50, 12) and try to determine the angle using what we know about the coordinate system. The angle we are trying to find Theta is equal to Sin(Theta) = Opposite/Hypoteneuse. We know the opposite side is equal the y coordinate. The hypoteneuse we’ll get the length of using the trusty pythagorean theorem of the Sqrt((87.95-50)^2 + 12^2). So adding that altogether we have this:

asin(12/sqrt(1440.20+144))
## [1] 0.3062574

That answer is in radians in order to get it to degrees we will multiply it by 180 and divide by pi.

radian <- asin(12/sqrt(1440.20+144))
(radian * 180)/3.14
## [1] 17.55616

So now we have that it is a shot taken at a 17 degree angle from the center of the goal. So let’s add all this together and create a new column for each shots angle form goal. To simplify things I’m also just going to mirror the x values by taking the abs() value of each as I don’t really believe which zone the shot is taken on has that much of an effect. I will also switch the y value from negative to positive in order to accurately display the correct wing that the shot was taken from.

Train_Fenwick_Data$coords_y <- ifelse(Train_Fenwick_Data$coords_x < 0,
    -1 * Train_Fenwick_Data$coords_y, Train_Fenwick_Data$coords_y)

Train_Fenwick_Data$coords_x <- abs(Train_Fenwick_Data$coords_x)


Train_Fenwick_Data$shot_angle <- (asin(abs(Train_Fenwick_Data$coords_y)/sqrt((87.95 
                                    - abs(Train_Fenwick_Data$coords_x))^2
                                    + Train_Fenwick_Data$coords_y^2))*180)/ 3.14

With this metric shots that have a 0 angle are 90 Degrees to the net or in the slot. There’s another problem though because shots below the net should result in an angle larger than 90 degrees although our fomula won’t do that because the negative value of the difference between the goal line and shot will wash out when we square it. So to fix that we just need to add 90 to the value of the angle if our x coordinate is larger than the x coordinate of the goal line. Also now that we have the coordinates and angle straight the next thing to do is construct the distance from the goal which is just the simple pythagorean theorem

Train_Fenwick_Data$shot_angle <- ifelse(abs(Train_Fenwick_Data$coords_x) > 88, 90 + 
                                  (180-(90 + Train_Fenwick_Data$shot_angle)), 
                                  Train_Fenwick_Data$shot_angle)

Train_Fenwick_Data$distance <- sqrt((87.95 - abs(Train_Fenwick_Data$coords_x))^2 + Train_Fenwick_Data$coords_y^2)

The strength state is important but it is a little muddy and there are a lot of different factors. First is that while it tells the strength state of the game it will be easier on the model if we probably tell it the strength state of the person taking the shot because while it may be the PP if someone on the PK is taking a shot logic follows that it is less likely to be a dangerous chance than the shot by the team on the PP.

So we’ve got all our variables set and ready to go. We’ve got shot location, shot type, strength state, and whether it was a shot from the home team or not. Needless to say for this base model we are going to ignore the player actually taking the shot and the goalie in net. Obviously these are important variables as a shot by Sidney Crosby and reasonably expected to go in at a higher way with all other variables controled than the same shot by say Jay McClement. And again that same shot against Carey Price or Braden Holtby would resaonably be expected to drop the chances of that same shot as opposed to Cam Ward or Chad Johnson.

So yes I realize that not trying to factor in those variables makes a big difference, but for the purpose of this model and article we can safely ignore them because right now I’m not worried about showing the best model. I’m trying to show a model and instruct people on how to build one.

Building the Model

Now with all our variables set, we are finally read to build the actual model. The first step to doing so is training the model on our training data. I am going to use all the variables we created. We will be doing this with R’s glm() function which stands for generalized linear model.

xGmodel <- glm(is_goal ~ poly(distance, 3, raw = TRUE) + 
                 poly(shot_angle, 3, raw = TRUE) + event_detail + 
                 shooter_strength +
                 is_rebound + is_rush,
                 data = Train_Fenwick_Data, 
                 family = binomial(link = 'logit'))

save(xGmodel, file = "xGmodelver2.rda")

The the first argument of the glm function is the formula we want to use to predict goals. Data is our training dataframe. The method we are using is a general linear model but when we use the binomial family that tells it to use the logistic regression. And that’s really all there is to it for building the model itself. In a lot of model building scenarios the time it takes to get the data ready for the model and the time afterwards to test the model will be the majority of time spent in building a model.

summary(xGmodel)
## 
## Call:
## glm(formula = is_goal ~ poly(distance, 3, raw = TRUE) + poly(shot_angle, 
##     3, raw = TRUE) + event_detail + shooter_strength + is_rebound + 
##     is_rush, family = binomial(link = "logit"), data = Train_Fenwick_Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4014  -0.3865  -0.2622  -0.1961   3.1432  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                       2.190e+00  8.420e-02  26.007  < 2e-16
## poly(distance, 3, raw = TRUE)1   -9.059e-02  4.834e-03 -18.739  < 2e-16
## poly(distance, 3, raw = TRUE)2    4.495e-04  1.396e-04   3.221  0.00128
## poly(distance, 3, raw = TRUE)3    1.600e-06  1.137e-06   1.407  0.15933
## poly(shot_angle, 3, raw = TRUE)1  4.947e-03  2.356e-03   2.100  0.03574
## poly(shot_angle, 3, raw = TRUE)2 -4.741e-04  5.847e-05  -8.108 5.14e-16
## poly(shot_angle, 3, raw = TRUE)3  3.324e-06  4.026e-07   8.256  < 2e-16
## event_detailDeflected             3.155e-02  6.192e-02   0.510  0.61040
## event_detailSlap                  7.702e-01  4.579e-02  16.822  < 2e-16
## event_detailSnap                  7.289e-01  4.034e-02  18.067  < 2e-16
## event_detailTip-In                6.536e-02  4.382e-02   1.491  0.13585
## event_detailWrap-around          -6.484e-01  1.148e-01  -5.646 1.64e-08
## event_detailWrist                 4.390e-01  3.417e-02  12.847  < 2e-16
## shooter_strengthEV               -3.347e+00  6.268e-02 -53.399  < 2e-16
## shooter_strengthPP               -2.909e+00  6.525e-02 -44.579  < 2e-16
## shooter_strengthPS               -1.051e+01  4.395e+01  -0.239  0.81100
## shooter_strengthSH               -3.236e+00  8.676e-02 -37.294  < 2e-16
## is_rebound                        4.300e-01  2.345e-02  18.338  < 2e-16
## is_rush                           1.439e+00  6.734e-02  21.369  < 2e-16
##                                     
## (Intercept)                      ***
## poly(distance, 3, raw = TRUE)1   ***
## poly(distance, 3, raw = TRUE)2   ** 
## poly(distance, 3, raw = TRUE)3      
## poly(shot_angle, 3, raw = TRUE)1 *  
## poly(shot_angle, 3, raw = TRUE)2 ***
## poly(shot_angle, 3, raw = TRUE)3 ***
## event_detailDeflected               
## event_detailSlap                 ***
## event_detailSnap                 ***
## event_detailTip-In                  
## event_detailWrap-around          ***
## event_detailWrist                ***
## shooter_strengthEV               ***
## shooter_strengthPP               ***
## shooter_strengthPS                  
## shooter_strengthSH               ***
## is_rebound                       ***
## is_rush                          ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 97215  on 203688  degrees of freedom
## Residual deviance: 85528  on 203670  degrees of freedom
##   (131 observations deleted due to missingness)
## AIC: 85566
## 
## Number of Fisher Scoring iterations: 7
coef(xGmodel)
##                      (Intercept)   poly(distance, 3, raw = TRUE)1 
##                     2.189886e+00                    -9.059070e-02 
##   poly(distance, 3, raw = TRUE)2   poly(distance, 3, raw = TRUE)3 
##                     4.495095e-04                     1.599734e-06 
## poly(shot_angle, 3, raw = TRUE)1 poly(shot_angle, 3, raw = TRUE)2 
##                     4.946806e-03                    -4.741020e-04 
## poly(shot_angle, 3, raw = TRUE)3            event_detailDeflected 
##                     3.323821e-06                     3.155053e-02 
##                 event_detailSlap                 event_detailSnap 
##                     7.702066e-01                     7.288615e-01 
##               event_detailTip-In          event_detailWrap-around 
##                     6.535929e-02                    -6.484371e-01 
##                event_detailWrist               shooter_strengthEV 
##                     4.389754e-01                    -3.347231e+00 
##               shooter_strengthPP               shooter_strengthPS 
##                    -2.908783e+00                    -1.051076e+01 
##               shooter_strengthSH                       is_rebound 
##                    -3.235612e+00                     4.299816e-01 
##                          is_rush 
##                     1.438987e+00

In the output you can see that all but a few of the model’s predictor variables are statistically significant at a .001 level at predicting the variance in the data. The ones that aren’t like the different shot types would fit in with conventional wisdom as those shot types aren’t the majority of NHL goals. The other is Penalty Shot and while one would think it would be there probably just wasn’t enough data on them to make a decision one way or the other.

One way to test the fit of our data is to take the chi square distribution of the difference in the Null and Residual deviances and their respective degrees of freedom.

1 - pchisq(95532-85842, 203199-203183)
## [1] 0

What this test does is tell us if the difference in the deviance between the Null model which is just a model with only the intercept and the logistic model we created is due to chance or is statistically significant. By subtracting it from one it gives the area to the left of our model on the chi square distribution curve. Since it is zero it tells us that the model is satistically significant at describing the variance in our data than the null model. This doesn’t mean the model is “Good” by any means fortunately there are more tests for that. You can read more about Null and Residual deviances here and here

Ok now that we have the model ready to go and the fit of it is statistically significant over the null model it’s time to see how it fits our test data. I’ll be performing all the same steps to the test data to create the variables as I did the training data. I won’t go through all the steps again but if you want to see the code it will be in the google drive with the rest of the files.

So now that we have our test data ready to go we will predict the expected goal for each Fenwick attempt in our test data

Test_Fenwick_Data$xG <- predict(xGmodel, Test_Fenwick_Data, type = "response")

Let’s graph our predictions at each location by the average xg Value to get a feel for our data.

avg_xG_by_coord <- Test_Fenwick_Data %>% group_by(coords_x, coords_y) %>%
    summarise(xg = mean(xG))

ggplot(avg_xG_by_coord, aes(coords_x, coords_y, fill = xg)) + geom_raster() +
    scale_fill_gradient(low = 'blue', high = 'red')+
    geom_vline(xintercept = 0, color = 'red') +
    geom_vline(xintercept = 25, color = 'blue') +
    geom_vline(xintercept = 88, color = 'red') +
    xlab('X Coordinates') + ylab('Y Coordinates') +
    labs(title = 'Average xG Value by Coordinate')

As you can see there are some weird outliers in the neutral zone. I’d assume these are caused by the inclusion of empty net goals in the prediction model. But other than that the plot seems to fit the intution that the closer to the goal and closer to the center of the ice a shooter is the greater chance of a shot turning into a goal. The next way to test the predictions of our model is to plot it’s ROC curve and calculate the area underneath the curve. This will be accomplished with the pROC package in R.

g <- roc(is_goal ~ xG, data = Test_Fenwick_Data)
plot(g)

auc(g)
## Area under the curve: 0.7556

You can read more info about interpereting the ROC curve from this link, but the gist of it is that the model is a fair model by this metric. It’s not a perfect model, but it’s not completely useless as well. But for the first attempt at a model it shows us that I’m working in the right direction and we have good foundation on which to build improvments on.

Ok lets to get to the fun stuff and see how individual players performed in the 2017 season by our model. I’m also going to plot each players xG total versus their goal total as another way to test the predictiveness of the model as well.

#code to group xg and goals by player
xg_player <- Test_Fenwick_Data %>%
    group_by(event_player_1, event_team) %>%
    summarise( xG = sum(xG), Goals = sum(is_goal), Difference = sum(xG) - sum(is_goal))
head(xg_player)
## # A tibble: 6 x 5
## # Groups:   event_player_1 [6]
##    event_player_1 event_team          xG Goals  Difference
##             <chr>      <chr>       <dbl> <dbl>       <dbl>
## 1    AARON.EKBLAD        FLA 11.02887921    10  1.02887921
## 2      AARON.NESS        WSH  0.02557414     0  0.02557414
## 3 ADAM.CLENDENING        NYR  1.35385413     2 -0.64614587
## 4  ADAM.CRACKNELL        DAL  8.22744066    10 -1.77255934
## 5       ADAM.ERNE        T.B  3.54879312     3  0.54879312
## 6   ADAM.HENRIQUE        N.J 18.51550598    20 -1.48449402
arrange(xg_player, desc(xG))
## # A tibble: 941 x 5
## # Groups:   event_player_1 [880]
##      event_player_1 event_team       xG Goals Difference
##               <chr>      <chr>    <dbl> <dbl>      <dbl>
##  1     JEFF.SKINNER        CAR 34.58222    37  -2.417784
##  2  AUSTON.MATTHEWS        TOR 34.00383    40  -5.996170
##  3     JOHN.TAVARES        NYI 33.65972    28   5.659719
##  4   CONNOR.MCDAVID        EDM 33.59372    30   3.593719
##  5   WAYNE.SIMMONDS        PHI 33.41914    31   2.419142
##  6      COREY.PERRY        ANA 30.50376    19  11.503756
##  7      RYAN.KESLER        ANA 30.48985    22   8.489851
##  8     JOE.PAVELSKI        S.J 30.47975    28   2.479749
##  9 NATHAN.MACKINNON        COL 29.61119    16  13.611189
## 10 PATRICE.BERGERON        BOS 29.60397    21   8.603967
## # ... with 931 more rows
ggplot(aes(x = xG, y = Goals), data = xg_player) +
    geom_point() + 
    geom_smooth(method = 'lm') +
    labs(title = 'Expected Goals vs Goals by Player')