library(tidyverse)
### Importing data
df <- read.csv("C:/Users/matth/OneDrive/Documents/INFO_H510/spi_matches.csv")
### Subsetting to only include the top 5 leagues
df_top_leagues <- df |>
filter(league %in% c("Barclays Premier League", "French Ligue 1", "Italy Serie A", "Spanish Primera Division", "German Bundesliga"))
Goals are the ultimate stat that determines who wins a soccer match. You can have all of the “underlying metrics” you want, but ultimately the only numbers that matter are the number of goals you score and the number of goals you allow. Thus, we want to look at the relationship between actual goals scored and a few different factors to see how they relate. Since goals are a count variable, we will use Poisson regression to look at the relationship with goals and xG, SPI, and win probability.
# Scale SPI and win prob for interpretation
df_top_leagues <- df_top_leagues |>
mutate(spi1_scaled = spi1 * 10) |>
mutate(prob1_scaled = prob1 * 10)
m1 <- glm(score1 ~ xg1 + spi1_scaled + prob1_scaled,
family = poisson(link = 'log'),
data = df_top_leagues)
summary(m1)
##
## Call:
## glm(formula = score1 ~ xg1 + spi1_scaled + prob1_scaled, family = poisson(link = "log"),
## data = df_top_leagues)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4723152 0.0534303 -8.840 <2e-16 ***
## xg1 0.3834384 0.0089369 42.905 <2e-16 ***
## spi1_scaled -0.0001005 0.0001036 -0.970 0.332
## prob1_scaled 0.0638882 0.0068356 9.346 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 11292.8 on 9024 degrees of freedom
## Residual deviance: 8262.7 on 9021 degrees of freedom
## (105 observations deleted due to missingness)
## AIC: 25631
##
## Number of Fisher Scoring iterations: 5
Our fitted model is:
\[log(\widehat{Goals}) = -0.47 + 0.38 * \widehat{xG}_i - 0.0001 * \widehat{SPI}_{scaled_i} + 0.06 * \widehat{WinProb}_{scaled_i}\]
First, we will check for overdispersion in the Poisson model so that we are not underestimating variance
deviance(m1) / df.residual(m1)
## [1] 0.9159401
Since our calculated value is close to 1, we are safe to move forward with this Poisson model according to this check. We are correctly estimating the variance in the data.
Next, we will check the residuals
plot(fitted(m1), resid(m1))
abline(h = 0, col = "red")
hist(resid(m1), breaks = 30)
Based on the 2 plots, we should be cautious that our model tends to overestimate when we predict higher number of goals. However, these seem to occur mostly in outlying situations where the home team clearly dominated and our model seems to predict that they score an outlandish number of goals. At the more reasonable numbers of fitted goals, the residuals seem to have no clear pattern, which is good. The histogram shows no major outliers, but we again need to be cognizant of the 2nd peak where we overestimate the number of actual goals scored.
cor(df_top_leagues[, c("xg1", "spi1_scaled", "prob1_scaled")],
use = "complete.obs")
## xg1 spi1_scaled prob1_scaled
## xg1 1.0000000 0.3777779 0.4734074
## spi1_scaled 0.3777779 1.0000000 0.6635510
## prob1_scaled 0.4734074 0.6635510 1.0000000
We definitely see some higher correlation with SPI and win probability especially. There is a near 0.5 correlation between win probability and xG,a nd a correlation of 0.66 between win probability and SPI. This is something to be mindful of, and we should perhaps consider interaction terms or just choosing 1 of the variables between win probability and xG in the future. Since SPI was shown to have low statistical significance, we will be omitting it from our further analysis anyways.
Since the response in a Poisson regression model is on a log scale, we will apply an exp() transformation to our coefficients for interpretation to get the multiplicative change in actual goals. Due to the high p-value (and thus low statistical significance) for SPI, we will only interpret the coefficients for xG and win probability:
exp(coef(m1)["xg1"])
## xg1
## 1.467321
exp(coef(m1)["prob1_scaled"])
## prob1_scaled
## 1.065973
From this, we can say that, holding SPI and win probability constant, we expect the number of actual goals scored by the home team to increase by a factor of 1.47 for a 1 unit increase in xG. This seems to hold consistent with what we would expect out of expected goals. It is very rare that you get a chance of above .9 xG, and chances above .5 xG would be considered extremely good chances that a player would likely score. Thus, it makes sense that jumping by 1.00 xG would increase the expected actual goals scored by a factor greater than 1.
As for the coefficient of win probability, we expect the actual goals scored by the home team to increase by a factor of 1.07 for an increase of 0.1 in win probability, holding xG and SPI constant. Again, this feels fair. Chance quality seems like a much stronger factor intuitively for how many goals a team scores than just a team’s likelihood of victory. Some teams are strong because of a stronger defense rather than scoring more goals. However, given their higher likelihood of winning (according to the predictive modeling that creates win probability), we would still expect the number of actual goals scored to increase when win probability increases, which is echoed in this coefficient.