### Packages
library(tidyverse)
### Load Data
### Importing data
df <- read.csv("C:/Users/matth/OneDrive/Documents/INFO_H510/spi_matches.csv")
### Subsetting to only include the top 5 leagues
df_top_leagues <- df |>
  filter(league %in% c("Barclays Premier League", "French Ligue 1", "Italy Serie A", "Spanish Primera Division", "German Bundesliga"))

Creating Binary Variable

This dataset does not contain a clean binary variable that really makes sense for analysis. However, there are a few options that we can create which open the doors for interesting analyses. The most impactful “stat” you can look at in soccer on the team level is if the team won or not. At the end of the day, this is the point of the game. Based on the structure of our dataset, with each row containing stats for the home and away team in a particular match, we will create 2 binary variables: home_winindicates if the home team won, and away_win indicates if the away team won. Keep in mind that soccer allows for the match to end in a tie, so these are not necessarily the opposite of each other. We will then fit logistic regression models for home wins and for away wins separately.

Another important variable shift is related to our interpretation of SPI in the model. Since we interpret a logistic regression model coefficient related to a 1 unit increase, and SPI only ranges from [0,1], the interpretation is really out of scope. Thus, we will create a spi_scaled variable for both the home and away team that multiplies SPI by 10. Thus, the coefficient will then be associated with an increase in 0.1 in SPI, which is a much more realistic difference in SPI.

### Creating variables
df_top_leagues <- df_top_leagues |>
  # Home win
  mutate(home_win = case_when(
    score1 > score2 ~ 1,
    TRUE ~ 0
  )) |>
  # Away win
  mutate(away_win = case_when(
    score2 > score1 ~ 1,
    TRUE ~ 0
  )) |>
  # Scaled SPI
  mutate(spi1_scaled = spi1*10) |>
  mutate(spi2_scaled = spi2*10)
head(df_top_leagues)
##   season       date league_id                  league         team1
## 1   2016 2016-08-12      1843          French Ligue 1        Bastia
## 2   2016 2016-08-12      1843          French Ligue 1     AS Monaco
## 3   2016 2016-08-13      2411 Barclays Premier League     Hull City
## 4   2016 2016-08-13      2411 Barclays Premier League       Burnley
## 5   2016 2016-08-13      2411 Barclays Premier League   Southampton
## 6   2016 2016-08-13      2411 Barclays Premier League Middlesbrough
##                 team2  spi1  spi2  prob1  prob2 probtie proj_score1 proj_score2
## 1 Paris Saint-Germain 51.16 85.68 0.0463 0.8380  0.1157        0.91        2.36
## 2            Guingamp 68.85 56.48 0.5714 0.1669  0.2617        1.82        0.86
## 3      Leicester City 53.57 66.81 0.3459 0.3621  0.2921        1.16        1.24
## 4        Swansea City 58.98 59.74 0.4482 0.2663  0.2854        1.37        1.05
## 5             Watford 69.49 59.33 0.5759 0.1874  0.2367        1.91        1.05
## 6          Stoke City 56.32 60.35 0.4380 0.2692  0.2927        1.30        1.01
##   importance1 importance2 score1 score2  xg1  xg2 nsxg1 nsxg2 adj_score1
## 1        32.4        67.7      0      1 0.97 0.63  0.43  0.45       0.00
## 2        53.7        22.9      2      2 2.45 0.77  1.75  0.42       2.10
## 3        38.1        22.2      2      1 0.85 2.77  0.17  1.25       2.10
## 4        36.5        29.1      0      1 1.24 1.84  1.71  1.56       0.00
## 5        34.1        30.7      1      1 1.05 0.22  1.52  0.41       1.05
## 6        33.9        32.5      1      1 1.40 0.55  1.13  1.06       1.05
##   adj_score2 home_win away_win spi1_scaled spi2_scaled
## 1       1.05        0        1       511.6       856.8
## 2       2.10        0        0       688.5       564.8
## 3       1.05        1        0       535.7       668.1
## 4       1.05        0        1       589.8       597.4
## 5       1.05        0        0       694.9       593.3
## 6       1.05        0        0       563.2       603.5

Model Structure

To model wins, we will look at a few different variables: the team’s xG, the opponent’s xG, and the team’s strength. Thus our models will include xg1, xg2, and spi1 or spi2 as explanatory variables. The only thing that will shift when modeling log odds for home and away wins is the coefficient interpretations. They will basically “flip” in interpretation as we move from the home to the away team, since xg1 looks at home xG, xg2 looks at away xG, and spi1 or spi2 correlates to either the home or away team. Thus, this is the base fitted model structure for the log odds of both home and away wins:

\[\log\left(\frac{\hat{p}}{1 - \hat{p}}\right) = \hat{\beta_0} + \hat{\beta_1}*(xG_{home}) + \hat{\beta_2}*(xG_{away}) + \hat{\beta_3}*(SPI_{team})\]

Home Wins

### Home Wins model:

m1 <- glm(home_win ~ xg1 + xg2 + spi1_scaled, data = df_top_leagues,
             family = binomial(link = 'logit'))

summary(m1)
## 
## Call:
## glm(formula = home_win ~ xg1 + xg2 + spi1_scaled, family = binomial(link = "logit"), 
##     data = df_top_leagues)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.7824021  0.1590644 -11.206  < 2e-16 ***
## xg1          0.9282467  0.0337253  27.524  < 2e-16 ***
## xg2         -0.9070155  0.0371240 -24.432  < 2e-16 ***
## spi1_scaled  0.0017902  0.0002265   7.903 2.72e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12404  on 9024  degrees of freedom
## Residual deviance: 10099  on 9021  degrees of freedom
##   (105 observations deleted due to missingness)
## AIC: 10107
## 
## Number of Fisher Scoring iterations: 4

Here is the fitted model for home wins:

\[\log\left(\frac{\hat{p}}{1 - \hat{p}}\right) = -1.78 + 0.93*(xG_{home}) - 0.917*(xG_{away}) + 0.002*(SPI_{scaled})\]

Based on the t-test p-values, we can say that all 3 coefficients related to our explanatory variables are significant (extremely low p-values). Here is how we can interpret these:

\(\hat{\beta_1} = 0.93:\) Holding away xG (opponent xG) and the home team’s SPI constant, the log odds of the home team winning increase by 0.93 for every additional 1 unit increase in home xG. Converting to odds, we see that a +1 increase in home xG multiplies the odds of a home win by 2.53 (odds increase)

\(\hat{\beta_2} = -0.91:\) Holding home xG and home SPI constant, the log odds of the home team winning decrease by 0.91 for every additional 1 unit increase in away xG. In terms of the odds, we see that a +1 increase in away xG multiplies the odds of a home win by 0.4 (odds decrease).

\(\hat{\beta_1}\) and \(\hat{\beta_2}\) have very similar magnitudes, but in opposite direction. Thus, this model treats home and away chance creation nearly symmetrically, so the relationship in match outcome is highly driven by the difference in xG between the teams rather than absolute values alone.

\(\hat{\beta_3} = 0.002:\) Holding home and away xG constant, the log odds of the home team winning increase by 0.002 for an increase of 0.1 in home SPI. While this variable is significant, it has a very low magnitude. This implies that once match-level performance (through xG) is accounted for, the pre-match team strength rating contributes little additional explanatory power.

Model Diagnostics

# Compute the cost function
y <- df_top_leagues$home_win
p_hat <- m1$fitted.values
cost <- -(y * log(p_hat) + (1 - y) * log(1 - p_hat))

# Plot
ggplot() +
  geom_boxplot(
    aes(x = as.factor(y), y = cost),
    orientation = "h"
  ) +
  labs(
    x = "Home Win (0 = No, 1 = Yes)",
    y = "Cost"
  ) +
  scale_y_log10()

The model seems to be OK at least a noisy soccer context, but there is definitely still some uncertainty surrounding match outcomes. While there are many points with low cost values, more of the points are higher in value for cost. Soccer is a very noisy sport, and as we have seen in other data dives, xG and SPI are not perfect predictors of outcome. Finishing, goalkeeper performance, and general luck/randomness play a huge role in actual match outcomes. These are not captured in xG, hence the noisy predictions.

Away Wins

### Away Wins model:

m2 <- glm(away_win ~ xg1 + xg2 + spi2_scaled, data = df_top_leagues,
             family = binomial(link = 'logit'))

summary(m2)
## 
## Call:
## glm(formula = away_win ~ xg1 + xg2 + spi2_scaled, family = binomial(link = "logit"), 
##     data = df_top_leagues)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.4941288  0.1779491 -14.016   <2e-16 ***
## xg1         -0.9363005  0.0394308 -23.745   <2e-16 ***
## xg2          1.1273245  0.0389955  28.909   <2e-16 ***
## spi2_scaled  0.0021561  0.0002434   8.859   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11146.2  on 9024  degrees of freedom
## Residual deviance:  8675.5  on 9021  degrees of freedom
##   (105 observations deleted due to missingness)
## AIC: 8683.5
## 
## Number of Fisher Scoring iterations: 5

We get the following fitted model for the log odds of away wins:

\[\log\left(\frac{\hat{p}}{1 - \hat{p}}\right) = -2.49 -0.94*(xG_{home}) + 1.13*(xG_{away}) + 0.002*(SPI_{scaled})\]

All 3 explanatory variables have significant coefficients under a t-test, so we can interpret them as follows:

\(\hat{\beta_1} = -0.94:\) We expect the log odds of an away win to decrease by 0.94 for every 1 unit increase in home team xG, holding away xG and away team SPI constant. Converted to odds, a +1 increase in home xG multiplies the odds of an away win by 0.39 (decreases the odds).

\(\hat{\beta_2} = 1.13:\) We expect the log odds of an away win to increase by 1.13 for every 1 unit increase in away team xG, holding home xG and away team SPI constant. Converted to odds, a +1 increase in away xG multiplies the odds of an away win by 3.1 (increases the odds).

The models of home and away win log odds are very similar. Home and away xG have similar magnitudes for their coefficients, but are in different directions. Thus, we can again say that the model treats home and away chance creation relatively symmetrically, with a slight lean towards away xG in modeling the log odds of an away win. But in general, it (in a sense) looks at the difference between the 2 more since they are so similar in magnitude, similar to with home wins.

We see nearly the same \(\hat{\beta_3}\) value in the away win model, which means that once again, once match-level performance factors like xG are accounted for, the pre-match predictive numbers contribute very little explanatory power.

Diagnostics

# Compute the cost function
y <- df_top_leagues$home_win
p_hat <- m2$fitted.values
cost <- -(y * log(p_hat) + (1 - y) * log(1 - p_hat))

# Plot
ggplot() +
  geom_boxplot(
    aes(x = as.factor(y), y = cost),
    orientation = "h"
  ) +
  labs(
    x = "Away Win (0 = No, 1 = Yes)",
    y = "Cost"
  ) +
  scale_y_log10()

Similarly to the home win model, we see a lot of noise in terms of modeling the actual wins of the away team. The model performs OK, but is generally better for when the away team did not win. Interestingly, though, when the away team did win, the model performed worse. The model systematically underestimates the probability of away wins in this model, according to the plot, since we have higher values when Away Win = 1. This does seem to make sense in a soccer context, since there is a general presence of a home field advantage and away wins are less frequent and are often associated with more noisy match-level performance indicators.