MK INFO-H510 Final Project

European Soccer Predictive and Performance Measures Evaluation

Presentation Link

Audience

This website is intended for the perspective of a fan or low-level soccer analyst to understand the relationships, strengths, and weaknesses of common soccer predictive and performance measures. The visualizations and analysis presented are designed to give the stats-loving soccer fan further insight and knowledge into the metrics they see on ESPN, FotMob, or other sources of scores and statistics.

Background

Context

There are tons of metrics out there that try to determine how strong a soccer team is, who is supposed to win a match, and how a team performed in a match irregardless of things like the official match score or the official place in the league table. A team can have seemingly very strong performances but not pull through with wins, create a ton of great chances but not put the ball in the back of the net, or exhibit the opposite characteristics and “scam” their way to wins and goals despite poor underlying metrics. FiveThirtyEight, before the company shut down, was a major source of soccer analysis and and statistics, and kept track of data relating to in-game performance statistics and pre-match performance predictors. This analysis builds analysis using a former FiveThirtyEight dataset.

Dataset

The aforementioned dataset was collected from Kaggle, but the data collection and use process is described in this FiveThirtyEight article. The full dataset I used from these links is the spi_matches.csv dataset, which contains data from many leagues all across the world across multiple seasons. I chose to focus my analysis on the 5 “biggest” leagues in the world: The English Premier League, Spanish La Liga (Primera Division), German Bundesliga, Italian Serie A, and French Ligue 1. My analysis will include data from 5 different full seasons from 2016 to 2021. The data dictionary is linked here.

FiveThirtyEight has a number of different pre and post match predictive and performance measures. One such variable is ESPN’s Soccer Power Index or SPI. The FiveThirtyEight article explains how the metric is calculated, but it generally evaluates how strong a team is based off of squad value and performances throughout the season and/or previous seasons. It rates teams on a scale from 0 to 100, with 0 being a worse team. FiveThirtyEight also includes multiple variables related to the pre-match predictions for each match. They include a win probability for each team and the probability of a draw, calculated from metrics like SPI, team form, and a baseline home field advantage factor. They also provide a projected score for each team, which is basically a prediction from strength measures and previous performances in how many goals each team is projected to score in the match at hand. In terms of match performance metrics, we will be including the actual score for each team and determining results as part of the analysis. We will also look at Expected Goals (xG), one of the biggest trending performance measures in modern-day performance analysis. It is basically a measure of the probability that the attacker would score on a given shot based on shot-contextual factors and previous shots from similar situations.

The dataset is organized in such a way that each row contains predictive and performance-related measures for a single match, both for the home and away team. We have basic identifying row features such as the date, season, and league. We also have the names of each of the teams as team1 and team2, where__1 represents the home team and __2 represents the away team. This will be consistent for the rest of the variables, so we are able to record statistics and metrics for both the home and away team in the same row. For instance, we have score1 and score2 for the home and away team’s goals in the match, spi1 and spi2 for each team’s SPI rating, etc.

Problem

We clearly have access to a large number of predictive and performance measures, but how accurate are they? Do they actually show strong relationships with the measures they are supposed to be showing? This article seeks to answer the following questions:

How accurate and strong are pre-match predictive measures like SPI and Win Probability in their relationships with actual match outcomes?

How strong of a predictive measure is Expected Goals in terms of its relationship with actual goals scored?

Are there differences in these relationships for home and away teams?

EDA and Assumptions

Row Counts

First, we will look at the match counts for each league in each season to make sure the numbers seem reasonable. For the Bundesliga, we should see 306 matches per season (34 matches per team, 18 teams, no double counting since each game/row has 2 teams). For the other 4 leagues, we should see 380 matches per season (38 matches, 20 teams). The table below confirms these, so there are no missing matches for any league in any season.

Match Counts by League and Season
season league n_matches
2016 Barclays Premier League 380
2016 French Ligue 1 380
2016 Italy Serie A 380
2016 Spanish Primera Division 380
2016 German Bundesliga 306
2017 Barclays Premier League 380
2017 French Ligue 1 380
2017 Italy Serie A 380
2017 Spanish Primera Division 380
2017 German Bundesliga 306
2018 Barclays Premier League 380
2018 French Ligue 1 380
2018 Italy Serie A 380
2018 Spanish Primera Division 380
2018 German Bundesliga 306
2019 Barclays Premier League 380
2019 French Ligue 1 380
2019 Italy Serie A 380
2019 Spanish Primera Division 380
2019 German Bundesliga 306
2020 Barclays Premier League 380
2020 French Ligue 1 380
2020 Italy Serie A 380
2020 Spanish Primera Division 380
2020 German Bundesliga 306

Variable Distributions: Goals, xG, SPI, and Win Probability

There are a couple potential outliers for both home and away goals. There are a couple matches where the home team scored at least 7 goals, while in a majority of matches they scored 3 or less. Similarly, for the away team we see at least 1 match where a team scored at least 7 goals while a majority of away teams scored 2 or less. These will be interesting to look at later with the Expected Goals analysis.

Again, like with actual goals, there are a couple potential outliers with both home and away expected goals. A majority of home and away expected goals values fall between 0 and 2 to 2.5, but we see a few points with xG values of 5 or 6 and higher. It will be interesting to see if these values are associated with the potential outliers in actual goals, which is what we would expect, or if these occurred in matches where a team vastly underperformed their expected goals.

### SPI
ggplot(df_top_leagues) +
  geom_histogram(aes(x = spi1), bins = 30, color="gray", fill = "lightblue") +
  labs(title = "Home Team SPI Distribution", x = "SPI") + theme_bw()

ggplot(df_top_leagues) +
  geom_histogram(aes(x = spi2), bins = 30, color="gray", fill = "#e68f44") +
  labs(title = "Away Team SPI Distribution", x = "SPI") + theme_bw()

Given that each team acts as a home and away team, often in back to back weeks, these plots are nearly identical. Both plots are centered around an SPI value of 55 to 65. Given that 50 is an average team, and this data includes the teams from the 5 “best” leagues in the world, it makes sense that we are slightly shifted towards better teams in general. We see 1 potential outlier with an SPI of somewhere around 30 in a few matches, but it is not so severe of an outlier that we would expect it to stand out too severely in our future analysis.

### Win Probability
ggplot(df_top_leagues) +
  geom_histogram(aes(x = prob1), bins = 30, color="gray", fill = "lightblue") +
  labs(title = "Home Win Probability Distribution", x = "Win Probability") + theme_bw()

ggplot(df_top_leagues) +
  geom_histogram(aes(x = prob2), bins = 30, color="gray", fill = "#e68f44") +
  labs(title = "Away Win Probability Distribution", x = "Win Probability") + theme_bw()

These plots really show the existence of the baseline home field advantage in the probability calculation. The shape of the 2 histograms is roughly the same, but the center of the home team plot is around 0.35-0.4 while it is around 0.25-0.3 for the away team. The win probabilities clearly induce a 0.1-0.15 increase in probability simply just for the team playing at home. Otherwise, we don’t notice any major outliers, with a relatively symmetric distribution on either side of the center in both plots.

Variable Correlations

### Home Team
home_vars <- df_top_leagues %>%
  select(spi1, prob1, proj_score1, xg1, score1)

cor_home <- cor(home_vars, use = "complete.obs")

corrplot(cor_home, method = "color", addCoef.col = "black",
         title = "Home Team Correlations", mar = c(0,0,1,0))

### Away Team
away_vars <- df_top_leagues %>%
  select(spi2, prob2, proj_score2, xg2, score2)

cor_away <- cor(away_vars, use = "complete.obs")

corrplot(cor_away, method = "color", addCoef.col = "black",
         title = "Away Team Correlations", mar = c(0,0,1,0))

From these 2 charts, we can really see just how influential SPI is in the other pre-match performance predictor variables. What this plot shows is how strongly linked 2 different variables are. The closer to 1 or negative 1 the value in the associated square between 2 variables is (and thus how much darker the square is), the more strongly associated the 2 variables are. Each variable, naturally, has a correlation of 1 with itself. Win probability has a near complete correlation with projected score (correlation coefficients of 0.92 and 0.95), which is natural since scoring more goals than the other team is how you win matches. We also see strong correlations with SPI and these2 variables (0.66 or 0.69 and 0.7 or 0.71, respectfully). This shows that SPI is a strong factor in determining these win and scoring projections. For our modeling standpoint, we need to be careful in using these 3 variables together, since there is such a strong correlation between the 2 it may throw off our model.

For the in-game metrics of expected goals and actual goals, there are moderately strong associations between the two, which is as we would expect. We also see weaker, but still relatively strong associations between xG and the predictive measures, and moderately weak associations between actual goals and these measures. They suggest a potential relationship, but that there is likely a lot of other outside factors influencing match outcomes than just the predictive strength measures alone.

Missing Values

missing_summary <- data.frame(
  variable = c(
    "spi1", "prob1", "proj_score1", "xg1", "score1",
    "spi2", "prob2", "proj_score2", "xg2", "score2"
  ),
  missing_count = c(
    sum(is.na(df_top_leagues$spi1)),
    sum(is.na(df_top_leagues$prob1)),
    sum(is.na(df_top_leagues$proj_score1)),
    sum(is.na(df_top_leagues$xg1)),
    sum(is.na(df_top_leagues$score1)),
    sum(is.na(df_top_leagues$spi2)),
    sum(is.na(df_top_leagues$prob2)),
    sum(is.na(df_top_leagues$proj_score2)),
    sum(is.na(df_top_leagues$xg2)),
    sum(is.na(df_top_leagues$score2))
  )
)

kable(missing_summary, caption = "Missing Values by Variable")
Missing Values by Variable
variable missing_count
spi1 0
prob1 0
proj_score1 0
xg1 105
score1 101
spi2 0
prob2 0
proj_score2 0
xg2 105
score2 101

This table shows us that, while we were not missing any actual matches, some matches did not have the score and/or expected goals properly recorded. Since these numbers are so low, we will just move forward by removing these rows from our future analysis. Below is the final counts of matches in each league in each season, and shows that these were relatively evenly dispersed among leagues and seasons, so we shouldn’t induce any major risk in our analysis by omitting too many games from the analysis:

league_season_counts <- df_top_leagues %>%
  drop_na() %>%
  group_by(season, league) %>%
  summarise(
    n_matches = n(),
    .groups = "drop"
  ) |>
  arrange(season, desc(n_matches))

# Well-formatted table
kable(league_season_counts, caption = "Match Counts by League and Season")
Match Counts by League and Season
season league n_matches
2016 Italy Serie A 356
2016 Barclays Premier League 344
2016 Spanish Primera Division 343
2016 French Ligue 1 336
2016 German Bundesliga 279
2017 Barclays Premier League 380
2017 French Ligue 1 380
2017 Spanish Primera Division 372
2017 Italy Serie A 370
2017 German Bundesliga 297
2018 Barclays Premier League 380
2018 French Ligue 1 380
2018 Italy Serie A 380
2018 Spanish Primera Division 380
2018 German Bundesliga 306
2019 Barclays Premier League 379
2019 Italy Serie A 379
2019 Spanish Primera Division 379
2019 German Bundesliga 306
2019 French Ligue 1 279
2020 Italy Serie A 379
2020 Barclays Premier League 377
2020 Spanish Primera Division 377
2020 French Ligue 1 372
2020 German Bundesliga 306

Assumptions

In order to perform safe and proper analyses, particularly in our modeling phases, we have a few general assumptions to make. First, we will assume that each match is independent from one another. This should hold, but is an important assumption to make. We also need to assume, based on the data description, that SPI, win probabilities, and projected score are pre-match estimates, and that in-game events from those matches do not influence the metric. Lastly, specifically for our models, we will assume that goals follow a Poisson-like distribution, and that a binary logit-link is appropriate for win/loss outcomes. With these assumptions, we can move forward in our modeling.

While modeling, based on some of the high correlations with certain variables, we will not use any highly collinear predictors together. In particular, we will not use win probability and projected score together.

Models and Analysis

We will be fitting 2 sets of models for this analysis:

  1. Logistic regression models to look at the relationship between wins and the pre-match predictive measures. There will be a model for the home and away teams and we will compare the 2

  2. Poisson regression models to look at the relationship between actual goals with xG and the pre-match predictive measures. We will also have a home and away model

Both models generally look at linear trends between our response variables (wins and goals) and our predictors.

Logistic Regression: Wins

Given our high variable correlations before, we will just be using SPI and win probability as our explanatory variables in this initial model. But since they had a moderately strong link, we will also be including an interaction term between the two to account for their relationship. We will also be scaling win probability for interpretation, so that the associated coefficient will be interpreted towards an increase in win probability by 0.1 rather than by 1 (since probability is between 0 and 1). Thus, our base model is:

\[\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1*SPI + \beta_2*Probability_{win} + \beta_3*(SPI*Probability_{win})\]


Call:
glm(formula = home_win ~ spi1 * prob1_scaled, family = binomial, 
    data = df_top_leagues)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -3.313512   0.423786  -7.819 5.33e-15 ***
spi1               0.011995   0.006246   1.920   0.0548 .  
prob1_scaled       0.706051   0.088258   8.000 1.25e-15 ***
spi1:prob1_scaled -0.002989   0.001183  -2.528   0.0115 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 12409  on 9028  degrees of freedom
Residual deviance: 11024  on 9025  degrees of freedom
  (101 observations deleted due to missingness)
AIC: 11032

Number of Fisher Scoring iterations: 4

Call:
glm(formula = away_win ~ spi2 * prob2_scaled, family = binomial, 
    data = df_top_leagues)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -3.847452   0.371829 -10.347  < 2e-16 ***
spi2               0.017384   0.005465   3.181  0.00147 ** 
prob2_scaled       1.007642   0.109188   9.228  < 2e-16 ***
spi2:prob2_scaled -0.006027   0.001394  -4.325 1.53e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11152.4  on 9028  degrees of freedom
Residual deviance:  9766.6  on 9025  degrees of freedom
  (101 observations deleted due to missingness)
AIC: 9774.6

Number of Fisher Scoring iterations: 4

From these 2 model outputs, we see that our fitted models are:

\[\ \log\left(\frac{P(\text{Home Win})}{1 - P(\text{Home Win})}\right) = -3.314 + 0.012 \cdot \text{SPI}_1 + 0.706 \cdot \text{Prob}_1^{(scaled)} - 0.003 \cdot (\text{SPI}_1 \times \text{Prob}_1^{(scaled)}) \]

\[ \ \log\left(\frac{P(\text{Away Win})}{1 - P(\text{Away Win})}\right) = -3.847 + 0.017 \cdot \text{SPI}_2 + 1.008 \cdot \text{Prob}_2^{(scaled)} - 0.006 \cdot (\text{SPI}_2 \times \text{Prob}_2^{(scaled)})\]

For the home team model, a 10% point increase in win probability increases the log-odds of winning by .706, holding SPI constant, offset by a slight decrease as SPI increases. An increase in SPI by 1 is associated with an increase of the log-odds of the home team winning by 0.012, holding win probability constant, with that offset of a slight decrease when win probability increases. In terms of odds, this corresponds to an increase in the odds of winning by a multiplicative factor of 2.03 and 1.012, respectively.

For the away team model, a 10% increase in away win probability increases the log-odds of an away win by 1.008, holding SPI constant, again offset by that slight decrease as SPI increases. For SPI, an increase of 1 is associated with the log-odds of an away win increasing by 0.017, holding win probability constant with the same offset as with the home team. In terms of odds, this corresponds to an increase in the odds of winning by a multiplicative factor of 2.74 and 1.017, respectively.

Using t-tests on each variable, we can see that each variable in the models are generally significant, due to their low p-values. While SPI is significant due to this, the effect is relatively small (coefficient close to 1). However, since we do see the significance of the interaction term in both models, we can see that the relationship between win probability and match outcomes does depend some on team strength. As SPI increases, the marginal impact of win probability on the likelihood of winning decreases slightly. This makes sense, since stronger teams should be more likely to win even if their win probabilities are considered to be lower.

Given the high correlation between SPI and probabilities, and the smaller coefficient, though, a simplistic model with just win probability may be sufficient enough to capture the relationship between the predictive metrics and wins. This does run the risk of ignoring that subtle interaction between strength and predicted outcomes, but would still generally capture the same relationships.

Model Diagnostics

df_home_diag <- df_top_leagues %>%
  drop_na(spi1, prob1_scaled, score1, score2) %>%  # 👈 add this
  mutate(
    home_win = ifelse(score1 > score2, 1, 0),
    p_hat_home = predict(home_logit, newdata = ., type = "response"),
    p_hat_home = pmin(pmax(p_hat_home, 1e-6), 1 - 1e-6),
    cost_home = -(home_win * log(p_hat_home) +
                  (1 - home_win) * log(1 - p_hat_home))
  )

ggplot(df_home_diag, aes(x = factor(home_win), y = cost_home)) +
  geom_boxplot(fill = "#0072B2") +  # colorblind-friendly blue
  scale_y_log10() +
  labs(
    x = "Home Win (0 = No, 1 = Yes)",
    y = "Log-Loss (Cost)",
    title = "Home Model Log-Loss by Outcome"
  ) +
  theme_minimal(base_size = 14)

Based on these boxplots, the model does better when predicting non-home wins (losses/draws). When the home team does not win, we see a lower cost so the model is generally more accurate. It struggles more with actual home wins, as shown by the high cost, and likely underpredicts home wins.

df_away_diag <- df_top_leagues %>%
  drop_na(spi2, prob2_scaled, score1, score2) %>%  # 👈 add this
  mutate(
    away_win = ifelse(score2 > score1, 1, 0),
    p_hat_away = predict(away_logit, newdata = ., type = "response"),
    p_hat_away = pmin(pmax(p_hat_away, 1e-6), 1 - 1e-6),
    cost_away = -(away_win * log(p_hat_away) +
                  (1 - away_win) * log(1 - p_hat_away))
  )

ggplot(df_away_diag, aes(x = factor(away_win), y = cost_away)) +
  geom_boxplot(fill = "#009E73") +  # colorblind-friendly green
  scale_y_log10() +
  labs(
    x = "Away Win (0 = No, 1 = Yes)",
    y = "Log-Loss (Cost)",
    title = "Away Model Log-Loss by Outcome"
  ) +
  theme_minimal(base_size = 14)

The away win model has similar problems with actual wins, likely due to the home advantage factor applied. There is a much higher cost associated with actual home wins (close to 1), so the model may miss a lot more away wins than actually occur.

Our models appear to be generally OK, but are biased more towards the expefcted outcome. Home teams win more often, so the model leans that way, while away wins are rarer so the model struggles more. This could also be due to the binarization of wins vs losses + draws, the mnodels may miss some of that nuance.

Poisson Regression: Goals

Our model for goals scored will use expected goals and SPI. Since SPI is very strongly linked with the other predictive measures, and we want to see how strong of a predictor SPI is in terms of strength, we will just include this. It also serves as an offset for xG, accounting for team strength in terms of comparing xG to actual goals since better teams should, in theory, be more clinical with their chances. Our models will look like:

\[\ \log(\mathbb{E}[Y_i]) = \beta_0 + \beta_1 \cdot xG_i + \beta_2 \cdot SPI_i\]

home_pois <- glm(
  score1 ~ xg1 + spi1,
  family = poisson,
  data = df_top_leagues
)

summary(home_pois)

Call:
glm(formula = score1 ~ xg1 + spi1, family = poisson, data = df_top_leagues)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.6532997  0.0503965 -12.963  < 2e-16 ***
xg1          0.4108074  0.0084052  48.875  < 2e-16 ***
spi1         0.0054026  0.0007805   6.922 4.45e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 11292.8  on 9024  degrees of freedom
Residual deviance:  8351.6  on 9022  degrees of freedom
  (105 observations deleted due to missingness)
AIC: 25718

Number of Fisher Scoring iterations: 5

Our fitted home goals model is:

\[\log(\mathbb{E}[{score1_i}]) = -0.6533 + 0.4108 \cdot xg1_i + 0.00540 \cdot spi1_i\]

away_pois <- glm(
  score2 ~ xg2 + spi2,
  family = poisson,
  data = df_top_leagues
)

summary(away_pois)

Call:
glm(formula = score2 ~ xg2 + spi2, family = poisson, data = df_top_leagues)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.9295183  0.0561980 -16.540  < 2e-16 ***
xg2          0.5398972  0.0104256  51.786  < 2e-16 ***
spi2         0.0052001  0.0008621   6.032 1.62e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 11367.2  on 9024  degrees of freedom
Residual deviance:  8228.1  on 9022  degrees of freedom
  (105 observations deleted due to missingness)
AIC: 23190

Number of Fisher Scoring iterations: 5

\[\log(\mathbb{E}[\text{score2}_i]) = -0.9295 + 0.5399 \cdot xg2_i + 0.00520 \cdot spi2_i\]

For the home team model, an increase of 1 in expected goals (xG) is associated with an increase in the log of expected goals scored by 0.411, holding SPI constant. An increase in SPI by 1 is associated with an increase in the log of expected goals scored by 0.0054, holding xG constant.

In terms of multiplicative effects, this corresponds to:

A 1 unit increase in xG increases the expected actual goals by a factor of around 1.51.

A 1 unit increase in SPI increases the expected actual goals by a factor of around 1.005

For the away team model, an increase of 1 in expected goals (xG) is associated with an increase in the log of expected goals scored by 0.540, holding SPI constant. An increase in SPI by 1 is associated with an increase in the log of expected goals scored by 0.0052, holding xG constant.

In terms of multiplicative effects, this corresponds to:

A 1 unit increase in xG increases the expected actual goals by a factor of around 1.72.

A 1 unit increase in SPI increases the expected actual goals by a factor of around 1.005.

Using t-tests on each variable, we can also see that each one is significant due to the low p-values. Thus, both xG and SPI have meaningful relationships with actual goals scored. However, xG has a much larger effect size in both models, which suggests that it is a stronger predictor (as we would expect given its actual in-match applications). We also see a stronger coefficient in the away model than in the home model, suggesting that xG may be a slightly better predictor for away goals than home goals, but the factor is small.

Overall, these suggest that xG is a highly effective predictive metric in terms of its relationship with actual goals. SPI has a smaller, but still significant contribution. A model without SPI may be effective, but still lacks the factor of accounting for differences in team strength which can increase the strength of the relationships.

Model Diagnostics

home_dispersion <- sum(residuals(home_pois, type = "pearson")^2) / home_pois$df.residual
home_dispersion
[1] 0.771206
away_dispersion <- sum(residuals(away_pois, type = "pearson")^2) / away_pois$df.residual
away_dispersion
[1] 0.7700792

Our Possion models are a good fit based on these overdispersion checks. Ideally, we would like these to be less than or around 1, and both models are around 0.77. We do not see any significant overdispersion, which means that we are properly accounting for the variance of actual goals in our modeling.

xG vs Actual Goals: Plotting

ggplot(df_top_leagues, aes(x = xg1, y = score1)) +
  geom_point(alpha = 0.25, color = "#0072B2") +
  geom_smooth(method = "lm", color = "black", se = TRUE) +
  labs(
    title = "Home Team: Expected Goals vs Actual Goals",
    x = "Expected Goals (xG)",
    y = "Actual Goals"
  ) +
  theme_minimal(base_size = 14)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 105 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 105 rows containing missing values or values outside the scale range
(`geom_point()`).

ggplot(df_top_leagues, aes(x = xg2, y = score2)) +
  geom_point(alpha = 0.25, color = "#009E73") +
  geom_smooth(method = "lm", color = "black", se = TRUE) +
  labs(
    title = "Away Team: Expected Goals vs Actual Goals",
    x = "Expected Goals (xG)",
    y = "Actual Goals"
  ) +
  theme_minimal(base_size = 14)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 105 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 105 rows containing missing values or values outside the scale range
(`geom_point()`).

As shown in the model, these 2 visuals further highlight the strong relationship between xG and actual goals, and emphasize that xG is a good predictor of actual goals when aggregated over multiple games. We still see some noise at the individual match level, but given that our lines of best fit follow a slope of around 0.8 - 1, they are pretty closely related. We would honestly expect there to be some slight xG overperformance (slope < 1), since even the best chances do not show xG values of exactly 1. Penalties, for instance, typically have an xG of around 0.75-0.8, despite the expectance that a player scores their penalty kick.

Conclusions and Insights

Key Findings

  1. Win Prediction (SPI and Probabilities)
    1. Win Probability seems to be a strong predictor of match outcomes
    2. Win probability outweighs SPI, and SPI adds incremental value
    3. Pre-Match probability models take team strength into account, and capture a strong predictive signal for match outcomes. A fan or analyst can trust these pedictions to generally be strong over the course of a season, though obviously individual match noise will prevent these from ever being 100% accurate.
  2. Goal Scoring (xG and SPI)
    1. xG is a good predictor of actual goals, particularly when aggregated over longer periods of time
    2. Individual match noise still persists, but generally an xG of ~1 is associated with a single goal, for instance.
    3. SPI is significant in this relationship, but is small in magnitude. Stronger teams don’t necessarily always score more goals, as they could be stronger because of defensive strength.
    4. Expected Goals is a more direct measure of attacking output than team strength. And given it’s strong relationship when aggregated over many games, a fan or analyst can use it as a reliable determinant of chance quality
  3. Home and Away differences
    1. Away matches are harder to predict with simplistic “win/not-win” models, while home outcomes tend to be more stable
    2. xG outperforms SPI for away scoring
    3. The home advantage factor introduces additional structure that makes home outcomes more predictable, while away teams become harder to predict based on pre-match and live-match factors.

Overall, xG and the pre-match strength indicators do appear to be generally strong at predicting what they are trying to show: xG is good at predicting goals while team strength and win probabilities are good at predicting match winners. While individual noise and the home field advantage factor definitely play a role in weakening these metrics, we can definitely trust these metrics as fans and analysts in determining who we believe will win a match and how strong of a chance-creating performance a team put in.