### Packages
library(tidyverse)
### Importing data
df <- read.csv("C:/Users/matth/OneDrive/Documents/INFO_H510/spi_matches.csv")
### Subsetting to only include the top 5 leagues
df_top_leagues <- df |>
  filter(league %in% c("Barclays Premier League", "French Ligue 1", "Italy Serie A", "Spanish Primera Division", "German Bundesliga"))

xG vs Actual Goals

ggplot(df_top_leagues, aes(x = xg1, y = as.factor(score1))) + 
  geom_point(color = "lightblue") + 
  labs(title = "Relationship Between Expected Goals and Actual Goals (Home Team)", x = "Home Expected Goals", y = "Home Actual Goals") +
  theme_bw()

### Correlation Coefficient for xG vs Actual Goals
cor_xg_score <- cor(df_top_leagues$xg1,
                    df_top_leagues$score1,
                    use = "complete.obs",
                    method = "pearson")

cor_xg_score
## [1] 0.5833942

Home xG and Home Actual Goals seem to follow a roughly positive trend, though I would not say the relationship is extremely strong. Other game factors (like finishing quality) definitely play a factor in terms of relating expected goals to actual goals. There are a couple potential outliers, most notably the point with a Home xG of ~5.9 and 0 actual goals and a few similar differences between xG and actual goals, but these are not glaring outliers based on the plot.

From the output, we can see that the correlation coefficient is 0.58. This indicates a moderately strong positive correlation, which is mirrored in the scatterplot. The points don’t follow a strict line in the plot, but rather a general positive trend with some spread.

ggplot(df_top_leagues, aes(x = xg2, y = as.factor(score2))) + 
  geom_point(color = "lightblue") + 
  labs(title = "Relationship Between Expected Goals and Actual Goals (Away Team)", x = "Away Expected Goals", y = "Away Actual Goals") +
  theme_bw()

### Correlation Coefficient for xG vs Actual Goals
cor_xg_score <- cor(df_top_leagues$xg2,
                    df_top_leagues$score2,
                    use = "complete.obs",
                    method = "pearson")

cor_xg_score
## [1] 0.6041109

There is no major difference for this relationship when we switch to looking at away xG vs actual goals. There again appears to be a moderately strong positive relationship between the two variables, with the points following this general trend. This time, though, there aren’t really many noticeable potential outliers. A few points with high xG that still somewhat “underperform” their xG (xG < actual goals) are noticeable, but they aren’t super far from the rest of the points/ The correlation coefficient is slightly stronger, perhaps because of there being less potential outliers. The correlation coefficient between xG and actual goals is 0.60, again confirming the plot’s moderately strong positive relationship.

Confidence Intervals

t.test(df_top_leagues$score1)
## 
##  One Sample t-test
## 
## data:  df_top_leagues$score1
## t = 112.13, df = 9028, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1.516598 1.570565
## sample estimates:
## mean of x 
##  1.543582

From this T-test output, we can see a 95% confidence interval for Home Actual Goals of (1.517, 1.571). From this, we can conclude that we are 95% confident the true average home goals for Europe’s top 5 leagues is between 1.517 and 1.571 goals per game.

t.test(df_top_leagues$score2)
## 
##  One Sample t-test
## 
## data:  df_top_leagues$score2
## t = 99.302, df = 9028, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1.208687 1.257367
## sample estimates:
## mean of x 
##  1.233027

The confidence interval for away goals has a similar range, but is shifted downwards. From the output, we can conclude that we are 95% confident the true average away goals in Europe’s top 5 leagues is between 1.201 and 1.257 goals per game.

SPI vs Win Probability

ggplot(df_top_leagues, aes(x = spi1, y = prob1)) + 
  geom_point(color = "lightblue") + 
  labs(title = "Relationship Between SPI and Win Probability (Home Team)", x = "Home SPI", y = "Home Win Probability") +
  theme_bw()

### Correlation Coefficient for SPI vs Win Probability
cor_spi_prob <- cor(df_top_leagues$spi1,
                    df_top_leagues$prob1,
                    use = "complete.obs",
                    method = "pearson")

cor_spi_prob
## [1] 0.6603522

From the plot, we can see a generally strong relationship for home SPI and home win probability. This is what we would expect, as the stronger a team is, the higher their win probability would be expected to be. Of course, other factors definitely play a part, such as opponent strength, form, and injuries. These and other confounding variables weaken the relationship a bit, but it is still generally strong. The only potential outliers appear with the weakest SPI points, where win probability falls between around 0.3 to 0.55, which is higher than we would expect if we follow the trend. However, I would not say they are glaring outliers.

The correlation coefficient between home SPi and home win probability is 0.66, which again confirms this generally strong positive relationship. This indicates that SPI, and thus team strength, is definitely a major factor in determining win probability as we would expect.

ggplot(df_top_leagues, aes(x = spi2, y = prob2)) + 
  geom_point(color = "lightblue") + 
  labs(title = "Relationship Between SPI and Win Probability (Away Team)", x = "Away SPI", y = "Away Win Probability") +
  theme_bw()

### Correlation Coefficient for SPI vs Win Probability
cor_spi_prob <- cor(df_top_leagues$spi2,
                    df_top_leagues$prob2,
                    use = "complete.obs",
                    method = "pearson")

cor_spi_prob
## [1] 0.6897497

We see the same relationship with away SPI and win probability as we did with the home teams, but it appears that the win probability trend has just generally shifted down. This indicates that a home field advantage factor is definitely factored in to win probability, but away SPI is still a strong factor in determining win probability. We see similar potential outliers here as well, but again they are not particularly glaring outliers.

The generally strong positive relationship is slightly strengthened here compared to with the home team, with a correlation coefficient of 0.69. This again indicates that SPI is a key factor in win probability for the away team as well.

Confidence Intervals

t.test(df_top_leagues$prob1)
## 
##  One Sample t-test
## 
## data:  df_top_leagues$prob1
## t = 237.12, df = 9129, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.4508034 0.4583188
## sample estimates:
## mean of x 
## 0.4545611

From the output, we can conclude that we are 95% confident the true average home win probability is between 0.451 and 0.458 in Europe’s top leagues. This feels expected, with the presence of draws and the spread of team strength, the home team being expected to win just under 50% of the time feels accurate.

t.test(df_top_leagues$prob2)
## 
##  One Sample t-test
## 
## data:  df_top_leagues$prob2
## t = 172.69, df = 9129, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.2977378 0.3045746
## sample estimates:
## mean of x 
## 0.3011562

Similar with xG and actual goals, we see a similar range on the confidence interval for the away team SPI and win probability as for the home team, but it is shifted downward. We can say that we are 95% confident the true average away win probability is between 0.298 and 0.305. Again, this feels expected with the presence of draws and the home field advantage factor, it would just be interesting to see if these numbers match up to the actual rates of home/away wins.

Additions: Creating New Columns (xG Difference and SPI Advantage)

df_top_leagues <- df_top_leagues |>
  mutate(
    xg_diff = score1 - xg1,
    spi_advantage = spi1 - spi2
  )

For the subsequent analysis, I will look at 2 new created columns: Home xG difference (difference between actual goals and xG), and the home team’s SPI advantage (spi difference between the home and away team). First, we will explore if matches with higher xG have larger differences between expected and actual scoring. Basically, we want to see if, for higher xG games, teams tend to over or underperform their xG. Then, we will check SPI to see whether stronger teams (higher SPI compared to their opponents) tend to have higher win probability. We will again do this from the lens of the home team.

Home xG vs Home xG Difference

# scatterplot
ggplot(df_top_leagues, aes(x = xg1, y = xg_diff)) + 
  geom_point(color = "lightblue") +
  labs(
    title = "Expected Goals vs Difference Between Actual and Expected Goals",
    x = "Home Expected Goals",
    y = "Actual Goals - Expected Goals"
  ) +
  theme_bw()

# Correlation coefficient
cor_xg_diff <- cor(df_top_leagues$xg1,
                   df_top_leagues$xg_diff,
                   use = "complete.obs",
                   method = "pearson")

cor_xg_diff
## [1] -0.1181812

From the plot and the correlation coefficient of -0.118, we can see that there may be a weak negative correlation between home xG and the difference between actual and expected goals. If this were stronger, a negative relationship would say that as xG increases, we would expect the team to underperform their xG. However, with this weak relationship, it implies that prediction errors in actual vs expected goals occur more randomly than based on xG increase.

Confidence Interval

# Confidence interval
t.test(df_top_leagues$xg_diff)
## 
##  One Sample t-test
## 
## data:  df_top_leagues$xg_diff
## t = 0.46353, df = 9024, p-value = 0.643
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.01685848  0.02730058
## sample estimates:
##   mean of x 
## 0.005221053

Our confidence interval of (-0.017, 0.027) points to the strength of xG. We are 95% confident that the true difference between actual and expected goals for the home team is between -0.017 and 0.027. Since this interval includes 0, we can say that there is no evidence that xG is a biased estimate of actual goal scoring towards over or underperforming. As for the weak negative relationship we saw above, this likely contributes to it. xG does not appear to be biased towards over or underperforming, so the weak relationship would be expected.

Home SPI difference vs Win Probability

# scatterplot
ggplot(df_top_leagues, aes(x = spi_advantage, y = prob1)) + 
  geom_point(color = "lightblue") +
  labs(
    title = "SPI Advantage Over Opponent vs Home Win Probability",
    x = "SPI Difference (Home - Away)",
    y = "Home Win Probability"
  ) +
  theme_bw()

# correlation
cor_spi_adv <- cor(df_top_leagues$spi_advantage,
                   df_top_leagues$prob1,
                   use = "complete.obs",
                   method = "pearson")

cor_spi_adv
## [1] 0.9628978

From the plot plus the correlation coefficient of 0.963, we can say that there is a VERY strong positive relationship between home win probability and the difference in SPI between the home and away teams. Basically, the stronger the home team is compared to the away team, they are much more likely to win and vice versa. This is definitely what we would expect, but it really speaks to just how big of a factor SPI difference is in calculating win probability. Obviously, the better team would be expected to win, but with a correlation coefficient of nearly 1, it shows that this is a huge contributing factor.

Confidence Interval

t.test(df_top_leagues$prob1)
## 
##  One Sample t-test
## 
## data:  df_top_leagues$prob1
## t = 237.12, df = 9129, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.4508034 0.4583188
## sample estimates:
## mean of x 
## 0.4545611

Here, we can re-emphasize the confidence interval we calculated earlier about home win probability. We are 95% confident the true home win probability is between 0.451 and 0.458. This is largely what we would expect with a home field advantage factor, given the presence of draws as a 3rd possible result.