### Packages
library(tidyverse)
### Importing data
df <- read.csv("C:/Users/matth/OneDrive/Documents/INFO_H510/spi_matches.csv")
### Subsetting to only include the top 5 leagues
df_top_leagues <- df |>
  filter(league %in% c("Barclays Premier League", "French Ligue 1", "Italy Serie A", "Spanish Primera Division", "German Bundesliga"))

xG vs Actual Goals

ggplot(df_top_leagues, aes(x = xg1, y = as.factor(score1))) + 
  geom_point(color = "lightblue") + 
  labs(title = "Relationship Between Expected Goals and Actual Goals (Home Team)", x = "Home Expected Goals", y = "Home Actual Goals") +
  theme_bw()

### Correlation Coefficient for xG vs Actual Goals
cor_xg_score <- cor(df_top_leagues$xg1,
                    df_top_leagues$score1,
                    use = "complete.obs",
                    method = "pearson")

cor_xg_score
## [1] 0.5833942

Home xG and Home Actual Goals seem to follow a roughly positive trend, though I would not say the relationship is extremely strong. Other game factors (like finishing quality) definitely play a factor in terms of relating expected goals to actual goals. There are a couple potential outliers, most notably the point with a Home xG of ~5.9 and 0 actual goals and a few similar differences between xG and actual goals, but these are not glaring outliers based on the plot.

From the output, we can see that the correlation coefficient is 0.58. This indicates a moderately strong positive correlation, which is mirrored in the scatterplot. The points don’t follow a strict line in the plot, but rather a general positive trend with some spread.

ggplot(df_top_leagues, aes(x = xg2, y = as.factor(score2))) + 
  geom_point(color = "lightblue") + 
  labs(title = "Relationship Between Expected Goals and Actual Goals (Away Team)", x = "Away Expected Goals", y = "Away Actual Goals") +
  theme_bw()

### Correlation Coefficient for xG vs Actual Goals
cor_xg_score <- cor(df_top_leagues$xg2,
                    df_top_leagues$score2,
                    use = "complete.obs",
                    method = "pearson")

cor_xg_score
## [1] 0.6041109

There is no major difference for this relationship when we switch to looking at away xG vs actual goals. There again appears to be a moderately strong positive relationship between the two variables, with the points following this general trend. This time, though, there aren’t really many noticeable potential outliers. A few points with high xG that still somewhat “underperform” their xG (xG < actual goals) are noticeable, but they aren’t super far from the rest of the points/ The correlation coefficient is slightly stronger, perhaps because of there being less potential outliers. The correlation coefficient between xG and actual goals is 0.60, again confirming the plot’s moderately strong positive relationship.

Confidence Intervals

t.test(df_top_leagues$score1)
## 
##  One Sample t-test
## 
## data:  df_top_leagues$score1
## t = 112.13, df = 9028, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1.516598 1.570565
## sample estimates:
## mean of x 
##  1.543582

From this T-test output, we can see a 95% confidence interval for Home Actual Goals of (1.517, 1.571). From this, we can conclude that we are 95% confident the true average home goals for Europe’s top 5 leagues is between 1.517 and 1.571 goals per game.

t.test(df_top_leagues$score2)
## 
##  One Sample t-test
## 
## data:  df_top_leagues$score2
## t = 99.302, df = 9028, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1.208687 1.257367
## sample estimates:
## mean of x 
##  1.233027

The confidence interval for away goals has a similar range, but is shifted downwards. From the output, we can conclude that we are 95% confident the true average away goals in Europe’s top 5 leagues is between 1.201 and 1.257 goals per game.

SPI vs Win Probability

ggplot(df_top_leagues, aes(x = spi1, y = prob1)) + 
  geom_point(color = "lightblue") + 
  labs(title = "Relationship Between SPI and Win Probability (Home Team)", x = "Home SPI", y = "Home Win Probability") +
  theme_bw()

### Correlation Coefficient for SPI vs Win Probability
cor_spi_prob <- cor(df_top_leagues$spi1,
                    df_top_leagues$prob1,
                    use = "complete.obs",
                    method = "pearson")

cor_spi_prob
## [1] 0.6603522

From the plot, we can see a generally strong relationship for home SPI and home win probability. This is what we would expect, as the stronger a team is, the higher their win probability would be expected to be. Of course, other factors definitely play a part, such as opponent strength, form, and injuries. These and other confounding variables weaken the relationship a bit, but it is still generally strong. The only potential outliers appear with the weakest SPI points, where win probability falls between around 0.3 to 0.55, which is higher than we would expect if we follow the trend. However, I would not say they are glaring outliers.

The correlation coefficient between home SPi and home win probability is 0.66, which again confirms this generally strong positive relationship. This indicates that SPI, and thus team strength, is definitely a major factor in determining win probability as we would expect.

ggplot(df_top_leagues, aes(x = spi2, y = prob2)) + 
  geom_point(color = "lightblue") + 
  labs(title = "Relationship Between SPI and Win Probability (Away Team)", x = "Away SPI", y = "Away Win Probability") +
  theme_bw()

### Correlation Coefficient for SPI vs Win Probability
cor_spi_prob <- cor(df_top_leagues$spi2,
                    df_top_leagues$prob2,
                    use = "complete.obs",
                    method = "pearson")

cor_spi_prob
## [1] 0.6897497

We see the same relationship with away SPI and win probability as we did with the home teams, but it appears that the win probability trend has just generally shifted down. This indicates that a home field advantage factor is definitely factored in to win probability, but away SPI is still a strong factor in determining win probability. We see similar potential outliers here as well, but again they are not particularly glaring outliers.

The generally strong positive relationship is slightly strengthened here compared to with the home team, with a correlation coefficient of 0.69. This again indicates that SPI is a key factor in win probability for the away team as well.

Confidence Intervals

t.test(df_top_leagues$prob1)
## 
##  One Sample t-test
## 
## data:  df_top_leagues$prob1
## t = 237.12, df = 9129, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.4508034 0.4583188
## sample estimates:
## mean of x 
## 0.4545611

From the output, we can conclude that we are 95% confident the true average home win probability is between 0.451 and 0.458 in Europe’s top leagues. This feels expected, with the presence of draws and the spread of team strength, the home team being expected to win just under 50% of the time feels accurate.

t.test(df_top_leagues$prob2)
## 
##  One Sample t-test
## 
## data:  df_top_leagues$prob2
## t = 172.69, df = 9129, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.2977378 0.3045746
## sample estimates:
## mean of x 
## 0.3011562

Similar with xG and actual goals, we see a similar range on the confidence interval for the away team SPI and win probability as for the home team, but it is shifted downward. We can say that we are 95% confident the true average away win probability is between 0.298 and 0.305. Again, this feels expected with the presence of draws and the home field advantage factor, it would just be interesting to see if these numbers match up to the actual rates of home/away wins.