### Packages
library(tidyverse)
### Importing data
df <- read.csv("C:/Users/matth/OneDrive/Documents/INFO_H510/spi_matches.csv")
### Subsetting to only include the top 5 leagues
df_top_leagues <- df |>
filter(league %in% c("Barclays Premier League", "French Ligue 1", "Italy Serie A", "Spanish Primera Division", "German Bundesliga"))
Goals, pun intended, are the ultimate goal of a soccer match. You
can’t win if you can’t score. Thus, I have chosen goals
(score1 and score2) as my primary response
variable. Many factors could influence the amount of goals a team scores
in a match, but intuitively the most influential factor is the quality
of chances the team creates. Thus, I will later be building simple
linear regression models which check the relationship between
score1 and xg1 (home expected goals) and
score2 and xg2 (away expected goals).
In terms of categorical variables in this particular dataset, we are a bit limited but the most obvious next step is to look at potential differences in home goal scoring between different leagues. Thus, my ANOVA will test for the difference in mean goals across the 5 different leagues in the dataset.
For this ANOVA, as mentioned, we will be looking for differences in the mean home goals scored across the 5 different leagues: The Premier League, Ligue 1, Bundesliga, Serie A, and Primera Division.
We will build our analysis using the following hypotheses:
\(H_0:\) Mean home goals are the same across all leagues \(H_A:\) At least 1 league has a different mean home goal value
# ANOVA
anova_model1 <- aov(score1 ~ league, data = df_top_leagues)
summary(anova_model1)
## Df Sum Sq Mean Sq F value Pr(>F)
## league 4 41 10.177 5.961 8.69e-05 ***
## Residuals 9024 15405 1.707
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 101 observations deleted due to missingness
ggplot(df_top_leagues, aes(x = league, y = score1)) +
geom_boxplot(fill = "lightblue") +
labs(
title = "Home Goals by League",
x = "League",
y = "Home Goals"
) +
theme_bw()
This ANOVA produced a p-value of .0000869. Given these, we can reject the null hypothesis that the mean home goals is the same across the 5 leagues. While we cannot conclude which league(s) differ, we can safely conclude that there are differences between leagues. This could be indicative of play style differences between the leagues. Some may play a slower, more defensive style of play in general while others may be more open and fast-paced. Our F-statistic of 5.961 indicates that the variation between leagues is around 6 times larger than the variation expected from random match-to-match differences alone in terms of home goals. These results suggest that the league a match is played in plays a part in explaining differences in home scoring.
Interestingly, though, the leagues have very similar boxplots. Perhaps the differences between leagues are small, and not easily perceived by the boxplots within the quadrants. Some may lean closer to or further away from the median.
We build this analysis based off of these hypotheses:
\(H_0:\) Mean away goals are the same across all leagues \(H_A:\) At least 1 league has a different mean away goal value
# ANOVA
anova_model2 <- aov(score2 ~ league, data = df_top_leagues)
summary(anova_model2)
## Df Sum Sq Mean Sq F value Pr(>F)
## league 4 54 13.500 9.735 7.44e-08 ***
## Residuals 9024 12514 1.387
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 101 observations deleted due to missingness
ggplot(df_top_leagues, aes(x = league, y = score2)) +
geom_boxplot(fill = "lightblue") +
labs(
title = "Away Goals by League",
x = "League",
y = "Home Goals"
) +
theme_bw()
Similar to with home goals, we have a very low p-value of .0000000744. Thus, we can reject the null hypothesis that there are no differences between leagues in terms of mean away goals. Again, we cannot conclude which leagues differ, just that at least 1 does. Perhaps league play style plays a factor here again. We get an f-statistic here of 9.735, which means that the variation between leagues is around 9.7 times larger than the variation expected from random match-to-match differences alone in terms of away goals. Thus, league-level factors may influence scoring patterns.
Here, we will model expected goals (xG) vs actual goals, with xG as the predictor.
Here, we will fit the following model:
\(score1 = \beta_0 + \beta_1(xG1)\)
# Regression
m1 <- lm(score1 ~ xg1, data = df_top_leagues)
summary(m1)
##
## Call:
## lm(formula = score1 ~ xg1, data = df_top_leagues)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2767 -0.7386 -0.1075 0.6229 5.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.22389 0.02234 10.02 <2e-16 ***
## xg1 0.85786 0.01257 68.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.063 on 9023 degrees of freedom
## (105 observations deleted due to missingness)
## Multiple R-squared: 0.3403, Adjusted R-squared: 0.3403
## F-statistic: 4655 on 1 and 9023 DF, p-value: < 2.2e-16
We get the fitted model of:
\(\widehat{score1}_i = 0.22 + 0.858*(xG1)_i\)
This means that for an increase of 1 xG, we would expect the home team to score 0.858 more actual goals. This implies that xG is a strong predictor of scoring, especially with how close the slope is to 1 (ideally, 1 xG = 1 actual goal).
Here, we calculated an \(R^2\) value of 0.34, which means that roughly 34% of the variance in actual home goals is explained by expected goals. This feels very reasonable, given how much noise occurs in soccer goals.
Here, we will fit the following model:
\(score2 = \beta_0 + \beta_1(xG2)\)
# Regression
m2 <- lm(score2 ~ xg2, data = df_top_leagues)
summary(m2)
##
## Call:
## lm(formula = score2 ~ xg2, data = df_top_leagues)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4001 -0.6244 -0.1341 0.5383 5.2653
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.10572 0.01852 5.709 1.17e-08 ***
## xg2 0.91004 0.01264 72.009 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9401 on 9023 degrees of freedom
## (105 observations deleted due to missingness)
## Multiple R-squared: 0.3649, Adjusted R-squared: 0.3649
## F-statistic: 5185 on 1 and 9023 DF, p-value: < 2.2e-16
Here, we calculate a fitted model of:
\(\widehat{score2}_i = 0.106 + 0.91*(xG2)_i\)
This implies that for an increase of 1 xG, we expect the away team to score 0.91 additional actual goals. Again, we can see that xG is a strong predictor of actual goals. We calculated an \(R^2\) of 0.365, which means around 36.5% of the variance in actual away goals is explained by expected goals. Once again, with how noisy an individual goal is, this feels fair.
Overall, for the relationship between xG and actual goals, expected goals seem like a good, meaningful predictor of actual goals. However, there is definitely still a lot of noise surrounding actual goals, with the influence of finishing skill, goalkeeping performance, and random factors among other external forces.