### Packages
library(tidyverse)
### Importing data
df <- read.csv("C:/Users/matth/OneDrive/Documents/INFO_H510/spi_matches.csv")
### Subsetting to only include the top 5 leagues
df_top_leagues <- df |>
  filter(league %in% c("Barclays Premier League", "French Ligue 1", "Italy Serie A", "Spanish Primera Division", "German Bundesliga"))

Response and Other Variable Selection

Goals, pun intended, are the ultimate goal of a soccer match. You can’t win if you can’t score. Thus, I have chosen goals (score1 and score2) as my primary response variable. Many factors could influence the amount of goals a team scores in a match, but intuitively the most influential factor is the quality of chances the team creates. Thus, I will later be building simple linear regression models which check the relationship between score1 and xg1 (home expected goals) and score2 and xg2 (away expected goals).

In terms of categorical variables in this particular dataset, we are a bit limited but the most obvious next step is to look at potential differences in home goal scoring between different leagues. Thus, my ANOVA will test for the difference in mean goals across the 5 different leagues in the dataset.

ANOVA: Mean Home/Away Goals

For this ANOVA, as mentioned, we will be looking for differences in the mean home goals scored across the 5 different leagues: The Premier League, Ligue 1, Bundesliga, Serie A, and Primera Division.

Home Goals

We will build our analysis using the following hypotheses:

\(H_0:\) Mean home goals are the same across all leagues \(H_A:\) At least 1 league has a different mean home goal value

# ANOVA
anova_model1 <- aov(score1 ~ league, data = df_top_leagues)

summary(anova_model1)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## league         4     41  10.177   5.961 8.69e-05 ***
## Residuals   9024  15405   1.707                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 101 observations deleted due to missingness
ggplot(df_top_leagues, aes(x = league, y = score1)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Home Goals by League",
    x = "League",
    y = "Home Goals"
  ) +
  theme_bw()

This ANOVA produced a p-value of .0000869. Given these, we can reject the null hypothesis that the mean home goals is the same across the 5 leagues. While we cannot conclude which league(s) differ, we can safely conclude that there are differences between leagues. This could be indicative of play style differences between the leagues. Some may play a slower, more defensive style of play in general while others may be more open and fast-paced. Our F-statistic of 5.961 indicates that the variation between leagues is around 6 times larger than the variation expected from random match-to-match differences alone in terms of home goals. These results suggest that the league a match is played in plays a part in explaining differences in home scoring.

Interestingly, though, the leagues have very similar boxplots. Perhaps the differences between leagues are small, and not easily perceived by the boxplots within the quadrants. Some may lean closer to or further away from the median.

Away Goals

We build this analysis based off of these hypotheses:

\(H_0:\) Mean away goals are the same across all leagues \(H_A:\) At least 1 league has a different mean away goal value

# ANOVA
anova_model2 <- aov(score2 ~ league, data = df_top_leagues)

summary(anova_model2)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## league         4     54  13.500   9.735 7.44e-08 ***
## Residuals   9024  12514   1.387                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 101 observations deleted due to missingness
ggplot(df_top_leagues, aes(x = league, y = score2)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Away Goals by League",
    x = "League",
    y = "Home Goals"
  ) +
  theme_bw()

Similar to with home goals, we have a very low p-value of .0000000744. Thus, we can reject the null hypothesis that there are no differences between leagues in terms of mean away goals. Again, we cannot conclude which leagues differ, just that at least 1 does. Perhaps league play style plays a factor here again. We get an f-statistic here of 9.735, which means that the variation between leagues is around 9.7 times larger than the variation expected from random match-to-match differences alone in terms of away goals. Thus, league-level factors may influence scoring patterns.

Regression: xG vs Goals Scored

Here, we will model expected goals (xG) vs actual goals, with xG as the predictor.

Home Goals

Here, we will fit the following model:

\(score1 = \beta_0 + \beta_1(xG1)\)

# Regression
m1 <- lm(score1 ~ xg1, data = df_top_leagues)
summary(m1)
## 
## Call:
## lm(formula = score1 ~ xg1, data = df_top_leagues)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2767 -0.7386 -0.1075  0.6229  5.0604 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.22389    0.02234   10.02   <2e-16 ***
## xg1          0.85786    0.01257   68.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.063 on 9023 degrees of freedom
##   (105 observations deleted due to missingness)
## Multiple R-squared:  0.3403, Adjusted R-squared:  0.3403 
## F-statistic:  4655 on 1 and 9023 DF,  p-value: < 2.2e-16

We get the fitted model of:

\(\widehat{score1}_i = 0.22 + 0.858*(xG1)_i\)

This means that for an increase of 1 xG, we would expect the home team to score 0.858 more actual goals. This implies that xG is a strong predictor of scoring, especially with how close the slope is to 1 (ideally, 1 xG = 1 actual goal).

Here, we calculated an \(R^2\) value of 0.34, which means that roughly 34% of the variance in actual home goals is explained by expected goals. This feels very reasonable, given how much noise occurs in soccer goals.

Away Goals

Here, we will fit the following model:

\(score2 = \beta_0 + \beta_1(xG2)\)

# Regression
m2 <- lm(score2 ~ xg2, data = df_top_leagues)
summary(m2)
## 
## Call:
## lm(formula = score2 ~ xg2, data = df_top_leagues)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4001 -0.6244 -0.1341  0.5383  5.2653 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.10572    0.01852   5.709 1.17e-08 ***
## xg2          0.91004    0.01264  72.009  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9401 on 9023 degrees of freedom
##   (105 observations deleted due to missingness)
## Multiple R-squared:  0.3649, Adjusted R-squared:  0.3649 
## F-statistic:  5185 on 1 and 9023 DF,  p-value: < 2.2e-16

Here, we calculate a fitted model of:

\(\widehat{score2}_i = 0.106 + 0.91*(xG2)_i\)

This implies that for an increase of 1 xG, we expect the away team to score 0.91 additional actual goals. Again, we can see that xG is a strong predictor of actual goals. We calculated an \(R^2\) of 0.365, which means around 36.5% of the variance in actual away goals is explained by expected goals. Once again, with how noisy an individual goal is, this feels fair.

Overall, for the relationship between xG and actual goals, expected goals seem like a good, meaningful predictor of actual goals. However, there is definitely still a lot of noise surrounding actual goals, with the influence of finishing skill, goalkeeping performance, and random factors among other external forces.