R Markdown

library(tidyverse)
library(tsibble)
### Importing data
df <- read.csv("C:/Users/matth/OneDrive/Documents/INFO_H510/spi_matches.csv")
### Subsetting to only include the top 5 leagues
df_top_leagues <- df |>
  filter(league %in% c("Barclays Premier League", "French Ligue 1", "Italy Serie A", "Spanish Primera Division", "German Bundesliga"))

Data Preparation

We will be looking at how xG varies over time and throughout each season. To do this, we will first make sure our date variable is actually a Date datatype in R. Then, since each row corresponds to a single game and includes columns for the xG of each of the 2 teams, we will combine xg1 and xg2 into a total_xg variable to track the total xG across the game. Finally, we will aggregate the mean xG over each day, since there will be duplicate dates in the dataset as multiple matches occur on each day. What this will do is allow us to is smooth out match-level noise and track to see if certain time periods produce more goals in general.

# Convert date
df_top_leagues$date <- as.Date(df_top_leagues$date)

# Create combined xG
df_top_leagues$total_xg <- df_top_leagues$xg1 + df_top_leagues$xg2

# Aggregate to daily level
daily_xg <- df_top_leagues %>%
  group_by(date) %>%
  summarise(avg_xg = mean(total_xg, na.rm = TRUE))
ts_data <- daily_xg %>%
  as_tsibble(index = date)

Initial Plot

ggplot(ts_data, aes(x = date, y = avg_xg)) +
  geom_line(alpha = 0.6) +
  labs(
    title = "Average Combined xG Over Time",
    x = "Date",
    y = "Average xG per Match"
  )

One of the first big things we notice is the gap in 2020 where the COVID-19 Pandemic shut down the top 5 leagues in Europe. This is the reason for the complete gap in data during that time period. Also, we tend to see bigger peaks each year around December/January and right near the end of the season (May). In December/January, teams often play a much more congested fixture list, which can often lead to fatigue particularly for defenders and midfielders who tend to play longer portions of each match without being substituted. Forwards get rotated more, so the difference in fatigue during thise time period may mean that teams tend to score more or give up more goals (or at least more goal scoring chances). For the end of the season in May, some teams have very little to play for while others are fighting for titles, European places, or to avoid relegation and thus have a lot more on the line. Those with less to play for may play weaker lineups to give younger players some minutes or just generally have a more lax mindset on the game, potentially leading to the increase in average xG in those games.

We do notice a slight upward trend in xG as the years roll on, but this appears to be a very small increase. We will explore this trend next.

Trend Analysis

trend_model <- lm(avg_xg ~ date, data = ts_data)
summary(trend_model)
## 
## Call:
## lm(formula = avg_xg ~ date, data = ts_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.07643 -0.29679  0.00413  0.27384  2.12278 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.056e+00  5.877e-01  -1.797   0.0727 .  
## date         2.121e-04  3.277e-05   6.472 1.57e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5277 on 921 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.0435, Adjusted R-squared:  0.04246 
## F-statistic: 41.89 on 1 and 921 DF,  p-value: 1.571e-10

As we saw in the plot, there is a small positive slope between average xG and date (0.00021). While this does show statistical significance with a low p-value, our \(R^2\) value is very low (0.04). This means that only 4% of the variability in xG is accounted for by the model with date. Thus, date alone is not a strong factor in explaining changes in xG. This relationship is made clearer on the plot below:

ggplot(ts_data, aes(x = date, y = avg_xg)) +
  geom_line(alpha = 0.3) +
  geom_smooth(se = FALSE, color = "blue") +
  labs(title = "Smoothed Trend of Average xG Over Time")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 15 rows containing non-finite outside the scale range
## (`stat_smooth()`).

Seasonality

ts_clean <- ts_data %>%
  filter(!is.na(avg_xg))

acf(ts_clean$avg_xg, main = "ACF of Average xG")

pacf(ts_clean$avg_xg, main = "PACF of Average xG")

Most lags seem to stay within the dashed lines, outside of lags 4, 8, 11, and 17 in the PACF plot (plus a couple more in the ACF plot), but none rise highly above the bounds. This indicates that there is no strong repeating pattern in average xG over time, which mirrors what we saw in our linear model above. There may be some small short-term dependencies (like with the indicated lags above) but these appear inconsistent and are not generally evenly spaced. Thus, we don’t seem to have any true seasonality. There is no clear autoregressive structure, so it does not appear that average xG over 1 day is affected too much by previous days or the date in general.

We do see a slight oscillating pattern in the plots, but this is very weak and the lack of regular spacing + small magnitude of correlations in the plots show that any seasonality in the data is minimal. Overall, there are only some minor short term dependencies and no real strong, consistent seasonal structure.