### Packages
library(tidyverse)

### Importing data
df <- read.csv("C:/Users/matth/OneDrive/Documents/INFO_H510/spi_matches.csv")
### Subsetting to only include the top 5 leagues
df_top_leagues <- df |>
  filter(league %in% c("Barclays Premier League", "French Ligue 1", "Italy Serie A", "Spanish Primera Division", "German Bundesliga"))

The data documentation for this particular dataset comes from a couple different sources. From “jayb” on GitHub (https://github.com/fivethirtyeight/data/tree/master/soccer-spi), we can get the official data dictionary, which explains what each of the variable names are and what they mean. However, the explanations behind certain variables and how they are collected/calculated are provided in a FiveThirtyEight article about the research (https://fivethirtyeight.com/methodology/how-our-club-soccer-predictions-work/). The following questions are answered using information from these 2 sites.

Question 1: Originally Unclear Columns

Originally, I was unsure about the spi1 and spi2 columns. I knew that they related to the ESPN Soccer Power Index ratings for each team, but it wasn’t clear how these were calculated and what exactly they meant. The FiveThirtyEight article gives some clarity to this. The rating is calculated from an offensive rating (related to how many goals the team would be expected to score against an average team on a neutral field) and a defensive rating (same circumstance but for expected goals conceded). These are used to represent the percentage of points (3 for a win, 1 for a tie, 0 for a loss) the team is expected to take from that match if it was played over and over again. The preseason ratings are calculated from the previous end of season SPI rating (67% of the rating) and a team market-value implied rating to serve as a proxy for team strength (33%). The rating for each team is updated based on match performance, which comes from performances compared to how the model predicted them to perform. I think this is a smart approach to modeling team strength and performance throughout a season. While the result is obviously important, a team can play very poorly in a few games and still get results (or play very well and not get results), which can be a better measure of a team’s strength and expected performance than just the recent results. While I could have still used the variable without the exact variable definition (as I had a general understanding), it adds very important context to how we evaluate these ratings.

Another set of variables I was unsure about were the match importance variables importance1 and importance2. Like with SPI, I had a general understanding of what this implied, as certain matches are more important to 1 team than another based on season context, but I was not sure how this was calculated. The FiveThirtyEight article offers a full section on the match importance variables. To calculate importance, they factored in probabilities of a team winning the league, being promoted/relegated, and qualifying for the Champions League with a win or a loss Then, they compare the difference in probabilities, scale to a 0-100 scale, and find the factor (promotion, champions, relegation, etc.) with the greatest difference and use this as the match importance. Again, while I could have used this without the explicit definition, as I had a general understanding, this is very important to providing proper context of the variable. This is an important way of adding context to team performances, as performances may vary when the stakes are higher or lower.

Question 2: Still Unclear

The 1 set of variables that is still a little unclear to me are the adjusted score variables adj_score1 and adj_score2. The article provides explanation for the context of the variables, but doesn’t do a great job at actually explaining the calculation and how the number of goals are actually “adjusted” for this metric. Basically, goals scored when a team is at a man advantage or when they are already winning late in a match have a reduced value, and other goals have an increased value to generally add up to the team’s total goals over the season. While this is important to understanding which goals are adjusted in which direction, it is still unclear how much the goals are actually adjusted on an individual level, which can be important on a match-level analysis or with a smaller sample of matches.

Question 3: Visualizations for Q2

Scatterplot of Adjusted vs Actual score

### Scatterplot of actual vs adjusted goals
ggplot(df_top_leagues, aes(x = score1, y = adj_score1)) +
  geom_point(alpha = 0.4) +
  # Reference line
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  labs(
    title = "Actual Goals vs Adjusted Goals (Team 1)",
    x = "Actual Goals",
    y = "Adjusted Goals"
  ) +
  theme_minimal()

From the plot, it appears that most goals, particularly when the actual goals are higher, are downweighted (below the red line). Thus, overall, adjusted goals are typically lower than actual goals, or at least the magnitude of downweighting is often more than the magnitude of upweighting. The other interesting thing to note is that there is a single point where the team did not score an actual goal, but had around 2 adjusted goals.

Without clear explanation into the individual-level adjustments and their magnitudes, we have to be cautious of whether unintended bias is introduced in a match-level analysis. While over the season, these differences may level out, we must be vary careful on smaller scale analyses to avoid these potential biases.

Average Adjustment by Goal Count

### Calculating average adjustment in goals based on goals scored
avg_adjustment <- df_top_leagues |>
  group_by(score1) |>
  summarise(
    avg_adj_diff = mean(adj_score1 - score1, na.rm = TRUE),
    n_games = n(),
    .groups = "drop"
  )

### Line Plot
ggplot(avg_adjustment, aes(x = score1, y = avg_adj_diff)) +
  geom_point(size = 3) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(
    title = "Average Adjustment by Actual Goals Scored",
    x = "Actual Goals",
    y = "Average (Adjusted - Actual)"
  ) +
  theme_minimal()

From here, we can see that in 1 and 2 goal games (particularly in 1 goal games), the goals tend to be more inflated. For 3-9 goal games the opposite is true. This plot aligns with what we noted from the previous plot, where the magnitude of downweighted goals appears to be higher than the magnitude of upweighted goals. The points for when 3-9 actual goals are scored deviate from the “0-line” more than for when 1 or 2 goals are scored. Again, while on a season-wide level, these differences may even out, it is very important to be aware of these differences in any analysis on a match level to avoid bias in our explanations.

Question 4: Missing Data: Season and Teams

Explicit Missing Data

### Check for counts of missing season or team values
df_top_leagues |>
  summarise(
    missing_season = sum(is.na(season)),
    missing_team1  = sum(is.na(team1)),
    missing_team2  = sum(is.na(team2))
  )

##   missing_season missing_team1 missing_team2
## 1              0             0             0

Here, we checked to make sure no matches were missing a season, and that each match has 2 teams assigned to it. Thankfully, we see no missing values here. Thus, we can be safe in our analyses in terms of knowing that each match is assigned a season and 2 teams.

Implicit Missing Data

# Making sure we have each season sequentially
sort(unique(df_top_leagues$season))

## [1] 2016 2017 2018 2019 2020

### Making sure each season has the same # of games
df_top_leagues |>
  count(season)

##   season    n
## 1   2016 1826
## 2   2017 1826
## 3   2018 1826
## 4   2019 1826
## 5   2020 1826

Here again, we can see that there are no implicitly missing seasons. Each season is accounted for sequentially. Thus, we can be certain that each season of matches follows the proper progression. We can also see that each season has the same number of matches, so we aren’t missing matches in any of the seasons.

Question 5: Outliers in Expected Goals

Most soccer matches see each team scoring anywhere from 0-2 goals. The same sentiment applies for expected goals, where match performances typically see a team have an expected goals value between 0-2. Higher values, especially in the 2-3 range are not necessarily uncommon, but the distribution is definitely right-skewed.

### Histogram of xG values for distribution and potential outliers
ggplot(df_top_leagues, aes(x = xg1)) + 
  geom_histogram(color = "gray", fill = "red") + 
  labs(title = "Histogram of xG Values", x = "xG") + 
  theme_minimal()

To look for outliers in the expected goals column, we will use a percentile method to find matches where a team is in the 99th percentile of expected goals, and see what those expected goal counts are.

### Getting the upper 99th percentile of games
upper_99 <- quantile(df_top_leagues$xg1, 0.99, na.rm = TRUE)

df_outliers <- df_top_leagues |>
  filter(xg1 > upper_99) |>
  select(season, date, team1, xg1) |>
  arrange(desc(xg1))

# Printing the top rows and looking for high xG values
head(df_outliers)

##   season       date               team1  xg1
## 1   2019 2019-09-21     Manchester City 7.07
## 2   2017 2017-12-17           Barcelona 7.04
## 3   2020 2021-04-21   Borussia Dortmund 6.72
## 4   2016 2016-08-20             AS Roma 6.12
## 5   2020 2021-02-07         Montpellier 6.07
## 6   2018 2019-01-19 Paris Saint-Germain 6.03

From the histogram, we can see this right skew of xG values. We can also see a few potential outliers, where a team had an xG value of 6+ goals, and particularly those at 7+ expected goals. From the table, we can see some of these instances. In particular, the Manchester City and Barcelona games with 7.07 and 7.04 xG, respectively, would likely be considered outliers on a global scale.

It would also be important to look on a team by team scale to see outliers for an individual team’s expected goals.

### Just Man City Games
mci_df <- df_top_leagues |>
  filter(team1 == 'Manchester City') |>
  select(season, date, team1, xg1) |>
  arrange(desc(xg1))

head(mci_df)

##   season       date           team1  xg1
## 1   2019 2019-09-21 Manchester City 7.07
## 2   2018 2018-10-20 Manchester City 5.25
## 3   2017 2017-09-23 Manchester City 5.02
## 4   2018 2018-09-15 Manchester City 4.62
## 5   2018 2018-08-19 Manchester City 4.46
## 6   2018 2019-02-10 Manchester City 4.08

From this table where we just explore Manchester City games, we can see that \(\approx{7}\) xG game is nearly 2 xG higher than any other game. While I would not suggest removing this from any analysis, we should be careful of this game and the affect it can have on any team level metrics for Manchester City relating to xG. The same applies to other teams, should we locate any outliers.

MK_Week5_DataDive

2026-02-15