### Packages
library(tidyverse)
### Importing data
df <- read.csv("C:/Users/matth/OneDrive/Documents/INFO_H510/spi_matches.csv")
### Subsetting to only include the top 5 leagues
df_top_leagues <- df |>
filter(league %in% c("Barclays Premier League", "French Ligue 1", "Italy Serie A", "Spanish Primera Division", "German Bundesliga"))
The data documentation for this particular dataset comes from a couple different sources. From “jayb” on GitHub (https://github.com/fivethirtyeight/data/tree/master/soccer-spi), we can get the official data dictionary, which explains what each of the variable names are and what they mean. However, the explanations behind certain variables and how they are collected/calculated are provided in a FiveThirtyEight article about the research (https://fivethirtyeight.com/methodology/how-our-club-soccer-predictions-work/). The following questions are answered using information from these 2 sites.
Originally, I was unsure about the spi1 and
spi2 columns. I knew that they related to the ESPN Soccer
Power Index ratings for each team, but it wasn’t clear how these were
calculated and what exactly they meant. The FiveThirtyEight article
gives some clarity to this. The rating is calculated from an offensive
rating (related to how many goals the team would be expected to score
against an average team on a neutral field) and a defensive rating (same
circumstance but for expected goals conceded). These are used to
represent the percentage of points (3 for a win, 1 for a tie, 0 for a
loss) the team is expected to take from that match if it was played over
and over again. The preseason ratings are calculated from the previous
end of season SPI rating (67% of the rating) and a team market-value
implied rating to serve as a proxy for team strength (33%). The rating
for each team is updated based on match performance, which comes from
performances compared to how the model predicted them to perform. I
think this is a smart approach to modeling team strength and performance
throughout a season. While the result is obviously important, a team can
play very poorly in a few games and still get results (or play very well
and not get results), which can be a better measure of a team’s strength
and expected performance than just the recent results. While I could
have still used the variable without the exact variable definition (as I
had a general understanding), it adds very important context to how we
evaluate these ratings.
Another set of variables I was unsure about were the match importance
variables importance1 and importance2. Like
with SPI, I had a general understanding of what this implied, as certain
matches are more important to 1 team than another based on season
context, but I was not sure how this was calculated. The FiveThirtyEight
article offers a full section on the match importance variables. To
calculate importance, they factored in probabilities of a team winning
the league, being promoted/relegated, and qualifying for the Champions
League with a win or a loss Then, they compare the difference in
probabilities, scale to a 0-100 scale, and find the factor (promotion,
champions, relegation, etc.) with the greatest difference and use this
as the match importance. Again, while I could have used this without the
explicit definition, as I had a general understanding, this is very
important to providing proper context of the variable. This is an
important way of adding context to team performances, as performances
may vary when the stakes are higher or lower.
The 1 set of variables that is still a little unclear to me are the
adjusted score variables adj_score1 and
adj_score2. The article provides explanation for the
context of the variables, but doesn’t do a great job at actually
explaining the calculation and how the number of goals are actually
“adjusted” for this metric. Basically, goals scored when a team is at a
man advantage or when they are already winning late in a match have a
reduced value, and other goals have an increased value to generally add
up to the team’s total goals over the season. While this is important to
understanding which goals are adjusted in which direction, it is still
unclear how much the goals are actually adjusted on an individual level,
which can be important on a match-level analysis or with a smaller
sample of matches.
### Scatterplot of actual vs adjusted goals
ggplot(df_top_leagues, aes(x = score1, y = adj_score1)) +
geom_point(alpha = 0.4) +
# Reference line
geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
labs(
title = "Actual Goals vs Adjusted Goals (Team 1)",
x = "Actual Goals",
y = "Adjusted Goals"
) +
theme_minimal()
From the plot, it appears that most goals, particularly when the actual goals are higher, are downweighted (below the red line). Thus, overall, adjusted goals are typically lower than actual goals, or at least the magnitude of downweighting is often more than the magnitude of upweighting. The other interesting thing to note is that there is a single point where the team did not score an actual goal, but had around 2 adjusted goals.
Without clear explanation into the individual-level adjustments and their magnitudes, we have to be cautious of whether unintended bias is introduced in a match-level analysis. While over the season, these differences may level out, we must be vary careful on smaller scale analyses to avoid these potential biases.
### Calculating average adjustment in goals based on goals scored
avg_adjustment <- df_top_leagues |>
group_by(score1) |>
summarise(
avg_adj_diff = mean(adj_score1 - score1, na.rm = TRUE),
n_games = n(),
.groups = "drop"
)
### Line Plot
ggplot(avg_adjustment, aes(x = score1, y = avg_adj_diff)) +
geom_point(size = 3) +
geom_line() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(
title = "Average Adjustment by Actual Goals Scored",
x = "Actual Goals",
y = "Average (Adjusted - Actual)"
) +
theme_minimal()
From here, we can see that in 1 and 2 goal games (particularly in 1 goal games), the goals tend to be more inflated. For 3-9 goal games the opposite is true. This plot aligns with what we noted from the previous plot, where the magnitude of downweighted goals appears to be higher than the magnitude of upweighted goals. The points for when 3-9 actual goals are scored deviate from the “0-line” more than for when 1 or 2 goals are scored. Again, while on a season-wide level, these differences may even out, it is very important to be aware of these differences in any analysis on a match level to avoid bias in our explanations.
### Check for counts of missing season or team values
df_top_leagues |>
summarise(
missing_season = sum(is.na(season)),
missing_team1 = sum(is.na(team1)),
missing_team2 = sum(is.na(team2))
)
## missing_season missing_team1 missing_team2
## 1 0 0 0
Here, we checked to make sure no matches were missing a season, and that each match has 2 teams assigned to it. Thankfully, we see no missing values here. Thus, we can be safe in our analyses in terms of knowing that each match is assigned a season and 2 teams.
# Making sure we have each season sequentially
sort(unique(df_top_leagues$season))
## [1] 2016 2017 2018 2019 2020
### Making sure each season has the same # of games
df_top_leagues |>
count(season)
## season n
## 1 2016 1826
## 2 2017 1826
## 3 2018 1826
## 4 2019 1826
## 5 2020 1826
Here again, we can see that there are no implicitly missing seasons. Each season is accounted for sequentially. Thus, we can be certain that each season of matches follows the proper progression. We can also see that each season has the same number of matches, so we aren’t missing matches in any of the seasons.
Most soccer matches see each team scoring anywhere from 0-2 goals. The same sentiment applies for expected goals, where match performances typically see a team have an expected goals value between 0-2. Higher values, especially in the 2-3 range are not necessarily uncommon, but the distribution is definitely right-skewed.
### Histogram of xG values for distribution and potential outliers
ggplot(df_top_leagues, aes(x = xg1)) +
geom_histogram(color = "gray", fill = "red") +
labs(title = "Histogram of xG Values", x = "xG") +
theme_minimal()
To look for outliers in the expected goals column, we will use a percentile method to find matches where a team is in the 99th percentile of expected goals, and see what those expected goal counts are.
### Getting the upper 99th percentile of games
upper_99 <- quantile(df_top_leagues$xg1, 0.99, na.rm = TRUE)
df_outliers <- df_top_leagues |>
filter(xg1 > upper_99) |>
select(season, date, team1, xg1) |>
arrange(desc(xg1))
# Printing the top rows and looking for high xG values
head(df_outliers)
## season date team1 xg1
## 1 2019 2019-09-21 Manchester City 7.07
## 2 2017 2017-12-17 Barcelona 7.04
## 3 2020 2021-04-21 Borussia Dortmund 6.72
## 4 2016 2016-08-20 AS Roma 6.12
## 5 2020 2021-02-07 Montpellier 6.07
## 6 2018 2019-01-19 Paris Saint-Germain 6.03
From the histogram, we can see this right skew of xG values. We can also see a few potential outliers, where a team had an xG value of 6+ goals, and particularly those at 7+ expected goals. From the table, we can see some of these instances. In particular, the Manchester City and Barcelona games with 7.07 and 7.04 xG, respectively, would likely be considered outliers on a global scale.
It would also be important to look on a team by team scale to see outliers for an individual team’s expected goals.
### Just Man City Games
mci_df <- df_top_leagues |>
filter(team1 == 'Manchester City') |>
select(season, date, team1, xg1) |>
arrange(desc(xg1))
head(mci_df)
## season date team1 xg1
## 1 2019 2019-09-21 Manchester City 7.07
## 2 2018 2018-10-20 Manchester City 5.25
## 3 2017 2017-09-23 Manchester City 5.02
## 4 2018 2018-09-15 Manchester City 4.62
## 5 2018 2018-08-19 Manchester City 4.46
## 6 2018 2019-02-10 Manchester City 4.08
From this table where we just explore Manchester City games, we can see that \(\approx{7}\) xG game is nearly 2 xG higher than any other game. While I would not suggest removing this from any analysis, we should be careful of this game and the affect it can have on any team level metrics for Manchester City relating to xG. The same applies to other teams, should we locate any outliers.