### Importing data
df <- read.csv("C:/Users/matth/OneDrive/Documents/INFO_H510/spi_matches.csv")
### Subsetting to only include the top 5 leagues
df_top_leagues <- df %>%
filter(league %in% c("Barclays Premier League", "French Ligue 1", "Italy Serie A", "Spanish Primera Division", "German Bundesliga"))
This exploration will explore trends in ESPN and FiveThirtyEight’s predictive models for Europe’s top 5 leagues from the 2016/17 season to the 2020/21 season. The top 5 leagues in Europe include the English Premier League, French Ligue 1, Italian Serie A, German Bundesliga, and Spanish Primera Division.
I wanted to explore trends in how FiveThirtyEight + ESPN projected
the home and away team to score goals in Europe’s top 5 leagues. I would
think that, ignoring team strength, the home team would often be
projected to score more goals. This dive will look at the distribution
of home team goals (proj_score1) and away team goals
(proj_score2) by exploring central tendency, spread, and
potential differences in these betweenb home and away teams.
### Cleanly assigning names to the variables, and calculating summary stats
df_top_leagues %>%
select(proj_score1, proj_score2) %>%
pivot_longer(
cols = everything(),
names_to = "team_type",
values_to = "projected_goals"
) %>%
mutate(
### Clean home/away team names
team_type = recode(
team_type,
proj_score1 = "Home Team",
proj_score2 = "Away Team"
)
) %>%
### Summarizing home and away team distributions
group_by(team_type) %>%
summarise(
Min = min(projected_goals, na.rm = TRUE),
Q1 = quantile(projected_goals, 0.25, na.rm = TRUE),
Median = median(projected_goals, na.rm = TRUE),
Mean = mean(projected_goals, na.rm = TRUE),
Q3 = quantile(projected_goals, 0.75, na.rm = TRUE),
Max = max(projected_goals, na.rm = TRUE),
SD = sd(projected_goals, na.rm = TRUE),
.groups = "drop"
) %>%
### Clean table
kable(
digits = 2,
caption = "Summary Statistics for Projected Goals by Team Type"
)
| team_type | Min | Q1 | Median | Mean | Q3 | Max | SD |
|---|---|---|---|---|---|---|---|
| Away Team | 0.20 | 0.87 | 1.10 | 1.18 | 1.40 | 3.43 | 0.49 |
| Home Team | 0.41 | 1.23 | 1.45 | 1.55 | 1.75 | 4.03 | 0.49 |
### Similar structure, but plotting at the end
df_top_leagues %>%
select(proj_score1, proj_score2) %>%
pivot_longer(
cols = everything(),
names_to = "team_type",
values_to = "projected_goals"
) %>%
mutate(
team_type = recode(
team_type,
proj_score1 = "Home Team",
proj_score2 = "Away Team"
)
) %>%
### Box and Whisker plot
ggplot(aes(x = team_type, y = projected_goals, fill = team_type)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.4) +
labs(
title = "Distribution of Projected Goals by Team Type",
x = "",
y = "Projected Goals"
) +
theme_minimal() +
theme(legend.position = "none")
From the above table, we can see that my initial intuition was correct: the home team is often projected to score more goals under these models. It has a higher value for each of the quartiles, median, mean, minimum value, and maximum value. This is echoed in the box and whisker plot. It is also interesting to point out that the standard deviation for both the home and away teams are 0.49. This reflects that the projections come from the same level of uncertainty, but the higher values for the home team point towards the inclusion of a general “home field advantage” to the model.
The above dive into the projected goals variable confirms the existence of a home field advantage in the modeling projections. However, it suggests nothing about any differences league by league. This next question aims to see if there are any differences in projected goals among the top 5 leagues, potentially hinting that the model thinks 1 league is better or worse at attacking/defending overall compared to the others. There are differences in both style of play and competitive balance within the league for each of these 5 leagues, so this quick exploration will look to see if this is reflected in the projections.
df_top_leagues %>%
### Calculating league-wide averages
group_by(league) %>%
summarise(
### Average home and away projected goals
Avg_Home_Proj_Goals = mean(proj_score1, na.rm = TRUE),
Avg_Away_Proj_Goals = mean(proj_score2, na.rm = TRUE),
Matches = n(),
.groups = "drop"
) %>%
arrange(desc(Avg_Home_Proj_Goals)) %>%
kable(
digits = 2,
caption = "Average Projected Goals by League"
)
| league | Avg_Home_Proj_Goals | Avg_Away_Proj_Goals | Matches |
|---|---|---|---|
| German Bundesliga | 1.59 | 1.23 | 1530 |
| Barclays Premier League | 1.57 | 1.22 | 1900 |
| Italy Serie A | 1.57 | 1.24 | 1900 |
| Spanish Primera Division | 1.53 | 1.10 | 1900 |
| French Ligue 1 | 1.49 | 1.11 | 1900 |
From this table, there are not major differences between leagues in terms of projected goals. I calculated the league-wide average over the 5 seasons in question, for both home and away goals. The German Bundesliga has the highest projected home goals per game, while the French Ligue 1 has the lowest. However, this difference is very small, likely indicating that the league strength is not a major factor in determining projected goals. Similar results apply for the away team, though there is a slightly bigger difference between the top (Italian Serie A) and the lowest (Spanish Primera Division). Ligue 1 has the lowest projections overall, while the Bundesliga has the highest. One interesting point is shown in the difference between expected home and away goals in Spain. There is a difference of 0.43, which is greater than the mean we saw from our original table summary. There are also general differences between the leagues in the home/away goal projection difference, indicating a potential difference in the “home advantage” factor applied to the goal projections.
Now that we have an understanding of the distribution of projected
goals, let’s explore the documentation to understand where those numbers
actually come from and how they are calculated. From FiveThirtyEight’s
article about their Club Soccer Predictions (https://fivethirtyeight.com/methodology/how-our-club-soccer-predictions-work/),
we can see just what goes in to their match forecasts. The article
explains how they incorporate ESPN’s SPI (Soccer Power Index), which is
updated after each game based on how a team performed relative to
expectations. They also factor in a league-specific
home field advantage, confirming our earlier suspicions, and the match
importance. From here, they project the number of goals each team is
expected to score and assume a Poisson process in soccer goal scoring to
generate Poisson distributions of the two projected scores to give a
likelihood of 0, 1, 2, … goals for each team. Finally, these are
combined into a matrix of all possible scores, from which a likelihood
of win, draw, and loss is given. An example from the article is included
below:
We now have an understanding of how FiveThirtyEight actually projected goals scored and the probabilities of winning, but I now wanted to take a look at the actual home-field advantage between different leagues. The article specifically referenced a league-specific home-field advantage, but I now want to take a look at what this actually is.
### Same pivoting strategy at the start, but combining league and home/away
df_top_leagues %>%
select(league, score1, score2) %>%
pivot_longer(
cols = c(score1, score2),
names_to = "team_type",
values_to = "actual_goals"
) %>%
mutate(
team_type = recode(
team_type,
score1 = "Home Team",
score2 = "Away Team"
)
) %>%
### Averages and Summary Statistics by league and home/away team
group_by(league, team_type) %>%
summarise(
Min = min(actual_goals, na.rm = TRUE),
Q1 = quantile(actual_goals, 0.25, na.rm = TRUE),
Median = median(actual_goals, na.rm = TRUE),
Mean = mean(actual_goals, na.rm = TRUE),
Q3 = quantile(actual_goals, 0.75, na.rm = TRUE),
Max = max(actual_goals, na.rm = TRUE),
Matches = n(),
.groups = "drop"
) %>%
arrange(league, desc(team_type)) %>%
### Clean table
kable(
digits = 2,
caption = "Actual Goals Scored by Home and Away Teams, League by League"
)
| league | team_type | Min | Q1 | Median | Mean | Q3 | Max | Matches |
|---|---|---|---|---|---|---|---|---|
| Barclays Premier League | Home Team | 0 | 1 | 1 | 1.51 | 2 | 9 | 1900 |
| Barclays Premier League | Away Team | 0 | 0 | 1 | 1.23 | 2 | 9 | 1900 |
| French Ligue 1 | Home Team | 0 | 1 | 1 | 1.49 | 2 | 9 | 1900 |
| French Ligue 1 | Away Team | 0 | 0 | 1 | 1.15 | 2 | 7 | 1900 |
| German Bundesliga | Home Team | 0 | 1 | 1 | 1.68 | 2 | 8 | 1530 |
| German Bundesliga | Away Team | 0 | 0 | 1 | 1.34 | 2 | 6 | 1530 |
| Italy Serie A | Home Team | 0 | 1 | 1 | 1.57 | 2 | 7 | 1900 |
| Italy Serie A | Away Team | 0 | 0 | 1 | 1.31 | 2 | 7 | 1900 |
| Spanish Primera Division | Home Team | 0 | 1 | 1 | 1.49 | 2 | 8 | 1900 |
| Spanish Primera Division | Away Team | 0 | 0 | 1 | 1.15 | 2 | 6 | 1900 |
The first noticeable trend in the distribution of home and away goals is that for each league, the 1st Quartile value is 1 goal for the home team and 0 for the away team in all leagues. We can also see that in each league, the maximum goals scored by a home team is always greater than or equal to that of an away team. However, each league has the same median (1) for home and away goals, indicating that either team scoring 1 goal takes up a large part of the data distribution.
In terms of league differences in home-field advantage, this likely comes from the mean. Each league has a higher mean goals scored for the home than away team. Both the Spanish Primera Division (the highest difference noted in projected goals) and French Ligue 1, and German Bundesliga have a difference of 0.34 between home and away goals scored. The Italian Serie A and English Premier League follow next with 0.28. While the documentation does not specifically mention where the league-specific home advantage is applied, it likely is related to these differences in mean home and away goals scored.
The next step after exploring both projected and actual goals scored is to see how they relate. Are these predictions actually a good indicator of how many goals a team will score? We will explore these with a scatterplot, colored by home/away team to see if there are differences in projection accuracy.
df_top_leagues %>%
### Same data aggregation structure, now to analyze differences in actual vs projected goals
select(proj_score1, proj_score2, score1, score2) %>%
pivot_longer(
cols = everything(),
names_to = c(".value", "team"),
names_pattern = "(proj_score|score)([12])"
) %>%
mutate(
team_type = ifelse(team == "1", "Home Team", "Away Team")
) %>%
## Scatter Plot
ggplot(aes(x = proj_score, y = score, color = team_type)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Projected vs Actual Goals Scored",
x = "Projected Goals",
y = "Actual Goals",
# Colored by team type
color = "Team Type"
) +
theme_minimal()
From the plot, we can notice immediately the home-field advantage factor, with more low values of projected goals for the away team than the home team. While the plot is crowded, there is an indication of a slight positive trend for both home and away goals. While not overly strong, the noticeable positive slope for both the home and away team indicates at least some predictive significance in the projected goals model.
It is also interesting to point out that the trend is stronger for the home team than the away team, though this is a small difference. This merits further investigation into the home-field advantage factor.
The other factor that FiveThirtyEight indicated was included in their projections was the match importance factor. This was determined to be a measure of how much the match outcome would change each team’s statistical outlook for the season. For instance, a match that would determine if a team wins the league is more important than a match for a team sitting in the middle of the table with no possibility of winning the league or getting relegated. I wanted to explore how this related to actual goals scored, once again colored by home and away teams, to see if match importance was a factor in a team scoring more. A positive trend could indicate that teams really push for goals more in important matches.
### Same set up as before, but looking at match importance
df_top_leagues %>%
select(importance1, importance2, score1, score2) %>%
pivot_longer(
cols = everything(),
names_to = c(".value", "team"),
names_pattern = "(importance|score)([12])"
) %>%
mutate(
team_type = ifelse(team == "1", "Home Team", "Away Team")
) %>%
## Same plotting structure
ggplot(aes(x = importance, y = score, color = team_type)) +
geom_jitter(alpha = 0.35, width = 0.5, height = 0) +
geom_smooth(method = "loess", se = FALSE) +
labs(
title = "Match Importance vs Actual Goals Scored",
x = "Match Importance",
y = "Goals Scored",
color = "Team Type"
) +
theme_minimal()
From the scatterplot, we can see that there doesn’t appear to be any major difference in the level of match importance and how many goals a team scored. There is a slight increase as match importance increases, but it appears to be relatively negligible. From here, I think a logical next step would be to investigate how match importance is applied to the projections, and take a look at how much of a factor it is when being used to calculate projected scores.