MK_Week2_DataDive

### Importing data
df <- read.csv("C:/Users/matth/OneDrive/Documents/INFO_H510/spi_matches.csv")

### Subsetting to only include the top 5 leagues
df_top_leagues <- df %>%
  filter(league %in% c("Barclays Premier League", "French Ligue 1", "Italy Serie A", "Spanish Primera Division", "German Bundesliga"))

Overview

This exploration will explore trends in ESPN and FiveThirtyEight’s predictive models for Europe’s top 5 leagues from the 2016/17 season to the 2020/21 season. The top 5 leagues in Europe include the English Premier League, French Ligue 1, Italian Serie A, German Bundesliga, and Spanish Primera Division.

Numeric Summary: Projected Goals (Home vs Away)

I wanted to explore trends in how FiveThirtyEight + ESPN projected the home and away team to score goals in Europe’s top 5 leagues. I would think that, ignoring team strength, the home team would often be projected to score more goals. This dive will look at the distribution of home team goals (proj_score1) and away team goals (proj_score2) by exploring central tendency, spread, and potential differences in these betweenb home and away teams.

### Cleanly assigning names to the variables, and calculating summary stats
df_top_leagues %>%
  select(proj_score1, proj_score2) %>%
  pivot_longer(
    cols = everything(),
    names_to = "team_type",
    values_to = "projected_goals"
  ) %>%
  mutate(
    ### Clean home/away team names
    team_type = recode(
      team_type,
      proj_score1 = "Home Team",
      proj_score2 = "Away Team"
    )
  ) %>%
  ### Summarizing home and away team distributions
  group_by(team_type) %>%
  summarise(
    Min = min(projected_goals, na.rm = TRUE),
    Q1 = quantile(projected_goals, 0.25, na.rm = TRUE),
    Median = median(projected_goals, na.rm = TRUE),
    Mean = mean(projected_goals, na.rm = TRUE),
    Q3 = quantile(projected_goals, 0.75, na.rm = TRUE),
    Max = max(projected_goals, na.rm = TRUE),
    SD = sd(projected_goals, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  ### Clean table
  kable(
    digits = 2,
    caption = "Summary Statistics for Projected Goals by Team Type"
  )

Summary Statistics for Projected Goals by Team Type
team_type	Min	Q1	Median	Mean	Q3	Max	SD
Away Team	0.20	0.87	1.10	1.18	1.40	3.43	0.49
Home Team	0.41	1.23	1.45	1.55	1.75	4.03	0.49

### Similar structure, but plotting at the end
df_top_leagues %>%
  select(proj_score1, proj_score2) %>%
  pivot_longer(
    cols = everything(),
    names_to = "team_type",
    values_to = "projected_goals"
  ) %>%
  mutate(
    team_type = recode(
      team_type,
      proj_score1 = "Home Team",
      proj_score2 = "Away Team"
    )
  ) %>%
  ### Box and Whisker plot
  ggplot(aes(x = team_type, y = projected_goals, fill = team_type)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.4) +
  labs(
    title = "Distribution of Projected Goals by Team Type",
    x = "",
    y = "Projected Goals"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

From the above table, we can see that my initial intuition was correct: the home team is often projected to score more goals under these models. It has a higher value for each of the quartiles, median, mean, minimum value, and maximum value. This is echoed in the box and whisker plot. It is also interesting to point out that the standard deviation for both the home and away teams are 0.49. This reflects that the projections come from the same level of uncertainty, but the higher values for the home team point towards the inclusion of a general “home field advantage” to the model.

Do Projected Goals Differ from League to League?

The above dive into the projected goals variable confirms the existence of a home field advantage in the modeling projections. However, it suggests nothing about any differences league by league. This next question aims to see if there are any differences in projected goals among the top 5 leagues, potentially hinting that the model thinks 1 league is better or worse at attacking/defending overall compared to the others. There are differences in both style of play and competitive balance within the league for each of these 5 leagues, so this quick exploration will look to see if this is reflected in the projections.

df_top_leagues %>%
  ### Calculating league-wide averages
  group_by(league) %>%
  summarise(
    ### Average home and away projected goals
    Avg_Home_Proj_Goals = mean(proj_score1, na.rm = TRUE),
    Avg_Away_Proj_Goals = mean(proj_score2, na.rm = TRUE),
    Matches = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(Avg_Home_Proj_Goals)) %>%
  kable(
    digits = 2,
    caption = "Average Projected Goals by League"
  )

Average Projected Goals by League
league	Avg_Home_Proj_Goals	Avg_Away_Proj_Goals	Matches
German Bundesliga	1.59	1.23	1530
Barclays Premier League	1.57	1.22	1900
Italy Serie A	1.57	1.24	1900
Spanish Primera Division	1.53	1.10	1900
French Ligue 1	1.49	1.11	1900

From this table, there are not major differences between leagues in terms of projected goals. I calculated the league-wide average over the 5 seasons in question, for both home and away goals. The German Bundesliga has the highest projected home goals per game, while the French Ligue 1 has the lowest. However, this difference is very small, likely indicating that the league strength is not a major factor in determining projected goals. Similar results apply for the away team, though there is a slightly bigger difference between the top (Italian Serie A) and the lowest (Spanish Primera Division). Ligue 1 has the lowest projections overall, while the Bundesliga has the highest. One interesting point is shown in the difference between expected home and away goals in Spain. There is a difference of 0.43, which is greater than the mean we saw from our original table summary. There are also general differences between the leagues in the home/away goal projection difference, indicating a potential difference in the “home advantage” factor applied to the goal projections.

How are the Matches Actually Predicted?

Now that we have an understanding of the distribution of projected goals, let’s explore the documentation to understand where those numbers actually come from and how they are calculated. From FiveThirtyEight’s article about their Club Soccer Predictions (https://fivethirtyeight.com/methodology/how-our-club-soccer-predictions-work/), we can see just what goes in to their match forecasts. The article explains how they incorporate ESPN’s SPI (Soccer Power Index), which is updated after each game based on how a team performed relative to expectations. They also factor in a league-specific home field advantage, confirming our earlier suspicions, and the match importance. From here, they project the number of goals each team is expected to score and assume a Poisson process in soccer goal scoring to generate Poisson distributions of the two projected scores to give a likelihood of 0, 1, 2, … goals for each team. Finally, these are combined into a matrix of all possible scores, from which a likelihood of win, draw, and loss is given. An example from the article is included below:

Are There Differences in Actual Goals Scored Between Home and Away teams League by League?

We now have an understanding of how FiveThirtyEight actually projected goals scored and the probabilities of winning, but I now wanted to take a look at the actual home-field advantage between different leagues. The article specifically referenced a league-specific home-field advantage, but I now want to take a look at what this actually is.

### Same pivoting strategy at the start, but combining league and home/away
df_top_leagues %>%
  select(league, score1, score2) %>%
  pivot_longer(
    cols = c(score1, score2),
    names_to = "team_type",
    values_to = "actual_goals"
  ) %>%
  mutate(
    team_type = recode(
      team_type,
      score1 = "Home Team",
      score2 = "Away Team"
    )
  ) %>%
  
  ### Averages and Summary Statistics by league and home/away team
  group_by(league, team_type) %>%
  summarise(
    Min = min(actual_goals, na.rm = TRUE),
    Q1 = quantile(actual_goals, 0.25, na.rm = TRUE),
    Median = median(actual_goals, na.rm = TRUE),
    Mean = mean(actual_goals, na.rm = TRUE),
    Q3 = quantile(actual_goals, 0.75, na.rm = TRUE),
    Max = max(actual_goals, na.rm = TRUE),
    Matches = n(),
    .groups = "drop"
  ) %>%
  arrange(league, desc(team_type)) %>%
  ### Clean table
  kable(
    digits = 2,
    caption = "Actual Goals Scored by Home and Away Teams, League by League"
  )

Actual Goals Scored by Home and Away Teams, League by League
league	team_type	Q1	Median	Mean	Q3	Max	Matches
Barclays Premier League	Home Team	1	1	1.51	2	9	1900
Barclays Premier League	Away Team	0	1	1.23	2	9	1900
French Ligue 1	Home Team	1	1	1.49	2	9	1900
French Ligue 1	Away Team	0	1	1.15	2	7	1900
German Bundesliga	Home Team	1	1	1.68	2	8	1530
German Bundesliga	Away Team	0	1	1.34	2	6	1530
Italy Serie A	Home Team	1	1	1.57	2	7	1900
Italy Serie A	Away Team	0	1	1.31	2	7	1900
Spanish Primera Division	Home Team	1	1	1.49	2	8	1900
Spanish Primera Division	Away Team	0	1	1.15	2	6	1900

The first noticeable trend in the distribution of home and away goals is that for each league, the 1st Quartile value is 1 goal for the home team and 0 for the away team in all leagues. We can also see that in each league, the maximum goals scored by a home team is always greater than or equal to that of an away team. However, each league has the same median (1) for home and away goals, indicating that either team scoring 1 goal takes up a large part of the data distribution.

In terms of league differences in home-field advantage, this likely comes from the mean. Each league has a higher mean goals scored for the home than away team. Both the Spanish Primera Division (the highest difference noted in projected goals) and French Ligue 1, and German Bundesliga have a difference of 0.34 between home and away goals scored. The Italian Serie A and English Premier League follow next with 0.28. While the documentation does not specifically mention where the league-specific home advantage is applied, it likely is related to these differences in mean home and away goals scored.

Projected vs Actual Goals Scored

The next step after exploring both projected and actual goals scored is to see how they relate. Are these predictions actually a good indicator of how many goals a team will score? We will explore these with a scatterplot, colored by home/away team to see if there are differences in projection accuracy.

df_top_leagues %>%
  ### Same data aggregation structure, now to analyze differences in actual vs projected goals
  select(proj_score1, proj_score2, score1, score2) %>%
  pivot_longer(
    cols = everything(),
    names_to = c(".value", "team"),
    names_pattern = "(proj_score|score)([12])"
  ) %>%
  mutate(
    team_type = ifelse(team == "1", "Home Team", "Away Team")
  ) %>%
  ## Scatter Plot
  ggplot(aes(x = proj_score, y = score, color = team_type)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Projected vs Actual Goals Scored",
    x = "Projected Goals",
    y = "Actual Goals",
    # Colored by team type
    color = "Team Type"
  ) +
  theme_minimal()

From the plot, we can notice immediately the home-field advantage factor, with more low values of projected goals for the away team than the home team. While the plot is crowded, there is an indication of a slight positive trend for both home and away goals. While not overly strong, the noticeable positive slope for both the home and away team indicates at least some predictive significance in the projected goals model.

It is also interesting to point out that the trend is stronger for the home team than the away team, though this is a small difference. This merits further investigation into the home-field advantage factor.

How does Match Importance Relate to Goals Scored, by Home and Away Team

The other factor that FiveThirtyEight indicated was included in their projections was the match importance factor. This was determined to be a measure of how much the match outcome would change each team’s statistical outlook for the season. For instance, a match that would determine if a team wins the league is more important than a match for a team sitting in the middle of the table with no possibility of winning the league or getting relegated. I wanted to explore how this related to actual goals scored, once again colored by home and away teams, to see if match importance was a factor in a team scoring more. A positive trend could indicate that teams really push for goals more in important matches.

### Same set up as before, but looking at match importance
df_top_leagues %>%
  select(importance1, importance2, score1, score2) %>%
  pivot_longer(
    cols = everything(),
    names_to = c(".value", "team"),
    names_pattern = "(importance|score)([12])"
  ) %>%
  mutate(
    team_type = ifelse(team == "1", "Home Team", "Away Team")
  ) %>%
  ## Same plotting structure
  ggplot(aes(x = importance, y = score, color = team_type)) +
  geom_jitter(alpha = 0.35, width = 0.5, height = 0) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(
    title = "Match Importance vs Actual Goals Scored",
    x = "Match Importance",
    y = "Goals Scored",
    color = "Team Type"
  ) +
  theme_minimal()

From the scatterplot, we can see that there doesn’t appear to be any major difference in the level of match importance and how many goals a team scored. There is a slight increase as match importance increases, but it appears to be relatively negligible. From here, I think a logical next step would be to investigate how match importance is applied to the projections, and take a look at how much of a factor it is when being used to calculate projected scores.