Introduction:

Bayern Munich is playing at the Allianz Arena next week. This made me think about the phenomenon that is home advantage. Is it real? As I am about to start my second semester as a statistics and data science student, I realized I’m slowly able to answer these types of questions.

Using 32 Bundesliga seasons worth of match data from football-data.co.uk of Bayern Munich’s performances (1993–2025), this report quantifies the exact impact of playing at the Allianz Arena. I specifically set out to answer two questions:

Process:

In order to make this project possible, I used the tidyverse package from R. Firstly, I compiled all the Bundesliga seasons into one data set using the purrr package:

files <- list.files(path = ".", pattern = "*.csv", full.names = TRUE)

seasons <- files |> 
  set_names() |> 
  map_df(read.csv, .id = "source_file")

With the help of the dplyr package I then extracted all the relevant information for the question:

bayern_munich_seasons <- seasons |> select(HomeTeam, AwayTeam, FTHG, FTAG, Date, source_file) |> filter(HomeTeam == "Bayern Munich" | AwayTeam == "Bayern Munich") 
head(bayern_munich_seasons)
##        HomeTeam      AwayTeam FTHG FTAG     Date     source_file
## 1 Bayern Munich      Freiburg    3    1 07/08/93 ./1993_1994.csv
## 2    Leverkusen Bayern Munich    2    1 14/08/93 ./1993_1994.csv
## 3 Bayern Munich       Dresden    5    0 21/08/93 ./1993_1994.csv
## 4     Stuttgart Bayern Munich    2    2 28/08/93 ./1993_1994.csv
## 5 Bayern Munich       Leipzig    3    0 01/09/93 ./1993_1994.csv
## 6      Duisburg Bayern Munich    2    2 04/09/93 ./1993_1994.csv

Exactly how many goals is the “Home Advantage” worth?

I plan to answer the first question by using a linear regression model. To achieve this, I edited the bayern_munich_seasons with dplyr functions, so that the match location and the goal difference for every Bayern Munich game is easily available:

modified_bayern_munich_seasons <- bayern_munich_seasons |> mutate(
  Is_Home = ifelse(HomeTeam == "Bayern Munich", 1, 0),
  Goal_Difference = ifelse(Is_Home == 1, FTHG - FTAG, FTAG - FTHG)
)
head(modified_bayern_munich_seasons)
##        HomeTeam      AwayTeam FTHG FTAG     Date     source_file Is_Home
## 1 Bayern Munich      Freiburg    3    1 07/08/93 ./1993_1994.csv       1
## 2    Leverkusen Bayern Munich    2    1 14/08/93 ./1993_1994.csv       0
## 3 Bayern Munich       Dresden    5    0 21/08/93 ./1993_1994.csv       1
## 4     Stuttgart Bayern Munich    2    2 28/08/93 ./1993_1994.csv       0
## 5 Bayern Munich       Leipzig    3    0 01/09/93 ./1993_1994.csv       1
## 6      Duisburg Bayern Munich    2    2 04/09/93 ./1993_1994.csv       0
##   Goal_Difference
## 1               2
## 2              -1
## 3               5
## 4               0
## 5               3
## 6               0

Now we have a clean data set, where the column “Is_Home” tells us if Bayern played at home (1) or away (0). For each game we also have the final goal difference.

All there is left is to see if there is some kind of advantage, e.i. change in goal difference when playing at home. In order to make it comprehensible to the naked eye, I plotted the data by using the ggplot2 package:

plot_bayern_munich_performances <- ggplot(modified_bayern_munich_seasons, aes(x = Is_Home, y = Goal_Difference)) +
  scale_x_continuous(
    breaks = c(0,1),
    labels = c("Away", "Home")
  ) +
  geom_count(colour = "red") +
  theme_minimal() + labs(
    title = "Bayern Munich Home Advantage (1993-2025)",
    x = "Match Location",
    y = "Goal Difference",
    size = "Number of Matches"
  ) + geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") 
plot_bayern_munich_performances

Immediately after looking at the graph, it seems as if there is some kind of advantage when playing at home. Is this advantage quantifiable though? To answer this thought, I created a linear regression model with the modified_bayern_munich_seasons:

linear_regression <- lm(formula = Goal_Difference ~ Is_Home, data = modified_bayern_munich_seasons)
LR_result <- summary(linear_regression)
LR_result
## 
## Call:
## lm(formula = Goal_Difference ~ Is_Home, data = modified_bayern_munich_seasons)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8658 -0.8658  0.1342  1.1342  6.1342 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.86581    0.08175  10.591   <2e-16 ***
## Is_Home      1.00000    0.11561   8.649   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.907 on 1086 degrees of freedom
## Multiple R-squared:  0.06445,    Adjusted R-squared:  0.06359 
## F-statistic: 74.81 on 1 and 1086 DF,  p-value: < 2.2e-16

Findings for the first question:

By looking at the “Estimate”, we see that the intercept is \(\beta_0 \approx 0.87\) and the slope is \(\beta_1 \approx 1.00\). What does this mean? When playing away, Bayern Munich on average has a positive goal difference of ca. +0.87. What’s striking is that when they play at home though, on average, their goal difference jumps up to ca. +1.87. That means they effectively score a full goal more when playing at the Allianz Arena. What’s special is that the p-value tells us that the chance of this difference being random is 0.0000000000000002. Moving towards the Multiple R-squared: \(R^2 \approx 0.065\). This tells us the match location explains around 6.5% of the final goal difference, which means the other 93.5% is explained by tactics, weather, players, etc. When looking at the grand scheme of things, that is a massive part of the percentage for just the match location.

Did this force disappear when stadiums were emptied during the global pandemic?

Moving onto the second question, I firstly was curious on how different the graph would look when only focusing on the 20/21 season. I achieved this by using dplyr, Base R and ggplot2 functions. While it’s visually interesting, one season (n=34) lacks statistical power and likelier to have more noise due to a few outlier performances. On top of that, comparing two visuals with two different sample sizes (n=34 vs n=1088) will skew the output. I only did this to see an initial difference.

modified_bayern_munich_seasons_20_21 <- modified_bayern_munich_seasons |> mutate(
  Date = dmy(Date)
) |> filter(Date >= as.POSIXct("2020-09-18") & Date <= as.POSIXct("2021-05-22"))
head(modified_bayern_munich_seasons_20_21)
##        HomeTeam      AwayTeam FTHG FTAG       Date     source_file Is_Home
## 1 Bayern Munich    Schalke 04    8    0 2020-09-18 ./2020_2021.csv       1
## 2    Hoffenheim Bayern Munich    4    1 2020-09-27 ./2020_2021.csv       0
## 3 Bayern Munich        Hertha    4    3 2020-10-04 ./2020_2021.csv       1
## 4     Bielefeld Bayern Munich    1    4 2020-10-17 ./2020_2021.csv       0
## 5 Bayern Munich Ein Frankfurt    5    0 2020-10-24 ./2020_2021.csv       1
## 6       FC Koln Bayern Munich    1    2 2020-10-31 ./2020_2021.csv       0
##   Goal_Difference
## 1               8
## 2              -3
## 3               1
## 4               3
## 5               5
## 6               1
plot_covid <- ggplot(modified_bayern_munich_seasons_20_21, aes(x = Is_Home, y = Goal_Difference)) +
  scale_x_continuous(
    breaks = c(0,1),
    labels = c("Away", "Home")
  ) +
  geom_count(colour = "red") +
  theme_minimal() + labs(
    title = "Bayern Munich 2020/21 Performance",
    x = "Match Location",
    y = "Goal Difference",
    size = "Number of Matches"
  ) + geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") 
plot_covid

It seems to be that there is some type of difference, but drawing conclusions purely off of the graph is misleading.

In order to properly argue whether there’s a difference or not, I decided to look at the median goal difference in each of the 32 seaons, as the median is a robust mass. I decided to plot this idea (using ggplot2), as it will be easier to make a final statement that way. Before doing this, I needed to edit the Modified Data Set once more, so that we could get the median goal difference for each season (by using dplyr functions). I used a Regular Expression (Regex) to clean the file names, transforming strings like ‘2020_2021.csv’ into a more readable ‘2020/2021’ format for the plot axis.

median_goal_difference <- modified_bayern_munich_seasons |> group_by(source_file) |> summarize(
  Median_GD = median(Goal_Difference)
) |> mutate(
  season = str_extract_all(source_file,"[[:alnum:]]+_[[:alnum:]]+")) |> mutate(
  season = str_replace_all(season,pattern = "_", replacement = "/"
))
head(median_goal_difference)
## # A tibble: 6 × 3
##   source_file     Median_GD season   
##   <chr>               <dbl> <chr>    
## 1 ./1993_1994.csv       0.5 1993/1994
## 2 ./1994_1995.csv       0   1994/1995
## 3 ./1995_1996.csv       1   1995/1996
## 4 ./1996_1997.csv       1   1996/1997
## 5 ./1997_1998.csv       1   1997/1998
## 6 ./1998_1999.csv       2   1998/1999
median_GD_plot <- ggplot(median_goal_difference, aes(x = season, y = Median_GD, group = 1)) + geom_line(color = "lightblue",linetype = "dashed", linewidth = 0.75 ) + geom_point(color = "darkred", size = 2.5) +
  theme_minimal() + theme(
    axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)
  ) + labs(
    x = "Season",
    y = "Median Goal Difference",
    title = "Median Goal Difference (1993-2025)"
  )
median_GD_plot

Hypothesis for the second question:

One could definitely argue that home advantage vanished when the pandemic hit, as Bayern were on a three season long streak of having a median goal difference of two and that it descended down to one when covid happened. I think this conclusion is false though, as, by solely judging the data from the last 32 seasons, the median goal difference has been one majority of the time (specifically ~59% of the time). Therefore I believe the home advantage did not disappear during the 20/21 season. This could be due to the players being more familiar with the Allianz Arena field than any other field or even just feeling more comfortable and confident when playing at their home ground.

Conclusion:

After reading this project you have hopfully learned something. Next time you want to bet on Bayern Munich to score more goals than the opposing team, remember that the location of the game plays a role of roughly 6.5%.

Limitations:

While this project has put out some great numbers and information, it has a few limitations: - The project only focuses on one team and only one league that they play in. When looking at other teams or including Bayern Munich’s performances in the Champions League, the outcome could be different. - Bayern Munich has had multiple home locations across the 32 seasons, meaning this project generalizes the meaning of “home location.”