The phenomenon of home advantage, where the home team generally performs better than the visiting team, is well-documented across many sports. In football, as in other sports, this effect is extensively studied, with works like those of Pollard et al. (2017) and Gomez-Ruano et al. (2021) providing comprehensive overviews. Pollard categorizes the existence of home advantage due to several key factors involving Crowd Effects, Travel Effects and Psychological Factors. This report aims to explore some of the factors that shape the outcomes of German Bundesliga matches, with a specific focus on goals scored, team performance, and fan engagement. Understanding these questions involves delving into the dynamics of home-field advantage, the effects of travel, and the impact of fan presence on player performance, specifically number of goals scored. For instance, does a larger home crowd correlate with a stronger performance by the home team? Are teams more consistent in certain phases of the season, such as early matchdays versus mid-season? Furthermore, the timing of a match within a season, represented by the matchday (e.g., Matchday 1 through Matchday 34 in a typical Bundesliga season), can significantly influence the total number of goals scored. Early in the season, teams may still be finding their rhythm, integrating new players, and adapting to tactical adjustments, potentially leading to more cautious gameplay and fewer goals. Conversely, mid-season and end-of-season matches might see variations in goal patterns driven by factors such as fatigue, or team form. Therefore, one key analysis we will examine would be whether distinct phases of the season are associated with variations in goal-scoring patterns and we will alsoinvestigate how factors such as viewer attendance within the season influence goal-scoring patterns, specifically focusing on whether these elements differentially affect goals scored by the home teams (goals_home) versus the away teams (goals_away).
The German Bundesliga dataset was chosen because it offers an opportunity for us to think from a different viewpoint as statisticians so that more like - sports analysts. Moreover, the Bundesliga is known for their competitive play and huge ratio of fans worldwide making it interesting to explore and study how different factors affect these football games. This dataset is available on Kaggle and includes a range of information about football matches from the Bundesliga such as which season the match was played, different team performances, number of goals scored ,the city or location where the match was played, match outcomes and viewer statistics. The dataset is count data, structured around match-level observations with each entry representing a single match. It spans 18 seasons, covering matches from 2009-2010 to 2022- 2023, with over 6,000 games featuring 30 teams competing throughout the span of the given seasons. This database including various statistics and outcomes might have been compiled from official league sources, match reports, and other sports data insights. By applying multilevel models to the structured data, we aim to uncover insights that account for the hierarchical relationships within the dataset while highlighting the impact of fan presence, seasonal phases, on team performances. Analyzing this dataset will help us in providing better insights into the dynamics of football matches and what influences those. By analyzing total goals scored across these phases, we can assess trends and anomalies, such as whether late-season matches consistently produce higher or lower goal counts.
Identifying the distribution of total goals per match helps reveal typical scoring patterns and informs predictions about match outcomes. The histogram in Figure 1 shows the distribution of total goals scored per match. The majority of matches have between 2 and 3 total goals, with the frequency decreasing as the number of goals increases. The distribution is right skewed, with very few matches having more than 6 goals. This indicates that most matches tend to be low to moderately scoring, with high scoring games being rare.
Understanding scoring patterns for home and away teams could be helpful for analyzing team performance and identifying factors influencing match outcomes. The bar plot in Figure 2 shows that home teams most commonly score 0–2 goals, with higher goal counts being rare, indicating a right-skewed distribution and a tendency for low-scoring matches. The bar plot in Figure 3 shows that away teams most frequently score 0–1 goals, with also higher goal counts being even more rare, indicating that away teams tend to score less compared to home teams.
Analyzing match outcomes is essential to understand the role of home field advantage and how it influences team performance and results. Table 1 shows that there were 703 home wins, 480 away wins, and 499 draws in the matches, indicating that home teams have a significant advantage in winning games compared to away teams. The bar plot in Figure 3 shows that home wins are the most common match outcome, followed by draws, while away wins are the least frequent, highlighting the advantage of playing at home.
Analyzing top scoring teams and locations can help identify factors contributing to high-scoring matches and team dominance. Table 2 displays that Bochum and Fürth are the top locations for high-scoring matches, with average goals exceeding 3 per game also shown in Table 2. Figure 4 highlights the top 10 goal-scoring teams, with FC Bayern München leading significantly, followed by Borussia Dortmund. These teams stand out for their scoring capabilities, contributing the highest total goals across matches. The plot shows a clear distinction between the top-performing teams and the rest, emphasizing the dominance of a few teams in goal scoring.
Both Figure 5 and Figure 6 illustrate the distribution of viewer attendance (viewer) grouped by the number of goals scored by the home team and the away team. Both figure 8 and 9 plot uses violin shapes to show the density and distribution of viewer numbers for each goal count by the home and away team respectively. In Figure 5, viewer attendance shows a clear relationship with goals scored by the home team. Matches where the home team scores 0–2 goals exhibit greater variability in viewer attendance, as seen in the wider violins. However, as the home team scores more goals (3–5), the viewer distribution narrows, indicating more consistent attendance. High-scoring games (5+ goals) show slightly higher median attendance and fewer matches, suggesting a potential association between larger audiences and increased scoring by the home team. This could reflect the “home advantage” effect, where higher attendance positively influences home team performance. As per Figure 6, for 0, 1, and 2 goals show higher medians and narrower spreads compared to higher goal counts, suggesting that matches with fewer away goals tend to have slightly higher and more consistent attendance. Matches where the away team scores 0–2 goals show slightly more variability in attendance, but the violins are generally narrower across all goal categories, suggesting that viewer attendance is less strongly associated with away team performance. Even in high-scoring away games (5+ goals), the medians and densities are less prominent compared to the home team, indicating that audience size may have some impact on the number of goals scored by the away team.
Figures 5 and 6 highlighted the differential
impact of viewership on performance where matches played at home showed
a stronger association with viewership and goal-scoring, as compared to
matches played away. This hinted at the role of home advantage in
amplifying performance under higher viewership. Since our EDA identified
these intriguing patterns, we wanted to explore building models that
could test these relationships and determine whether the observed trends
held statistical significance. Poisson regression is an appropriate
choice for our analysis because it is specifically designed for count
data, allowing us to model the log of the expected counts as a linear
function of the predictors. We selected this model to investigate the
relationship between goals scored (home and away), viewership across
season. Both response variables, goals_home and goals_away, are counts,
making Poisson regression an ideal framework for this analysis. For our
first research question, we created two multilevel Poisson models to
examine home goals and away goals separately. These model include random
effects at the season level to capture baseline differences in
goal-scoring patterns across seasons, while estimating fixed effects for
viewership and other predictors to identify overall trends. This
approach accounts for the nested nature of the data—matches nested
within seasons—and ensures that the impact of viewership on home and
away performance is modeled accurately, reflecting season-specific
variability as highlighted in Figures 5 and 6.
Generally, the response variable \(Y_i\) ( the number of goals scored) that follows a Poisson distribution would be: \[ Y_i \sim \text{Poisson}(\lambda_i) \] where \(\lambda_i\) is the expected (mean) number of events for observation \(i\).
The expected count \(\lambda_i\) is related to the predictors \(X_1, X_2, \ldots, X_k\) through a log link function: \[ \log(\lambda_i) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots + \beta_k X_{ki} \]
Equivalently, the expected count \(\lambda_i\) is expressed as: \[ \lambda_i = e^{\beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots + \beta_k X_{ki}} \]
Scaling is a crucial step in statistical modeling, especially when variables have vastly different magnitudes or units. We will be Scaling viewer variable to thousands to reduce the magnitude of the values, making the models easier to interpret. The viewer variable contains large values (e.g., 10,000 or more), coefficients can appear very small and difficult to interpret directly. Scaling reduces the range of values, making coefficients more meaningful and easier to compare across variables.
\[ \log(\lambda_{ij}) = -0.1058 + 0.006713 \cdot \text{viewer}_i + u_j \]
Where:
\(\lambda_{ij}\): The expected number of home goals for match \(i\) in season \(j\).
\(\text{viewer}_i\): The scaled number of viewers (in thousands) for match \(i\).
\(u_j\): The random effect for season \(j\), capturing season-specific deviations.
The Mean vs Variance Plots and the Residual Plot were used to checking for Overdispersion and they showed generally linear relationship between the mean and variance, closely aligning with the dashed blue line, which supports the assumption of equidispersion in Poisson regression. While some variance values slightly deviate from the line, the deviations are not substantial, suggesting the variance is not drastically exceeding the mean. Furthermore, the Residual Plot for the Home Team Model displays a structured pattern, but the spread of Pearson residuals is relatively contained, with no severe outliers, confirming that the overdispersion in the model is minimal. Therefore, while there may be slight overdispersion, it is not substantial enough to warrant a transition to a quasi-Poisson or negative binomial model.
Interpretation of Model 1 Coefficients:
Thus, the model suggests that increased viewership provides a psychological boost or motivation to the home team, resulting in slightly better performance.
The random effects in the model capture season-specific deviations in the baseline goal-scoring rate. These deviations reflect unobserved factors, such as differences in team performance, strategies, or rule changes across seasons.
This means that, due to random effects, the baseline goal-scoring rate can vary by as much as 54.3% (up or down) across seasons.
Range of variability:
Practical implication: The variability between seasons highlights the importance of accounting for unobserved season-specific factors, such as differences in team composition, tactics, or even rule changes, that can significantly influence scoring patterns.
Model 1 Summary: The p-value for the viewer effect (\(p = 2.31 \times 10^{-8}\)) is highly significant, showing that higher viewer attendance has a strong positive association with the home team’s scoring. The intercept (\(p = 0.435\)) is not significant, implying that the baseline log-goals for the home team in the absence of viewers are not statistically different from zero. In conclusion, results confirm that higher attendance significantly boosts home team performance, reinforcing the home advantage effect, while season-specific factors introduce variability in scoring patterns.
\[ \log(\lambda_{ij}) = 0.0585 - 0.004001 \cdot \text{viewer}_i + u_j \]
Where:
\(\lambda_{ij}\): Expected number of away goals for match \(i\) in season \(j\).
\(\text{viewer}_i\): The scaled number of viewers (in thousands) for match \(i\).
\(u_j \sim N(0, \sigma^2)\): The random effect for season \(j\), capturing season-specific deviations in baseline goal-scoring rates.
In order to Checking for Overdispersion for model 2, The Mean vs Variance Plot indicates that the variance of observed values generally aligns with the mean which represents the expected Poisson relationship. However, there are slight deviations, with some data points exhibiting marginally higher variance than the mean. The Residual Plot for the Home-Away Model also showed structured patterns in the Pearson residuals, with some divergence at higher fitted values. Despite these patterns, the residual spread remains relatively contained, and there are no extreme outliers.In summary, while Model 2 exhibits minor overdispersion, the Poisson regression remains appropriate.
Log-scale interpretation: The intercept represents the log of the expected number of away goals when there are no viewers (\(\text{viewer} = 0\)) and the season-specific random effect is at its average value (\(u_j = 0\)).
Exponentiated interpretation: \[ e^{0.0585} \approx 1.06 \] This means the baseline expected number of away goals is approximately 1.06 goals per game when there are no viewers. This provides a baseline to assess how viewer attendance affects away team performance.
Contextual Insight: The intercept suggests that, in a hypothetical scenario with no viewers, the away team is expected to score just over 1 goal on average. This reflects the baseline performance of away teams without the influence of a home crowd or attendance effects.
Log-scale interpretation: For every additional 1,000 viewers, the log of the expected number of away goals decreases by 0.004001.
Exponentiated interpretation: \[ e^{-0.004001} \approx 0.996 \] This means that for every 1,000 additional viewers, the expected number of away goals decreases by approximately 0.4%.
Contextual Insight: The negative coefficient for viewers implies that larger home crowds negatively impact the away team’s scoring ability. This aligns with the “home advantage” theory, where higher attendance motivates the home team and places psychological pressure on the away team, leading to a reduction in their performance.
This means that, due to random effects, the baseline goal-scoring rate for away teams can vary by as much as 54% (up or down) across seasons.
Range of variability:
Contextual Insight: The random effects reflect that the away goal-scoring baseline varies significantly across seasons. Some seasons may favor higher-scoring games for away teams (e.g., due to more attacking tactics), while others may see reduced scoring due to tighter defenses or other contextual factors like rule changes.
Model 2 Summary
The p-value for the viewer effect (\(p = 0.00374\)) is statistically significant at the \(1\%\) level, indicating strong evidence that higher viewer attendance negatively impacts the away team’s scoring. The intercept, however, is not significant (\(p = 0.67189\)), meaning the baseline away goals without viewers are not significantly different from zero in this context. The model highlights the strong influence of home crowd attendance in reducing away team performance, supporting the theory of a “home advantage.” It also underscores the importance of accounting for seasonality, as scoring dynamics vary significantly from one season to another. In conclusion, the results shows that higher viewer attendance significantly reduces away team performance, while season-specific factors cause notable variability in away goal-scoring patterns, reinforcing the home advantage effect.
Figure 7 below shows the average goals scored per matchday, revealing notable fluctuations throughout the season. Early matchdays shows relatively stable scoring, while late matchdays have a sharp increase in average goals, particularly towards the end of the season. This pattern suggests that late matchdays may be affected by factors such as increased competitiveness or tactical shifts, which aligns with Research Question 2’s focus on matchday influence on goal counts. Aditionally, Figure 8 shows the average goals per matchday across multiple seasons, showing significant variability both within and between seasons. Some seasons exhibit consistent scoring trends across matchdays, while others show sharp fluctuations, particularly towards the end of the season. This suggests that matchday specific factors, such as team dynamics or end of season competitiveness, may influence goal counts differently each year, which matches with Research Question 2’s focus on whether early or late matchdays affect scoring patterns. Figure 8 also highlights significant variability in average goals scored per matchday across seasons, with distinct trends observed for each season. This variation emphasizes the importance of using multilevel modeling when analyzing matchday data, as matchdays are nested within seasons. Multilevel models account for this structure by allowing for random effects at the season level, capturing differences in baseline goal-scoring patterns across seasons. At the same time, they estimate fixed effects for matchday number to explore overall trends across all seasons. By explicitly modeling this nested structure, multilevel models will address the non-independence of observations within seasons, which simpler regression methods cannot handle effectively.
\[ \log(\lambda_{ij}) = 0.7100 + 0.001656 \cdot \text{matchday\_nr}_i + u_j \]
Where:
\(\lambda_{ij}\): Expected number of total goals for match \(i\) in season \(j\).
\(\text{matchday\_nr}_i\): The match day number (0 for the starting day of the league, followed by subsequent days).
\(u_j \sim N(0, \sigma^2)\): The random effect for season \(j\), capturing season-specific deviations in the baseline goal-scoring rate.
Similarly, checking for Overdispersion for Model 3, the variance of total goals appeared to grow slightly faster than the mean and the points exhibit moderate spread but do not show extreme deviations, and the pattern indicates that the variance scales more linearly with the mean rather than exponentially. The residuals plot also showed a slight structured pattern with heteroscedasticity (spread increases at higher fitted values), but the residual spread is relatively controlled and did not exhibit severe skewness or excessive outliers.
Interpretation of Model 3:
Contextual Insight: The intercept indicates that the league typically begins with matches averaging around 2 goals on the opening day. This could reflect heightened intensity or equalized matchupss on the first day of the season.
Log-scale interpretation: For every additional match day, the log of the expected total number of goals increases by 0.001656.
Exponentiated interpretation: \[ e^{0.001656} \approx 1.00166 \] This means that for every additional match day, the expected total number of goals increases by approximately 0.166%.
Larger increments:
Contextual Insight: The match day effect is positive but very small, indicating that as the league progresses, the number of goals per match increases slightly. This could reflect factors such as teams settling into form, players gaining fitness, or changes in league dynamics (e.g., more open play toward the end of the league).
This means that, due to random effects, the baseline total goal-scoring rate can vary by up to 50.2% (up or down) across seasons.
Range of variability:
Contextual Insight: Random effects reflect significant variability in scoring patterns between seasons, influenced by unobserved factors such as rule changes, playing styles, or changes in team quality.
Model 3 Summary:
This model examines how total goal counts vary across match days in the league, accounting for season-specific variability. Key findings include. The intercept (\(p = 2.6 \times 10^{-9}\)) is highly significant, confirming that the baseline scoring rate on the starting day of the league is statistically robust. The match day effect (\(p = 0.337\)) is not statistically significant, suggesting that the small positive trend in scoring across match days may not be consistent or reliable. While the league may exhibit slight increases in scoring as it progresses, the lack of statistical significance for the match day effect suggests that other factors, such as matchups, tactics, or league rules, may have a stronger influence on scoring trends.
Our analysis using multi-level Poisson models (model_home, model_away, and model_matchday) provides valuable insights into how viewer attendance, matchday progression, and season-specific factors influence goal-scoring dynamics in football leagues. Home Goals (model_home) analysis revealed a strong and statistically significant positive relationship between viewer attendance and the number of goals scored by home teams. Higher attendance is consistently associated with an increase in home goals, supporting the idea of a ‘home advantage’ amplified by larger crowds. This phenomenon likely arises from a combination of psychological motivation provided by the home crowd, tactical advantages of familiarity with the venue, and the intimidating effect on visiting teams. This confirms that playing at home, especially in front of a large audience, significantly boosts performance in terms of goals scored. Conversely, Away Goals (model_away) model highlights a statistically significant negative association between viewer attendance and the number of goals scored by away teams. Larger home crowds appear to place psychological and tactical pressure on visiting teams, reducing their performance. The findings underscore how viewer attendance can exacerbate the challenges of playing away, highlighting the importance of mitigating these pressures through preparation and strategy. This inverse relationship between attendance and away goals further reinforces the broader impact of the home crowd on match outcomes. Matchday Progression (model_matchday) model analysis of matchday progression and total goals scored within a season found no statistically significant trend linking the chronological progression of the season (matchday) to variations in goal-scoring rates. While there is a slight positive effect indicating that goals may increase marginally as the season progresses, this effect was not statistically significant. This suggests that other unobserved factors, such as team form, tactical decisions, or environmental conditions (e.g., weather or pitch quality), likely have a more substantial influence on goal-scoring than the timing of matches within the season. Overall, this analysis provides nuanced insights into the factors influencing football match outcomes. Viewer attendance plays a significant role, enhancing home team performance while hindering away teams. However, the timing of matches within a season, as reflected by matchday progression, appears to have minimal direct impact on total goals scored. These findings challenge certain traditional assumptions, such as the steady increase in goal-scoring over a season, and emphasize the importance of considering deeper strategic, psychological, and contextual factors in understanding league dynamics. It lays the groundwork for further investigations into how teams might adapt to maximize their performance throughout a season. Understanding these dynamics can significantly aid in team planning and strategic decisions, enhancing the overall competitiveness and excitement of the league.
This analysis opens the door to several interesting questions that could be explored in future research. For instance, it would be fascinating to examine how large crowds impact players individually, both the home team feeling energized and the away team potentially struggling under pressure. Adding factors like weather, pitch conditions, or even travel fatigue could give a fuller picture of what influences match outcomes. Using hierarchical models could maybe capture the multi-level variations between teams, matches, and seasons, offering more nuanced insights. Similarly, time-series models could be used to explore trends over a season or even across multiple years, helping to identify patterns that develop over time. Expanding the analysis to include these methods and data from other leagues or tournaments could help reveal whether these patterns hold true everywhere or are unique to specific contexts. This could provide teams and organizers with valuable insights to improve strategies, boost performance, and keep fans excited about the game.
Lack of External Factors Information:
The analysis does not account for external variables such as player
availability, tactical strategies, injuries, or environmental conditions
like weather, which could significantly impact match performance and the
number of goals scored.
Data Collection Transparency:
The dataset lacks clear documentation on how the data was collected,
processed, and verified. This raises concerns about the potential for
bias or inaccuracies in the data, such as errors in recording match
statistics or inconsistencies in defining variables.
Limited Feature Scope:
While the dataset provides core match details like goals, teams, and
matchdays, it does not include other critical variables like
substitutions, possession percentages, or fouls, which could provide
deeper insights into team and player performance.
Viewership Context:
The dataset does not differentiate between viewership types (e.g.,
in-stadium audience vs. broadcast audience) or their influence on
specific teams, which could offer additional context for understanding
the home-away dynamic.
Static Nature of Data:
The dataset focuses on match outcomes and aggregated statistics but does
not account for dynamic events within matches (e.g., key moments, red
cards, or momentum shifts) that might affect goal patterns or win
probabilities.
Generalizability:
The dataset is specific to Bundesliga matches and seasons, meaning the
findings cannot be generalized outside of Germany. Differences in play
style, team dynamics, audience behavior, and league structures in other
countries may mean these insights are not universally
applicable.
Pollard, R. (2008).Home advantage in football: A current review of an unsolved puzzle.The open sports sciences journal.
Gómez-Ruano, M. A., Pollard, R., and Lago-Peñas, C. (2021).Home advantage in sport.New York: Routledge.