This reproducible document models the fundamental strength of each national soccer team to ever have played a match. Using a comprehensive dataset of all international soccer matches since 1872, we employ a multilevel negative binomial regression model to asses team strength, controlling for strength of the opponent, the home-field advantage and the team-specific home-field advantage (“stadium effect”). A multilevel approach allows us to “nest” together teams of the same nation but different year, which squares neatly with our intuition that national teams employ similar players, coaches and strategies in nearby years. We model goals scored (and allowed) as a count with negative binomial ditribution, a generalization of the Poisson distribution for when the tendency to score (or allow) goals may be heterogeneous across teams. Among our findings is that 2018 Germany may be slightly over-hyped, Bolivia easily has the greatest stadium effect (perhaps due to their high elevation of play) and that 1908 England was by far the greatest soccer team of all-time.
We ingest this comprehensive dataset containing results from all ~40,000 international soccer matches ever played. These are seen as rows in the data, meaning each match is a single row and consists of information like home_team, away_team, home_score, away_score and the date and location of the match. The below script reads this raw dataset and converts it to df_summarized which is a yearly summary for each teams: their number of goals scored, allowed, wins, losses and draws.
Also note we create df_augmented which is the “expanded” version of our original dataset. It contains the same columns and exactly twice as many rows. Each match now has two rows - one representing each side. df_augmented is later transformed as the input dataframe into our model.
# ingest & pre-process
df_raw <- read_csv("results.csv")
# assign matchID
matchID <- 1:nrow(df_raw)
df_raw <- data.frame(matchID, df_raw)
df_join <-
df_raw %>%
select(home_team, away_team, matchID) %>%
gather(key = type, value = key_team, -matchID)
df_processed <-
df_raw %>%
mutate(year = year(date))
df_augmented <-
df_join %>%
left_join(df_processed) %>%
mutate(result = case_when(
type == "home_team" & home_score > away_score ~ "win",
type == "home_team" & home_score < away_score ~ "loss",
type == "away_team" & away_score > home_score ~ "win",
type == "away_team" & away_score < home_score ~ "loss",
home_score == away_score ~ "draw"
)) %>%
mutate(
goals_for = ifelse(type == "home_team", home_score, away_score),
goals_against = ifelse(type == "home_team", away_score, home_score)
) %>%
mutate(
tmp = 1
) %>%
spread(result, tmp, fill = 0)
df_summarized <-
df_augmented %>%
group_by(year, key_team) %>%
summarize_at(vars(win, draw, loss, goals_for, goals_against), sum)
write_csv(df_summarized, str_c(today(), "_summarized-yearly-results.csv"))
First we plot the distribution of goals scored by team per match for each of our ~40,000 matches. At first glance, the distribution appears to be a good fit for the negative binomial because of the long right tail and because the variance appears to be larger than the mean. In the next section (“Model”) we dive much deeper into the rationale behind the selection of the negative binomial distribution.
# distribution of all-time goals scored here (neg binom)
ggplot(data=df_augmented, aes(goals_for)) +
geom_histogram(binwidth = 1,
col="red",
fill="green",
alpha = .2) +
labs(title="Goals Scored by Team in Single Match") +
labs(x="Number of Goals Scored in Match")
mean(df_augmented$goals_for)
## [1] 1.460923
var(df_augmented$goals_for)
## [1] 2.52924
# distribution of all-time goals scored here (neg binom)
ggplot(data=df_augmented, aes(goals_for)) +
geom_histogram(binwidth = 1,
col="red",
fill="green",
alpha = .2) +
labs(title="Goals Scored by Team in Single Match") +
labs(x="Number of Goals Scored in Match")
We also plot the total number of goals scored by team-season:
p1 <-
ggplot(data=df_summarized, aes(goals_for)) +
geom_histogram(binwidth = 1,
col="black",
fill="green",
alpha = .2) +
labs(title="Goals Scored by Team in Single Year") +
labs(x="Number of Goals Scored in Year")
p1
p2 <-
ggplot(data=df_summarized, aes(goals_against)) +
geom_histogram(binwidth = 1,
col="black",
fill="red",
alpha = .2) +
labs(title="Goals Allowed by Team in Single Year") +
labs(x="Number of Goals Allowed in Year")
p2
We see that the peaks are higher for goals_for in a given year than goals_against. This is perhaps due to the fact that the worst teams often do not play very many games in total.
The negative binomial distribution is a generalization of the Poisson distribution which enables us to relax the assumption that the tendency to score (or allow) goals is homogeneous across teams. Recall that the Poisson distribution models the number of events of that occur over a given interval of time or space. The Poisson distribution has a rate parameter, \(\lambda\), which equals both the mean and variance of the distribution. Crucially, the Poisson’s rate parameter is constant, meaning that the rate at which events occur is constant and that the probability of an event occurring in a sub-interval is proportional to the length of the sub-interval. As a Poisson-Gamma mixture, the negative binomial allows for differing \(\lambda\) values across sample groups, meaning that the rate at which events occur may vary across these groups. Further, the variance of the negative binomial distribution must be greater than its mean, which contrasts with the Poisson (where variance equals mean).
In the case of goals scored (or allowed) over the ~150-year history of international soccer, it simply is not reasonable to assume a constant rate parameter across sample groups. We must allow for the strong possibility that the frequency of goals scored across teams and seasons may vary considerably. Certain teams may have fundamentally different tendencies to score or allow goals, based on their own uniqueness (players, coaches and/or strategies). Allowing for this heterogeneity ought to increase the explanatory power and accuracy of our model.
Equally important, we see empirically that the variance of goals scored (or allowed) per match exceeds its mean. In our dataset df_augmented the mean goals/match (per team) is 1.46, and the variance is 2.53. This relationship ( variance > mean) must be true for the negative binomial distribution because, in general, if x ~ \(NB(r, p)\), where \(r\) is the number of failures until the experiment has stopped, and \(p\) is the probability of success in each trial, then \(mean(x) = {rp/(1-p)}\) and \(var(x) = {rp/(1-p)^2}\), where p < 1.
We therefore elect to model goals by the negative binomial distribution because it allows for heterogeneity across national team-seasons and because our sample’s variance exceeds its mean.
A discussion of the multilevel (“heirarchical”) approach is warranted here as well. Multilevel models such as ours are also popular in the biosciences and other domains where natural “nesting” relationships exist amongst groups of observations. In our case, since there is a great deal of similarity between national teams in nearby years (e.g. 1997 Brazil and 1998 Brazil), we have a natural opportunity “nest” together national teams across different years and improve model performance, versus simply assuming every single team-season throughout history is an independent, random sample from the population. Clearly, large commonalities in players, coaches and strategies may exist for individual national teams across nearby years, which lends itself well to this hierarchical approach.
A classical use case for multilevel models is for the analysis of student outcomes, where one seeks not just a residual for student-level performance but also school-level performance. One can account for the “grouping” or “nesting” of students within schools, rather than viewing them each as independent samples selected randomly from the global population. According to the University of Bristol’s Centre for Multilevel Modelling: link
Multilevel models recognise the existence of such data hierarchies by allowing for residual components at each level in the hierarchy. For example, a two-level model which allows for grouping of child outcomes within schools would include residuals at the child and school level. Thus the residual variance is partitioned into a between-school component (the variance of the school-level residuals) and a within-school component (the variance of the child-level residuals). The school residuals, often called ‘school effects’, represent unobserved school characteristics that affect child outcomes. It is these unobserved variables which lead to correlation between outcomes for children from the same school.
For these reasons, we feel a multilevel model on the negative binomial distribution is a compelling way to model performance for international soccer teams.
Returning to our data transformations, having doubled our number of observations (one for each team per match), we are ready to reshape the data for input into our model. We create columns for team, goals, opponent, and the home_ _team. We also transform the current home/away boolean such that it contains “neutral” when game is neither home nor away for team. This enables us to accurately access home-field advantage and stadium effects.
# Model input dataset
df_model <-
df_augmented %>%
mutate(
team = ifelse(type == "home_team", home_team, away_team),
goals = ifelse(type == "home_team", home_score, away_score),
opponent = ifelse(type == "home_team", away_team, home_team),
type = ifelse(neutral == TRUE, "neutral", type),
home_team = ifelse(neutral == TRUE, "none", home_team)
) %>%
select(year, team, goals, opponent, type, home_team)
write_csv(df_model, str_c(today(), "_df_model.csv"))
Next we build our three models:
Important Note on Scoring: the Metric “Strength”: We refer to the term “Strength” throughout this paper. What does this mean? Team Strength is each team’s ability to score more goals than they allow. It is each team’s intrinsic skill at outscoring opponents. Each raw Strength calculation is rescaled on a 0 - 100 basis. All else equal, you should expect a team with a higher Strength to be more likely to win that a team with a lower Strength, assuming a neutral field location. Also note that a team with a Strength of 60 is 10% more efficient at outscoring opponents than a team with a Strength of 50.
Interestingly, the mean Team Strength across all teams is not 50, because there are more extremely poor teams than extremely good teams. So the median team is expected to score more goals than they allow, regardless of which of our three models you use. Note that because our model allows us to parse out offensive performance from defensive performance, we have estimates for both Offensive Strength and Defensive Strength as well.
### Model 1 ###
# Run overnight:
model_nb <-
glmer.nb(goals ~ (1 | team/year) + (1 | opponent/year), df_model)
save(model_nb, file = str_c(today(), "_model-nb.rda")
### Model 2 - Model 1 plus home/away as crossed random effect (non-nested) ###
# Run overnight:
model_nb2 <-
glmer.nb(goals ~ (1 | team/year) + (1 | opponent/year) + (1 | type), df_model)
save(model_nb2, file = str_c(today(), "_model-nb2.rda"))
### Model 3 - Model 1 plus home/away as nested random effect within team ###
# Run overnight:
model_nb3 <-
glmer.nb(goals ~ (1 | team/year) + (1 | opponent/year) + (1 | type/home_team), df_model)
save(model_nb3, file = str_c(today(), "_model-nb3.rda"))
We use an ANOVA test to find that our third model - the one accounting for team-specific home-field advantage, or “stadium effects” - has a lower AIC than the model which uses a constant home-field advantage across all teams. We use Model 3 for our results analysis below.
load("2018-05-10_model-nb.rda")
load("2018-05-12_model-nb2.rda")
load("2018-05-14_model-nb3.rda")
anova(model_nb, model_nb2, model_nb3)
The results of Model 3 are extremely revealing. First, let’s check our residuals.
# model_nb3 actually doesn't exist (it is called model_nb6 instead)
# model 3 which ran overnight shd be ~12mb. Load it if needed.
#source("000_libraries.R")
load("2018-05-14_model-nb3.rda")
model_coef <- coef(model_nb6)
# residuals checking
model_predict <- predict(model_nb6, type = "response")
plot(model_predict, df_model$goals)
plot(model_predict, model_predict - df_model$goals)
# model 3 which ran overnight shd be ~13mb. Load it if needed.
#source("000_libraries.R")
load("2018-05-14_model-nb3.rda")
model_coef <- coef(model_nb3)
# residuals checking
model_predict <- predict(model_nb3, type = "response")
plot(model_predict, df_model$goals)
plot(model_predict, model_predict - df_model$goals)
There is no apparent skew or bias to our residuals, as the above plots demonstrate. Had we seen some clear pattern, we would re-evaluate our model selection.
We are almost ready to analyze the results, but first we create model_coef_comparable which contains each team-season in history (e.g. 1978 Brazil, 1908 England) and their overall strength score as calculated by our model. This data frame also includes columns for offensive and defensive strength and thus allows us to easily compare teams from different seasons with eachother by offense, defense and overall strength. We rescale the strength variable on a 0-100 basis to increase interpretability.
# munging coefficients
model_coef_offense <-
model_coef$team %>%
as.data.frame %>%
set_names("offense") %>%
mutate(team = rownames(.))
model_coef_defense <-
model_coef$opponent %>%
as.data.frame %>%
set_names("defense") %>%
mutate(team = rownames(.))
model_coef_type <-
model_coef$`home_team:type` %>%
set_names("type_home_team") %>%
mutate(team_home_team = rownames(.)) %>%
separate(team_home_team, c("team", "type"), sep = ":") %>%
spread(key = type, value = type_home_team) %>%
select(-neutral)
model_coef_offense_year <-
model_coef$`year:team` %>%
as.data.frame %>%
set_names("offense_year") %>%
mutate(team_year = rownames(.)) %>%
separate(team_year, c("year", "team"), sep = ":")
model_coef_defense_year <-
model_coef$`year:opponent` %>%
as.data.frame %>%
set_names("defense_year") %>%
mutate(team_year = rownames(.)) %>%
separate(team_year, c("year", "team"), sep = ":")
model_coef_type_home_team <-
model_coef$`home_team:type` %>%
as.data.frame %>%
set_names("type_home_team") %>%
mutate(team = rownames(.)) %>%
separate(team, c("team", "type"), sep = ":")
# rolling into one
model_coef_comparable <-
model_coef_defense_year %>%
left_join(model_coef_offense_year) %>%
left_join(model_coef_offense) %>%
left_join(model_coef_defense) %>%
left_join(model_coef_type) %>%
left_join(model_coef_type_home_team) %>%
mutate(overall_score = offense + offense_year - (defense + defense_year)) %>%
mutate(strength = round(rescale(overall_score, c(0,100)), 1)) %>%
rename(team_hf_adv = home_team) %>%
rename(team_af_disadv = away_team) %>%
select(-type) %>%
select(-type_home_team) %>%
na.omit()
Let’s peak at model_coef_comparable, which, again, is a table containing team-seasons as observations and our model’s coefficients as variables. In a single line of code we are able to unlock various new important insights from our data. For example, did you know that 1908 England is the highest-rated team in history, and that it’s not really even close? 1908 England was perhaps ~4% better than the #2 all-time team, 1965 Brazil. We also see that while Brazil is historically a better defense than England (defense for Brazil is lower than defense for Brazil), the two teams historically have very similar offensive strength (offense for England nearly identical to offense for Brazil).
Also interestingly, we see that England’s single-year offensive strength jumped greatly in 1908 (offense_year is relatively high), which means that, controlling for England’s overall historical offensive strength, the 1908 squad was particularly prolific at putting the ball in the net.
model_coef_comparable %>%
select(year, team, offense, offense_year, defense, defense_year, team_hf_adv, team_af_disadv, strength) %>%
arrange(desc(strength)) %>%
mutate_at(3:9, round, 2) %>%
select(c("year", "team", "strength", "offense", "offense_year", "defense", "defense_year", "team_hf_adv", "team_af_disadv")) %>%
filter(mod(row_number(), 2) == 0)
Next we evaluate talent distribution across teams. One might expect talent to be normally distributed, but we in fact see significant negative skew. As the below boxplot shows, there is a large “middle/upper class” of soccer teams, and along tail of extremly poor teams. So, next time you see a team with a surprisingly strong win-loss record, check their list of opponents. It’s possible they’ve been “feasting” on the small class of extremely poor teams.
# team strength exhibits negative skew
(model_coef$team - model_coef$opponent) %>%
as.data.frame %>%
setNames("val") %>%
mutate(team = rownames(.)) %>%
mutate(strength = rescale(val, c(0, 100))) %>%
select(strength) %>%
boxplot(horizontal = TRUE)
Next, let’s parse offensive strength and defensive strength from overall team strength and analyze them.
Which were the greatest single-season offenses in history? This is another question we can easily answer:
# best offenses of all-time (single season)
model_coef_comparable %>%
mutate(total_offense = offense + offense_year) %>%
mutate(total_offense_scaled = rescale(total_offense, c(0,100))) %>%
arrange(desc(total_offense_scaled)) %>%
select(year, team, total_offense_scaled )%>%
mutate_at(3, round, 2)
How about the greatest single-season defenses in history? 1978 Brazil allowed only three goals in the entire World Cup, but was boxed–out of the Championship match when host country Argentina beat Peru 6-0 in the Second Round, thereby beating division rival by virtue of goal differential.
Unproven allegations of collusion followed that match.
# best defenses of all-time (single season)
model_coef_comparable %>%
mutate(total_defense = defense + defense_year) %>%
mutate(total_defense_scaled = 100 - rescale(total_defense, c(0,100))) %>%
arrange(desc(total_defense_scaled)) %>%
select(year, team, total_defense_scaled ) %>%
mutate_at(3, round, 2)
Next we show that offensive and defensive skill are highly correlated, but that that is explained by the fact that teams with good offenses and defenses are also strong overall. This is an intuitive and even obvious point, but it’s the reason sports like soccer are particularly “clean” to model. This opens a larger discussion (see “Extending this Work” section.)
with(model_coef_comparable, cor(offense + offense_year, defense + defense_year))
## [1] -0.899871
with(model_coef_comparable, cor(offense, defense))
## [1] -0.9456068
The year-to-year fluctuations in offensive and defensive strength are not highly correlated. In other words, just because a team had a sudden offensive boom one year, it doesn’t suggest they’re also likely to be improved on defense as well.
with(model_coef_comparable, cor(offense_year, defense_year))
## [1] -0.01404749
We see that defensive skill is 49% more variable year to year overall than offense.
with(model_coef_comparable, sd(defense_year)) / with(model_coef_comparable, sd(offense_year)) - 1
## [1] 0.4862493
We see that better teams have less variable defense and more variable offense than weaker teams. Still, even for good teams, defense is more variable than offensive skill.
model_coef_sd <-
model_coef_comparable %>%
group_by(team) %>%
mutate(sd_offense = sd(offense_year),
sd_defense = sd(defense_year)) %>%
summarize(avg_score = mean(overall_score), sd_offense = first(sd_offense),
sd_defense = first(sd_defense))
model_coef_sd_gathered <-
model_coef_sd %>%
gather(key = skill, value = value, -avg_score, -team)
ggplot(model_coef_sd_gathered) +
aes(x = avg_score, y = value, color = skill) +
geom_point() +
stat_smooth(method = 'gam')
Prior to the tournament, our model predicted 13 of 16 (81.25%) of the final 16 (knockout round) teams, missing only entrants Switzerland and Mexico, as well as Cinderella story Japan. In what many consider a whacky and wild World Cup, our model performed quite well in selecting the final 16 teams.
Which are the best teams heading into World Cup 2018? Let’s evaluate team strength since 2016 in an attempt to answer this question:
# best team since last World Cup (let's call it 2016 thru present)
model_coef_comparable %>%
filter(year %in% 2016:2018) %>%
group_by(team) %>%
summarise(mean = mean(strength)) %>%
arrange(desc(mean)) %>%
mutate_at(2, round, 2)
So, if we assume World Cup games are essentially “neutral” locations (neither team is “home” nor “away”), then we have little reason to select any other team to win the title than the Brazilians. Of course there is great uncertainty, but at least since 2016, Brazil has established themselves as the current strongest team globally.
Netherlands and Italy were the two strongest teams to miss the 2018 World Cup, grading out as the #6 and #7 strongest national teams since 2016 according to our model. Pre-tournament Vegas favorite Germany stumbled early in its opening 0-1 loss to Mexico and was ultimately eliminated by the South Koreans. While Germany was a favorite to many forecasters, our model found them to be only the #3 best team since 2016, behind Brazil and Spain.
We can parse out offense from defense, just as in the previous section. So which are the strongest offenses since 2016?
# best offenses since 2016
model_coef_comparable %>%
filter(year %in% 2016:2018) %>%
select(c(year, team, offense, offense_year)) %>%
mutate(total_offense = offense + offense_year) %>%
group_by(team) %>%
summarise(mean_total_offense = mean(total_offense)) %>%
mutate(mean_total_offense = rescale(mean_total_offense, c(0, 100))) %>%
arrange(desc(mean_total_offense)) %>%
mutate_at(2, round, 2)
Note above we scale the ratings on a 0-100 scale relative to the sample group (teams since 2016), not since 1872.
How about the strongest defenses since 2016?
# best defenses since 2016
model_coef_comparable %>%
filter(year %in% 2016:2018) %>%
select(c(year, team, defense, defense_year)) %>%
mutate(total_defense = defense + defense_year) %>%
group_by(team) %>%
summarise(mean_total_defense = mean(total_defense)) %>%
mutate(mean_total_defense = 100 - rescale(mean_total_defense, c(0, 100))) %>%
arrange(desc(mean_total_defense)) %>%
mutate_at(2, round, 2)
Clearly, there is opportunity to feature engineer the model and improve accuracy. Conceived and published in less than six weeks, the original intent of this paper is to showcase the model structure more so than its features (predictor variables). Models such as ours have clear and natural extensions to business (particularly retail) where customer frequency of purchasing frequency should be considered heterogeneous, and a “nested” approach such as ours makes intuitive sense and provides powerful predictive accuracy.
We quantify the home-field advantage to be worth:
model_coef$type
Which teams have the largest home-field advantages?
# which teams have the biggest home-field advantages ?
model_coef$`home_team:type` %>%
as.data.frame %>%
set_names("type_home_team") %>%
mutate(team_home_team = rownames(.)) %>%
separate(team_home_team, c("team", "type"), sep = ":") %>%
spread(key = type, value = type_home_team) %>%
mutate(diff = home_team - away_team) %>%
arrange((desc(diff))) %>%
select(team, home_team, away_team, diff) %>%
mutate_at(2:4, round, 2)
Bolivia isn’t known for its football prowess, having been outscored by a total of 834-450 in the team’s history. But Bolivia boasts our model’s #1 home-field advantage, at almost double the #2 team. Our model estimates a baseline increase of +0.47 goals scored/game for a typical team’s home-field advantage, but on top of that, Bolivia receives an addition “stadium effect” of +0.42 goals/game. This is by far the largest in soccer, at almost 4.5 standard deviations above the mean. This unique advantage may be tied to the team’s high elevation of play. In 2007 FIFA temporarily banned World Cup qualifying games from being played in Bolivia, Ecudar and Columbia due to high elevations affecting players’ health. Perhaps not surprisingly, Ecuador appears near the top of the home-field advantage charts as well, at #5.
In the “Results” section above we show that offensive and defensive skill are highly correlated, and discuss why that is not a surprising result in a sport like soccer. The sport is “symmetrical” in the sense that both offense and defense on the field simultaneously, and an individual player might be called upon to quickly alternate between offensive and defensive play. Plus, the field is symmetrical with identical goals facing eachother. Further, in soccer the team’s offense plus its defense comprise essentially the entirety of its fortunes, and good offensive players ought to be expected to be better defenders than poor offensive players at a given position. (For example faster players may be better at both scoring and preventing goals.) We will describe this as a “symmetrical” and “exclusive”-style team sport. We use “exclusive” to mean that offense + defense map encapsulate virtually the entirety of a team’s fortunes.
Contrast this “exclusivity” with baseball. In baseball, the concept of pitching is a very independent factor from defense overall. In fact, sabermetricians use the statistic Fielding-Independent Pitching (FIP) in order to more accurately asess pitcher performance. Pitching is considered a distinct, third component which drives team performance, and there’s little reason to expect it correlates highly with defense, let alone offense. Baseball team performance is not “exclusive” to offense and defense because pitching is a mostly distinct third factor of success. (As an aside, baseball models are often somewhat easier to model due to the fact that each pitch is a distinct event which can be cleanly labeled as “ball 2”, “strike out”, “home run”, etc. This contrasts with sports like soccer and basketball which are mostly continual flows of semi-structured movements.)
The concept of a goalkeeper in soccer is indeed perhaps a “third variable” which could be expected to vary independently from offense or defense in our model. If individual goalkeeper strength were evaluated by our model, we could gain even further insights into “true” defensive performance, exclusive of goalkeeper ability. This is analogous to the concept of FIP. So, no, soccer team strength is not perfectly encapsulated purely by offensive and defensive strength. But, again, contrast this with baseball. It’s difficult to argue that a goalkeeper is as crucial to a team’s overall strength as the strength of a baseball team’s pitchers, who commence every single event during the game, and whose performance is a huge driver of final outcomes. The point is that our model’s current features fit a team sport like soccer better than baseball. We would need to include the concept of pitching in a baseball model.
Similarly, in American football we have little reason to expect that special teams performance (e.g. kicking, punting, returning etc.) is highly correlated to offensive or defensive strength. Personnel overlap between the three groups is typically minimal or zero. Coaches and strategies may be quite distinct as well. So, American football isn’t clean and “exclusive” like soccer. So, similar to a good baseball model requiring pitching, a good football model needs to account for special teams.
For these same reasons, volleyball or lacrosse may be terrific candidates for our model due to their symmetry and exclusivity. Basketball may lie somewhere in between, as the game’s structure is similar to volleyball, lacrosse or soccer, but it may be the case that prolific offenses are often paired with poor or highly variable defenses (at least in the NBA). So our model may extend to baseketball well, but we should not expect the same strong offensive/defensive strength correlations as soccer.
Multilevel models on the binomial distribution may be a particularly effective approach for modeling team strength in symmetrical, “exclusive” sport such as soccer. This approach allows us to account for the natural “nested” relationship within a single country, and the carryover of players, coaches and strategies from year to year. Using a comprehensive dataset of all international soccer matches since 1872, we built a multilevel negative binomial regression model to asses team strength, controlling for strength of the opponent, the home-field advantage and the team-specific home-field advantage (“stadium effect”). Among our findings are that 1908 England was by far the strongest single team-season, that higher-elevation stadiums may have extremely positive home-field advantages and that Germany may be slightly overhyped heading into the 2018 World Cup.