The 2018 FIFA World Cup is a football tournament starting on 14 June 2018 in Russia. This document describes the process I have taken to develop a model for determining the outcome of a match at this tournament, based on historic data from previous World Cups. I then use this model to simulate the outcome of the tournament a number of times, in order to give a prediction of the most likely winners of the tournament, as well as other finishes (most likely finalists, semi-finalists etc).
The results of the modelling were as follows. A more detailed discussion of the results can be found in the Results section.
The results of all games from the tournaments held between 1978 - 2014 were downloaded from http://www.international-football.net/ . Ten world cups have been held over this period; this should represent enough games to build a model. I didn’t want to delve too far into history out of a concern that the game may have changed such that the results are not informative for our purposes. Note that I haven’t tested this assumption!
World Cups consist of a group stage, and then a knockout bracket. During the group stage a game can be a draw; in the knockout stage extra time is played and then a penalty shoot-out takes place if the game is a draw. The source website lists results of knockout games after extra time is played, if needed, and then lists the game as a draw if a penalty shootout took place. The presence of the extra-time results could in theory lead to an under-representation of draws. In my modelling, I haven’t treated knockout games differently, this could be a future step in the development of the model.
So now, I have a list of game results; in order to develop a model, I need some predictors. Intuitively, the main factor will be some measurement of team strength. FIFA maintain their own rankings which may suit our purposes but I have some suspicions over. In particular, these rankings have been used to seed the tournament, so some teams have structured their schedules to maximise this ranking, such as Poland. If you’d wondered why group H looks a bit weak (Poland, Colombia, Senegal,Japan) now you know why!
This fairly candid Wikipedia page makes the claim that an ELO rating performs better than the FIFA ranking system. Therefore, I’ve used ELO rating of countries as a measure of a team’s relative strength. Each team’s ELO rating before the world cup has been downloaded from http://www.international-football.net/ , which in turn uses http://www.eloratings.net/ as it’s source.
For the modelling, I’ve also captured each team’s continental affiliation, and in which continent the tournament was played in. Note this is based on the team’s current affiliation. So for instance Australia moving from OFC to AFC in the early 2000s is not reflected in the historic data; they are treated as being part of AFC for the analysis.
The below table lists the total win, loss and draw percentages within the data. Now each game is either a draw, or a win for either team; the underlying data lists one row for each game, so I have duplicated the data with the teams and scores swapped, so we have one row for each team and each game.
Table 1; overall results
| Total games | Win count | Draw count | Loss count | Win percentage | Draw percentage | Loss percentage |
|---|---|---|---|---|---|---|
| 1132 | 424 | 284 | 424 | 0.375 | 0.251 | 0.375 |
Now I have some factors, it’s worth exploring the relationship with the result. Let’s start with ELO rating. Of course, a game involves two teams; rather than considering the rank of one team it will be better to consider the ratio of ratings between the teams.
Encouragingly, there does appear to be some correlation between the ratio of differences and the result of the game.
Other factors may be worth considering in the modelling as well. Firstly, it may be worth considering the contiental affiliation of the teams. The below table lists win rates by continent;
Table two; results by continental affiliation
| Conference | Continent | Win percentage | Draw percentage | Loss percentage | Win count | Draw count | Loss count | Total games |
|---|---|---|---|---|---|---|---|---|
| AFC | Asia | 0.146 | 0.219 | 0.635 | 14 | 21 | 61 | 96 |
| CAF | Africa | 0.217 | 0.283 | 0.500 | 26 | 34 | 60 | 120 |
| CONCACAF | North America | 0.234 | 0.255 | 0.511 | 22 | 24 | 48 | 94 |
| CONMEBOL | South America | 0.490 | 0.216 | 0.293 | 102 | 45 | 61 | 208 |
| OFC | Oceania | 0.000 | 0.500 | 0.500 | 0 | 3 | 3 | 6 |
| UEFA | Europe | 0.428 | 0.258 | 0.314 | 260 | 157 | 191 | 608 |
Exploring the relationship between ELO rating and continental affiliation reveals some interesting trends. The below box plots consider the distribution of ELO rating differences by the result of the game, and continental affiliation. All continents seem to show a broadly similar trend, in which teams with a higher ELO rating are more likely to win.
Another factor to consider could be the location of the tournament, relative to the location of the teams. E.g. if the tournament is held in Europe then we might see European teams perform better due to the smaller distance of travel.
Ultimately we want a model that will take as an input factors relating to the two teams (ELO ratings, continental affilation etc) and produce as an output the probability of team one winning, team two winning, and a draw.
The modelling approach I have taken is to develop a binomal logistic regression model with “team one win” as the outcome variable. Once we have the model we can then apply it to team one and team two to get an estimate of the probabilities of each of the teams winning. The probability of a draw can then be derived via 1 minus the probability of each team winning.
The final model that I settled on uses the rating, and a factor based on whether the opposing team was playing in their home continent. Applying this model to a hold-out sample resulted in a 0.721 accuracy in predicting a win, with 95% confidence that the true model accuracy lies between 0.658 and 0.779. In other words, the model is able to predict an accurate result 72.1% of the time.
This is an acceptable result for our modelling, without being incredibly accurate; since circa 40% of games end in a win for a specific team, a coin-flip model with heads for a win and tails for a loss would be accurate 40% of the time. Nevertheless, this model should be sufficient for the purposes of our simulations.
Monte-carlo simulation involves iterating a process with random factors in order to obtain a distribution of possible results (for more information, please see this wikipedia link. Now that we have a model that can predict the result of a game, we can use this to simulate the entire world cup.
The method that I have used to do this is as follows. Teams have already been seperated into pools based on the world cup draw. For the group stage, I apply the model for each team, for each game. I then apply a random number between 0 and 1 to each game; if this number is less than the estimate for team one, the game is a win for that team; if it is larger than the win estimate but less than the draw estimate, then the game is a draw; otherwise the game is a win for team two.
The below table is an example of this.
Table three; example of group stage modelling
| Team one | Rank team one | Team two | Rank team two | Team one win likelihood | Draw likelihood | Team two win likelihood | Random number | Result based on random number |
|---|---|---|---|---|---|---|---|---|
| Russia | 1685 | Saudi Arabia | 1601 | 0.519 | 0.288 | 0.192 | 0.328 | Russia win |
| Russia | 1685 | Egypt | 1646 | 0.453 | 0.313 | 0.233 | 0.759 | Draw |
| Russia | 1685 | Uruguay | 1890 | 0.198 | 0.267 | 0.535 | 0.114 | Russia win |
| Saudi Arabia | 1601 | Egypt | 1646 | 0.300 | 0.283 | 0.416 | 0.691 | Egypt win |
Using a random number generator I can then determine a result for each game. Once we have done this I can calculate the results of the group stage. In the event of a tie during the group stage, I have used a second random number to determine the ranking.
Once this has taken place, I can then simulate the knock-out stages. For the knock-out stages I have used the same method as above to determine the likelihood of each winner. In the event of a draw, I have used a coin-flip to determine the result. This could weight slightly towards the less favoured team, but acts as a simplifying assumption in part.
We can then iterate this process a number of times; the results of this are then an approximation of how we might expect the result to play out.
The below table lists how far each team progressed in the world cup, based on 1,000 iterations of the simulation described above.
Table four; modelling results
Some interesting narratives present themselves when reviewing this table, and in particular how some teams have markedly different likelihoods of winning their group against winning the world cup. For instance, Poland have a higher likelihood than Portugal than winning their group, but Portgual still have a much higher likelihood of winning the world cup. Recall Poland are in a comparatively weak group but Portugal’s group contains Spain.
The most readily available predictions of world cup results come via betting websites. For easy comparison, the below table contains the same information but presented as “decimal odds”. This can be interpreted as the winnings made when a bet is placed on that team.
Table five; modelling results formatted as decimal odds
For all of this analysis, I have used RStudio. Here is a link to a Github repository with my R scripts. Please be forewarned that the exploratory analysis and modelling stages are quite messy.
http://github.com/chrishay1/wc2018_pred
I can be contacted at an_aucklander@hotmail.com if you wish to discuss further.