Predicting the FIDE Chess Candidates 2022
Introduction
The FIDE Chess Candidates tournament is just around the corner and is arguably the most anticipated tournament of the year. A few people have posted predictions on the winner, and I thought it would be a great idea to join in, using a framework that I have been actively developing for around a year.
Setup
Before presenting predictions, it is important to establish the baseline on which the predictions are made. Loads of tools, databases and lists are available to use for predictive modelling, and they will all produce different results. Below is an overview of what goes into my predictive framework.
Data
The initial measurement of player strength is based on the Universal Rating System (URS). One might question why the official FIDE Elo ratings are not used. Without going into much detail, I believe that a simultaneous rating calculation based on the current player pool solves some of the issues in the Elo system. However, this does result in the ranking order of top players (and hence the players participating in the Candidates) looking different than the official FIDE rating list and the popular live rating lists at 2700chess. This will inevitably impact the predictions.
The core of the game data is the TWIC database from January 2018 onwards with the following criteria:
- Games from players with a URS rating of 2550 and above on the most recent URS rating list.
- Games where both players are rated at least 2500. This criterion is implemented to partly protect against players (typically juniors) who have been rapidly improving and hence have a URS rating which does not reflect their actual playing strength.
- Games in “official” tournaments only, with a few exceptions. For example, the database does not contain many of the chess.com tournaments, such as Titled Tuesday, but does contain the Magnus Carlsen/Meltwater Chess Tour, for example. In the end, it is a list of tournaments where I have made a subjective evaluation of whether the format/prize fund etc. has warranted best play from the participants.
For each tournament on the final list, I have manually added the following bits of information:
Tournament time control (classical, rapid, blitz)
Tournament rounds
The tournament format (swiss, round-robin, match, team tournament)
Online (yes/no)
The above additions are quite important since it changes the dynamic of a given game. Classical, rapid and blitz games are separated into three different databases. Tournament rounds can assist in capturing changes in the predicted outcome of a game based on tournament progress. Whether a game was played online or not is important following the popularity of online events from 2020 onwards. We have often seen players who excel in online tournaments which may not have performed the same OTB. However, this indicator is irrelevant for classical games, since practically no classical tournaments have been played online at the highest level.
Shoutout to Mark Crowther for maintaining the TWIC database which is an invaluable tool for people like me who likes to fiddle with chess statistics on a weekly updated database.
Variables and modelling
Besides the simple game statistics such as URS rating of each player, tournament format, etc. as mentioned above, we can extract several statistics from a given matchup. The following is not a complete list, but contains some of the more interesting/relevant variables that are computed prior to setting up the model:
| Variable(s) | Explanation |
|---|---|
| Player rating development | How much a players URS rating has developed since the same day the year prior |
| H2H statistics | Fraction of H2H wins/draws/losses going into a game, weighted by H2H game count |
| Player category | Category of the player, split by world rank on the URS rating list |
| Weighted player performance | Players overall weighted performance, weighted performance with given color, and weighted performance against the opponent category |
| Player tournament performance | Player tournament performance round-by-round and previous game result |
| Player activity | Number of games in the past 12 months and time since last game |
Aside from the above, more, although less interesting, variables are added. I also include relevant interactions between the variables.
A small note on “weighted player performance” is necessary. By “weighted”, I refer to a version of the calculated performance rating described on Chessmetrics. However, instead of using a linear formula, I treat the weight as exponential decay as shown below.
The function has a half-life of around 157 days, i.e. a game played 157 days ago only carries half the weight of a game played today. Similarly, a game played one year ago has a weight of 0.2 and a game two years ago has a weight of only 0.04. This plays a big role when connected to uncertainty surrounding expected performance, especially during corona times. Aside from weighting, I also use padding like Chessmetrics, although less punishing.
For modelling the predicted outcome of a given game, I use a machine learning approach, more specifically the CatBoost (version 1.6.0) gradient boosting framework. In essence, predicting the outcome of a given game is a classification task (win/draw/loss) based on a number of variables. CatBoost is convenient due to the nature in which it handles categorical variables, something which is common in the database. To optimize hyperparameters I make use of Bayesian optimisation (some more details here). Everything is run in R programming language and utilising the Tidymodels framework.
I have been fine-tuning the model for more than a year, and it does achieve reasonable fit metrics. However, it is important to note that a model like this will (and should) never achieve perfect accuracy in terms of predicting the outcome of a game. Chess players will know that circumstances both on and outside the board can be a deciding factor. Most notably for high-level games is the tournament standings (e.g. “a draw is enough to win the tournament”), but may also be game related (getting into time trouble, blunders). So a prediction is always an average in the long run, based on available information.
That being said, I have achieved great results using the model on the betting market, with a disclosed but (very) high ROI to the point where some bookies are not interested in me wagering anymore. Not anything significant, but indicates that the model is “competitive” so to speak.
Descriptive statistics
Before diving into the results, we should have a look at some initial descriptive statistics, just to get a feel for where we stand.
Player rating and performance
The following table is the June 2022 URS rating list of players participating in the candidates:
Yes, Nakamura is the highest rated on that list, and yes, Firouzja is
the second lowest rated. This is quite different from the official FIDE
rating lists as mentioned earlier. Let us have a look at their rating
development over time:
There are two things to highlight. First, Firouzja is an interesting
case, since he has been on the rise and is the player with the highest
relative rating development. This is something which is challenging for
a model to capture, but using weighted performance makes the games prior
to 2021 close to negligible. Second, ratings are almost static during
the corona crisis. There is a specific reason for this, and I suggest
reading about it on the URS webpage (news post from November 20,
2020).
Now for the weighted performances. The following is a table (sortable) with the current weighted performance ratings for each candidate, both overall, with each colour and alongside weighted win/draw/loss percentages. The percentages are weighted using the exponential decay as well, with padded draws.
Alright, a couple of things worth mentioning. Ding has some impressive
statistics, although most of the games in consideration are from his
April speed run to reach 30 classical games to qualify. However, his
performance is not as high as Nakamura, which had an incredible
performance in the Grand Chess Tour. Also noteworthy is Radjabov with a
weighted win-% of 1.2 and by far the lowest performance rating.
H2H statistics
Next, let us have a look at the H2H statistics between the players in the field (classical games only).
The table is read from left to right for white/black. The number in the
bracket is the number of games. So, for example, Caruana has 4 games in
the database with white against Ding, with a score of 50%. We
immediately see that we have very little H2H information on Radjabov and
Firouzja. In general, having white carries a notable advantage (we knew
that), and in particularly Ding, Nakamura and Nepomniachtchi has good
scores with white against the field. With the black pieces, no one
really stands out by performing significantly better than the
others.
Predictions
Finally we arrive at some actual predictions. I set up the pairings and ran 50’000 simulations, updating the relevant variables between each round. I only simulate the main tournament, so no tiebreaks. The tiebreak format is tedious to code, but some players definitely has an advantage over others if it ends in a tie.
Since I include variables like tournament performance, outcome of previous game and tournament score, each simulation is “unique”. That is to say, if Player A faces Player B in round 4, where Player A has 3/4 and just won a game will not result in the same prediction for that particular game as if Player A has 2/4 and just made a draw.
Who is going to win?
So there it is. First of all, this is an indication that the tournament
is definitely a very close affair! Second, the predicted probabilities
of Nakamura and Duda winning the tournament outright vs Firouzja is most
likely going to result in some comments and questions. Also, Radjabov
with under 1% to win outright is very low. I have to reiterate that a
model is no more than the data that goes into it (and the subjective
choices we make in regards to that data).
Are the predictions reasonable? I believe yes, in particular when we already have an idea about the variables going into the model. But I also acknowledge the fact that expected variability in expected score and rank is quite high.
I will not get into the chances of each player winning the tournament from a subjective standpoint, but I do encourage people to listen to two great podcasts; The Chicken Chess Club and The Perpetual Chess which both have a candidates preview episode.
Other interesting observations
We can calculate a ton of different tournament statistics, and this is just to show a couple of them. One thing that sparked my interest was Firouzja’s chance to win outright vs finishing outright last. He has a higher chance of winning outright compared to Rapport, but also a higher chance of finishing outright last. This is due to the variability in expected score across the simulations:
It can be hard to see, but Firouzja has the highest variability in the expected score, followed by Duda and Nakamura. This is consistent with the weighted win/draw/loss rates presented earlier. Firouzja is often involved in decisive games, and this shows. If he can get off to a good start, it may further increase his chances of winning the following games. In other words, he is the player who is most subject to being “on a roll”. On the other end of the spectrum, we have Radjabov, who has a high weighted draw rate (and arguably loss rate) overall, which results in less variability in the expected score across 14 rounds.
Speaking about rounds, these are the projected expected scores in the first half of the tournament:
If the interest is there, I am going to provide an updated table on
predicted probabilities of winning alongside predictions for each round.
The Candidates is definitely a tournament which is going to be
interesting to follow, and all outcomes are possible!