Predicting the FIDE Chess Candidates 2022

Introduction

The FIDE Chess Candidates tournament is just around the corner and is arguably the most anticipated tournament of the year. A few people have posted predictions on the winner, and I thought it would be a great idea to join in, using a framework that I have been actively developing for around a year.

Setup

Before presenting predictions, it is important to establish the baseline on which the predictions are made. Loads of tools, databases and lists are available to use for predictive modelling, and they will all produce different results. Below is an overview of what goes into my predictive framework.

Data

The initial measurement of player strength is based on the Universal Rating System (URS). One might question why the official FIDE Elo ratings are not used. Without going into much detail, I believe that a simultaneous rating calculation based on the current player pool solves some of the issues in the Elo system. However, this does result in the ranking order of top players (and hence the players participating in the Candidates) looking different than the official FIDE rating list and the popular live rating lists at 2700chess. This will inevitably impact the predictions.

The core of the game data is the TWIC database from January 2018 onwards with the following criteria:

  • Games from players with a URS rating of 2550 and above on the most recent URS rating list.
  • Games where both players are rated at least 2500. This criterion is implemented to partly protect against players (typically juniors) who have been rapidly improving and hence have a URS rating which does not reflect their actual playing strength.
  • Games in “official” tournaments only, with a few exceptions. For example, the database does not contain many of the chess.com tournaments, such as Titled Tuesday, but does contain the Magnus Carlsen/Meltwater Chess Tour, for example. In the end, it is a list of tournaments where I have made a subjective evaluation of whether the format/prize fund etc. has warranted best play from the participants.

For each tournament on the final list, I have manually added the following bits of information:

  1. Tournament time control (classical, rapid, blitz)

  2. Tournament rounds

  3. The tournament format (swiss, round-robin, match, team tournament)

  4. Online (yes/no)

The above additions are quite important since it changes the dynamic of a given game. Classical, rapid and blitz games are separated into three different databases. Tournament rounds can assist in capturing changes in the predicted outcome of a game based on tournament progress. Whether a game was played online or not is important following the popularity of online events from 2020 onwards. We have often seen players who excel in online tournaments which may not have performed the same OTB. However, this indicator is irrelevant for classical games, since practically no classical tournaments have been played online at the highest level.

Shoutout to Mark Crowther for maintaining the TWIC database which is an invaluable tool for people like me who likes to fiddle with chess statistics on a weekly updated database.

Variables and modelling

Besides the simple game statistics such as URS rating of each player, tournament format, etc. as mentioned above, we can extract several statistics from a given matchup. The following is not a complete list, but contains some of the more interesting/relevant variables that are computed prior to setting up the model:

Variable(s) Explanation
Player rating development How much a players URS rating has developed since the same day the year prior
H2H statistics Fraction of H2H wins/draws/losses going into a game, weighted by H2H game count
Player category Category of the player, split by world rank on the URS rating list
Weighted player performance Players overall weighted performance, weighted performance with given color, and weighted performance against the opponent category
Player tournament performance Player tournament performance round-by-round and previous game result
Player activity Number of games in the past 12 months and time since last game

Aside from the above, more, although less interesting, variables are added. I also include relevant interactions between the variables.

A small note on “weighted player performance” is necessary. By “weighted”, I refer to a version of the calculated performance rating described on Chessmetrics. However, instead of using a linear formula, I treat the weight as exponential decay as shown below.

The function has a half-life of around 157 days, i.e. a game played 157 days ago only carries half the weight of a game played today. Similarly, a game played one year ago has a weight of 0.2 and a game two years ago has a weight of only 0.04. This plays a big role when connected to uncertainty surrounding expected performance, especially during corona times. Aside from weighting, I also use padding like Chessmetrics, although less punishing.

For modelling the predicted outcome of a given game, I use a machine learning approach, more specifically the CatBoost (version 1.6.0) gradient boosting framework. In essence, predicting the outcome of a given game is a classification task (win/draw/loss) based on a number of variables. CatBoost is convenient due to the nature in which it handles categorical variables, something which is common in the database. To optimize hyperparameters I make use of Bayesian optimisation (some more details here). Everything is run in R programming language and utilising the Tidymodels framework.

I have been fine-tuning the model for more than a year, and it does achieve reasonable fit metrics. However, it is important to note that a model like this will (and should) never achieve perfect accuracy in terms of predicting the outcome of a game. Chess players will know that circumstances both on and outside the board can be a deciding factor. Most notably for high-level games is the tournament standings (e.g. “a draw is enough to win the tournament”), but may also be game related (getting into time trouble, blunders). So a prediction is always an average in the long run, based on available information.

That being said, I have achieved great results using the model on the betting market, with a disclosed but (very) high ROI to the point where some bookies are not interested in me wagering anymore. Not anything significant, but indicates that the model is “competitive” so to speak.

Descriptive statistics

Before diving into the results, we should have a look at some initial descriptive statistics, just to get a feel for where we stand.

Player rating and performance

The following table is the June 2022 URS rating list of players participating in the candidates:


Yes, Nakamura is the highest rated on that list, and yes, Firouzja is the second lowest rated. This is quite different from the official FIDE rating lists as mentioned earlier. Let us have a look at their rating development over time:


There are two things to highlight. First, Firouzja is an interesting case, since he has been on the rise and is the player with the highest relative rating development. This is something which is challenging for a model to capture, but using weighted performance makes the games prior to 2021 close to negligible. Second, ratings are almost static during the corona crisis. There is a specific reason for this, and I suggest reading about it on the URS webpage (news post from November 20, 2020).

Now for the weighted performances. The following is a table (sortable) with the current weighted performance ratings for each candidate, both overall, with each colour and alongside weighted win/draw/loss percentages. The percentages are weighted using the exponential decay as well, with padded draws.


Alright, a couple of things worth mentioning. Ding has some impressive statistics, although most of the games in consideration are from his April speed run to reach 30 classical games to qualify. However, his performance is not as high as Nakamura, which had an incredible performance in the Grand Chess Tour. Also noteworthy is Radjabov with a weighted win-% of 1.2 and by far the lowest performance rating.

H2H statistics

Next, let us have a look at the H2H statistics between the players in the field (classical games only).


The table is read from left to right for white/black. The number in the bracket is the number of games. So, for example, Caruana has 4 games in the database with white against Ding, with a score of 50%. We immediately see that we have very little H2H information on Radjabov and Firouzja. In general, having white carries a notable advantage (we knew that), and in particularly Ding, Nakamura and Nepomniachtchi has good scores with white against the field. With the black pieces, no one really stands out by performing significantly better than the others.

Predictions

Finally we arrive at some actual predictions. I set up the pairings and ran 50’000 simulations, updating the relevant variables between each round. I only simulate the main tournament, so no tiebreaks. The tiebreak format is tedious to code, but some players definitely has an advantage over others if it ends in a tie.

Since I include variables like tournament performance, outcome of previous game and tournament score, each simulation is “unique”. That is to say, if Player A faces Player B in round 4, where Player A has 3/4 and just won a game will not result in the same prediction for that particular game as if Player A has 2/4 and just made a draw.

Who is going to win?


So there it is. First of all, this is an indication that the tournament is definitely a very close affair! Second, the predicted probabilities of Nakamura and Duda winning the tournament outright vs Firouzja is most likely going to result in some comments and questions. Also, Radjabov with under 1% to win outright is very low. I have to reiterate that a model is no more than the data that goes into it (and the subjective choices we make in regards to that data).

Are the predictions reasonable? I believe yes, in particular when we already have an idea about the variables going into the model. But I also acknowledge the fact that expected variability in expected score and rank is quite high.

I will not get into the chances of each player winning the tournament from a subjective standpoint, but I do encourage people to listen to two great podcasts; The Chicken Chess Club and The Perpetual Chess which both have a candidates preview episode.

Other interesting observations

We can calculate a ton of different tournament statistics, and this is just to show a couple of them. One thing that sparked my interest was Firouzja’s chance to win outright vs finishing outright last. He has a higher chance of winning outright compared to Rapport, but also a higher chance of finishing outright last. This is due to the variability in expected score across the simulations:

It can be hard to see, but Firouzja has the highest variability in the expected score, followed by Duda and Nakamura. This is consistent with the weighted win/draw/loss rates presented earlier. Firouzja is often involved in decisive games, and this shows. If he can get off to a good start, it may further increase his chances of winning the following games. In other words, he is the player who is most subject to being “on a roll”. On the other end of the spectrum, we have Radjabov, who has a high weighted draw rate (and arguably loss rate) overall, which results in less variability in the expected score across 14 rounds.

Speaking about rounds, these are the projected expected scores in the first half of the tournament:


If the interest is there, I am going to provide an updated table on predicted probabilities of winning alongside predictions for each round. The Candidates is definitely a tournament which is going to be interesting to follow, and all outcomes are possible!