Introduction

Data

The data is an extraction of Chess Game data, which was collected from Lichess.org using Lichess API (which enables collection of any given users game history) by Mian Faisal. The full dataset is available on Kaggle.

Initial modification made to the original data set

  • Only included data for rated chess games.
  • Replaced their created_at and last_move variables with game_time, which shows the time it takes to finish the game in seconds.
  • Added move_time, which shows the maximum game time (in seconds) allowed at the beginning, and increment, which shows the added time after each move (in seconds).
  • Added the rating differences rating_dif, which is rating of player playing white minuses rating of player playing black.
  • Removed all the duplicates.
  • Only included the generic categories of opening games, instead of having specific variations.

There are 4515 entries, 17 variables.

library(readr)
ChessData <- read_csv("~/Documents/Math_248/F22/Project/ChessData.csv")

The variables are:

  • id : unique game ID
  • game_time : time it takes to finish the game (in seconds)
  • turns : number of turns in the game (a turn is completed when both black and white have moved)
  • victory_status : how the game ended (black wins, white wins, draw)
  • winner : game result
  • move_time : maximum game time (in seconds) allowed at the beginning
  • increment : added time after each move (in seconds)
  • increment_code : codes corresponding to different total allowed game time with increments
  • white_id : player playing white’s ID
  • white_rating : player playing white’s Glicko-2 rating
  • black_id : player playing white’s ID
  • black_rating : player playing white’s Glicko-2 rating
  • rating_dif : rating of player playing white minues rating of player playing black
  • moves : list of all moves needed to end the game
  • opening_eco : ECO code corresponding to the chess opening moves used.
  • opening_name : chess opening strategy used
  • opening_ply : number of moves in the opening phrase

Research Idea

We will predict the chess game result winner for a player given the color they will play, and other appropriate predictors.

Potential Predictors

rating_dif, game_time, turns, opening_name, victory_status, move_time

library(ggplot2)
library(gridExtra)
grid.arrange(
  (ggplot(data=ChessData, aes(x=game_time, y = winner)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))),
  
  (ggplot(data=ChessData, aes(x=move_time, y = winner)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))),
  
  (ggplot(data=ChessData, aes(x=turns, y = winner)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))),
  
  (ggplot(data=ChessData, aes(x=opening_name, fill=winner)) +
  geom_histogram(stat = "count", position = "fill")),
  
  (ggplot(data=ChessData, aes(x=opening_name, fill=winner)) +
     geom_histogram(stat = "count")),
  
  (ggplot(data=ChessData, aes(x=victory_status, fill=winner)) +
     geom_histogram(stat = "count", position = "fill")),

  (ggplot(data=ChessData, aes(x=rating_dif, y=winner)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))),

nrow=4)

For game_time and victory_status, there do not seem to be clear differences given different outcomes of winner.

For opening_name, there do not seem to be clear differences given different outcomes of winner, especially among the popular opening games used. Some opening games show clear favor for an outcome of winner, but these opening games do not appear to be popularly used/appear less in our dataset.

For move_time and turns, it is clear that games with winner as draw have a higher IQR compared to games with winner as black or white.

For rating_dif, games with winner as draw has its median around 0 (two players having similar ratings). Games with winner as white has its median as well as the whole IQR larger than 0 (player playing white having a higher rating). Games with winner as black has its median as well as the whole IQR slightly less than 0 (player playing black having a higher rating).

rating_dif, move_time, and turns seem to be the best predictors out of the chosen six.

However, as move_time and turns both show the time it takes to complete the game (move_time shows the actual amount of time in second, while turns show the time in the number of steps), it is reasonable to question if we would need both of these in our model.

We will first create a model that only include rating_dif and turns as predictors, then compare the accuracy of that model to our second model that include all three.

New ChessData

NewChessData <- ChessData[c("winner", "rating_dif", "move_time", "turns")]
str(NewChessData)
## tibble [4,515 × 4] (S3: tbl_df/tbl/data.frame)
##  $ winner    : chr [1:4515] "black" "white" "white" "white" ...
##  $ rating_dif: num [1:4515] -475 316 -34 407 139 -74 -118 -127 -94 -193 ...
##  $ move_time : num [1:4515] 480 480 480 300 720 720 900 540 900 900 ...
##  $ turns     : num [1:4515] 10 21 43 29 34 40 41 78 32 34 ...

Research Question

We will predict the chess game result winner for a player given the color they will play, rating_dif, and turns.

Quantitative predictors: rating_dif, turns

Multinomial Logistic Regression - Theoretical Model

Our response variable winner is a categorical variable with three nominal/un-ordered possible outcomes/categories white, black, draw. Therefore, multinomial logistic regression can be used.

Linear predictor:

\[\eta = f(rating\_dif, turns)\]

  • Compare to Binomial Logistic Regression: Response variable is a categorical variable with two possible outcomes. This only requires one logistic regression model, for example, the a general link function of binomial logistic regression would look like this: \[log(\frac{\pi}{1 - \pi}) = \alpha + \beta X_1 + \theta X_2 + \phi X_3\] In multiple logistic regression, if we have k possible outcomes, it is required to have k-1 link functions, and we need to pick a reference outcome/category.

Let’s winner = white` be our reference category.

First model: Between winner = black and winner = white:

\[log(\frac{Prob(winner = black)}{Prob(winner = white)})\] \[= \alpha_1 + \theta_1 \times rating\_dif + \phi_1 \times turns\]

Second model: Between winner = draw and winner = white:

\[log(\frac{Prob(winner = draw)}{Prob(winner = white)})\] \[= \alpha_2 + \theta_2 \times rating\_dif + \phi_2 \times turns\]

Distribution:

\[winner \sim Multinomial(n, prob(winner=white))\]

Conditions:

  • Linearity: the relationship between the log(Probability of each outcome) and the predictors are linear

  • No multicollinearity between the possible outcomes of response variable

cor(NewChessData[, 2:3])
##              rating_dif    move_time
## rating_dif  1.000000000 -0.003865911
## move_time  -0.003865911  1.000000000

  • Independence: (potentially violated) as the data is collected over time on a chess-playing website, there is no guarantee that two players would not play multiple games together.

  • No outliers: (violated)

grid.arrange(ggplot(data = NewChessData, aes(x = turns)) + geom_boxplot(), ggplot(data = NewChessData,
    aes(x = rating_dif)) + geom_boxplot(), nrow = 1)

Multinomial Logistic Regression Model for our ChessData

Split the Data into Training and Testing Sets

set.seed(1234)
ChessData_split <- initial_split(NewChessData, prop = 0.75)

ChessData_train <- training(ChessData_split)
ChessData_test <- testing(ChessData_split)

Train the Model

ChessData_train$winner <- relevel(as.factor(ChessData_train$winner), ref = "white")

ChessData.model <- multinom(winner ~ rating_dif + turns, data = ChessData_train)
## # weights:  12 (6 variable)
## initial  value 3719.901209 
## iter  10 value 2552.965203
## final  value 2544.024695 
## converged

Predict

Predict on the Training Set

ChessData_train$winnerPredicted <- predict(ChessData.model, newdata = ChessData_train,
    "class")

# Classification table
train_table <- table(ChessData_train$winner, ChessData_train$winnerPredicted)
train_table
##        
##         white black draw
##   white  1206   490    2
##   black   672   871    1
##   draw     70    66    8
# accuracy
sum(train_table["white", "white"], train_table["black", "black"], train_table["draw",
    "draw"])/sum(train_table) * 100
## [1] 61.57708

The accuracy rate of our model on training set is 61.58%.

Predict on the Testing Set

ChessData_test$winnerPredicted <- predict(ChessData.model, newdata = ChessData_test,
    "class")

# Classification table
test_table <- table(ChessData_test$winner, ChessData_test$winnerPredicted)
test_table
##        
##         white black draw
##   black   195   307    0
##   draw     28    19    0
##   white   424   155    1
# accuracy
sum(test_table["white", "white"], test_table["black", "black"], test_table["draw",
    "draw"])/sum(test_table) * 100
## [1] 64.74756

The accuracy rate of our model on testing set is 64.75%.

Three-predictor Multinomial Logistic Model

Linear predictor:

\[\eta = f(rating\_dif, turns, move\_time)\]

First model: Between winner = black and winner = white:

\[log(\frac{Prob(winner = black)}{Prob(winner = white)})\] \[= -0.0486 + 0 \times move\_time - 0.0044 \times rating\_dif - 0.0006 \times turns\]

Second model: Between winner = draw and winner = white:

\[log(\frac{Prob(winner = draw)}{Prob(winner = white)})\] \[= -4.4857 + 0.0002 \times move\_time - 0.0028 \times rating\_dif + 0.025 \times turns\]

Distribution:

\[winner \sim Multinomial(n, prob(winner=white))\]

ChessData.model2 <- multinom(winner ~ rating_dif + move_time + turns, data = ChessData_train)
## # weights:  15 (8 variable)
## initial  value 3719.901209 
## iter  10 value 2695.150266
## final  value 2540.044202 
## converged
tidy(ChessData.model2) %>%
    knitr::kable(digits = 4, format = "markdown")
y.level term estimate std.error statistic p.value
black (Intercept) -0.0486 0.1036 -0.4690 0.6391
black rating_dif -0.0044 0.0002 -18.8210 0.0000
black move_time 0.0000 0.0001 0.1329 0.8943
black turns -0.0006 0.0013 -0.4721 0.6368
draw (Intercept) -4.4857 0.2409 -18.6221 0.0000
draw rating_dif -0.0028 0.0005 -5.5127 0.0000
draw move_time 0.0002 0.0001 3.1087 0.0019
draw turns 0.0250 0.0024 10.4937 0.0000

move_time and turns are not significant predictors given the winner as black, as their corresponding p-values are 0.89 and 0.64 > 0.05 respectively. rating_dif is the only significant predictor in this case.

all three predictors are significant given the winner as draw.

Conditions:

  • Linearity: the relationship between the log(Probability of each outcome) and the predictors are linear

  • No multicollinearity between the possible outcomes of response variable

cor(NewChessData[, 2:4])
##              rating_dif    move_time       turns
## rating_dif  1.000000000 -0.003865911 -0.02636707
## move_time  -0.003865911  1.000000000 -0.08012186
## turns      -0.026367069 -0.080121864  1.00000000

  • Independence: (potentially violated) as the data is collected over time on a chess-playing website, there is no guarantee that two players would not play multiple games together.

  • No outliers: (violated)

ggplot(data = NewChessData, aes(x = move_time)) + geom_boxplot()

ChessData_train$winnerPredicted2 <- predict(ChessData.model2, newdata = ChessData_train,
    "class")
# Classification table
train_table2 <- table(ChessData_train$winner, ChessData_train$winnerPredicted2)
train_table2
##        
##         white black draw
##   white  1206   489    3
##   black   671   871    2
##   draw     70    65    9
# accuracy
sum(train_table2["white", "white"], train_table2["black", "black"], train_table2["draw",
    "draw"])/sum(train_table2) * 100
## [1] 61.60662
ChessData_test$winnerPredicted2 <- predict(ChessData.model2, newdata = ChessData_test,
    "class")

# Classification table
test_table2 <- table(ChessData_test$winner, ChessData_test$winnerPredicted2)
test_table2
##        
##         white black draw
##   black   195   307    0
##   draw     28    19    0
##   white   424   155    1
# accuracy
sum(test_table2["white", "white"], test_table2["black", "black"], test_table2["draw",
    "draw"])/sum(test_table2) * 100
## [1] 64.74756

The accuracy rate on training data and testing data are 61.6% and 64.75%. These are similar to the accuracy rates that we get with our two-predictor model (without move_time). Therefore, it is not necessary to use the three-predictor multinomial logistic model for our chess data.

Conclusion

The overall accuracy rate of our model is above 60%, which is good as it is higher than the accuracy rate of random guess (33%), or of all-in (50%).

Interpret Coefficients

ChessData.coef <- t((coef(ChessData.model)))

exp(ChessData.coef)
##                 black       draw
## (Intercept) 0.9564340 0.01418898
## rating_dif  0.9956237 0.99719349
## turns       0.9993820 1.02466657

turns:

  • As the number of turns increases by 1, the odd of black winning will change by 0.999 fold, meaning that player playing black is slightly less likely to win.

  • As the number of turns increases by 1, the odd of draw will change by 1.025 fold, meaning that a draw is more likely to happen.

rating_dif:

  • As the rating difference between the player playing white and black increases by 1, the odd of black winning will change by 0.996 fold. It means that player playing black is slightly less likely to win if player playing white has a higher rating.

  • As the rating difference between the player playing white and black increases by 1, the odd of draw will change by 0.997 fold, meaning that a draw is slightly less likely to happen.

Therefore, a larger rating difference between the player playing whire and black will be more likely to result in a win for white. In the case where a player is more disadvantaged in other aspects of the game, they may want to keep the game longer so that they may end up with a draw (and get some points for the game). Player playing black may want to keep the game shorter as a longer game will decrease their odd of winning.

Limitation

From the Data

  • The data is collected over time, so the independence condition is not met. If we can make sure that two players only meet once in our dataset, then the prediction based on our model would be different.

  • The result come from this model could only be generalized to rated chess games on Lichess. Other platforms or competitions may use a different rating system like Elo instead of Glicko-2 as used on Lichess. The difference in the environment that the players are in could also affect their thought process (playing though online platforms like Lichess may arguably be less stressful than playing in person).

From the Model

  • There could be more predictors in this model. I only picked out five variables that seem most rational to me to investigate in the first place.

  • Multinomial Logistic Regression model tend to overfit the data. Apply k-fold cross-validation could be a better approach than hold-out in resolving this overfitting problem.

  • Our model does not do well in predicting games with winner as draw. It may be because of the large overlapping area seen on the distributions given different predictor (Potential Predictor part) that games with winner as draw has with either games with winner as white, or games with winner as black. There may be other predictors that we could have included to help with this.

  • If there is not much differences coming from a predictor for different outcomes of the response variable, as we seen in the cases of opening_name, game_time, and victory_status, Multinomial Logistic Regression is not an effective model to use.

References

Chess Game Dataset. (2022, October 5). Kaggle. https://www.kaggle.com/datasets/faisaljanjua0555/chess-game-dataset

Frequently Asked Questions • lichess.org. (n.d.). https://lichess.org/faq

StatsTest.com. (2020, May 19). Multinomial Logistic Regression. https://www.statstest.com/multinomial-logistic-regression/