The data is an extraction of Chess Game data, which was collected from Lichess.org using Lichess API (which enables collection of any given users game history) by Mian Faisal. The full dataset is available on Kaggle.
created_at and last_move
variables with game_time, which shows the time it takes to
finish the game in seconds.move_time, which shows the maximum game time (in
seconds) allowed at the beginning, and increment, which
shows the added time after each move (in seconds).rating_dif, which is
rating of player playing white minuses rating of player playing
black.There are 4515 entries, 17 variables.
library(readr)
ChessData <- read_csv("~/Documents/Math_248/F22/Project/ChessData.csv")
The variables are:
We will predict the chess game result winner for a
player given the color they will play, and other appropriate
predictors.
rating_dif, game_time, turns,
opening_name, victory_status,
move_time
library(ggplot2)
library(gridExtra)
grid.arrange(
(ggplot(data=ChessData, aes(x=game_time, y = winner)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))),
(ggplot(data=ChessData, aes(x=move_time, y = winner)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))),
(ggplot(data=ChessData, aes(x=turns, y = winner)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))),
(ggplot(data=ChessData, aes(x=opening_name, fill=winner)) +
geom_histogram(stat = "count", position = "fill")),
(ggplot(data=ChessData, aes(x=opening_name, fill=winner)) +
geom_histogram(stat = "count")),
(ggplot(data=ChessData, aes(x=victory_status, fill=winner)) +
geom_histogram(stat = "count", position = "fill")),
(ggplot(data=ChessData, aes(x=rating_dif, y=winner)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))),
nrow=4)
For game_time and victory_status, there do
not seem to be clear differences given different outcomes of
winner.
For opening_name, there do not seem to be clear
differences given different outcomes of winner, especially
among the popular opening games used. Some opening games show clear
favor for an outcome of winner, but these opening games do
not appear to be popularly used/appear less in our dataset.
For move_time and turns, it is clear that
games with winner as draw have a higher IQR
compared to games with winner as black or
white.
For rating_dif, games with winner as
draw has its median around 0 (two players having similar
ratings). Games with winner as white has its
median as well as the whole IQR larger than 0 (player playing white
having a higher rating). Games with winner as
black has its median as well as the whole IQR slightly less
than 0 (player playing black having a higher rating).
rating_dif, move_time, and
turns seem to be the best predictors out of the chosen
six.
However, as move_time and turns both show
the time it takes to complete the game (move_time shows the
actual amount of time in second, while turns show the time
in the number of steps), it is reasonable to question if we would need
both of these in our model.
We will first create a model that only include
rating_dif and turns as predictors, then
compare the accuracy of that model to our second model that include all
three.
NewChessData <- ChessData[c("winner", "rating_dif", "move_time", "turns")]
str(NewChessData)
## tibble [4,515 × 4] (S3: tbl_df/tbl/data.frame)
## $ winner : chr [1:4515] "black" "white" "white" "white" ...
## $ rating_dif: num [1:4515] -475 316 -34 407 139 -74 -118 -127 -94 -193 ...
## $ move_time : num [1:4515] 480 480 480 300 720 720 900 540 900 900 ...
## $ turns : num [1:4515] 10 21 43 29 34 40 41 78 32 34 ...
We will predict the chess game result winner for a
player given the color they will play, rating_dif, and
turns.
Quantitative predictors: rating_dif,
turns
Our response variable winner is a categorical variable
with three nominal/un-ordered possible outcomes/categories
white, black, draw. Therefore,
multinomial logistic regression can be used.
\[\eta = f(rating\_dif, turns)\]
Let’s winner = white` be our reference category.
winner = black and
winner = white:\[log(\frac{Prob(winner = black)}{Prob(winner = white)})\] \[= \alpha_1 + \theta_1 \times rating\_dif + \phi_1 \times turns\]
winner = draw and
winner = white:\[log(\frac{Prob(winner = draw)}{Prob(winner = white)})\] \[= \alpha_2 + \theta_2 \times rating\_dif + \phi_2 \times turns\]
\[winner \sim Multinomial(n, prob(winner=white))\]
Linearity: the relationship between the log(Probability of each outcome) and the predictors are linear
No multicollinearity between the possible outcomes of response variable
cor(NewChessData[, 2:3])
## rating_dif move_time
## rating_dif 1.000000000 -0.003865911
## move_time -0.003865911 1.000000000
Independence: (potentially violated) as the data is collected over time on a chess-playing website, there is no guarantee that two players would not play multiple games together.
No outliers: (violated)
grid.arrange(ggplot(data = NewChessData, aes(x = turns)) + geom_boxplot(), ggplot(data = NewChessData,
aes(x = rating_dif)) + geom_boxplot(), nrow = 1)
set.seed(1234)
ChessData_split <- initial_split(NewChessData, prop = 0.75)
ChessData_train <- training(ChessData_split)
ChessData_test <- testing(ChessData_split)
ChessData_train$winner <- relevel(as.factor(ChessData_train$winner), ref = "white")
ChessData.model <- multinom(winner ~ rating_dif + turns, data = ChessData_train)
## # weights: 12 (6 variable)
## initial value 3719.901209
## iter 10 value 2552.965203
## final value 2544.024695
## converged
winner = black and
winner = white:\[\widehat{log(\frac{Prob(winner = black)}{Prob(winner = white)})}\] \[= - 0.157 - 0.0044 \times rating\_dif - 0.0006 turns\]
winner = draw and
winner = white:\[\widehat{log(\frac{Prob(winner = draw)}{Prob(winner = white)})}\] \[= - 3.576 - 0.0028 \times rating\_dif + 0.0244 \times turns\]
tidy(ChessData.model) %>%
knitr::kable(digits = 4, format = "markdown")
| y.level | term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|---|
| black | (Intercept) | -0.0445 | 0.0924 | -0.4820 | 0.6298 |
| black | rating_dif | -0.0044 | 0.0002 | -18.8228 | 0.0000 |
| black | turns | -0.0006 | 0.0013 | -0.4655 | 0.6415 |
| draw | (Intercept) | -4.2553 | 0.2254 | -18.8788 | 0.0000 |
| draw | rating_dif | -0.0028 | 0.0005 | -5.5631 | 0.0000 |
| draw | turns | 0.0244 | 0.0024 | 10.2936 | 0.0000 |
Given winner as draw, p-values according to
turns, and rating_dif are approximately 0 <
0.05, meaning that both turns and rating_dif
are important predictor in this case.
Given winner as black, p-values according
to turns, and rating_dif are 0.64 > 0.05,
and approximately 0 respectively. These mean that while
rating_dif is an important predictor in this case,
turns is not.
ChessData_train$winnerPredicted <- predict(ChessData.model, newdata = ChessData_train,
"class")
# Classification table
train_table <- table(ChessData_train$winner, ChessData_train$winnerPredicted)
train_table
##
## white black draw
## white 1206 490 2
## black 672 871 1
## draw 70 66 8
# accuracy
sum(train_table["white", "white"], train_table["black", "black"], train_table["draw",
"draw"])/sum(train_table) * 100
## [1] 61.57708
The accuracy rate of our model on training set is 61.58%.
ChessData_test$winnerPredicted <- predict(ChessData.model, newdata = ChessData_test,
"class")
# Classification table
test_table <- table(ChessData_test$winner, ChessData_test$winnerPredicted)
test_table
##
## white black draw
## black 195 307 0
## draw 28 19 0
## white 424 155 1
# accuracy
sum(test_table["white", "white"], test_table["black", "black"], test_table["draw",
"draw"])/sum(test_table) * 100
## [1] 64.74756
The accuracy rate of our model on testing set is 64.75%.
\[\eta = f(rating\_dif, turns, move\_time)\]
winner = black and
winner = white:\[log(\frac{Prob(winner = black)}{Prob(winner = white)})\] \[= -0.0486 + 0 \times move\_time - 0.0044 \times rating\_dif - 0.0006 \times turns\]
winner = draw and
winner = white:\[log(\frac{Prob(winner = draw)}{Prob(winner = white)})\] \[= -4.4857 + 0.0002 \times move\_time - 0.0028 \times rating\_dif + 0.025 \times turns\]
\[winner \sim Multinomial(n, prob(winner=white))\]
ChessData.model2 <- multinom(winner ~ rating_dif + move_time + turns, data = ChessData_train)
## # weights: 15 (8 variable)
## initial value 3719.901209
## iter 10 value 2695.150266
## final value 2540.044202
## converged
tidy(ChessData.model2) %>%
knitr::kable(digits = 4, format = "markdown")
| y.level | term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|---|
| black | (Intercept) | -0.0486 | 0.1036 | -0.4690 | 0.6391 |
| black | rating_dif | -0.0044 | 0.0002 | -18.8210 | 0.0000 |
| black | move_time | 0.0000 | 0.0001 | 0.1329 | 0.8943 |
| black | turns | -0.0006 | 0.0013 | -0.4721 | 0.6368 |
| draw | (Intercept) | -4.4857 | 0.2409 | -18.6221 | 0.0000 |
| draw | rating_dif | -0.0028 | 0.0005 | -5.5127 | 0.0000 |
| draw | move_time | 0.0002 | 0.0001 | 3.1087 | 0.0019 |
| draw | turns | 0.0250 | 0.0024 | 10.4937 | 0.0000 |
move_time and turns are not significant
predictors given the winner as black, as their
corresponding p-values are 0.89 and 0.64 > 0.05 respectively.
rating_dif is the only significant predictor in this
case.
all three predictors are significant given the winner as
draw.
Linearity: the relationship between the log(Probability of each outcome) and the predictors are linear
No multicollinearity between the possible outcomes of response variable
cor(NewChessData[, 2:4])
## rating_dif move_time turns
## rating_dif 1.000000000 -0.003865911 -0.02636707
## move_time -0.003865911 1.000000000 -0.08012186
## turns -0.026367069 -0.080121864 1.00000000
Independence: (potentially violated) as the data is collected over time on a chess-playing website, there is no guarantee that two players would not play multiple games together.
No outliers: (violated)
ggplot(data = NewChessData, aes(x = move_time)) + geom_boxplot()
ChessData_train$winnerPredicted2 <- predict(ChessData.model2, newdata = ChessData_train,
"class")
# Classification table
train_table2 <- table(ChessData_train$winner, ChessData_train$winnerPredicted2)
train_table2
##
## white black draw
## white 1206 489 3
## black 671 871 2
## draw 70 65 9
# accuracy
sum(train_table2["white", "white"], train_table2["black", "black"], train_table2["draw",
"draw"])/sum(train_table2) * 100
## [1] 61.60662
ChessData_test$winnerPredicted2 <- predict(ChessData.model2, newdata = ChessData_test,
"class")
# Classification table
test_table2 <- table(ChessData_test$winner, ChessData_test$winnerPredicted2)
test_table2
##
## white black draw
## black 195 307 0
## draw 28 19 0
## white 424 155 1
# accuracy
sum(test_table2["white", "white"], test_table2["black", "black"], test_table2["draw",
"draw"])/sum(test_table2) * 100
## [1] 64.74756
The accuracy rate on training data and testing data are 61.6% and
64.75%. These are similar to the accuracy rates that we get with our
two-predictor model (without move_time). Therefore, it is
not necessary to use the three-predictor multinomial logistic model for
our chess data.
The overall accuracy rate of our model is above 60%, which is good as it is higher than the accuracy rate of random guess (33%), or of all-in (50%).
ChessData.coef <- t((coef(ChessData.model)))
exp(ChessData.coef)
## black draw
## (Intercept) 0.9564340 0.01418898
## rating_dif 0.9956237 0.99719349
## turns 0.9993820 1.02466657
turns:As the number of turns increases by 1, the odd of black winning will change by 0.999 fold, meaning that player playing black is slightly less likely to win.
As the number of turns increases by 1, the odd of draw will change by 1.025 fold, meaning that a draw is more likely to happen.
rating_dif:As the rating difference between the player playing white and black increases by 1, the odd of black winning will change by 0.996 fold. It means that player playing black is slightly less likely to win if player playing white has a higher rating.
As the rating difference between the player playing white and black increases by 1, the odd of draw will change by 0.997 fold, meaning that a draw is slightly less likely to happen.
Therefore, a larger rating difference between the player playing whire and black will be more likely to result in a win for white. In the case where a player is more disadvantaged in other aspects of the game, they may want to keep the game longer so that they may end up with a draw (and get some points for the game). Player playing black may want to keep the game shorter as a longer game will decrease their odd of winning.
The data is collected over time, so the independence condition is not met. If we can make sure that two players only meet once in our dataset, then the prediction based on our model would be different.
The result come from this model could only be generalized to rated chess games on Lichess. Other platforms or competitions may use a different rating system like Elo instead of Glicko-2 as used on Lichess. The difference in the environment that the players are in could also affect their thought process (playing though online platforms like Lichess may arguably be less stressful than playing in person).
There could be more predictors in this model. I only picked out five variables that seem most rational to me to investigate in the first place.
Multinomial Logistic Regression model tend to overfit the data. Apply k-fold cross-validation could be a better approach than hold-out in resolving this overfitting problem.
Our model does not do well in predicting games with
winner as draw. It may be because of the large
overlapping area seen on the distributions given different predictor
(Potential Predictor part) that games with winner as
draw has with either games with winner as
white, or games with winner as
black. There may be other predictors that we could have
included to help with this.
If there is not much differences coming from a predictor for
different outcomes of the response variable, as we seen in the cases of
opening_name, game_time, and
victory_status, Multinomial Logistic Regression is not an
effective model to use.
Chess Game Dataset. (2022, October 5). Kaggle. https://www.kaggle.com/datasets/faisaljanjua0555/chess-game-dataset
Frequently Asked Questions • lichess.org. (n.d.). https://lichess.org/faq
StatsTest.com. (2020, May 19). Multinomial Logistic Regression. https://www.statstest.com/multinomial-logistic-regression/