Final Project

Introduction

Data

The data is an extraction of Chess Game data, which was collected from Lichess.org using Lichess API (which enables collection of any given users game history) by Mian Faisal. The full dataset is available on Kaggle.

Initial modification made to the original data set

Only included data for rated chess games.
Replaced their created_at and last_move variables with game_time, which shows the time it takes to finish the game in seconds.
Added move_time, which shows the maximum game time (in seconds) allowed at the beginning, and increment, which shows the added time after each move (in seconds).
Added the rating differences rating_dif, which is rating of player playing white minuses rating of player playing black.
Removed all the duplicates.
Only included the generic categories of opening games, instead of having specific variations.

There are 4515 entries, 17 variables.

library(readr)
ChessData <- read_csv("~/Documents/Math_248/F22/Project/ChessData.csv")

The variables are:

id : unique game ID
game_time : time it takes to finish the game (in seconds)
turns : number of turns in the game (a turn is completed when both black and white have moved)
victory_status : how the game ended (black wins, white wins, draw)
winner : game result
move_time : maximum game time (in seconds) allowed at the beginning
increment : added time after each move (in seconds)
increment_code : codes corresponding to different total allowed game time with increments
white_id : player playing white’s ID
white_rating : player playing white’s Glicko-2 rating
black_id : player playing white’s ID
black_rating : player playing white’s Glicko-2 rating
rating_dif : rating of player playing white minues rating of player playing black
moves : list of all moves needed to end the game
opening_eco : ECO code corresponding to the chess opening moves used.
opening_name : chess opening strategy used
opening_ply : number of moves in the opening phrase

Research Idea

We will predict the chess game result winner for a player given the color they will play, and other appropriate predictors.

Potential Predictors

rating_dif, game_time, turns, opening_name, victory_status, move_time

library(ggplot2)
library(gridExtra)
grid.arrange(
  (ggplot(data=ChessData, aes(x=game_time, y = winner)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))),
  
  (ggplot(data=ChessData, aes(x=move_time, y = winner)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))),
  
  (ggplot(data=ChessData, aes(x=turns, y = winner)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))),
  
  (ggplot(data=ChessData, aes(x=opening_name, fill=winner)) +
  geom_histogram(stat = "count", position = "fill")),
  
  (ggplot(data=ChessData, aes(x=opening_name, fill=winner)) +
     geom_histogram(stat = "count")),
  
  (ggplot(data=ChessData, aes(x=victory_status, fill=winner)) +
     geom_histogram(stat = "count", position = "fill")),

  (ggplot(data=ChessData, aes(x=rating_dif, y=winner)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))),

nrow=4)

For game_time and victory_status, there do not seem to be clear differences given different outcomes of winner.

For opening_name, there do not seem to be clear differences given different outcomes of winner, especially among the popular opening games used. Some opening games show clear favor for an outcome of winner, but these opening games do not appear to be popularly used/appear less in our dataset.

For move_time and turns, it is clear that games with winner as draw have a higher IQR compared to games with winner as black or white.

For rating_dif, games with winner as draw has its median around 0 (two players having similar ratings). Games with winner as white has its median as well as the whole IQR larger than 0 (player playing white having a higher rating). Games with winner as black has its median as well as the whole IQR slightly less than 0 (player playing black having a higher rating).

rating_dif, move_time, and turns seem to be the best predictors out of the chosen six.

However, as move_time and turns both show the time it takes to complete the game (move_time shows the actual amount of time in second, while turns show the time in the number of steps), it is reasonable to question if we would need both of these in our model.

We will first create a model that only include rating_dif and turns as predictors, then compare the accuracy of that model to our second model that include all three.

New ChessData

NewChessData <- ChessData[c("winner", "rating_dif", "move_time", "turns")]
str(NewChessData)

## tibble [4,515 × 4] (S3: tbl_df/tbl/data.frame)
##  $ winner    : chr [1:4515] "black" "white" "white" "white" ...
##  $ rating_dif: num [1:4515] -475 316 -34 407 139 -74 -118 -127 -94 -193 ...
##  $ move_time : num [1:4515] 480 480 480 300 720 720 900 540 900 900 ...
##  $ turns     : num [1:4515] 10 21 43 29 34 40 41 78 32 34 ...

Research Question

We will predict the chess game result winner for a player given the color they will play, rating_dif, and turns.

Quantitative predictors: rating_dif, turns

Multinomial Logistic Regression - Theoretical Model

Our response variable winner is a categorical variable with three nominal/un-ordered possible outcomes/categories white, black, draw. Therefore, multinomial logistic regression can be used.

Linear predictor:

\[\eta = f(rating\_dif, turns)\]

Compare to Binomial Logistic Regression: Response variable is a categorical variable with two possible outcomes. This only requires one logistic regression model, for example, the a general link function of binomial logistic regression would look like this: \[log(\frac{\pi}{1 - \pi}) = \alpha + \beta X_1 + \theta X_2 + \phi X_3\] In multiple logistic regression, if we have k possible outcomes, it is required to have k-1 link functions, and we need to pick a reference outcome/category.

Let’s winner = white` be our reference category.

First model: Between `winner` = `black` and `winner` = `white`:

\[log(\frac{Prob(winner = black)}{Prob(winner = white)})\] \[= \alpha_1 + \theta_1 \times rating\_dif + \phi_1 \times turns\]

Second model: Between `winner` = `draw` and `winner` = `white`:

\[log(\frac{Prob(winner = draw)}{Prob(winner = white)})\] \[= \alpha_2 + \theta_2 \times rating\_dif + \phi_2 \times turns\]

Distribution:

\[winner \sim Multinomial(n, prob(winner=white))\]

Conditions:

Linearity: the relationship between the log(Probability of each outcome) and the predictors are linear
No multicollinearity between the possible outcomes of response variable

cor(NewChessData[, 2:3])

##              rating_dif    move_time
## rating_dif  1.000000000 -0.003865911
## move_time  -0.003865911  1.000000000

Independence: (potentially violated) as the data is collected over time on a chess-playing website, there is no guarantee that two players would not play multiple games together.
No outliers: (violated)

grid.arrange(ggplot(data = NewChessData, aes(x = turns)) + geom_boxplot(), ggplot(data = NewChessData,
    aes(x = rating_dif)) + geom_boxplot(), nrow = 1)

Multinomial Logistic Regression Model for our ChessData

Split the Data into Training and Testing Sets

set.seed(1234)
ChessData_split <- initial_split(NewChessData, prop = 0.75)

ChessData_train <- training(ChessData_split)
ChessData_test <- testing(ChessData_split)

Train the Model

ChessData_train$winner <- relevel(as.factor(ChessData_train$winner), ref = "white")

ChessData.model <- multinom(winner ~ rating_dif + turns, data = ChessData_train)

## # weights:  12 (6 variable)
## initial  value 3719.901209 
## iter  10 value 2552.965203
## final  value 2544.024695 
## converged

Link functions:

First model: Between `winner` = `black` and `winner` = `white`:

\[\widehat{log(\frac{Prob(winner = black)}{Prob(winner = white)})}\] \[= - 0.157 - 0.0044 \times rating\_dif - 0.0006 turns\]

Second model: Between `winner` = `draw` and `winner` = `white`:

\[\widehat{log(\frac{Prob(winner = draw)}{Prob(winner = white)})}\] \[= - 3.576 - 0.0028 \times rating\_dif + 0.0244 \times turns\]

tidy(ChessData.model) %>%
    knitr::kable(digits = 4, format = "markdown")

y.level	term	estimate	std.error	statistic	p.value
black	(Intercept)	-0.0445	0.0924	-0.4820	0.6298
black	rating_dif	-0.0044	0.0002	-18.8228	0.0000
black	turns	-0.0006	0.0013	-0.4655	0.6415
draw	(Intercept)	-4.2553	0.2254	-18.8788	0.0000
draw	rating_dif	-0.0028	0.0005	-5.5631	0.0000
draw	turns	0.0244	0.0024	10.2936	0.0000

Given winner as draw, p-values according to turns, and rating_dif are approximately 0 < 0.05, meaning that both turns and rating_dif are important predictor in this case.

Given winner as black, p-values according to turns, and rating_dif are 0.64 > 0.05, and approximately 0 respectively. These mean that while rating_dif is an important predictor in this case, turns is not.

Predict

Predict on the Training Set

ChessData_train$winnerPredicted <- predict(ChessData.model, newdata = ChessData_train,
    "class")

# Classification table
train_table <- table(ChessData_train$winner, ChessData_train$winnerPredicted)
train_table

##        
##         white black draw
##   white  1206   490    2
##   black   672   871    1
##   draw     70    66    8

# accuracy
sum(train_table["white", "white"], train_table["black", "black"], train_table["draw",
    "draw"])/sum(train_table) * 100

## [1] 61.57708

The accuracy rate of our model on training set is 61.58%.

Predict on the Testing Set

ChessData_test$winnerPredicted <- predict(ChessData.model, newdata = ChessData_test,
    "class")

# Classification table
test_table <- table(ChessData_test$winner, ChessData_test$winnerPredicted)
test_table

##        
##         white black draw
##   black   195   307    0
##   draw     28    19    0
##   white   424   155    1

# accuracy
sum(test_table["white", "white"], test_table["black", "black"], test_table["draw",
    "draw"])/sum(test_table) * 100

## [1] 64.74756

The accuracy rate of our model on testing set is 64.75%.

Three-predictor Multinomial Logistic Model

Linear predictor:

\[\eta = f(rating\_dif, turns, move\_time)\]

First model: Between `winner` = `black` and `winner` = `white`:

\[log(\frac{Prob(winner = black)}{Prob(winner = white)})\] \[= -0.0486 + 0 \times move\_time - 0.0044 \times rating\_dif - 0.0006 \times turns\]

Second model: Between `winner` = `draw` and `winner` = `white`:

\[log(\frac{Prob(winner = draw)}{Prob(winner = white)})\] \[= -4.4857 + 0.0002 \times move\_time - 0.0028 \times rating\_dif + 0.025 \times turns\]

Distribution:

\[winner \sim Multinomial(n, prob(winner=white))\]

ChessData.model2 <- multinom(winner ~ rating_dif + move_time + turns, data = ChessData_train)

## # weights:  15 (8 variable)
## initial  value 3719.901209 
## iter  10 value 2695.150266
## final  value 2540.044202 
## converged

tidy(ChessData.model2) %>%
    knitr::kable(digits = 4, format = "markdown")

y.level	term	estimate	std.error	statistic	p.value
black	(Intercept)	-0.0486	0.1036	-0.4690	0.6391
black	rating_dif	-0.0044	0.0002	-18.8210	0.0000
black	move_time	0.0000	0.0001	0.1329	0.8943
black	turns	-0.0006	0.0013	-0.4721	0.6368
draw	(Intercept)	-4.4857	0.2409	-18.6221	0.0000
draw	rating_dif	-0.0028	0.0005	-5.5127	0.0000
draw	move_time	0.0002	0.0001	3.1087	0.0019
draw	turns	0.0250	0.0024	10.4937	0.0000

move_time and turns are not significant predictors given the winner as black, as their corresponding p-values are 0.89 and 0.64 > 0.05 respectively. rating_dif is the only significant predictor in this case.

all three predictors are significant given the winner as draw.

Conditions:

Linearity: the relationship between the log(Probability of each outcome) and the predictors are linear
No multicollinearity between the possible outcomes of response variable

cor(NewChessData[, 2:4])

##              rating_dif    move_time       turns
## rating_dif  1.000000000 -0.003865911 -0.02636707
## move_time  -0.003865911  1.000000000 -0.08012186
## turns      -0.026367069 -0.080121864  1.00000000

Independence: (potentially violated) as the data is collected over time on a chess-playing website, there is no guarantee that two players would not play multiple games together.
No outliers: (violated)

ggplot(data = NewChessData, aes(x = move_time)) + geom_boxplot()

ChessData_train$winnerPredicted2 <- predict(ChessData.model2, newdata = ChessData_train,
    "class")
# Classification table
train_table2 <- table(ChessData_train$winner, ChessData_train$winnerPredicted2)
train_table2

##        
##         white black draw
##   white  1206   489    3
##   black   671   871    2
##   draw     70    65    9

# accuracy
sum(train_table2["white", "white"], train_table2["black", "black"], train_table2["draw",
    "draw"])/sum(train_table2) * 100

## [1] 61.60662

ChessData_test$winnerPredicted2 <- predict(ChessData.model2, newdata = ChessData_test,
    "class")

# Classification table
test_table2 <- table(ChessData_test$winner, ChessData_test$winnerPredicted2)
test_table2

##        
##         white black draw
##   black   195   307    0
##   draw     28    19    0
##   white   424   155    1

# accuracy
sum(test_table2["white", "white"], test_table2["black", "black"], test_table2["draw",
    "draw"])/sum(test_table2) * 100

## [1] 64.74756

The accuracy rate on training data and testing data are 61.6% and 64.75%. These are similar to the accuracy rates that we get with our two-predictor model (without move_time). Therefore, it is not necessary to use the three-predictor multinomial logistic model for our chess data.

Conclusion

The overall accuracy rate of our model is above 60%, which is good as it is higher than the accuracy rate of random guess (33%), or of all-in (50%).

Interpret Coefficients

ChessData.coef <- t((coef(ChessData.model)))

exp(ChessData.coef)

##                 black       draw
## (Intercept) 0.9564340 0.01418898
## rating_dif  0.9956237 0.99719349
## turns       0.9993820 1.02466657

`turns`:

As the number of turns increases by 1, the odd of black winning will change by 0.999 fold, meaning that player playing black is slightly less likely to win.
As the number of turns increases by 1, the odd of draw will change by 1.025 fold, meaning that a draw is more likely to happen.

`rating_dif`:

As the rating difference between the player playing white and black increases by 1, the odd of black winning will change by 0.996 fold. It means that player playing black is slightly less likely to win if player playing white has a higher rating.
As the rating difference between the player playing white and black increases by 1, the odd of draw will change by 0.997 fold, meaning that a draw is slightly less likely to happen.

Therefore, a larger rating difference between the player playing whire and black will be more likely to result in a win for white. In the case where a player is more disadvantaged in other aspects of the game, they may want to keep the game longer so that they may end up with a draw (and get some points for the game). Player playing black may want to keep the game shorter as a longer game will decrease their odd of winning.

Limitation

From the Data

The data is collected over time, so the independence condition is not met. If we can make sure that two players only meet once in our dataset, then the prediction based on our model would be different.
The result come from this model could only be generalized to rated chess games on Lichess. Other platforms or competitions may use a different rating system like Elo instead of Glicko-2 as used on Lichess. The difference in the environment that the players are in could also affect their thought process (playing though online platforms like Lichess may arguably be less stressful than playing in person).

From the Model

There could be more predictors in this model. I only picked out five variables that seem most rational to me to investigate in the first place.
Multinomial Logistic Regression model tend to overfit the data. Apply k-fold cross-validation could be a better approach than hold-out in resolving this overfitting problem.
Our model does not do well in predicting games with winner as draw. It may be because of the large overlapping area seen on the distributions given different predictor (Potential Predictor part) that games with winner as draw has with either games with winner as white, or games with winner as black. There may be other predictors that we could have included to help with this.
If there is not much differences coming from a predictor for different outcomes of the response variable, as we seen in the cases of opening_name, game_time, and victory_status, Multinomial Logistic Regression is not an effective model to use.

References

Chess Game Dataset. (2022, October 5). Kaggle. https://www.kaggle.com/datasets/faisaljanjua0555/chess-game-dataset

Frequently Asked Questions • lichess.org. (n.d.). https://lichess.org/faq

StatsTest.com. (2020, May 19). Multinomial Logistic Regression. https://www.statstest.com/multinomial-logistic-regression/

Final Project

Thanh-Thuy Dau

2022-12-13

Introduction

Data

Initial modification made to the original data set

Research Idea

Potential Predictors

New ChessData

Research Question

Multinomial Logistic Regression - Theoretical Model

Linear predictor:

First model: Between `winner` = `black` and `winner` = `white`:

Second model: Between `winner` = `draw` and `winner` = `white`:

Distribution:

Conditions:

Multinomial Logistic Regression Model for our ChessData

Split the Data into Training and Testing Sets

Train the Model

Link functions:

First model: Between `winner` = `black` and `winner` = `white`:

Second model: Between `winner` = `draw` and `winner` = `white`:

Predict

Predict on the Training Set

Predict on the Testing Set

Three-predictor Multinomial Logistic Model

Linear predictor:

First model: Between `winner` = `black` and `winner` = `white`:

Second model: Between `winner` = `draw` and `winner` = `white`:

Distribution:

Conditions:

Conclusion

Interpret Coefficients

`turns`:

`rating_dif`:

Limitation

From the Data

From the Model

References

Final Project

Thanh-Thuy Dau

2022-12-13

Introduction

Data

Initial modification made to the original data set

Research Idea

Potential Predictors

New ChessData

Research Question

Multinomial Logistic Regression - Theoretical Model

Linear predictor:

First model: Between winner = black and winner = white:

Second model: Between winner = draw and winner = white:

Distribution:

Conditions:

Multinomial Logistic Regression Model for our ChessData

Split the Data into Training and Testing Sets

Train the Model

Link functions:

First model: Between winner = black and winner = white:

Second model: Between winner = draw and winner = white:

Predict

Predict on the Training Set

Predict on the Testing Set

Three-predictor Multinomial Logistic Model

Linear predictor:

First model: Between winner = black and winner = white:

Second model: Between winner = draw and winner = white:

Distribution:

Conditions:

Conclusion

Interpret Coefficients

turns:

rating_dif:

Limitation

From the Data

From the Model

References

First model: Between `winner` = `black` and `winner` = `white`:

Second model: Between `winner` = `draw` and `winner` = `white`:

First model: Between `winner` = `black` and `winner` = `white`:

Second model: Between `winner` = `draw` and `winner` = `white`:

First model: Between `winner` = `black` and `winner` = `white`:

Second model: Between `winner` = `draw` and `winner` = `white`:

`turns`:

`rating_dif`: