Derrick Zhen Stat 041 Project 2

Introduction

In this article, I will attempt to train a model to predict the outcome of a chess game without understanding the logic of the game itself. To do so, I will need lots of metadata about each particular game. Luckily, lichess.com, an online amateur chess platform, keeps thorough records of each game played and supports a dedicated API to access electronic game records.

To train this model, I’ll be using 20,000 CSV formatted game records I found on Kaggle. The link to the original dataset can be found here:

https://www.kaggle.com/datasnaek/chess

EDA

Our model’s dependent variable is the outcome of the game. However, it is less obvious what the independent variables of the model should be. In the following section, I’m going to conduct some exploratory data analysis on variables I suspect could be explanatory of game outcome.

Openings

What moves a player makes has all the effect on the trajectory of the game. However, our model can’t comprehend chess rules, and thus is unable to assign merit to openings using min-maxing.

The next best thing for us is to look at game outcomes of specific openings, and to gauge an opening’s favorability for white/black based on historic data.

Unfortunately, there are a lot of chess openings, most of which are played very infrequently. In the mere ~20,000 games we have access to, a whopping 1477 distinct opening sequences were played. Since the bulk of these ~20,000 games open with common openings, we expect a majority of the 1477 recorded sequences to have only been played several times. The low sample count of these openings make them liable to sampling error. Luckily, there is an easy way for us to increase the number of samples per opening.

Most of lichess’s recorded openings look like this: Modern Defense: Lizard Defense | Mittenberger Gambit. These long winded openings contain information not only about the opening played, but also its variety and response. Thus, we can reduce specificity and hopefully bolster games per opening simply by parsing out the variety and response. I’ve done this below with Python, getting us from 1477 distinct openings to 156.

import pandas as pd
games = pd.read_csv("./games.csv")
# refining the data because there are 1477 total possible openings
games["opening_name"].value_counts()

## Van't Kruijs Opening                                          368
## Sicilian Defense                                              358
## Sicilian Defense: Bowdler Attack                              296
## Scotch Game                                                   271
## French Defense: Knight Variation                              271
##                                                              ... 
## Blackmar-Diemer Gambit Declined |  Weinsbach Declination        1
## Hungarian Opening: Buecker Gambit                               1
## English Opening: Symmetrical Variation |  Hedgehog Defense      1
## Alekhine Defense: Balogh Variation                              1
## King's Indian Defense: Orthodox Variation |  Glek Defense       1
## Name: opening_name, Length: 1477, dtype: int64

delimiters = [":", "#", "Accepted", "Declined", "Refused"]
games["reduced_openings"] = games["opening_name"].str.split(r":|#|Accepted|Declined|Refused|\|", expand=True)[0]

games["reduced_openings"].value_counts()

## Sicilian Defense       2573
## French Defense         1306
## Queen's Pawn Game      1059
## Italian Game            981
## King's Pawn Game        917
##                        ... 
## Australian Defense        1
## Global Opening            1
## Pterodactyl Defense       1
## Valencia Opening          1
## Doery Defense             1
## Name: reduced_openings, Length: 156, dtype: int64

Having engineered a reduced_openings feature, we are now ready to explore whether openings have a significant impact on a game’s outcome. Since we want to investigate a relationship between two categorical variables, I will be using a stacked bar chart and a histogram.

games <- read_csv("./games_refined.csv")
grouped_games<- games%>%
  group_by(reduced_openings, winner)%>%
  summarise(count=n())

p1 <- ggplot(data=grouped_games, aes(x = reduced_openings, y=count, fill=winner))+
  geom_bar(position="fill", stat="identity")+
  theme_minimal()+
  xlab("Opening")+
  ylab("Win Percentages")+
  theme(
    axis.text.x = element_blank(),
    panel.background = element_blank()
  )+scale_fill_viridis_d(end = .75, option = "C") +
  scale_color_viridis_d(end = .75, option = "C") 

p2 <- ggplot(data=games, aes(x = reduced_openings, fill=winner))+
  geom_histogram(stat="count")+
  theme_minimal()+
  xlab("Opening")+
  ylab("Games")+
  theme(
    axis.text.x = element_blank(),
    panel.background = element_blank()
  )+
  scale_fill_viridis_d(end = .75, option = "C") +
  scale_color_viridis_d(end = .75, option = "C") 
  

p1/p2

Our stacked bar chart tells us that while most openings are balanced, some openings heavily favor one side over another. Thus, it would appear that the opening that is played is highly predictive of the game’s outcome in a nontrivial amount of games played.

Our histogram tells a different story: the openings that are most played tend to be balanced, with unbalanced openings comprising the vast minority of games played. This raises the possiblity that once again sampling error, rather than real strategic elements, may be responsible for an opening’s favorability for one side or another.

However, the counterargument can be made that these openings are infrequent to begin with because they are unbalanced. When both players must consent to an opening, it is unlikely that both sides will pursue openings that are unfavorable for one side, hence, niche openings may yet have a strong effect on the outcomes of games.

Rating

We would expect the side with a rating advantage to also win more frequently. To investigate the relationship between a quantitative variable and a categorical variable, I will use a violin plot.

# rating diff = white - black
games$rating_diff <- games$white_rating - games$black_rating
# outcomes: 0 <- black victory, 1 <- white victory, 0.5 <- draw
games$outcome <- games$winner %>% 
  sapply(function(outcome) {
  if (outcome == "white") {
    return(1)
  } else if (outcome == "black") {
    return(0)
  } else{
    return(0.5)
  }
})

ggplot(data = games, aes(x = rating_diff, y = outcome, color= winner, fill=winner)) +
  geom_jitter(width = 0, height = 0.09, alpha = 0.5)+
  geom_violin(alpha=0.8, color="white")+
  theme_minimal()+
  ylab("Outcome")+
  xlab("White Rating Minus Black rating")+
  scale_fill_viridis_d(end = .75, option = "C") +
  scale_color_viridis_d(end = .75, option = "C")

Although there is a substantial amount of overlap in rating_difference for both black victories, white victories and draws, overall we are not surprised to find that white tends to win over black when outranking black and vice versa. Draws are also most likely to occur when players are rated similarly.

Turns

We’d also like to know whether the number of moves played in a game influences its outcome.

p1 <- games%>%
  pivot_longer(turns)%>%
  ggplot(., aes(y = winner, x = value, fill = winner)) +
  geom_density_ridges(alpha = .8) +
  scale_fill_viridis_d(end = .75, option = "C") +
  scale_color_viridis_d(end = .75, option = "C") +
  theme_minimal()+
  theme(legend.position = "none",
        axis.title.y=element_blank()
        )+
  xlab("Number of Turns the Game Takes")

p2 <- games%>%
  pivot_longer(opening_ply)%>%
  ggplot(., aes(y = winner, x = value, fill = winner)) +
  geom_density_ridges(alpha = .8) +
  scale_fill_viridis_d(end = .75, option = "C") +
  scale_color_viridis_d(end = .75, option = "C") +
  theme_minimal()+
  theme(legend.position = "none",
        axis.title.y=element_blank()
        )+
  xlab("Number of Turns the Opening Takes")

p1/p2

It appears that while white and black victories take a similar number of turns, draws take on average longer.

In addition, there is no discernable difference in the distributions of the number of turns the opening takes across outcomes.

Victory Status

Finally, we want to know whether or not the cause of victory/defeat is at all predictive of the winning/losing color.

# table(games$victory_status)
grouped_games <- games%>%filter(victory_status!="draw")%>%group_by(victory_status, winner)%>%
  summarise(count= n())
  
ggplot(data=grouped_games, aes(x = victory_status, y=count, fill=winner))+
  geom_bar(position="fill", stat="identity")+
  theme_minimal()+
  xlab("Reason for Decision")+
  ylab("Win Percentages")+
  theme(
    panel.background = element_blank()
  )+
  scale_fill_viridis_d(end = .75, option = "C") +
  scale_color_viridis_d(end = .75, option = "C")

Besides the surprising discovery that running out of time does not always result in a loss for one side, reason for decision seems to be a pretty bad predictor for game outcome, since all three columns have nearly identical ratios.

Modeling

Aided by EDA, our domain specific knowledge suggests that the following are predictive elements in a chess game:

Number of turns
Opening
Rating difference

The following appear not to be particularly predictive:

Number of turns in opening
Reason for decision

Thus, we will proceed to fit our logistic model with 4 explanatory variables: color, opening sequence, number of turns and rating difference; of which the first two are categorical and the latter two are quantitative.

# colnames(games)
games$winner <- factor(games$winner)

train_control<- trainControl(method="cv", number=10)
model <- train(winner ~ reduced_openings + turns + rating_diff,
               data=games,
               trControl=train_control,
               method="polr"
)
model$results

Our final model is a multinomial probit regression. Using trainControl along with the MASS package allows us to train 4 proportional odds models using 10 fold cross validation. Extracting finalModel from the resultant train object gives us the multinomial probit as the most accurate model with an overall accuracy of 0.6253263.

Although our model’s performance is not stellar by any means, our accuracy of ~0.6 is significantly better than a random guess (which given 3 possible outcomes, is around 0.33). Our model also outperforms an educated guess, like betting on white every time, which would yield an accuracy slightly less than 0.5. Furthermore, we somewhat avoid the pitfall of overfitting by conducting a 10-fold cross validation.

We can also check to see if our model is intuitive.

fit <- model$finalModel

print("Effect of each turn:")

## [1] "Effect of each turn:"

fit$coefficients["turns"]

##        turns 
## -0.001048769

print("Effect of each point difference in rating:")

## [1] "Effect of each point difference in rating:"

fit$coefficients["rating_diff"]

## rating_diff 
## 0.002238192

Based on our coefficients, we glean that:

White out-rating black gives white a better change of victory and vice versa.
Longer games are more likely to be draws
There are openings that play out horrendously for black
There are openings that play out horrendously for white

By the way, we can verify that negative coefficients indicate black favorability and positive coefficients indicate white favorability by checking the zeta of our model.

print(fit$zeta)

## black|draw draw|white 
##  0.1087551  0.2596580

The zeta, or intercepts of our model indicate that log odds below -0.253 indicate a likely black win and log odds above -0.118 indicate a likely white win. Games between the two intercepts are likely draws.

Knowing this, we can infer from rating_diff’s positive coefficient, 0.0021, that a white advantage in rating favors white’s chances of winning. Similarly, we can infer from turns’s extremely minute but negative coefficient, -0.00131, that longer games tend towards draws.

As an amateur chess player, perhaps the most exciting part of this model are the coefficients associated with specific openings. Let’s examine some of these openings in detail:

print("Best moves for black:")

## [1] "Best moves for black:"

head(sort(fit$coefficients))

##     `reduced_openingsCrab Opening`     `reduced_openingsIrish Gambit` 
##                          -6.564325                          -5.701535 
##    `reduced_openingsDoery Defense` `reduced_openingsValencia Opening` 
##                          -5.434282                          -4.685730 
##      `reduced_openingsSlav Indian`    `reduced_openingsSodium Attack` 
##                          -1.778037                          -1.336999

print("Best moves for white:")

## [1] "Best moves for white:"

head(sort(fit$coefficients, decreasing=TRUE))

##         `reduced_openingsRubinstein Opening` 
##                                     5.236087 
## `reduced_openingsQueen's Indian Accelerated` 
##                                     4.715786 
##             `reduced_openingsGlobal Opening` 
##                                     4.644578 
##        `reduced_openingsPterodactyl Defense` 
##                                     4.475058 
##            `reduced_openingsLemming Defense` 
##                                     4.307709 
##          `reduced_openingsGuatemala Defense` 
##                                     4.256541

Let’s examine the “Crab Opening”, supposedly one of the worst openings white can execute.

crab The Crab opening sees white pushing up two edge pawns, going against the traditional early game maxims: controlling the center and developing minor pieces. As the crab opening is such a strategically unconventional opening, chess.com has only two records of previous crab games, both with black victories.

We can also examine white’s most favorable opening: the Rubinstein opening.

The Rubinstein opening appears much more orthodox when compared to the Crab opening, with both sides vying for control of the center and clearing paths for early development. As a novice, I would be hard pressed to explain why white outperforms black so significantly on both lichess and chess.com. However, as stated on chess.com’s opening wiki, the rubinstein does apparently diminish black’s chances of victory.

Our model also has its pitfalls.

doery The Doery defense, apparently black’s most favorable matchup, is neither intuitively nor statistically better for black, according to chess.com. Thus, there exists the possibility that our model’s estimation of the Doery defense’s favorability for black is due to sampling error.

Also notable is the fact that the most common openings tend to be fairly balanced, according to our model. Two examples are the Sicilian Defense and the Ruy Lopez, both well studied openings, that seem to favor neither side very strongly. And just for fun, the Queen’s gambit seems like a fairly balanced strategy as well.

fit$coefficients["`reduced_openingsSicilian Defense`"]

## `reduced_openingsSicilian Defense` 
##                         -0.1410222

fit$coefficients["`reduced_openingsRuy Lopez`"]

## `reduced_openingsRuy Lopez` 
##                  -0.0364424

fit$coefficients["`reduced_openingsQueen's Gambit`"]

## `reduced_openingsQueen's Gambit` 
##                      -0.02150615

We can attempt to visualize the results of our regression in 3 dimensions using plotly.

games$model_score <- fit$fitted.values[, 3]
library(plotly)

## Warning: package 'plotly' was built under R version 4.0.3

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:MASS':
## 
##     select

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

plot_ly(
  games,
  x = ~ rating_diff,
  y = ~ turns,
  z = ~ model_score,
  color = ~ winner,
  colors = c("#0D0887", "#A92395", "#F89441"),
  type = "scatter3d",
  mode = "markers",
  text = ~reduced_openings
) %>%
  layout(title = "Projected Odds over Rating Diff & #Turns w/Ground Truth Annotations",
         scene = list(
           xaxis = list(title = "Rating Difference"),
           yaxis = list(title = "# Turns"),
           zaxis = list(title = "White's Projected Odds of Winning")
         ))

In this visualization, number of turns and rank difference (white minus black) are the y and x axis respectively. The z axis is white’s projected odds of winning according to our model.

Although I was unable to plot the regression plane, it is evident from the 3d scatter that the regression plane assume as sigmoidal plane shape. We glean that white is both projected to, and actually wins more when outranking black (and vice versa). Furthermore, draws appear more frequently the more turns a game lasts. These results are roughly consistent with our hypotheses.