Overview

In the game of chess, are there openings where white or black has a significant victory margin?

Introduction

The data used for this project is originally from a website called lichess.org, which is an online chess server that also serves as an open source database for over 5 billion chess games. This specific data set consists of 20,058 lichess games collected in 2017 and put together by a user on kaggle.com, an online data science platform, and then featured on tidy tuesday, which is where I got the data from.

Exploring the Data

# Read the chess dataset from tidy tuesday and store into the environment as an object called chess
chess <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-10-01/chess.csv')
## Rows: 20058 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): game_id, victory_status, winner, time_increment, white_id, black_id...
## dbl (6): start_time, end_time, turns, white_rating, black_rating, opening_ply
## lgl (1): rated
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Structure of the chess dataset
str(chess)
## spc_tbl_ [20,058 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ game_id       : chr [1:20058] "TZJHLljE" "l1NXvwaE" "mIICvQHh" "kWKvrqYL" ...
##  $ rated         : logi [1:20058] FALSE TRUE TRUE TRUE TRUE FALSE ...
##  $ start_time    : num [1:20058] 1.5e+12 1.5e+12 1.5e+12 1.5e+12 1.5e+12 ...
##  $ end_time      : num [1:20058] 1.5e+12 1.5e+12 1.5e+12 1.5e+12 1.5e+12 ...
##  $ turns         : num [1:20058] 13 16 61 61 95 5 33 9 66 119 ...
##  $ victory_status: chr [1:20058] "outoftime" "resign" "mate" "mate" ...
##  $ winner        : chr [1:20058] "white" "black" "white" "white" ...
##  $ time_increment: chr [1:20058] "15+2" "5+10" "5+10" "20+0" ...
##  $ white_id      : chr [1:20058] "bourgris" "a-00" "ischia" "daniamurashov" ...
##  $ white_rating  : num [1:20058] 1500 1322 1496 1439 1523 ...
##  $ black_id      : chr [1:20058] "a-00" "skinnerua" "a-00" "adivanov2009" ...
##  $ black_rating  : num [1:20058] 1191 1261 1500 1454 1469 ...
##  $ moves         : chr [1:20058] "d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5 Bf4" "d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6 Qe5+ Nxe5 c4 Bb4+" "e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc6 bxc6 Ra6 Nc4 a4 c3 a3 Nxa3 Rxa3 Rxa3 c4 dxc4 d5 cxd5 Qxd5 exd5 "| __truncated__ "d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O-O O-O-O Nb5 Nb4 Rc1 Nxa2 Ra1 Nb4 Nxa7+ Kb8 Nb5 Bxc2 Bxc7+ Kc8 Qd"| __truncated__ ...
##  $ opening_eco   : chr [1:20058] "D10" "B00" "C20" "D02" ...
##  $ opening_name  : chr [1:20058] "Slav Defense: Exchange Variation" "Nimzowitsch Defense: Kennedy Variation" "King's Pawn Game: Leonardis Variation" "Queen's Pawn Game: Zukertort Variation" ...
##  $ opening_ply   : num [1:20058] 5 4 3 3 5 4 10 5 6 4 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   game_id = col_character(),
##   ..   rated = col_logical(),
##   ..   start_time = col_double(),
##   ..   end_time = col_double(),
##   ..   turns = col_double(),
##   ..   victory_status = col_character(),
##   ..   winner = col_character(),
##   ..   time_increment = col_character(),
##   ..   white_id = col_character(),
##   ..   white_rating = col_double(),
##   ..   black_id = col_character(),
##   ..   black_rating = col_double(),
##   ..   moves = col_character(),
##   ..   opening_eco = col_character(),
##   ..   opening_name = col_character(),
##   ..   opening_ply = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

The data set consists of 20,058 observations across 16 variables. Each observation represents one game. The variables are all aspects of each game, including the ID of the game, whether or not it was rated, the start time and end time, how many turns were taken in the game, whether it was a mate, resign, or timeout, whether the winner was white, black, or a draw, the time increment of the game, the players’ IDs and ratings, what moves were played in the game, the opening name and ECO code, and the number of moves in the opening game.

Exploratory Graphs of the Raw Data

# Bar graph of opening frequency
barplot(table(chess$opening_name), cex.names = 0.2, las = 2, main = "Chess Opening Frequency")

Viewing a bar graph of the raw opening frequency data, we can plainly see that the data will need to be cleaned up to be useful for our purpose of finding out whether certain openings have a significant victory margin.

Cleaning and Transforming the Data

To clean the data, first I want to omit the columns that really aren’t necessary for the analysis. Also, since we are only looking at victories for white and black, I also want to eliminate any “draw” results from the data set.

# Omit the columns unnecessary for analysis
omitchess <- chess[-c(1:6, 8, 9, 11, 13, 14, 16)]

# Eliminate any observations where the winner was a draw
nodrawchess <- subset(omitchess, omitchess$winner != "draw")

Then, I want to clean up the opening names so they condensed into their main lines, instead of multiple different variations. I also noticed that some openings still had duplicates, so I want to combine those.

# Load the dplyr, purrr, and stringr libraries
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(purrr)
## Warning: package 'purrr' was built under R version 4.5.3
library(stringr)
# Split the opening names by their main line and variations, extract the main line
mainlinechess <- nodrawchess %>%
  mutate(opening_name = strsplit(opening_name, "[:|#]") %>% 
           map_chr(1) %>%
           str_trim())

# Combine duplicate openings into one
mainlinechess$opening_name[mainlinechess$opening_name == "Queen's Gambit Refused"] <- "Queen's Gambit Declined"
mainlinechess$opening_name[mainlinechess$opening_name == "Queen's Pawn"] <- "Queen's Pawn Game"
mainlinechess$opening_name[mainlinechess$opening_name == "King's Pawn"] <- "King's Pawn Game"
mainlinechess$opening_name[mainlinechess$opening_name == "Petrov"] <- "Petrov's Defense"
mainlinechess$opening_name[mainlinechess$opening_name == "Russian Game"] <- "Petrov's Defense"
mainlinechess$opening_name[mainlinechess$opening_name == "King's Indian"] <- "King's Indian Defense"

Next, we can use the dplyr package to count up the amount of games played for each opening.

# Count the amount of each chess opening in the dataset
chessopenings <- mainlinechess %>% 
  count(mainlinechess$opening_name, sort = TRUE)

Knowing how many games were played for each opening is going to help pick out the top 25 openings used. The top 25 openings all have at least 185 games played.

Next, I am going to sort the columns, reorder them, and eliminate the rows of data that are not within the top 25 most used openings.

# Sort the openings by amount in decreasing order
counts <- sort(table(mainlinechess$opening_name), decreasing = TRUE)

# Reorder the opening name column by the sorted openings
cleansortedopenings <- mainlinechess[order(match(mainlinechess$opening_name, names(counts))), ]

# Eliminate the rows with openings that are not within the top 25
top25chess <- cleansortedopenings[-(14968:19108), ]

With the raw data cleaned and sorted into only the top 25 openings, the last task is to calculate how many times white or black won in a certain opening, compare that to the total number of games in that opening, and compute a win rate for white and black in that opening. Then, we can put this data into a new data frame to use for analysis.

# Table of the opening names and how many times white or black won as a matrix
df1 <- as.data.frame.matrix(table(top25chess$opening_name, top25chess$winner))

# Transforming the opening names from a row label into a variable
df2 <- cbind(opening_name = rownames(df1), data.frame(df1, row.names = NULL))

# Inserting data values into a new dataframe
openingrates <- data.frame(
  opening_name = c(df2$opening_name),
  total_games = c(df2$white + df2$black),
  white_wins = c(df2$white),
  black_wins = c(df2$black),
  white_win_rate = c(df2$white / (df2$black + df2$white)),
  black_win_rate = c(df2$black / (df2$black + df2$white))
)

Analysis

Using the newly made “openingrates” data set, we can take a look at the opening win rates for white and black using bar graphs. Let’s start with the white win rates.

White Win Rates

# Load the ggplot2 library
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
# Barplot of white win rates using the ggplot function
ggplot(openingrates, aes(reorder(x = opening_name, white_win_rate), y = white_win_rate)) +
  geom_bar(stat = "identity", fill = "#69923e") +
  coord_flip() +                
  labs(title = "Chess Openings by White Win Rate",
       x = "Opening Name",
       y = "Win Rate Percentage",
       subtitle = "Out of 14967 Observations") +                
  theme_minimal(paper = "#4b4847", ink = "#EEEED2")

In this graph, we can see the top win rate percentages for white.

On a surface level, winning over 50% of games played in one type of opening seems significant. The question that remains is, are these win rates actually statistically significant? For example, can we use this sample of chess games to prove that someone playing the Nimzowitsch Defense as white in the general population is actually more likely to win?

Since it is known that white will always have a slight advantage over black, there are quite a few openings for white that have a win rate over 50%, showing that white is more likely to win in general. To narrow it down, I am only going to test the top 10 openings in the graph for statistical significance.

Using a hypothesis test, we can test to see if the true population proportion is greater than 0.50, using a significance level of 0.05.

\(H_0: p = 0.50\)
\(H_1: p > 0.50\)
\(\alpha = 0.05\)

# Row values of Nimzowitsch Defense
openingrates[openingrates$opening_name == "Nimzowitsch Defense", ]
##           opening_name total_games white_wins black_wins white_win_rate
## 12 Nimzowitsch Defense         216        147         69      0.6805556
##    black_win_rate
## 12      0.3194444
# P-value
prop.test(x = 147, n = 216, p = 0.50, alternative = "greater")$p.value
## [1] 8.064304e-08
# Row values of Bishop's Opening
openingrates[openingrates$opening_name == "Bishop's Opening", ]
##       opening_name total_games white_wins black_wins white_win_rate
## 1 Bishop's Opening         306        186        120      0.6078431
##   black_win_rate
## 1      0.3921569
# P-value
prop.test(x = 186, n = 306, p = 0.50, alternative = "greater")$p.value
## [1] 0.0001012798
# Row values of Philidor Defense
openingrates[openingrates$opening_name == "Philidor Defense", ]
##        opening_name total_games white_wins black_wins white_win_rate
## 14 Philidor Defense         663        396        267      0.5972851
##    black_win_rate
## 14      0.4027149
# P-value
prop.test(x = 396, n = 663, p = 0.50, alternative = "greater")$p.value
## [1] 3.328566e-07
# Row values of Queen's Gambit Declined
openingrates[openingrates$opening_name == "Queen's Gambit Declined", ]
##               opening_name total_games white_wins black_wins white_win_rate
## 17 Queen's Gambit Declined         619        363        256      0.5864297
##    black_win_rate
## 17      0.4135703
# P-value
prop.test(x = 363, n = 619, p = 0.50, alternative = "greater")$p.value
## [1] 1.019852e-05
# Row values of Zukertort Opening
openingrates[openingrates$opening_name == "Zukertort Opening", ]
##         opening_name total_games white_wins black_wins white_win_rate
## 25 Zukertort Opening         308        179        129      0.5811688
##    black_win_rate
## 25      0.4188312
# P-value
prop.test(x = 179, n = 308, p = 0.50, alternative = "greater")$p.value
## [1] 0.002618892
# Row values of Pirc Defense
openingrates[openingrates$opening_name == "Pirc Defense", ]
##    opening_name total_games white_wins black_wins white_win_rate black_win_rate
## 15 Pirc Defense         270        156        114      0.5777778      0.4222222
# P-value
prop.test(x = 156, n = 270, p = 0.50, alternative = "greater")$p.value
## [1] 0.006294653
# Row values of English Opening
openingrates[openingrates$opening_name == "English Opening", ]
##      opening_name total_games white_wins black_wins white_win_rate
## 3 English Opening         691        395        296      0.5716353
##   black_win_rate
## 3      0.4283647
# P-value
prop.test(x = 395, n = 691, p = 0.50, alternative = "greater")$p.value
## [1] 9.646606e-05
# Row values of Queen's Gambit Accepted
openingrates[openingrates$opening_name == "Queen's Gambit Accepted", ]
##               opening_name total_games white_wins black_wins white_win_rate
## 16 Queen's Gambit Accepted         244        139        105      0.5696721
##    black_win_rate
## 16      0.4303279
# P-value
prop.test(x = 139, n = 244, p = 0.50, alternative = "greater")$p.value
## [1] 0.01731714
# Row values of Petrov's Defense
openingrates[openingrates$opening_name == "Petrov's Defense", ]
##        opening_name total_games white_wins black_wins white_win_rate
## 13 Petrov's Defense         329        184        145      0.5592705
##    black_win_rate
## 13      0.4407295
# P-value
prop.test(x = 184, n = 329, p = 0.50, alternative = "greater")$p.value
## [1] 0.01808515
# Row values of Scotch Game
openingrates[openingrates$opening_name == "Scotch Game", ]
##    opening_name total_games white_wins black_wins white_win_rate black_win_rate
## 21  Scotch Game         450        251        199      0.5577778      0.4422222
# P-value
prop.test(x = 251, n = 450, p = 0.50, alternative = "greater")$p.value
## [1] 0.008104771

\(p < \alpha\) = Reject the null hypothesis, accept the alternative.

For all top 10 openings used for white, there is enough data to suggest that the true win rate of the population is greater than 50%.

Since we are looking for the best win rate for white, we can narrow down these results further by looking at them in terms of the lowest to highest p-values. The top 3 openings with the lowest p-values are the Nimzowitsch Opening, the Philidor Defense, and the Queen’s Gambit Declined.

Black Win Rates

# Barplot of black win rates using the ggplot function
ggplot(openingrates, aes(reorder(x = opening_name, black_win_rate), y = black_win_rate)) +
  geom_bar(stat = "identity", fill = "#69923e") +
  coord_flip() +                
  labs(title = "Chess Openings by Black Win Rate",
       x = "Opening Name",
       y = "Win Rate Percentage",
       subtitle = "Out of 14967 Observations") +                
  theme_minimal(paper = "#4b4847", ink = "#EEEED2")

Since the only options for winners are either white or black, the bar plot of the black win rates is essentially just the inversion of the white win rates bar plot. As stated before, white will always have a slight advantage over black, so there is a much smaller number of openings where black won at least 50% of the games.

Black won the most using the Van’t Kruijs Opening, the Indian Game, the Modern Defense, the Sicilian Defense, and the Slav Defense.

The hypothesis test here will be the same as white, using a significance level of 0.05.

\(H_0: p = 0.50\)
\(H_1: p > 0.50\)
\(\alpha = 0.05\)

# Row values of Van't Kruijs Opening
openingrates[openingrates$opening_name == "Van't Kruijs Opening", ]
##            opening_name total_games white_wins black_wins white_win_rate
## 24 Van't Kruijs Opening         352        126        226      0.3579545
##    black_win_rate
## 24      0.6420455
# P-value
prop.test(x = 226, n = 352, p = 0.50, alternative = "greater")$p.value
## [1] 6.575911e-08
# Row values of Indian Game
openingrates[openingrates$opening_name == "Indian Game", ]
##   opening_name total_games white_wins black_wins white_win_rate black_win_rate
## 7  Indian Game         299        123        176      0.4113712      0.5886288
# P-value
prop.test(x = 176, n = 299, p = 0.50, alternative = "greater")$p.value
## [1] 0.001318168
# Row values of Modern Defense
openingrates[openingrates$opening_name == "Modern Defense", ]
##      opening_name total_games white_wins black_wins white_win_rate
## 11 Modern Defense         216        101        115      0.4675926
##    black_win_rate
## 11      0.5324074
# P-value
prop.test(x = 115, n = 216, p = 0.50, alternative = "greater")$p.value
## [1] 0.1882029
# Row values of Sicilian Defense
openingrates[openingrates$opening_name == "Sicilian Defense", ]
##        opening_name total_games white_wins black_wins white_win_rate
## 22 Sicilian Defense        2502       1203       1299      0.4808153
##    black_win_rate
## 22      0.5191847
# P-value
prop.test(x = 1299, n = 2502, p = 0.50, alternative = "greater")$p.value
## [1] 0.02876643
# Row values of Slav Defense
openingrates[openingrates$opening_name == "Slav Defense", ]
##    opening_name total_games white_wins black_wins white_win_rate black_win_rate
## 23 Slav Defense         222        111        111            0.5            0.5
# P-value
prop.test(x = 111, n = 222, p = 0.50, alternative = "greater")$p.value
## [1] 0.5

\(p < \alpha\) = Reject the null hypothesis, accept the alternative.

There is enough data to suggest that the true population win rate for black is greater than 50% when using the Van’t Kruijs Opening, the Indian Game, and the Sicilian Defense. For the Modern Defense and the Slav Defense, however, there is not enough evidence to suggest that these openings have a win rate greater than 50%.

Conclusions

Based on this analysis, the Nimzowitsch Opening gives white the most significant victory margin, and the Van’t Kruijs Opening gives black the most significant victory margin. However, this is not the full story. Why aren’t the Nimzowitsch Defense and the Van’t Kruijs Opening the most popular opening games in the world? It is because both of these openings rely on the other player to make a fatal mistake at the very beginning of the game. The Nimzowitsch Defense is described as “dubious” and relies on the black player disregarding the baseline strategy of chess, which is to control the center. The Van’t Kruijs Opening is the same way. It’s success relies on the white player making an unfortunate move in their first play of the game, leading to black immediately having the upper hand.

The second most statistically significant openings for white and black are the Philidor Defense and the Indian Game, respectively. These openings are much more solid and playable, which makes them more applicable to real games, rather than relying on your opponent making questionable moves in the beginning.

Limitations

The biggest limitation of this analysis is that we are not accounting for player rating, as a player with a higher rating would hypothetically be able to win more games, regardless of the opening used. Additionally, this data comes from games that were played online and many consider there to be quite the cheating epidemic in online chess communities, wherein players use computer engines to come up with the best move to play for them. This could definitely impact the data. Also, having to narrow down the opening variations into their main lines could have impacted the analysis more than we realize. Because opening variations can be incredibly different from one another, the games played in one general main line opening could have been played very differently.


This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College.
The course was led by Professor Billy Jackson.
Student Name: Spencer Anderson
Semester: Spring 2026