Data Overview

I chose to analyze the “Chess Game Dataset (Lichess)” from Tidy Tuesday for my data dives as I have recently become very interested in chess. However, I am terrible at chess: I am, objectively, a below average player. Thus by studying the data set, which includes moves, wins, and openings, I hope to improve my chess playing.

To begin this data dive, we’ll load the dataset into this notebook.

#loading the data and needed libraries into the markdown file
library(readr)
chess_data <- read_csv("C:/Users/Lauren/Documents/chess.csv", )
## Rows: 20058 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): game_id, victory_status, winner, time_increment, white_id, black_id...
## dbl (6): start_time, end_time, turns, white_rating, black_rating, opening_ply
## lgl (1): rated
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chess_data
## # A tibble: 20,058 × 16
##    game_id  rated start_time end_time turns victory_status winner time_increment
##    <chr>    <lgl>      <dbl>    <dbl> <dbl> <chr>          <chr>  <chr>         
##  1 TZJHLljE FALSE    1.50e12  1.50e12    13 outoftime      white  15+2          
##  2 l1NXvwaE TRUE     1.50e12  1.50e12    16 resign         black  5+10          
##  3 mIICvQHh TRUE     1.50e12  1.50e12    61 mate           white  5+10          
##  4 kWKvrqYL TRUE     1.50e12  1.50e12    61 mate           white  20+0          
##  5 9tXo1AUZ TRUE     1.50e12  1.50e12    95 mate           white  30+3          
##  6 MsoDV9wj FALSE    1.50e12  1.50e12     5 draw           draw   10+0          
##  7 qwU9rasv TRUE     1.50e12  1.50e12    33 resign         white  10+0          
##  8 RVN0N3VK FALSE    1.50e12  1.50e12     9 resign         black  15+30         
##  9 dwF3DJHO TRUE     1.50e12  1.50e12    66 resign         black  15+0          
## 10 afoMwnLg TRUE     1.50e12  1.50e12   119 mate           white  10+0          
## # ℹ 20,048 more rows
## # ℹ 8 more variables: white_id <chr>, white_rating <dbl>, black_id <chr>,
## #   black_rating <dbl>, moves <chr>, opening_eco <chr>, opening_name <chr>,
## #   opening_ply <dbl>

Numeric and Visual Summaries

For the summaries, I’d like to take a look at the following columns: turns, white_rating, black_rating, and victory_status.

# a numeric summary of turns
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
chess_data |>
  summarize(median_turns = median(turns),
            mean_turns = mean(turns),
            min_turns = min(turns),
            max_turns = max(turns),
            quantile_25_turns = quantile(turns, probs = c(.25)),
            quantile_50_turns = quantile(turns, probs = c(.5)),
            quantile_75_turns = quantile(turns, probs = c(.75)))
## # A tibble: 1 × 7
##   median_turns mean_turns min_turns max_turns quantile_25_turns
##          <dbl>      <dbl>     <dbl>     <dbl>             <dbl>
## 1           55       60.5         1       349                37
## # ℹ 2 more variables: quantile_50_turns <dbl>, quantile_75_turns <dbl>

Here, we can see that the max amount of turns is 349 turns, while the least amount of turns is just a single turn! We can also see that the mean is higher than the median (by almost 10%). This means we’re going to find there are a lot high turn count games (or perhaps a few very high turn count games). We can confirm this by charting the amount of turns as a histogram.

#making a histogram of turns to see more information about turns as a whole.
library(ggplot2)
ggplot(chess_data, mapping = aes (x = turns)) + 
  geom_histogram(binwidth = 4)

Since we know what the quartiles are, we can calculate the interquartile range (42), and the upper fence (142), so we can see how many of the games have turns that are noticeably more than the bulk of the games. (While we could look at the games outside the lower fence, it is in the negatives, and there isn’t a thing as negative turns in chess.)

Let’s re-plot the graph with the upper fence line on the graph.

ggplot(chess_data, mapping = aes (x = turns)) + 
  geom_histogram(binwidth = 4) +
  geom_vline(xintercept = 142)

We can see that the histogram has both a tail of higher turn games, as well as a few very high turn games, confirming our suspicion.

# a numeric summary of ratings for both White and Black
chess_data |>
  summarize(mean_WhiteRating = mean(white_rating
),
            mean_BlackRating = mean(black_rating)
            )
## # A tibble: 1 × 2
##   mean_WhiteRating mean_BlackRating
##              <dbl>            <dbl>
## 1            1597.            1589.

Here we can see that the mean white rating is ever-so-slightly higher (.49%) than the mean black rating. For the source data, Lichess, randomizes most players’ colors, so we would expect to see this negligible difference in the means of the ratings of each color.

Does this correlate to more wins? Let’s see:

ggplot(chess_data, mapping = aes(x = winner)) +
  geom_bar()

It does! How interesting. It may be important to see how the ratings impact the win rate in a later notebook.

Speaking of wins, let’s look at how games end in the victory_status column.

#victory_status numeric summary
table(chess_data$victory_status)
## 
##      draw      mate outoftime    resign 
##       906      6325      1680     11147
prop.table(table(chess_data$victory_status))
## 
##       draw       mate  outoftime     resign 
## 0.04516901 0.31533553 0.08375710 0.55573836

Here we can see how the games end in Lichess. Over half result in a resignation! That’s impressive. Let’s take a look at these numbers in a bar chart.

#Victory status visual breakdown
library(ggplot2)
ggplot(chess_data, mapping = aes (x = victory_status)) + 
  geom_bar()

Three Novel Questions

  1. Do higher ranked players make for longer or shorter games (by turn count)?

  2. Does white win more often in chess games given equivalent rankings?

  3. Which openings produce longer games (more turns) and higher mate rates?

Addressing: Do higher ranked players make for longer or shorter games (by turn count)?

I would like to know if the higher ranked players have noticeably different turn counts for their games than the lower ranked players.

#First, since there are two measures for rating (one for black and one for white) we need to make a measurement of the rating of the players in general. We'll do that by taking the mean of the two ratings and assigning that to a new column, mean_game_rating
chess_data_mean <-
chess_data |>
  mutate(mean_game_rating = (white_rating + black_rating)/2)
#Next, we'll plot it against the turns each game took. For fun, we'll color it using the winner column in case theres a relationship between the winner and the turns taken. 
ggplot(chess_data_mean, mapping = aes(x=turns, y=mean_game_rating)) +
  geom_point(mapping = aes(color=winner))

From this scatter plot, we can conclude there is no meaningful relationship between mean_game_rating and turns for any game of chess. It does however, appear that there is a slight correlation between turns taken and the game resulting in a draw. How interesting!