I chose to analyze the “Chess Game Dataset (Lichess)” from Tidy Tuesday for my data dives as I have recently become very interested in chess. However, I am terrible at chess: I am, objectively, a below average player. Thus by studying the data set, which includes moves, wins, and openings, I hope to improve my chess playing.
To begin this data dive, we’ll load the dataset into this notebook.
#loading the data and needed libraries into the markdown file
library(readr)
chess_data <- read_csv("C:/Users/Lauren/Documents/chess.csv", )
## Rows: 20058 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): game_id, victory_status, winner, time_increment, white_id, black_id...
## dbl (6): start_time, end_time, turns, white_rating, black_rating, opening_ply
## lgl (1): rated
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chess_data
## # A tibble: 20,058 × 16
## game_id rated start_time end_time turns victory_status winner time_increment
## <chr> <lgl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 TZJHLljE FALSE 1.50e12 1.50e12 13 outoftime white 15+2
## 2 l1NXvwaE TRUE 1.50e12 1.50e12 16 resign black 5+10
## 3 mIICvQHh TRUE 1.50e12 1.50e12 61 mate white 5+10
## 4 kWKvrqYL TRUE 1.50e12 1.50e12 61 mate white 20+0
## 5 9tXo1AUZ TRUE 1.50e12 1.50e12 95 mate white 30+3
## 6 MsoDV9wj FALSE 1.50e12 1.50e12 5 draw draw 10+0
## 7 qwU9rasv TRUE 1.50e12 1.50e12 33 resign white 10+0
## 8 RVN0N3VK FALSE 1.50e12 1.50e12 9 resign black 15+30
## 9 dwF3DJHO TRUE 1.50e12 1.50e12 66 resign black 15+0
## 10 afoMwnLg TRUE 1.50e12 1.50e12 119 mate white 10+0
## # ℹ 20,048 more rows
## # ℹ 8 more variables: white_id <chr>, white_rating <dbl>, black_id <chr>,
## # black_rating <dbl>, moves <chr>, opening_eco <chr>, opening_name <chr>,
## # opening_ply <dbl>
For the summaries, I’d like to take a look at the following columns: turns, white_rating, black_rating, and victory_status.
# a numeric summary of turns
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
chess_data |>
summarize(median_turns = median(turns),
mean_turns = mean(turns),
min_turns = min(turns),
max_turns = max(turns),
quantile_25_turns = quantile(turns, probs = c(.25)),
quantile_50_turns = quantile(turns, probs = c(.5)),
quantile_75_turns = quantile(turns, probs = c(.75)))
## # A tibble: 1 × 7
## median_turns mean_turns min_turns max_turns quantile_25_turns
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 55 60.5 1 349 37
## # ℹ 2 more variables: quantile_50_turns <dbl>, quantile_75_turns <dbl>
Here, we can see that the max amount of turns is 349 turns, while the least amount of turns is just a single turn! We can also see that the mean is higher than the median (by almost 10%). This means we’re going to find there are a lot high turn count games (or perhaps a few very high turn count games). We can confirm this by charting the amount of turns as a histogram.
#making a histogram of turns to see more information about turns as a whole.
library(ggplot2)
ggplot(chess_data, mapping = aes (x = turns)) +
geom_histogram(binwidth = 4)
Since we know what the quartiles are, we can calculate the interquartile range (42), and the upper fence (142), so we can see how many of the games have turns that are noticeably more than the bulk of the games. (While we could look at the games outside the lower fence, it is in the negatives, and there isn’t a thing as negative turns in chess.)
Let’s re-plot the graph with the upper fence line on the graph.
ggplot(chess_data, mapping = aes (x = turns)) +
geom_histogram(binwidth = 4) +
geom_vline(xintercept = 142)
We can see that the histogram has both a tail of higher turn games, as well as a few very high turn games, confirming our suspicion.
# a numeric summary of ratings for both White and Black
chess_data |>
summarize(mean_WhiteRating = mean(white_rating
),
mean_BlackRating = mean(black_rating)
)
## # A tibble: 1 × 2
## mean_WhiteRating mean_BlackRating
## <dbl> <dbl>
## 1 1597. 1589.
Here we can see that the mean white rating is ever-so-slightly higher (.49%) than the mean black rating. For the source data, Lichess, randomizes most players’ colors, so we would expect to see this negligible difference in the means of the ratings of each color.
Does this correlate to more wins? Let’s see:
ggplot(chess_data, mapping = aes(x = winner)) +
geom_bar()
It does! How interesting. It may be important to see how the ratings impact the win rate in a later notebook.
Speaking of wins, let’s look at how games end in the victory_status column.
#victory_status numeric summary
table(chess_data$victory_status)
##
## draw mate outoftime resign
## 906 6325 1680 11147
prop.table(table(chess_data$victory_status))
##
## draw mate outoftime resign
## 0.04516901 0.31533553 0.08375710 0.55573836
Here we can see how the games end in Lichess. Over half result in a resignation! That’s impressive. Let’s take a look at these numbers in a bar chart.
#Victory status visual breakdown
library(ggplot2)
ggplot(chess_data, mapping = aes (x = victory_status)) +
geom_bar()
Do higher ranked players make for longer or shorter games (by turn count)?
Does white win more often in chess games given equivalent rankings?
Which openings produce longer games (more turns) and higher mate rates?
I would like to know if the higher ranked players have noticeably different turn counts for their games than the lower ranked players.
#First, since there are two measures for rating (one for black and one for white) we need to make a measurement of the rating of the players in general. We'll do that by taking the mean of the two ratings and assigning that to a new column, mean_game_rating
chess_data_mean <-
chess_data |>
mutate(mean_game_rating = (white_rating + black_rating)/2)
#Next, we'll plot it against the turns each game took. For fun, we'll color it using the winner column in case theres a relationship between the winner and the turns taken.
ggplot(chess_data_mean, mapping = aes(x=turns, y=mean_game_rating)) +
geom_point(mapping = aes(color=winner))
From this scatter plot, we can conclude there is no meaningful relationship between mean_game_rating and turns for any game of chess. It does however, appear that there is a slight correlation between turns taken and the game resulting in a draw. How interesting!