Chess Tournament Database Pipeline

Author

Kevin Havis

Setup

library(tidyverse)
library(purrr)
library(DBI)
library(duckdb)
library(tools)
library(knitr)

Introduction

We’ll be parsing a text file containing the results of recent chess tournament in this article. By leveraging dplyr and tidyverse libraries, we can convert this text file into a proper structured data set. Once we have the data formatted, we can review the results of the tournament.

Loading & cleaning the data

This data is easily human-readable, however it has a couple things we need to address to get it into proper shape. Let’s start with cleaning up the ASCII delimiters.

-----------------------------------------------------------------------------
 Pair | Player Name                     |Total|Round|Round|Round|Round|Round|
 Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |
-----------------------------------------------------------------------------
    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|
   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |
-----------------------------------------------------------------------------
    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|
   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |
-----------------------------------------------------------------------------
    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|
   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |

ASCII clean up

The headers, rows, and columns are delimited by ASCII characters, which we will need to strip out. We’ll do this by reading each line in as a row and delimiting the columns.

# Pre-define our initial column names
init_cols = c(
  'pairNum_state',
  'player_info',
  'total',
  'round_1',
  'round_2',
  'round_3',
  'round_4',
  'round_5',
  'round_6',
  'round_7'
)

# Create a tibble by splitting row-wise by line breaks
# and column-wise by pipe
df <- tibble(txt = read_lines('tournament_info.txt')) |> 
  separate_rows(sep = '\\n') |> 
  separate(col = txt, into = init_cols, sep = '\\|')

# Drop --- row separators
df <- na.omit(df)

# Drop secondary header
df <- df |> 
  slice(-(1:2))
pairNum_state player_info total round_1 round_2 round_3 round_4 round_5 round_6 round_7
1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4
ON 15445895 / R: 1794 ->1817 N:2 W B W B W B W
2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7
MI 14598900 / R: 1553 ->1663 N:2 B W B W B W B
3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11 W 13 W 12
MI 14959604 / R: 1384 ->1640 N:2 W B W B W B W

Ungrouping the rows

With the extra ASCII characters cleaned up, we’re in much better shape. We can now turn our attention to the stacked rows for each player. To do this, we’ll group the rows and then pivot them wider into their own columns.

# We have two rows per player so we group them and create new columns
df <- df |> 
  group_by(grp = ceiling(row_number()/2)) |> # create groups by rounding the mod of each row number
  mutate(row = row_number()) |> 
  pivot_wider(names_from = row, values_from = -c(row, grp)) |> # break out row pairs into columns
  ungroup() |> 
  select(-grp) |> 
  mutate(across(everything(), ~ str_trim(.x)))
# Update column names
col_names = c(
  "player",
  "state",
  "name",
  "rating_info",
  "total_pts",
  "n",
  "results_rd_1",
  "color_rd_1",
  "results_rd_2",
  "color_rd_2",
  "results_rd_3",
  "color_rd_3",
  "results_rd_4",
  "color_rd_4",
  "results_rd_5",
  "color_rd_5",
  "results_rd_6",
  "color_rd_6",
  "results_rd_7",
  "color_rd_7"
)

df <- df |> set_names(col_names)
player state name rating_info total_pts n results_rd_1 color_rd_1 results_rd_2 color_rd_2 results_rd_3 color_rd_3 results_rd_4 color_rd_4 results_rd_5 color_rd_5 results_rd_6 color_rd_6 results_rd_7 color_rd_7
1 ON GARY HUA 15445895 / R: 1794 ->1817 6.0 N:2 W 39 W W 21 B W 18 W W 14 B W 7 W D 12 B D 4 W
2 MI DAKSHESH DARURI 14598900 / R: 1553 ->1663 6.0 N:2 W 63 B W 58 W L 4 B W 17 W W 16 B W 20 W W 7 B
3 MI ADITYA BAJAJ 14959604 / R: 1384 ->1640 6.0 N:2 L 8 W W 61 B W 25 W W 21 B W 11 W W 13 B W 12 W
4 MI PATRICK H SCHILLING 12616049 / R: 1716 ->1744 5.5 N:2 W 23 W D 28 B W 2 W W 26 B D 5 W W 19 B D 1 B
5 MI HANSHI ZUO 14601533 / R: 1655 ->1690 5.5 N:2 W 45 B W 37 W D 12 B D 13 W D 4 B W 14 W W 17 B
6 OH HANSEN SONG 15055204 / R: 1686 ->1687 5.0 N:3 W 34 W D 29 B L 11 W W 35 B D 10 B W 27 W W 21 B

Concatenated values

Next, we’ll need to address the concatenated values, including the rating_info which contains the player’s USCF ID, rating before the tournament, and rating after the tournament.

We’ll also split the “results” columns for each round into the result (W / L) and the opponent’s ID number using regex.

# Split results columns into single columns
df <- df |> 
 separate_wider_regex(
   starts_with("results"),
   patterns = c(
     result = "[WLDBH]",
     opponent = "\\s+\\d+"),
   names_sep = "_",
   too_few = "align_start") |> 
  mutate(across(starts_with("results"), ~ str_trim(.x)))
# Create interim tibble
rating_df <- df |> 
  filter(str_detect(rating_info, "[0-9]{8}")) |> 
  select(rating_info)

# Retrieve USCF ID
uscf_id <- rating_df |> 
  mutate(uscf_id = str_extract(rating_info, "([0-9]{8})")) |> 
  select(uscf_id)

# Retrieve pre_rating
pre_rating <- rating_df |> 
  mutate(pre_rating = str_extract(rating_info, "(?<=R:\\s{1,5})\\d+")) |> 
  select(pre_rating)

# Retrieve post_rating
post_rating <- rating_df |> 
  mutate(post_rating = str_extract(rating_info, "(?<=>\\s{0,5})\\d{3,4}")) |> 
  select(post_rating)

# Add new columns to main tibble
df <- df |> 
  cbind(uscf_id) |>
  cbind(pre_rating) |>
  cbind(post_rating) |>
  mutate(pre_rating = as.numeric(pre_rating)) |> 
  mutate(post_rating = as.numeric(post_rating)) |> 
  select(everything(), -rating_info)

We now have a well organized and “tidy” data set!

player state name total_pts n results_rd_1_result results_rd_1_opponent color_rd_1 results_rd_2_result results_rd_2_opponent color_rd_2 results_rd_3_result results_rd_3_opponent color_rd_3 results_rd_4_result results_rd_4_opponent color_rd_4 results_rd_5_result results_rd_5_opponent color_rd_5 results_rd_6_result results_rd_6_opponent color_rd_6 results_rd_7_result results_rd_7_opponent color_rd_7 uscf_id pre_rating post_rating
1 ON GARY HUA 6.0 N:2 W 39 W W 21 B W 18 W W 14 B W 7 W D 12 B D 4 W 15445895 1794 1817
2 MI DAKSHESH DARURI 6.0 N:2 W 63 B W 58 W L 4 B W 17 W W 16 B W 20 W W 7 B 14598900 1553 1663
3 MI ADITYA BAJAJ 6.0 N:2 L 8 W W 61 B W 25 W W 21 B W 11 W W 13 B W 12 W 14959604 1384 1640
4 MI PATRICK H SCHILLING 5.5 N:2 W 23 W D 28 B W 2 W W 26 B D 5 W W 19 B D 1 B 12616049 1716 1744
5 MI HANSHI ZUO 5.5 N:2 W 45 B W 37 W D 12 B D 13 W D 4 B W 14 W W 17 B 14601533 1655 1690
6 OH HANSEN SONG 5.0 N:3 W 34 W D 29 B L 11 W W 35 B D 10 B W 27 W W 21 B 15055204 1686 1687

Save into database

While this data is useful and we could certainly perform any further analysis we’d like with it, we should store it properly in a database. This will allow us manage the data in a scalable way to accommodate new players and future tournaments.

Prepare our tables

We’ll first split the data into logical tables, players and rounds. The players data is already well organized, but we’ll add some finishing touches.

# Separate data into player table

players <- df |> 
  select(player, state, name, pre_rating, post_rating, total_pts) |> 
  mutate(rating_change = post_rating - pre_rating) |> 
  mutate(name = str_to_title(name))

The rounds table will contain information specific to this tournament. To make this more database-friendly, we’ll convert the data into a long format.

# Convert round columns into long format
rounds <- df |>
  pivot_longer(
    cols = starts_with("results_rd_"),
    names_to = c("round", "type"),
    names_pattern = "results_rd_(\\d+)_(result|opponent)",
    values_to = "value"
  ) |>
  pivot_longer(
    cols = starts_with("color"),
    names_to = "color_col",
    values_to = "color"
  ) |> 
  pivot_wider(
    names_from = c("type"),
    values_from = c("value")
  ) |> 
  select(player, color, round, result, opponent)

Write to the database

Now we can actually add our new tables to our database.

# Give rounds a unique ID as a key
rounds <- rounds |> 
  mutate(id = row_number())
  
conn <- DBI::dbConnect(duckdb::duckdb(), dbdir = "tournament.db")

# Write tables if they don't already exist
if (!dbExistsTable(conn, "players")) {
dbWriteTable(conn, "players", players)
} else {
  
}


if (!dbExistsTable(conn, "rounds")) {
dbWriteTable(conn, "rounds", rounds)
} else {

}


dbDisconnect(conn)

Tournament results

Now, finally, we can calculate the final results of the tournament.

We’ll calculate the average rating of each player’s opponents.

# Calculate average opponent's rating

average_ratings <- left_join(rounds, players, by = c("opponent" = "player")) |> 
  group_by(player) |> 
  summarize(average_opponent_rating = mean(pre_rating, na.rm = TRUE))

players <- left_join(players, average_ratings, by = "player")

Build our final table

Our final table contains the player number, the player’s name, their home state, the total points they won through the tournament, their rating before the tournament, and finally, our newly calculated average opponent’s rating.

# Create final table

final_table <- players |> 
  select(player, name, state, total_pts, pre_rating, average_opponent_rating)
player name state total_pts pre_rating average_opponent_rating
1 Gary Hua ON 6.0 1794 1605.286
2 Dakshesh Daruri MI 6.0 1553 1469.286
3 Aditya Bajaj MI 6.0 1384 1563.571
4 Patrick H Schilling MI 5.5 1716 1573.571
5 Hanshi Zuo MI 5.5 1655 1500.857
6 Hansen Song OH 5.0 1686 1518.714

Winners and losers

Now that we have all our data in order, we can review the results of the tournament. Let’s visualize the change in ratings after the tournament.

Conclusion

In this article we developed a full end-to-end pipeline to parse, organize, store, and visualize the results of a chess tournament. We used common packages from the tidyverse and leaned heavily on regex to extract the key values out of our data. Once the data was in tidy shape, we separated it into logical relational tables and uploaded to our database. Finally, we plotted the outcomes of the tournament to identify the winners and losers of the event.