Project 1

The Project

This project will import a chess tournament crosstable, transform and tidy the data, and produce a .csv file with summary information from the original crosstable. The information to be returned is the player names, player states, total number of points, player pre-ratings, and average pre-chess rating of the player’s opponents.

Load packages

The following code loads the tidyverse package.

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Import the data

The data will be imported from github using read_delim to separate all the data by a vertical bar (“|”). Rows of dashes are removed by labeling them as comments. White space is trimmed, and the first three rows are skipped. Then a final empty column is dropped from the data frame.

tournament_info <-
  read_delim("https://raw.githubusercontent.com/mraynolds/data_607/refs/heads/main/tournamentinfo.txt", delim = "|", comment = "--", trim_ws = TRUE, skip = 3, col_names = FALSE, show_col_types = FALSE) |> 
  select(!X11)

head(tournament_info)
## # A tibble: 6 × 10
##   X1    X2                       X3    X4    X5    X6    X7    X8    X9    X10  
##   <chr> <chr>                    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1     GARY HUA                 6.0   W  39 W  21 W  18 W  14 W   7 D  12 D   4
## 2 ON    15445895 / R: 1794   ->… N:2   W     B     W     B     W     B     W    
## 3 2     DAKSHESH DARURI          6.0   W  63 W  58 L   4 W  17 W  16 W  20 W   7
## 4 MI    14598900 / R: 1553   ->… N:2   B     W     B     W     B     W     B    
## 5 3     ADITYA BAJAJ             6.0   L   8 W  61 W  25 W  21 W  11 W  13 W  12
## 6 MI    14959604 / R: 1384   ->… N:2   W     B     W     B     W     B     W

Make observations a single row

The data is now laid out where each player’s data is laid out over two rows, one immediately following the other. The following code separates alternating lines of code, then combines them into single rows, so that each player’s data is contained on a single row. This change arranges the data so that columns now contain the same type of data. The code then renames the columns to better communicate what the data is that they contain.

players <- tournament_info |> 
  filter(str_detect(tournament_info$X1, "\\d"))

states <- tournament_info |> 
  filter(str_detect(tournament_info$X1, "\\D"))

tournament <- merge(players, states, by = "row.names", sort = TRUE) |> 
  mutate(Row.names = as.numeric(Row.names)) |> 
  arrange(Row.names) |> 
  select(!X1.x) |> 
  rename(player_number = Row.names,
         player_name = X2.x,
         total_pts = X3.x,
         round_1 = X4.x,
         round_2 = X5.x,
         round_3 = X6.x,
         round_4 = X7.x,
         round_5 = X8.x,
         round_6 = X9.x,
         round_7 = X10.x,
         state = X1.y,
         uscf_ID_rating = X2.y,
         norm_performance = X3.y,
         color_1 = X4.y,
         color_2 = X5.y,
         color_3 = X6.y,
         color_4 = X7.y,
         color_5 = X8.y,
         color_6 = X9.y,
         color_7 = X10.y) |> 
  relocate(state:uscf_ID_rating, .after = player_name)

head(tournament)
##   player_number         player_name state              uscf_ID_rating total_pts
## 1             1            GARY HUA    ON 15445895 / R: 1794   ->1817       6.0
## 2             2     DAKSHESH DARURI    MI 14598900 / R: 1553   ->1663       6.0
## 3             3        ADITYA BAJAJ    MI 14959604 / R: 1384   ->1640       6.0
## 4             4 PATRICK H SCHILLING    MI 12616049 / R: 1716   ->1744       5.5
## 5             5          HANSHI ZUO    MI 14601533 / R: 1655   ->1690       5.5
## 6             6         HANSEN SONG    OH 15055204 / R: 1686   ->1687       5.0
##   round_1 round_2 round_3 round_4 round_5 round_6 round_7 norm_performance
## 1   W  39   W  21   W  18   W  14   W   7   D  12   D   4              N:2
## 2   W  63   W  58   L   4   W  17   W  16   W  20   W   7              N:2
## 3   L   8   W  61   W  25   W  21   W  11   W  13   W  12              N:2
## 4   W  23   D  28   W   2   W  26   D   5   W  19   D   1              N:2
## 5   W  45   W  37   D  12   D  13   D   4   W  14   W  17              N:2
## 6   W  34   D  29   L  11   W  35   D  10   W  27   W  21              N:3
##   color_1 color_2 color_3 color_4 color_5 color_6 color_7
## 1       W       B       W       B       W       B       W
## 2       B       W       B       W       B       W       B
## 3       W       B       W       B       W       B       W
## 4       W       B       W       B       W       B       B
## 5       B       W       B       W       B       W       B
## 6       W       B       W       B       B       W       B

Make the Data Tidy

The code block below makes the data tidy. Several columns still contain multiple variables. Each column that contains multiple variables in each cell is separated wider. Once each cell contains a single variable or observation, the data is pivoted longer to tidy the data. As the data had been structured, there were multiple repeated columns. This would make adding additional data to the table challenging and does not meet Hadley Wickham’s standard for “tidy”. The pivot longer solves this issue.

Some of the column data types are changed, and then the order of columns is changed to make the fixed variables contiguous.

After running the below code, the data is now tidy and ready for analysis.

tournament <- tournament |> 
  separate_wider_delim(uscf_ID_rating, " / R: ", names = c("uscf_id", "rating")) |> 
  separate_wider_delim(rating, "->", names = c("pre_tournament_rating", "post_tournament_rating")) |> 
  separate_wider_delim(pre_tournament_rating, "P", names = c("pre_tourn_rating","pre_tourn_provisional_games"), too_few = "align_start") |> 
  separate_wider_delim(post_tournament_rating, "P", names = c("post_tourn_rating","post_tourn_provisional_games"), too_few = "align_start") |>
  separate_wider_regex(round_1, patterns = c(result_1 = "\\w", "\\s+", opponent_1 = "\\d+"), too_few = "align_start") |> 
  separate_wider_regex(round_2, patterns = c(result_2 = "\\w", "\\s+", opponent_2 = "\\d+"), too_few = "align_start") |>
  separate_wider_regex(round_3, patterns = c(result_3 = "\\w", "\\s+", opponent_3 = "\\d+"), too_few = "align_start") |>
  separate_wider_regex(round_4, patterns = c(result_4 = "\\w", "\\s+", opponent_4 = "\\d+"), too_few = "align_start") |>
  separate_wider_regex(round_5, patterns = c(result_5 = "\\w", "\\s+", opponent_5 = "\\d+"), too_few = "align_start") |>
  separate_wider_regex(round_6, patterns = c(result_6 = "\\w", "\\s+", opponent_6 = "\\d+"), too_few = "align_start") |>
  separate_wider_regex(round_7, patterns = c(result_7 = "\\w", "\\s+", opponent_7 = "\\d+"), too_few = "align_start") |>
  pivot_longer(
    cols = starts_with(c("result", "opponent", "color")),
    names_to = c(".value", "round"),
    names_sep = "_",
    values_drop_na = FALSE
  ) |> 
  mutate(
    pre_tourn_rating = as.numeric(pre_tourn_rating),
    pre_tourn_provisional_games = as.numeric(pre_tourn_provisional_games),
    post_tourn_rating = as.numeric(post_tourn_rating),
    post_tourn_provisional_games = as.numeric(post_tourn_provisional_games),
    total_pts = as.numeric(total_pts),
    opponent = as.numeric(opponent)
  ) |> 
  relocate(norm_performance, .after = total_pts) |> 
  relocate(pre_tourn_provisional_games, .after = post_tourn_rating)

tournament <- tournament |> mutate(norm_performance = str_remove_all(tournament$norm_performance, "N:"))

head(tournament)
## # A tibble: 6 × 14
##   player_number player_name state uscf_id  pre_tourn_rating post_tourn_rating
##           <dbl> <chr>       <chr> <chr>               <dbl>             <dbl>
## 1             1 GARY HUA    ON    15445895             1794              1817
## 2             1 GARY HUA    ON    15445895             1794              1817
## 3             1 GARY HUA    ON    15445895             1794              1817
## 4             1 GARY HUA    ON    15445895             1794              1817
## 5             1 GARY HUA    ON    15445895             1794              1817
## 6             1 GARY HUA    ON    15445895             1794              1817
## # ℹ 8 more variables: pre_tourn_provisional_games <dbl>,
## #   post_tourn_provisional_games <dbl>, total_pts <dbl>,
## #   norm_performance <chr>, round <chr>, result <chr>, opponent <dbl>,
## #   color <chr>

A Preliminary Summary

From the tidy data frame, the majority of the required summary data is extracted and saved to a new data frame. This data frame includes player numbers, player names, player states, and pre-tournament player ratings. The only thing missing is the average of each player’s opponents’ pre-tournament ratings.

player_summary <- tournament |>
  select(player_number, player_name, state, total_pts, pre_tourn_rating) |> 
  distinct(player_number, .keep_all = TRUE)

head(player_summary)
## # A tibble: 6 × 5
##   player_number player_name         state total_pts pre_tourn_rating
##           <dbl> <chr>               <chr>     <dbl>            <dbl>
## 1             1 GARY HUA            ON          6               1794
## 2             2 DAKSHESH DARURI     MI          6               1553
## 3             3 ADITYA BAJAJ        MI          6               1384
## 4             4 PATRICK H SCHILLING MI          5.5             1716
## 5             5 HANSHI ZUO          MI          5.5             1655
## 6             6 HANSEN SONG         OH          5               1686

The Opponent Averages

The following code block creates two new data frames. One data frame has basic information about each player including their number, name, pre-tournament rating, and opponent number.

The second data frame is a list of player numbers and their pre-tournament rating.

The code block then joins the two data frames by pairing the opponent numbers from the first data frame with the pre-tournament ratings from the second frame.

The data is then grouped by player and the opponent ratings are averaged.

Finally, the list of averages is joined to the player summary data that was created in the previous code block, and the player number column is dropped from the data frame.

The data summary is now complete.

averages <- tournament |> 
  select(player_number, player_name, player_pre_tourn_rating = pre_tourn_rating, opponent) |> 
  filter(opponent != is.na(NA))

ratings <- tournament |> 
  select(opponent_number = player_number, opponent_pre_tourn_rating = pre_tourn_rating) |> 
  distinct()

combined_rating <- left_join(averages, ratings, by = join_by(opponent == opponent_number))

avg_rating <- combined_rating |> 
  group_by(player_number) |> 
  summarize(avg_opponent_rating = round(mean(opponent_pre_tourn_rating)))

player_summary_final <- left_join(player_summary, avg_rating, by = join_by(player_number == player_number)) |> 
  select(!player_number)

head(player_summary_final)
## # A tibble: 6 × 5
##   player_name         state total_pts pre_tourn_rating avg_opponent_rating
##   <chr>               <chr>     <dbl>            <dbl>               <dbl>
## 1 GARY HUA            ON          6               1794                1605
## 2 DAKSHESH DARURI     MI          6               1553                1469
## 3 ADITYA BAJAJ        MI          6               1384                1564
## 4 PATRICK H SCHILLING MI          5.5             1716                1537
## 5 HANSHI ZUO          MI          5.5             1655                1501
## 6 HANSEN SONG         OH          5               1686                1519

Export to .csv

This code block will export the player summary final to a .csv file for future use.

write_csv(player_summary_final, "player_summary_final.csv")