This project will import a chess tournament crosstable, transform and tidy the data, and produce a .csv file with summary information from the original crosstable. The information to be returned is the player names, player states, total number of points, player pre-ratings, and average pre-chess rating of the player’s opponents.
The following code loads the tidyverse package.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The data will be imported from github using read_delim to separate all the data by a vertical bar (“|”). Rows of dashes are removed by labeling them as comments. White space is trimmed, and the first three rows are skipped. Then a final empty column is dropped from the data frame.
tournament_info <-
read_delim("https://raw.githubusercontent.com/mraynolds/data_607/refs/heads/main/tournamentinfo.txt", delim = "|", comment = "--", trim_ws = TRUE, skip = 3, col_names = FALSE, show_col_types = FALSE) |>
select(!X11)
head(tournament_info)
## # A tibble: 6 × 10
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4
## 2 ON 15445895 / R: 1794 ->… N:2 W B W B W B W
## 3 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7
## 4 MI 14598900 / R: 1553 ->… N:2 B W B W B W B
## 5 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11 W 13 W 12
## 6 MI 14959604 / R: 1384 ->… N:2 W B W B W B W
The data is now laid out where each player’s data is laid out over two rows, one immediately following the other. The following code separates alternating lines of code, then combines them into single rows, so that each player’s data is contained on a single row. This change arranges the data so that columns now contain the same type of data. The code then renames the columns to better communicate what the data is that they contain.
players <- tournament_info |>
filter(str_detect(tournament_info$X1, "\\d"))
states <- tournament_info |>
filter(str_detect(tournament_info$X1, "\\D"))
tournament <- merge(players, states, by = "row.names", sort = TRUE) |>
mutate(Row.names = as.numeric(Row.names)) |>
arrange(Row.names) |>
select(!X1.x) |>
rename(player_number = Row.names,
player_name = X2.x,
total_pts = X3.x,
round_1 = X4.x,
round_2 = X5.x,
round_3 = X6.x,
round_4 = X7.x,
round_5 = X8.x,
round_6 = X9.x,
round_7 = X10.x,
state = X1.y,
uscf_ID_rating = X2.y,
norm_performance = X3.y,
color_1 = X4.y,
color_2 = X5.y,
color_3 = X6.y,
color_4 = X7.y,
color_5 = X8.y,
color_6 = X9.y,
color_7 = X10.y) |>
relocate(state:uscf_ID_rating, .after = player_name)
head(tournament)
## player_number player_name state uscf_ID_rating total_pts
## 1 1 GARY HUA ON 15445895 / R: 1794 ->1817 6.0
## 2 2 DAKSHESH DARURI MI 14598900 / R: 1553 ->1663 6.0
## 3 3 ADITYA BAJAJ MI 14959604 / R: 1384 ->1640 6.0
## 4 4 PATRICK H SCHILLING MI 12616049 / R: 1716 ->1744 5.5
## 5 5 HANSHI ZUO MI 14601533 / R: 1655 ->1690 5.5
## 6 6 HANSEN SONG OH 15055204 / R: 1686 ->1687 5.0
## round_1 round_2 round_3 round_4 round_5 round_6 round_7 norm_performance
## 1 W 39 W 21 W 18 W 14 W 7 D 12 D 4 N:2
## 2 W 63 W 58 L 4 W 17 W 16 W 20 W 7 N:2
## 3 L 8 W 61 W 25 W 21 W 11 W 13 W 12 N:2
## 4 W 23 D 28 W 2 W 26 D 5 W 19 D 1 N:2
## 5 W 45 W 37 D 12 D 13 D 4 W 14 W 17 N:2
## 6 W 34 D 29 L 11 W 35 D 10 W 27 W 21 N:3
## color_1 color_2 color_3 color_4 color_5 color_6 color_7
## 1 W B W B W B W
## 2 B W B W B W B
## 3 W B W B W B W
## 4 W B W B W B B
## 5 B W B W B W B
## 6 W B W B B W B
The code block below makes the data tidy. Several columns still contain multiple variables. Each column that contains multiple variables in each cell is separated wider. Once each cell contains a single variable or observation, the data is pivoted longer to tidy the data. As the data had been structured, there were multiple repeated columns. This would make adding additional data to the table challenging and does not meet Hadley Wickham’s standard for “tidy”. The pivot longer solves this issue.
Some of the column data types are changed, and then the order of columns is changed to make the fixed variables contiguous.
After running the below code, the data is now tidy and ready for analysis.
tournament <- tournament |>
separate_wider_delim(uscf_ID_rating, " / R: ", names = c("uscf_id", "rating")) |>
separate_wider_delim(rating, "->", names = c("pre_tournament_rating", "post_tournament_rating")) |>
separate_wider_delim(pre_tournament_rating, "P", names = c("pre_tourn_rating","pre_tourn_provisional_games"), too_few = "align_start") |>
separate_wider_delim(post_tournament_rating, "P", names = c("post_tourn_rating","post_tourn_provisional_games"), too_few = "align_start") |>
separate_wider_regex(round_1, patterns = c(result_1 = "\\w", "\\s+", opponent_1 = "\\d+"), too_few = "align_start") |>
separate_wider_regex(round_2, patterns = c(result_2 = "\\w", "\\s+", opponent_2 = "\\d+"), too_few = "align_start") |>
separate_wider_regex(round_3, patterns = c(result_3 = "\\w", "\\s+", opponent_3 = "\\d+"), too_few = "align_start") |>
separate_wider_regex(round_4, patterns = c(result_4 = "\\w", "\\s+", opponent_4 = "\\d+"), too_few = "align_start") |>
separate_wider_regex(round_5, patterns = c(result_5 = "\\w", "\\s+", opponent_5 = "\\d+"), too_few = "align_start") |>
separate_wider_regex(round_6, patterns = c(result_6 = "\\w", "\\s+", opponent_6 = "\\d+"), too_few = "align_start") |>
separate_wider_regex(round_7, patterns = c(result_7 = "\\w", "\\s+", opponent_7 = "\\d+"), too_few = "align_start") |>
pivot_longer(
cols = starts_with(c("result", "opponent", "color")),
names_to = c(".value", "round"),
names_sep = "_",
values_drop_na = FALSE
) |>
mutate(
pre_tourn_rating = as.numeric(pre_tourn_rating),
pre_tourn_provisional_games = as.numeric(pre_tourn_provisional_games),
post_tourn_rating = as.numeric(post_tourn_rating),
post_tourn_provisional_games = as.numeric(post_tourn_provisional_games),
total_pts = as.numeric(total_pts),
opponent = as.numeric(opponent)
) |>
relocate(norm_performance, .after = total_pts) |>
relocate(pre_tourn_provisional_games, .after = post_tourn_rating)
tournament <- tournament |> mutate(norm_performance = str_remove_all(tournament$norm_performance, "N:"))
head(tournament)
## # A tibble: 6 × 14
## player_number player_name state uscf_id pre_tourn_rating post_tourn_rating
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 1 GARY HUA ON 15445895 1794 1817
## 2 1 GARY HUA ON 15445895 1794 1817
## 3 1 GARY HUA ON 15445895 1794 1817
## 4 1 GARY HUA ON 15445895 1794 1817
## 5 1 GARY HUA ON 15445895 1794 1817
## 6 1 GARY HUA ON 15445895 1794 1817
## # ℹ 8 more variables: pre_tourn_provisional_games <dbl>,
## # post_tourn_provisional_games <dbl>, total_pts <dbl>,
## # norm_performance <chr>, round <chr>, result <chr>, opponent <dbl>,
## # color <chr>
From the tidy data frame, the majority of the required summary data is extracted and saved to a new data frame. This data frame includes player numbers, player names, player states, and pre-tournament player ratings. The only thing missing is the average of each player’s opponents’ pre-tournament ratings.
player_summary <- tournament |>
select(player_number, player_name, state, total_pts, pre_tourn_rating) |>
distinct(player_number, .keep_all = TRUE)
head(player_summary)
## # A tibble: 6 × 5
## player_number player_name state total_pts pre_tourn_rating
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 GARY HUA ON 6 1794
## 2 2 DAKSHESH DARURI MI 6 1553
## 3 3 ADITYA BAJAJ MI 6 1384
## 4 4 PATRICK H SCHILLING MI 5.5 1716
## 5 5 HANSHI ZUO MI 5.5 1655
## 6 6 HANSEN SONG OH 5 1686
The following code block creates two new data frames. One data frame has basic information about each player including their number, name, pre-tournament rating, and opponent number.
The second data frame is a list of player numbers and their pre-tournament rating.
The code block then joins the two data frames by pairing the opponent numbers from the first data frame with the pre-tournament ratings from the second frame.
The data is then grouped by player and the opponent ratings are averaged.
Finally, the list of averages is joined to the player summary data that was created in the previous code block, and the player number column is dropped from the data frame.
The data summary is now complete.
averages <- tournament |>
select(player_number, player_name, player_pre_tourn_rating = pre_tourn_rating, opponent) |>
filter(opponent != is.na(NA))
ratings <- tournament |>
select(opponent_number = player_number, opponent_pre_tourn_rating = pre_tourn_rating) |>
distinct()
combined_rating <- left_join(averages, ratings, by = join_by(opponent == opponent_number))
avg_rating <- combined_rating |>
group_by(player_number) |>
summarize(avg_opponent_rating = round(mean(opponent_pre_tourn_rating)))
player_summary_final <- left_join(player_summary, avg_rating, by = join_by(player_number == player_number)) |>
select(!player_number)
head(player_summary_final)
## # A tibble: 6 × 5
## player_name state total_pts pre_tourn_rating avg_opponent_rating
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 GARY HUA ON 6 1794 1605
## 2 DAKSHESH DARURI MI 6 1553 1469
## 3 ADITYA BAJAJ MI 6 1384 1564
## 4 PATRICK H SCHILLING MI 5.5 1716 1537
## 5 HANSHI ZUO MI 5.5 1655 1501
## 6 HANSEN SONG OH 5 1686 1519
This code block will export the player summary final to a .csv file for future use.
write_csv(player_summary_final, "player_summary_final.csv")