In this project, we worked with a text file containing chess tournament results. The goal was to create a CSV file that summarizes each player’s information, including their name, state, total points, pre-tournament rating, and the average pre-tournament rating of their opponents. The final CSV can be used for further analysis or imported into a database.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)
library(readr)
library(dplyr)
First, we read the text file from a GitHub repository which can be found in this link: “https://github.com/arutam-antunish/DATA607/blob/main/tournamentinfo.txt”. We cleaned the data by removing header lines, separators, and empty lines to keep only the lines with player information.
tournament_raw_url <- "https://raw.githubusercontent.com/arutam-antunish/DATA607/refs/heads/main/tournamentinfo.txt"
tournament_lines <- read_lines(tournament_raw_url)
view(tournament_lines)
tournament_lines <- tournament_lines[!str_detect(tournament_lines, "^\\s*(Pair|Num)")]
tournament_lines <- tournament_lines[!str_detect(tournament_lines, "^-{3,}")]
tournament_lines <- tournament_lines[str_trim(tournament_lines) != ""]
The data had two types of rows: main rows with the player name, total points, and round results, and detail rows with the player’s state and ratings. We separated these two types of rows and prepared them for further processing.
is_main <- str_detect(tournament_lines, "^\\s*\\d+")
main_lines <- tournament_lines[is_main]
detail_lines <- tournament_lines[!is_main]
main_df <- str_split_fixed(main_lines, "\\|", 10) %>% as.data.frame(stringsAsFactors = FALSE)
detail_df <- str_split_fixed(detail_lines, "\\|", 10) %>% as.data.frame(stringsAsFactors = FALSE)
main_df <- main_df %>% mutate_all(str_squish)
detail_df <- detail_df %>% mutate_all(str_squish)
colnames(main_df) <- paste0("V", 1:10)
colnames(detail_df) <- paste0("V", 1:10)
We split each row into columns using the ‘|’ symbol. Then, we combined the main rows with the detail rows to create a table with all relevant information for each player.
players <- main_df %>%
mutate(Player_State = detail_df$V1, Player_Detail = detail_df$V2, Player_Name = V2, Total_Points = as.numeric(V3), R1_raw = V4, R2_raw = V5, R3_raw = V6, R4_raw = V7, R5_raw = V8, R6_raw = V9, R7_raw = V10, Pre_Rating = as.numeric(str_extract(Player_Detail, "(?<=R: )\\d+")))
We extracted each player’s pre-tournament rating and the numbers of their opponents from each round. Then, we reshaped the data to have a long format for opponents and joined it with the pre-ratings to calculate the average rating of opponents for each player.
players <- players %>% mutate(opp1 = as.numeric(str_extract(R1_raw, "\\d+")),
opp2 = as.numeric(str_extract(R2_raw, "\\d+")),
opp3 = as.numeric(str_extract(R3_raw, "\\d+")),
opp4 = as.numeric(str_extract(R4_raw, "\\d+")),
opp5 = as.numeric(str_extract(R5_raw, "\\d+")),
opp6 = as.numeric(str_extract(R6_raw, "\\d+")),
opp7 = as.numeric(str_extract(R7_raw, "\\d+")))
opp_long <- players %>% select(Player_Name, Pre_Rating, opp1:opp7) %>% pivot_longer(cols = starts_with("opp"), values_to = "opp_pair", names_to = "round") %>% filter(!is.na(opp_pair))
rating_lookup <- players %>% mutate(Pair = 1:n()) %>% select(Pair, Pre_Rating)
opp_long <- players %>% mutate(Pair = 1:n()) %>% select(Pair, Player_Name, opp1:opp7) %>% pivot_longer(cols = starts_with("opp"), names_to = "round", values_to = "opp_pair") %>% filter(!is.na(opp_pair))
opp_long <- opp_long %>% left_join(rating_lookup, by = c("opp_pair" = "Pair")) %>% rename(Opp_Pre_Rating = Pre_Rating)
View(opp_long)
avg_opp <- opp_long %>% group_by(Player_Name) %>%
summarize(Avg_Opponents_Pre_Rating = mean(Opp_Pre_Rating, na.rm = TRUE), .groups = "drop")
View(avg_opp)
Finally, we created a summary table with the player’s name, state, total points, pre-rating, and average rating of opponents. We saved this table as a CSV file.
players_tournament <- players %>% select(Player_Name, Player_State, Total_Points, Pre_Rating) %>% left_join(avg_opp, by = "Player_Name")
View(players_tournament)
write_csv(players_tournament, "players_tournament.csv")
This histogram shows how the total points of all players are distributed. Each bar represents the number of players that achieved a certain total score in the tournament. It helps us see which scores are most common and how the players performed overall.
ggplot(players_tournament, aes(x = Total_Points)) +
geom_histogram(binwidth = 0.5, fill = "darkblue") +
theme_classic() + labs(title = "Distribution of Total Points", x = "Total Points", y = "Number of Players")
This scatter plot compares each player’s pre-tournament rating with the average rating of their opponents. Each point represents one player. This graph helps us understand whether stronger players faced stronger opponents on average, and if there is any relationship between a player’s rating and the difficulty of their matches.
ggplot(players_tournament, aes(x = Pre_Rating, y = Avg_Opponents_Pre_Rating)) +
geom_point(color = "darkred") + theme_classic() +
labs(title = "Player Rating vs Average Opponents' Rating", x = "Pre-Rating", y = "Avg Opponents Pre-Rating")
We successfully transformed a text file with chess tournament results into a structured CSV file. This CSV contains useful information for analysis, such as total points and ratings. In the future, we could use this data to create visualizations, explore patterns in player performance, or compare ratings between states or rounds.