The purpose of this document is to systematically import a text file containing data from a chess tournament cross table. The below R code uses string manipulations and extractions in order to clean up the text file and extract relevant information formatted in a clean, reusable, and exported dataframe with the goal of writing the newly created table to a CSV file for easy use.
library(stringr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
First you’ll need to import the raw data from my github repo. You can find a link to the data here: <https://raw.githubusercontent.com/awrubes/Project-1-v2/main/tournamentinfo.txt>. Instead of reading the file directly into a structured format, the data is read in as a string vector to allow for specific string manipulation. This allows us to parse and clean the data in subsequent steps.
#import chess data as txt file
chess_data <- suppressWarnings(readLines("https://raw.githubusercontent.com/awrubes/Project-1-v2/main/tournamentinfo.txt"
))
head(chess_data)
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
chess_df <-tibble(raw=chess_data)
head(chess_df)
## # A tibble: 6 × 1
## raw
## <chr>
## 1 "----------------------------------------------------------------------------…
## 2 " Pair | Player Name |Total|Round|Round|Round|Round|Round…
## 3 " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 …
## 4 "----------------------------------------------------------------------------…
## 5 " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7…
## 6 " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W …
Next, we’ll want to clean up the text file so that it’s easier to parse. We’ll do this by getting rid of the header rows and removing the “—” row separators. This way the only text in the file will be the data we want to selectively extract.
chess_df_clean <- chess_df %>%
filter(!grepl("^-+$", raw),
!grepl("^\\s*(Pair|Num)", raw))
head(chess_df_clean)
## # A tibble: 6 × 1
## raw
## <chr>
## 1 " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7…
## 2 " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W …
## 3 " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16…
## 4 " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B …
## 5 " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11…
## 6 " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W …
Once the unnecessary rows are removed, we proceed to further
structure the dataset. Each player’s data spans two rows—one for their
general tournament information and another for their rating information.
We separate these rows using the mutate()
and
pivot_wider()
functions, which allow us to organize the
data into separate columns for easier extraction.
chess_df_clean <-chess_df_clean %>%
mutate(row_num = row_number()) %>%
mutate(info_type = ifelse(row_num %%2 == 1, "player_info", "player_rating_info")) %>%
group_by(info_type) %>%
mutate(group = row_number())%>%
pivot_wider(names_from = info_type, values_from = raw) %>%
select(-group)
chess_df_clean <- chess_df_clean %>%
mutate(
player_rating_info = ifelse(is.na(player_rating_info), lead(player_rating_info), player_info)
)%>%
filter(!is.na(player_info), !is.na(player_rating_info))
head(chess_df_clean)
## # A tibble: 6 × 3
## row_num player_info player_rating_info
## <int> <chr> <chr>
## 1 1 " 1 | GARY HUA |6.0 |W … " ON | 15445895…
## 2 3 " 2 | DAKSHESH DARURI |6.0 |W … " MI | 14598900…
## 3 5 " 3 | ADITYA BAJAJ |6.0 |L … " MI | 14959604…
## 4 7 " 4 | PATRICK H SCHILLING |5.5 |W … " MI | 12616049…
## 5 9 " 5 | HANSHI ZUO |5.5 |W … " MI | 14601533…
## 6 11 " 6 | HANSEN SONG |5.0 |W … " OH | 15055204…
With the data structured, we now extract specific columns that
contain player names, state information, total points, pre-tournament
ratings, and games played. String extraction functions
(str_extract
) are applied to extract relevant data from
each row. The mutate()
function is used to create new
columns that hold this information.
We also calculate the average rating of opponents each player faced,
using a combination of lapply()
to extract the IDs of
opponents and sapply()
to compute their average ratings
based on the opponent ratings in the data set.
chess_df_clean <- chess_df_clean %>%
mutate(player_name = str_extract(player_info, "(?<=\\|)\\s*[A-Z\\s]+(?=\\s*\\|)"))%>%
mutate(state = trimws(str_extract(player_rating_info, "^\\s*[A-Z]{2}\\b")))%>%
mutate(total_points = as.numeric(str_extract(player_info, "(?<=|)\\d\\.\\d")))%>%
mutate(pre_rating = as.numeric(str_extract(player_rating_info, "(?<=R:)\\s*(\\d+)(?=\\s*->|\\w)")))%>%
mutate(games_played = as.numeric(str_count(player_info, "\\b(W|L|D){1}\\s*\\d+")))%>%
mutate(
opponent_ids = lapply(player_info, function(info){
as.numeric(trimws(unlist(str_extract_all(info, "(?<=\\b[WLD])\\s*\\d+"))))
})
)%>%
mutate(
avg_op_rating=sapply(opponent_ids, function(ids){
opponent_ratings <- .data$pre_rating[ids]
total_opponent_ratings <- sum(opponent_ratings, na.rm = TRUE)
avg_op_rating <- total_opponent_ratings / length(ids)
return(avg_op_rating)
})
)
head(chess_df_clean)
## # A tibble: 6 × 10
## row_num player_info player_rating_info player_name state total_points
## <int> <chr> <chr> <chr> <chr> <dbl>
## 1 1 " 1 | GARY HUA … " ON | 15445895… " GARY HUA… ON 6
## 2 3 " 2 | DAKSHESH D… " MI | 14598900… " DAKSHESH… MI 6
## 3 5 " 3 | ADITYA BAJ… " MI | 14959604… " ADITYA B… MI 6
## 4 7 " 4 | PATRICK H … " MI | 12616049… " PATRICK … MI 5.5
## 5 9 " 5 | HANSHI ZUO… " MI | 14601533… " HANSHI Z… MI 5.5
## 6 11 " 6 | HANSEN SON… " OH | 15055204… " HANSEN S… OH 5
## # ℹ 4 more variables: pre_rating <dbl>, games_played <dbl>,
## # opponent_ids <list>, avg_op_rating <dbl>
This step produces a clean data set that includes critical player statistics such as total points, pre-tournament ratings, the number of games played, and the average rating of their opponents.
After extracting the relevant information, the final step is to create a dataframe that includes only the necessary columns: player name, state, total points, pre-tournament rating, and average opponent rating. This dataframe is then exported to a CSV file for future use.
selected_columns <- chess_df_clean[, c("player_name", "state", "total_points", "pre_rating", "avg_op_rating")]
head(selected_columns)
## # A tibble: 6 × 5
## player_name state total_points pre_rating avg_op_rating
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 " GARY HUA … ON 6 1794 1605.
## 2 " DAKSHESH DARURI … MI 6 1553 1469.
## 3 " ADITYA BAJAJ … MI 6 1384 1564.
## 4 " PATRICK H SCHILLING … MI 5.5 1716 1574.
## 5 " HANSHI ZUO … MI 5.5 1655 1501.
## 6 " HANSEN SONG … OH 5 1686 1519.
The resulting CSV file contains a clean and reusable dataset that can be used for further analysis or reporting. The streamlined approach ensures that the data is both accurate and well-organized for future use.
write.csv(selected_columns, file = "chess_tournament_selected_columns.csv", row.names = FALSE)