Introduction

This project involves taking a semi-structured chess tournament text file and transforming it into a clean CSV dataset containing summarized player statistics. The workflow programmatically extracts relevant player information and computes each player’s average opponent pre-tournament rating. The resulting dataset is structured for reproducible analysis and potential database integration.

Approach

For this project, I will develop an R Markdown workflow to extract and transform data from a semi-structured chess tournament text file stored in the Project1 directory of my public GitHub repository. In order to generate the required CSV output, I will first read the raw text file into R and examine its structure. Because the file is formatted as a fixed-width cross table, I will reconstruct each player’s complete record by combining the two lines that represent a single player.

Next, I will extract the required fields for each player: name, state, total points, and pre-tournament rating. I will parse opponent numbers from the round result columns on the first line and include only valid games (wins, losses, and draws) in the calculation, excluding byes, unplayed rounds, and forfeit results (X).

To compute the average pre-tournament rating of each player’s opponents, I will construct a lookup structure that maps player pairing numbers to their pre-tournament ratings. Using this mapping, I will retrieve the corresponding ratings for each opponent and calculate the mean opponent rating based only on games actually played. Finally, I will assemble the cleaned and computed data into a structured data frame and export the results as a CSV file containing the required fields (Player Name, State, Total Points, Pre-Rating, and Average Opponent Pre-Rating), suitable for reproducible analysis and potential database integration.

Code Base

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read the tournament text file from GitHub as raw text
# Each element of 'txt' represents one line from the file

url <- "https://raw.githubusercontent.com/MKudanova/Data607/main/Project%201/tournamentinfo.txt"
txt <- read_lines(url)

# Quick sanity check: formatting, character, preview
class(txt)
## [1] "character"
length(txt)
## [1] 196
txt[1:5]
## [1] "-----------------------------------------------------------------------------------------" 
## [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [4] "-----------------------------------------------------------------------------------------" 
## [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
# Since the data file is not structured as a standard CSV or table,
# I am working with semi-structured text.
# Each player is represented by two different line types:
#   -  A stat line that begins with a numeric pair number
#   -  A detail line that begins with a two-letter state code
#
# The vertical bar "|" acts as a column separator within each line.
# To distinguish between the two line types, I use pattern matching
# with regular expressions (regex). This allows me to identify rows
# based on their structural format rather than specific text values.
#
# The first pattern selects lines that start with a number followed by
# a vertical bar, representing the main player rows.
# The second pattern selects lines that start with a two-letter state code
# followed by a vertical bar, representing the detail rows that contain
# rating information.

stat_rows <- txt[str_detect(txt, "^\\s*\\d+\\s*\\|")]
detail_rows <- txt[str_detect(txt, "^\\s*[A-Z]{2}\\s*\\|")]

# Another sanity check point: there should be one main row and one detail row per player
stopifnot(length(stat_rows) == length(detail_rows))
# Splitting stat rows by the vertical bar "|" to separate fields
stat_split <- str_split(stat_rows, "\\|")

# Extract pair number, player name, and total points
stat_df <- tibble(parts = stat_split) %>%
  mutate(
    PairNum     = as.integer(str_trim(map_chr(parts, 1))),
    PlayerName  = str_trim(map_chr(parts, 2)),
    TotalPoints = as.numeric(str_trim(map_chr(parts, 3)))
  ) %>%
  select(PairNum, PlayerName, TotalPoints)

stat_df %>% slice(1)
## # A tibble: 1 × 3
##   PairNum PlayerName TotalPoints
##     <int> <chr>            <dbl>
## 1       1 GARY HUA             6
# Extract opponent pair numbers from round result fields
stat_df <- tibble(parts = stat_split) %>%
  mutate(
    PairNum     = as.integer(str_trim(map_chr(parts, 1))),
    PlayerName  = str_trim(map_chr(parts, 2)),
    TotalPoints = as.numeric(str_trim(map_chr(parts, 3))),
    
    # Extract opponent numbers from columns 4–10
    Opponents = map(parts, ~ str_extract_all(.x[4:10], "\\d+"))
  ) %>%
  select(PairNum, PlayerName, TotalPoints, Opponents)
# Convert opponent pair numbers to integers
stat_df <- stat_df %>%
  mutate(
    Opponents = map(Opponents, as.integer)
  )
# Split detail rows and extract State + PreRating (number after "R:")
detail_split <- str_split(detail_rows, "\\|")

detail_df <- tibble(parts = detail_split) %>%
  mutate(
    State = str_trim(map_chr(parts, 1)),
    PreRating = as.integer(str_extract(map_chr(parts, 2), "(?<=R:)\\s*\\d+"))
  ) %>%
  select(State, PreRating)

detail_df %>% slice(1)
## # A tibble: 1 × 2
##   State PreRating
##   <chr>     <int>
## 1 ON         1794
# Combine stat_df and detail_df
players <- bind_cols(stat_df, detail_df)

# Build lookup table (PairNum → PreRating)
rating_lookup <- players %>% select(PairNum, PreRating)

# Compute average opponent pre-rating
final_df <- players %>%
  
  # Expand each player's opponent list so each row represents one opponent matchup
  unnest_longer(Opponents, values_to = "OppPairNum") %>%
  
  # Join opponent pairing numbers to their corresponding pre-tournament ratings
  left_join(rating_lookup,
            by = c("OppPairNum" = "PairNum"),
            suffix = c("", "_opp")) %>%
  
  # Group back to one row per player (PairNum is safest unique identifier)
  group_by(PairNum, PlayerName, State, TotalPoints, PreRating) %>%
  
  # Compute the average opponent pre-rating
  # na.rm = TRUE ensures byes/unplayed rounds are excluded
  summarise(
    AvgOppPreRating = round(mean(PreRating_opp, na.rm = TRUE), 0),
    .groups = "drop"
  ) %>%
  
  # Sort by tournament performance:
  # 1) Highest TotalPoints first
  # 2) If tied, higher opponent average first
  arrange(desc(TotalPoints), desc(AvgOppPreRating))
    
# Another sanity check point
final_df %>% filter(PlayerName == "GARY HUA")
## # A tibble: 1 × 6
##   PairNum PlayerName State TotalPoints PreRating AvgOppPreRating
##     <int> <chr>      <chr>       <dbl>     <int>           <dbl>
## 1       1 GARY HUA   ON              6      1794            1605
final_df %>% filter(PlayerName == "ASHWIN BALAJI")
## # A tibble: 1 × 6
##   PairNum PlayerName    State TotalPoints PreRating AvgOppPreRating
##     <int> <chr>         <chr>       <dbl>     <int>           <dbl>
## 1      62 ASHWIN BALAJI MI              1      1530            1186
players %>% filter(PlayerName == "ASHWIN BALAJI") %>% pull(Opponents)
## [[1]]
## [1] 55 NA NA NA NA NA NA
players %>% filter(PairNum == 55) %>% select(PlayerName, PreRating)
## # A tibble: 1 × 2
##   PlayerName PreRating
##   <chr>          <int>
## 1 ALEX KONG       1186
# Write CSV
write_csv(final_df, "chess_players.csv")   
getwd()
## [1] "/Users/madinak/Documents/Data607/Project1"

Final Result!

# CSV output with the required fields:
# PlayerName, State, TotalPoints, PreRating, AvgOppPreRating

csv_url <- "https://raw.githubusercontent.com/MKudanova/Data607/refs/heads/main/Project%201/chess_players.csv"

csv_preview <- read_csv(csv_url)
## Rows: 64 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): PlayerName, State
## dbl (4): PairNum, TotalPoints, PreRating, AvgOppPreRating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(csv_preview, 10)
## # A tibble: 10 × 6
##    PairNum PlayerName          State TotalPoints PreRating AvgOppPreRating
##      <dbl> <chr>               <chr>       <dbl>     <dbl>           <dbl>
##  1       1 GARY HUA            ON            6        1794            1605
##  2       3 ADITYA BAJAJ        MI            6        1384            1564
##  3       2 DAKSHESH DARURI     MI            6        1553            1469
##  4       4 PATRICK H SCHILLING MI            5.5      1716            1574
##  5       5 HANSHI ZUO          MI            5.5      1655            1501
##  6      10 ANVIT RAO           MI            5        1365            1554
##  7       9 STEFANO LEE         ON            5        1411            1523
##  8       6 HANSEN SONG         OH            5        1686            1519
##  9       8 EZEKIEL HOUGHTON    MI            5        1641            1468
## 10       7 GARY DEE SWATHELL   MI            5        1649            1372

Conclusion

In this project, I transformed a semi-structured chess tournament crosstable into a structured dataset suitable for analysis in R. Using regular expressions and controlled parsing logic, I extracted player information, opponent identifiers, and pre-tournament ratings.

I then calculated each player’s average opponent pre-rating by joining opponent IDs to their corresponding ratings and excluding non-played rounds from the denominator. Two hand calculated test cases one player who completed all rounds and one who played fewer than all rounds were used to verify the correctness of the computation. The manually computed averages matched the programmatic results, confirming the accuracy of the implementation.

Finally, the dataset was explicitly ordered by total points and average opponent pre-rating to reflect tournament performance rather than alphabetical ordering.