PROJECT 1: Chess Tournament Results: Data extraction and Parsing

Author

Pascal Hermann Kouogang Tafo

INTRODUCTION

In this project, we will manipulate a structured chess tournament file that contains the result of 64 players. Our goal is to extract some meaningful players data including player’s name, state, total tournament points, pre-tournament rating, the average pre-tournament rating of all opponents they faced and then export into a clean, relational CSV file.

APPROACH

The text file contains data which are pipe-delimited (|) and interleaved across two lines per player. To accomplish my goal, i will take the following steps:

Read the file and strip out the dashed separators pairing consecutive player rows.
Extract each field of interest (Name, State, Points, and Pre-Rating) using regular expressions and string parsing functions.
Create a lookup table because the opponents are listed by “Pair Number” rather than name and perform a second iteration through each player’s round results, match their opponents’ IDs to their respective Pre-Ratings
Compute the rounded mean for the average opponent rating column
Export the final results into a clean CSV file

Install and Load R packages

library(stringr)

Warning: package 'stringr' was built under R version 4.5.2

library(dplyr)

Warning: package 'dplyr' was built under R version 4.5.2


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(readr)

Warning: package 'readr' was built under R version 4.5.2

Read the raw text file

url <- "https://raw.githubusercontent.com/Pascaltafo2025/PROJECT1-DATA-607/refs/heads/main/tournamentinfo.txt"

rawdata_Chess_tournament <- readLines(url)

Warning in readLines(url): incomplete final line found on
'https://raw.githubusercontent.com/Pascaltafo2025/PROJECT1-DATA-607/refs/heads/main/tournamentinfo.txt'

head(rawdata_Chess_tournament,10)

 [1] "-----------------------------------------------------------------------------------------" 
 [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
 [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
 [4] "-----------------------------------------------------------------------------------------" 
 [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
 [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
 [7] "-----------------------------------------------------------------------------------------" 
 [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
 [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
[10] "-----------------------------------------------------------------------------------------"

Let’s clean our raw data.

In order to clean our raw data, we will have to remove the dashed lined that divide rows using the “grep” function. The grep will help us identifies the index of every line. In our case, we will be looking at every line that starts with a long string of dashes and try to exclude them.

clean_data_Chess_tournament <- rawdata_Chess_tournament[!grepl("^-+$", rawdata_Chess_tournament)] 


glimpse(clean_data_Chess_tournament)

 chr [1:130] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| " ...

head(clean_data_Chess_tournament,10)

 [1] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
 [2] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
 [3] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
 [4] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
 [5] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
 [6] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
 [7] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|" 
 [8] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
 [9] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|" 
[10] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"

Extract fields of interest

Here, we would like to extract the field that we will be necessary to complete our goal. Those fields of interest are player’s name, state, total tournament points, pre-tournament rating. To successfully extract those specific data, we will use regular expressions which are use to match patterns in text and help extract data from a string. After removing the dashed lines, our cleaned chess tournament results data consist of a header (the first 2 rows) followed by consecutive row pairs for each player. We will aim to combine these 2 rows.

Here i use the help of “Claude Sonnet 4.6” for the code.

# 1. Remove the headers of the data

#The first two lines of our "clean_data_Chess_tournament" file contains the data headers so we can get ride of or skip them and only keep players data to simplify the fields extraction.

Players_DataInfo <- clean_data_Chess_tournament[-(1:2)]

head(Players_DataInfo,10)

 [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
 [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
 [3] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
 [4] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
 [5] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
 [6] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
 [7] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
 [8] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
 [9] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
[10] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"

# 1) Let's split the row pairs for each player

row1 <- Players_DataInfo[seq(1, length(Players_DataInfo), by = 2)]
row2 <- Players_DataInfo[seq(2, length(Players_DataInfo), by = 2)]

# 2) Let's extract each field of interest:

# Player's Name: They are located between the first and second pipe (|).

Name <- str_trim(str_extract(row1, "(?<=\\|)[^|]+(?=\\|)"))

# Player's State: It is represented by the two-letter code at the beginning of the second row.

State <- str_extract(row2, "([A-Z]{2})")

# Player's Total Points: It represents the first number after the second pipe in the first row

Total_Points <- as.numeric(str_extract(row1, "(?<=\\|)\\s*\\d+\\.\\d+"))


# Player's Pre-Rating: It is located after "R: " in the second row. 

Pre_rating <- as.integer(str_extract(row2, "(?<=R:\\s{0,4})\\d+"))

# Let's combine all our field of interest into a data frame for a better visual


PlayerDataInfo_df <- data.frame(
  Name       = Name,
  State      = State,
  Points     = Total_Points,
  Pre_Rating = Pre_rating,
  stringsAsFactors = FALSE
)

head(PlayerDataInfo_df,10)

                  Name State Points Pre_Rating
1             GARY HUA    ON    6.0       1794
2      DAKSHESH DARURI    MI    6.0       1553
3         ADITYA BAJAJ    MI    6.0       1384
4  PATRICK H SCHILLING    MI    5.5       1716
5           HANSHI ZUO    MI    5.5       1655
6          HANSEN SONG    OH    5.0       1686
7    GARY DEE SWATHELL    MI    5.0       1649
8     EZEKIEL HOUGHTON    MI    5.0       1641
9          STEFANO LEE    ON    5.0       1411
10           ANVIT RAO    MI    5.0       1365

We obtain a table that contains all the 64 players Name, State, Points and Pre-ratings. Now we should try to calculate Average Pre Chess Rating of Opponents to complete our data frame. To calculate the average opponents rating, we will treat our data as a database where the Pair Number acts as the Primary Key because the raw text file is structured like two separate, but related, tables that have been flattened into one.

COMPUTE the Average Opponent pre-rating for each player

Since the opponents are listed by “Pair Number” rather than name, we will first map every Pair Number to its corresponding Pre-Rating (Lookup Table), then For each player, find the Pair Numbers of their opponents, look up those ratings, and finally calculate the average Opponent pre-rating for each player.

Here i use the help of “Gemini 3” for the code:

## 1. Let's create the Lookup Table 
  
# The 'Pair Number' represents the row index 1 to 64.

lookup_table <- data.frame(
  PairNum = 1:length(Pre_rating),
  Rating = Pre_rating
)

## 2. Let's extract Opponent Pair Numbers using regular expressions

# Opponents are in Row 1, following the "W", "L", or "D" indicators.

opponents_list <- str_extract_all(row1, "(?<=[WLD]\\s{1,5})\\d+")

## 3. Match Opponents to Ratings and Calculate the average Opponent pre-rating

Avg_Opp_Rating <- sapply(opponents_list, function(opp_ids) {
  # Convert extracted strings to integers
  ids <- as.numeric(opp_ids)
  
  # Filter out any NAs (in case of byes or unplayed games)
  ids <- ids[!is.na(ids)]
  
  # Look up ratings for these IDs from our lookup_table
  opp_ratings <- lookup_table$Rating[match(ids, lookup_table$PairNum)]
  
  
## 4. Calculate the average Opponent pre-rating (rounding to the nearest whole number)
  
  return(round(mean(opp_ratings, na.rm = TRUE)))
})


## 5. Final Data Frame

Final_tournament_PlayersInfo_df <- data.frame(
  PlayerName = Name,
  State = State,
  TotalPoints = Total_Points,
  PreRating = Pre_rating,
  AvgOpponentRating = Avg_Opp_Rating,
  stringsAsFactors = FALSE
)

head(Final_tournament_PlayersInfo_df,10)

            PlayerName State TotalPoints PreRating AvgOpponentRating
1             GARY HUA    ON         6.0      1794              1605
2      DAKSHESH DARURI    MI         6.0      1553              1469
3         ADITYA BAJAJ    MI         6.0      1384              1564
4  PATRICK H SCHILLING    MI         5.5      1716              1574
5           HANSHI ZUO    MI         5.5      1655              1501
6          HANSEN SONG    OH         5.0      1686              1519
7    GARY DEE SWATHELL    MI         5.0      1649              1372
8     EZEKIEL HOUGHTON    MI         5.0      1641              1468
9          STEFANO LEE    ON         5.0      1411              1523
10           ANVIT RAO    MI         5.0      1365              1554

Export the final results into a clean CSV file

write.csv(Final_tournament_PlayersInfo_df, "Final_tournament_PlayersInfo.csv", row.names = FALSE)

Create the scatter plot Pre-Rating vs. Average Opponent Rating

Here i use the help of “Gemini 3” for the code

library(ggplot2)


# Create the scatter plot

ggplot(Final_tournament_PlayersInfo_df, aes(x = PreRating, y = AvgOpponentRating)) +
  geom_point(color = "blue", size = 2, alpha = 0.7) +
  geom_smooth(method = "lm", color = "darkorange", se = FALSE) + # Add a trend line
  labs(
    title = "Player Pre-Rating vs. Average Opponent Rating",
    subtitle = "Analysis of Tournament Pairings",
    x = "Player Pre-Rating",
    y = "Average Opponent Rating"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.title = element_text(size = 11)
  )

`geom_smooth()` using formula = 'y ~ x'

The trend line on the scatter plot has a positive slope which indicates that as a player’s Pre-Rating increases, the Average Opponent Rating also increases overall. This relationship proves that the tournament successfully avoided “mismatches”, ensuring a competitive experience for all skill tiers

CONCLUSION

The analysis of the tournament data reveals a strong structural correlation between player performance and schedule difficulty which demonstrates a high level of pairing integrity. we can conclude by saying that players final scores were earned against statistically appropriate opposition.