The objective of this Project One exercise is to extract and analyze structured information from a text file containing chest tournament results. The provided dataset is not in a tidy format, but rather, is presented as a formatted cross table harboring facets such as player identifiers, ratings, and round by round results.
As such, the primary goal of this assignment will therefore involve parsing the raw text file and transforming it into a structured dataframe for analysis within R. Additionally, beyond the aim of extraction, this project also requires the calculation of each player’s average pre-tournament ratings of their opponents, which introduces an additional analytical component to the task.
All of the work will be conducted in R, using packages such as stringr, dplyr, and potentially tidyr to assist with the data transformation process.
Data Structure
From an initial inspection of the text file, each player instance appears to posses:
The player name
The player state
Their total points
USCF ID
Their pre-tournament rating
Their post-tournament rating
Their round by round opponent
However, these values are formatted within text rows separated by divider lines, meaning that precise string parsing will be called for. Moreover, the opponent ratings are not directly housed within a player’s row. Instead, only the opponent pair numbers are listed for each round. As such, computing the average opponent pre-rating will require the referencing of the other player’s extracted pre-ratings.
Proposed Plan
The analytical approach will likely follow the steps outlined below:
Import the text file using readLines().
Identify and subsequently isolate the player record blocks
Use regular expressions to extract:
Player name
Player state
Total Points
Pre-Tournament rating
Opponent pair numbers for each round
Construct an initial dataframe containing one row per player, inclusive of their pre-rating.
Create a look-up structure that maps pair number to pre-rating.
For each player, I will then retrieve the list of opponent pair numbers, match them against the corresponding pre-ratings, and compute the average of those pre-ratings, while ensuring that potential complications are addressed, such as accounting for games not played (byes and forfeits) and excluding them from the average calculations.
The computed averages will then be appended as a novel variable within the final dataframe.
Finally, the completed dataset will be exported to a .csv file using the write.csv() function, as required by the project specifications.
Anticipated Challenges
One expected challenge involves correctly parsing the inconsistent spacing within the text file. Since the dataset itself is not comma or tab separated, the extraction process will depend heavily on deciphering recurring patterns.
Another focal challenge will be ensuring that the opponent references are correctly matched to the corresponding player records. Because opponent ratings are to be derived indirectly via pair numbers, careful indexing and validation will be required to prevent mismatches from transpiring.
Furthermore, some entries contain instances of provisional ratings (for example, “P” designations), which may require cleaning prior to numeric conversion. Games marked as byes or unplayed rounds must also be considered and excluded from the opponent-average calculations.
Code Base/Body
The first step, as per normal, involves loading the required libraries.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(stringr)
Warning: package 'stringr' was built under R version 4.5.2
library(tidyr)
Warning: package 'tidyr' was built under R version 4.5.2
Next, we will import the contents of the text file, which is housed within my GitHub repository, for the purpose of code reproducibility.
Warning in readLines(url): incomplete final line found on
'https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/Project%20One%20Assignment/tournamentinfo.txt'
As it stands, the previously created raw_lines object serves as a character vector wherein each element represents one line of the text file.
Removing The Divider Lines
In examining the head of the raw_lines character vector, we can observe that some of the lines (for example, lines 4 and 10) are composed entirely of dashes. These lined are not required in the final players dataframe.
Currently, within the clean_lines character vector, the record corresponding to each player is spread across two consecutive lines.
The first one contains features such as the pair number, the player name, the total points, and the round results. The second line, on the other hand, contains facets such as the player’s state, their USCF ID, and their pre and post-ratings.
As such, we will first strive to isolate/filter only the lines that commence with the pair numbers of the players.
Now that we have obtained all of the players’ base components from within the chess tournament ratings text file, we can proceed with the creation of the initial players dataframe.
# A tibble: 6 × 5
pair_number player_name player_state total_points pre_rating
<dbl> <chr> <chr> <dbl> <dbl>
1 1 GARY HUA ON 6 1794
2 2 DAKSHESH DARURI MI 6 1553
3 3 ADITYA BAJAJ MI 6 1384
4 4 PATRICK H SCHILLING MI 5.5 1716
5 5 HANSHI ZUO MI 5.5 1655
6 6 HANSEN SONG OH 5 1686
Extract Opponent Pair Numbers
Now in possession of the initial player information dataframe, our next task is to calculate the average pre chess ratings for each player’s opponents. In doing so, the first step entails extracting the pair numbers corresponding to each match in which a given player competed.
#Extract all the occurrences of a letter followed by spaces and digits from within the player_linesopponent_list <-str_extract_all(player_lines, "[WLDHUX]\\s*\\d+")# Extract just the numeric portionopponent_list <-lapply(opponent_list, function(x) {as.numeric(str_extract(x, "\\d+"))})head(opponent_list)
Now that we have obtained the pair numbers corresponding to each of players’ partaken games, we must convert the obtained resultants into a long format to allow for better analytical examination.
Now that we have a tidy dataframe containing the players’ pair numbers and their opponents’, the next step involves appending the opponents’ pre-rating figures to the dataframe. This will be accomplished by joining the opponent_df to the players_df.
Now that our opponent_df dataframe has been updated to include the opponents’ pre-ratings, we can calculate the average pre-chess ratings of each player’s opponents.
avg_opponent_rating <- opponent_df %>%group_by(pair_number) %>%summarise(avg_opponents_pre_rating =round(mean(pre_rating, na.rm =TRUE)))#Note in the above calculation of the average, the resultant was rounded to reflect the given example of GARY HUA. In this, his pre chess opponent average was given as 1605, instead of 1605.286...head(avg_opponent_rating)
As we have obtained the desired figure within this analytical endeavor (the pre-chess rating of each player’s opponents), we can attach these values to our final dataframe.
final_df <- players_df %>%left_join(avg_opponent_rating, by ="pair_number")head(final_df)
# A tibble: 6 × 6
pair_number player_name player_state total_points pre_rating
<dbl> <chr> <chr> <dbl> <dbl>
1 1 GARY HUA ON 6 1794
2 2 DAKSHESH DARURI MI 6 1553
3 3 ADITYA BAJAJ MI 6 1384
4 4 PATRICK H SCHILLING MI 5.5 1716
5 5 HANSHI ZUO MI 5.5 1655
6 6 HANSEN SONG OH 5 1686
# ℹ 1 more variable: avg_opponents_pre_rating <dbl>
Sanity Check
Using the data from within the original text file, the opponent pre-ratings for two players were hand-calculated.
The first player Anvit Rao (pair number of 10) participated in all seven matches, and his opponents had an average pre-rating score of 1,554 (1,554.1428571429).
final_df %>%filter(pair_number ==10)
# A tibble: 1 × 6
pair_number player_name player_state total_points pre_rating
<dbl> <chr> <chr> <dbl> <dbl>
1 10 ANVIT RAO MI 5 1365
# ℹ 1 more variable: avg_opponents_pre_rating <dbl>
The second player Eugene L McClure (pair number of 22) participated in six of the seven rounds, and his opponents had an average pre-rating score of 1,300 (1,300.333…).
final_df %>%filter(pair_number ==22)
# A tibble: 1 × 6
pair_number player_name player_state total_points pre_rating
<dbl> <chr> <chr> <dbl> <dbl>
1 22 EUGENE L MCCLURE MI 4 1555
# ℹ 1 more variable: avg_opponents_pre_rating <dbl>
Export To CSV
Our final_df dataframe now contains the required components of a player’s pair number, name, state, total points, their own pre-rating, and the average pre chess rating of their opponents. The final requirement of the Project One assignment involves exporting this dataframe to a CSV file, which can then be imported into an SQL database for further analysis.
In completing this assignment, it became evident that unrefined yet structured data can be systematically transformed into meaningful analytical resultants. In the case of this project, this transformation process involved precise parsing and relational mapping. Moreover, string manipulation techniques were drawn upon alongside the dplyr library’s grouping functionality in order to extract the required player information and compute the sought after average pre-chess opponent ratings.
With respect to the most technically demanding portion of the project’s execution, I would attribute that to the parsing process itself. This step required close attention to detail, as well as multiple iterations, in order to achieve the desired functionality.
LLM Used
OpenAI. (2026). ChatGPT (Version 4o) [Large language model]. https://chat.openai.com . Accessed February 21, 2026.