DATA 607 - Project 1 (Chess Tournament)

Author

Denise Atherley

Approach

I will take the provided structured .txt file and upload it into my github repository to ensure my work is reproducible. I will examine the file’s information using tidyverse syntax in RStudio and ultimately generate a .csv file with columns for:

“Player’s Name”

“Total # of Points”

“Player’s State”

“Player’s Pre-Rating”

“Average Pre Chess Rating of Opponents”

Given that the .txt file is formatted in a multi-line structure with pipe delimiters, I believe I will be most challenged by trying to manipulate the columns to create a clean .csv output.

In addition to that, the column for “Average Pre Chess Rating of Opponents” consists of summing all of the player’s opponents’ Pre-Rating score and dividing by the total number of games played. This includes any wins, losses or draws. I will hand-calculate at least two test cases against my coded results to verify my work.

LLM support and enhancement

I will use Google Gemini to assist me in generating the correct code to read the structured .txt file and create the needed columns for my .csv output. Because the original .txt file has some mixed formatting, I will likely have to perform the column extraction in a couple of steps. Once those columns are generated, I will then calculate the average pre chess opponent rating.

AI citation -

(Google DeepMind. (2026). Gemini Pro [Large language model].

https://gemini.google.com. Accessed February 19, 2026

Code Deliverable:

I ensure that I am able to use tidyverse syntax in my code by loading the necessary package.

library (tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

I examine the information by loading the data and then previewing the top 5 players.

# Read the .txt file with all lines  

raw_chess_data <- read_lines("https://raw.githubusercontent.com/DRA-SPS27/DATA-607---Project-1/refs/heads/main/tournamentinfo.txt")
# Preview first 5 players  

head(raw_chess_data, 19)
 [1] "-----------------------------------------------------------------------------------------" 
 [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
 [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
 [4] "-----------------------------------------------------------------------------------------" 
 [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
 [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
 [7] "-----------------------------------------------------------------------------------------" 
 [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
 [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
[10] "-----------------------------------------------------------------------------------------" 
[11] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|" 
[12] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
[13] "-----------------------------------------------------------------------------------------" 
[14] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|" 
[15] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |" 
[16] "-----------------------------------------------------------------------------------------" 
[17] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|" 
[18] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
[19] "-----------------------------------------------------------------------------------------" 

Remove unneeded formatting

I clean up the formatting of the data to prepare my columns

# Remove the dashed separator lines  

clean_chess_data <- raw_chess_data[str_detect(raw_chess_data, "---", negate = TRUE)] 
# Remove the header rows (the first two rows after separators are gone)  

headerless_data <- clean_chess_data[-(1:2)]

Separate multi-line rows

The data follows a pattern: Row A contains the name and points, while Row B contains the state and rating. I will use indexing to split them.

# All of the odd numbered rows are the "Name" rows  

row_group1 <- headerless_data[seq(1, length(headerless_data), 2)]  

# All of the even numbered rows are the "State/Rating" rows  

row_group2 <- headerless_data[seq(2, length(headerless_data), 2)]

Extract specific fields

I use read_delim on these subsets to turn the text into columns based on the pipe (|) delimiter.

library(readr)
# Extract Name and Total Points  

df1 <- read_delim(paste(row_group1, collapse = "\n"),                   
                  delim = "|", col_names = FALSE, trim_ws = TRUE) %>%   
  select(Player_Name = X2, Total_Points = X3)
Warning: The `file` argument of `vroom()` must use `I()` for literal data as of vroom
1.5.0.
  
  # Bad:
  vroom("X,Y\n1.5,2.3\n")
  
  # Good:
  vroom(I("X,Y\n1.5,2.3\n"))
ℹ The deprecated feature was likely used in the readr package.
  Please report the issue at <https://github.com/tidyverse/readr/issues>.
Rows: 64 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "|"
chr (8): X2, X4, X5, X6, X7, X8, X9, X10
dbl (2): X1, X3
lgl (1): X11

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Extract State and Pre-Rating # I will use regex to pull just the numbers before the "->" in the rating column  

df2 <- read_delim(paste(row_group2, collapse = "\n"),                   
                  delim = "|", col_names = FALSE, trim_ws = TRUE) %>%   
  mutate(State = X1,           
         Pre_Rating = str_extract(X2, "(?<=R: )\\s*\\d+")) %>%   
  select (State, Pre_Rating)
Rows: 64 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "|"
chr (10): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
lgl  (1): X11

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Combine needed columns

I bind the columns together and generate a new data frame with the created columns.

# Merge the two pieces side-by-side  

chess_tournament_data <- bind_cols(df1, df2)

Calculate the Average Pre Chess Rating of Opponents

To calculate the Average Opponent Rating, I need to treat the tournament data like a relational database where I will have each player’s unqiue ID along with their opponent’s ID for each round. This will allow me to look-up their pre rating score and bring in the average based on games played.

Extract Opponent IDs

I need to pull the tournament rounds from the row_group1 I created earlier. The opponent IDs are listed after the win/loss/draw indicator (e.g., “W 39” means they played player 39).

# Extract the rounds (Columns 4 - 10 in the pipe-delimited structure)  

opponents_df <- read_delim (paste(row_group1, collapse = "\n"),                            delim = "|", col_names = FALSE, trim_ws = TRUE) %>%   
  select (Player_ID = X1, X4:X10) %>%   
  pivot_longer(cols = X4:X10, names_to = "Round", values_to = "Opponent_Info") %>%   
  mutate(Opponent_ID = as.numeric(str_extract(Opponent_Info, "\\d+"))) #Pulls '39' from 'W 39'
Rows: 64 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "|"
chr (8): X2, X4, X5, X6, X7, X8, X9, X10
dbl (2): X1, X3
lgl (1): X11

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Create a rating look-up table

I need a simple table where each Player ID is linked to their Pre Rating so I can look up how “strong” each opponent was.

# Use the df2 created previously to make a lookup map  

rating_lookup <- df1 %>%   
  mutate(Player_ID = row_number(),          
         Pre_Rating = as.numeric(df2$Pre_Rating)) %>%   
  select(Player_ID, Pre_Rating)

Map ratings and calculate average

I will now join the opponent IDs with their actual ratings and calculate the mean for each player

# Join the tables to find the rating for every opponent played  

avg_opp_rating <- opponents_df %>%   
  left_join(rating_lookup, by = c("Opponent_ID" = "Player_ID")) %>%   
  group_by(Player_ID) %>%   
  summarize(avg_opp_rating = round(mean(Pre_Rating, na.rm = TRUE), 0))
# Attach this to final data frame  

chess_tournament_data <- chess_tournament_data %>%   
  mutate(Player_ID = row_number()) %>%   
  left_join(avg_opp_rating, by = "Player_ID") %>%   
  select(-Player_ID)  #Remove the ID helper column before export
# Show preview of newly created data frame  

library(gt)  

gt(chess_tournament_data)
Player_Name Total_Points State Pre_Rating avg_opp_rating
GARY HUA 6.0 ON 1794 1605
DAKSHESH DARURI 6.0 MI 1553 1469
ADITYA BAJAJ 6.0 MI 1384 1564
PATRICK H SCHILLING 5.5 MI 1716 1574
HANSHI ZUO 5.5 MI 1655 1501
HANSEN SONG 5.0 OH 1686 1519
GARY DEE SWATHELL 5.0 MI 1649 1372
EZEKIEL HOUGHTON 5.0 MI 1641 1468
STEFANO LEE 5.0 ON 1411 1523
ANVIT RAO 5.0 MI 1365 1554
CAMERON WILLIAM MC LEMAN 4.5 MI 1712 1468
KENNETH J TACK 4.5 MI 1663 1506
TORRANCE HENRY JR 4.5 MI 1666 1498
BRADLEY SHAW 4.5 MI 1610 1515
ZACHARY JAMES HOUGHTON 4.5 MI 1220 1484
MIKE NIKITIN 4.0 MI 1604 1386
RONALD GRZEGORCZYK 4.0 MI 1629 1499
DAVID SUNDEEN 4.0 MI 1600 1480
DIPANKAR ROY 4.0 MI 1564 1426
JASON ZHENG 4.0 MI 1595 1411
DINH DANG BUI 4.0 ON 1563 1470
EUGENE L MCCLURE 4.0 MI 1555 1300
ALAN BUI 4.0 ON 1363 1214
MICHAEL R ALDRICH 4.0 MI 1229 1357
LOREN SCHWIEBERT 3.5 MI 1745 1363
MAX ZHU 3.5 ON 1579 1507
GAURAV GIDWANI 3.5 MI 1552 1222
SOFIA ADINA STANESCU-BELLU 3.5 MI 1507 1522
CHIEDOZIE OKORIE 3.5 MI 1602 1314
GEORGE AVERY JONES 3.5 ON 1522 1144
RISHI SHETTY 3.5 MI 1494 1260
JOSHUA PHILIP MATHEWS 3.5 ON 1441 1379
JADE GE 3.5 MI 1449 1277
MICHAEL JEFFERY THOMAS 3.5 MI 1399 1375
JOSHUA DAVID LEE 3.5 MI 1438 1150
SIDDHARTH JHA 3.5 MI 1355 1388
AMIYATOSH PWNANANDAM 3.5 MI 980 1385
BRIAN LIU 3.0 MI 1423 1539
JOEL R HENDON 3.0 MI 1436 1430
FOREST ZHANG 3.0 MI 1348 1391
KYLE WILLIAM MURPHY 3.0 MI 1403 1248
JARED GE 3.0 MI 1332 1150
ROBERT GLEN VASEY 3.0 MI 1283 1107
JUSTIN D SCHILLING 3.0 MI 1199 1327
DEREK YAN 3.0 MI 1242 1152
JACOB ALEXANDER LAVALLEY 3.0 MI 377 1358
ERIC WRIGHT 2.5 MI 1362 1392
DANIEL KHAIN 2.5 MI 1382 1356
MICHAEL J MARTIN 2.5 MI 1291 1286
SHIVAM JHA 2.5 MI 1056 1296
TEJAS AYYAGARI 2.5 MI 1011 1356
ETHAN GUO 2.5 MI 935 1495
JOSE C YBARRA 2.0 MI 1393 1345
LARRY HODGE 2.0 MI 1270 1206
ALEX KONG 2.0 MI 1186 1406
MARISA RICCI 2.0 MI 1153 1414
MICHAEL LU 2.0 MI 1092 1363
VIRAJ MOHILE 2.0 MI 917 1391
SEAN M MC CORMICK 2.0 MI 853 1319
JULIA SHEN 1.5 MI 967 1330
JEZZEL FARKAS 1.5 ON 955 1327
ASHWIN BALAJI 1.0 MI 1530 1186
THOMAS JOSEPH HOSMER 1.0 MI 1175 1350
BEN LI 1.0 MI 1163 1263

Manual review of average

I will hand-calculate at least two test cases against my coded results to verify my work.

Test Case #1 -

Player 3, Aditya Bajaj, played all games. Their opponents’ ratings are:

1641, 955, 1745, 1563, 1712, 1666, 1663

This is a total of 10,945 divided by 7 is 1564

Test Case #2 -

Player 16, Mike Nikitin, played only 5 games. Their opponents’ ratings are:

1365, 1220, 1436, 1553, 1355

This is a total of 6,929 divided by 5 is 1386

Both tests verified that my code was correct!

Export .csv file

# Write to csv  

write_csv(chess_tournament_data, "chess_tournament_results.csv")