DATA 607 - Project 1 (Chess Tournament)

Author

Denise Atherley

Approach

I will take the provided structured .txt file and upload it into my github repository to ensure my work is reproducible. I will examine the file’s information using tidyverse syntax in RStudio and ultimately generate a .csv file with columns for:

“Player’s Name”

“Total # of Points”

“Player’s State”

“Player’s Pre-Rating”

“Average Pre Chess Rating of Opponents”

Given that the .txt file is formatted in a multi-line structure with pipe delimiters, I believe I will be most challenged by trying to manipulate the columns to create a clean .csv output.

In addition to that, the column for “Average Pre Chess Rating of Opponents” consists of summing all of the player’s opponents’ Pre-Rating score and dividing by the total number of games played. This includes any wins, losses or draws. I will hand-calculate at least two test cases against my coded results to verify my work.

LLM support and enhancement

I will use Google Gemini to assist me in generating the correct code to read the structured .txt file and create the needed columns for my .csv output. Because the original .txt file has some mixed formatting, I will likely have to perform the column extraction in a couple of steps. Once those columns are generated, I will then calculate the average pre chess opponent rating.

AI citation -

(Google DeepMind. (2026). Gemini Pro [Large language model].

https://gemini.google.com. Accessed February 19, 2026

Code Deliverable:

I ensure that I am able to use tidyverse syntax in my code by loading the necessary package.

library (tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

I examine the information by loading the data and then previewing the top 5 players.

# Read the .txt file with all lines  

raw_chess_data <- read_lines("https://raw.githubusercontent.com/DRA-SPS27/DATA-607---Project-1/refs/heads/main/tournamentinfo.txt")

# Preview first 5 players  

head(raw_chess_data, 19)

 [1] "-----------------------------------------------------------------------------------------" 
 [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
 [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
 [4] "-----------------------------------------------------------------------------------------" 
 [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
 [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
 [7] "-----------------------------------------------------------------------------------------" 
 [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
 [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
[10] "-----------------------------------------------------------------------------------------" 
[11] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|" 
[12] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
[13] "-----------------------------------------------------------------------------------------" 
[14] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|" 
[15] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |" 
[16] "-----------------------------------------------------------------------------------------" 
[17] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|" 
[18] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
[19] "-----------------------------------------------------------------------------------------"

Remove unneeded formatting

I clean up the formatting of the data to prepare my columns

# Remove the dashed separator lines  

clean_chess_data <- raw_chess_data[str_detect(raw_chess_data, "---", negate = TRUE)]

# Remove the header rows (the first two rows after separators are gone)  

headerless_data <- clean_chess_data[-(1:2)]

Separate multi-line rows

The data follows a pattern: Row A contains the name and points, while Row B contains the state and rating. I will use indexing to split them.

# All of the odd numbered rows are the "Name" rows  

row_group1 <- headerless_data[seq(1, length(headerless_data), 2)]  

# All of the even numbered rows are the "State/Rating" rows  

row_group2 <- headerless_data[seq(2, length(headerless_data), 2)]

Extract specific fields

I use read_delim on these subsets to turn the text into columns based on the pipe (|) delimiter.

library(readr)

# Extract Name and Total Points  

df1 <- read_delim(paste(row_group1, collapse = "\n"),                   
                  delim = "|", col_names = FALSE, trim_ws = TRUE) %>%   
  select(Player_Name = X2, Total_Points = X3)

Warning: The `file` argument of `vroom()` must use `I()` for literal data as of vroom
1.5.0.
  
  # Bad:
  vroom("X,Y\n1.5,2.3\n")
  
  # Good:
  vroom(I("X,Y\n1.5,2.3\n"))
ℹ The deprecated feature was likely used in the readr package.
  Please report the issue at <https://github.com/tidyverse/readr/issues>.

Rows: 64 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "|"
chr (8): X2, X4, X5, X6, X7, X8, X9, X10
dbl (2): X1, X3
lgl (1): X11

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Extract State and Pre-Rating # I will use regex to pull just the numbers before the "->" in the rating column  

df2 <- read_delim(paste(row_group2, collapse = "\n"),                   
                  delim = "|", col_names = FALSE, trim_ws = TRUE) %>%   
  mutate(State = X1,           
         Pre_Rating = str_extract(X2, "(?<=R: )\\s*\\d+")) %>%   
  select (State, Pre_Rating)

Rows: 64 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "|"
chr (10): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
lgl  (1): X11

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Combine needed columns

I bind the columns together and generate a new data frame with the created columns.

# Merge the two pieces side-by-side  

chess_tournament_data <- bind_cols(df1, df2)

Calculate the Average Pre Chess Rating of Opponents

To calculate the Average Opponent Rating, I need to treat the tournament data like a relational database where I will have each player’s unqiue ID along with their opponent’s ID for each round. This will allow me to look-up their pre rating score and bring in the average based on games played.

Extract Opponent IDs

I need to pull the tournament rounds from the row_group1 I created earlier. The opponent IDs are listed after the win/loss/draw indicator (e.g., “W 39” means they played player 39).

# Extract the rounds (Columns 4 - 10 in the pipe-delimited structure)  

opponents_df <- read_delim (paste(row_group1, collapse = "\n"),                            delim = "|", col_names = FALSE, trim_ws = TRUE) %>%   
  select (Player_ID = X1, X4:X10) %>%   
  pivot_longer(cols = X4:X10, names_to = "Round", values_to = "Opponent_Info") %>%   
  mutate(Opponent_ID = as.numeric(str_extract(Opponent_Info, "\\d+"))) #Pulls '39' from 'W 39'

Rows: 64 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "|"
chr (8): X2, X4, X5, X6, X7, X8, X9, X10
dbl (2): X1, X3
lgl (1): X11

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Create a rating look-up table

I need a simple table where each Player ID is linked to their Pre Rating so I can look up how “strong” each opponent was.

# Use the df2 created previously to make a lookup map  

rating_lookup <- df1 %>%   
  mutate(Player_ID = row_number(),          
         Pre_Rating = as.numeric(df2$Pre_Rating)) %>%   
  select(Player_ID, Pre_Rating)

Map ratings and calculate average

I will now join the opponent IDs with their actual ratings and calculate the mean for each player

# Join the tables to find the rating for every opponent played  

avg_opp_rating <- opponents_df %>%   
  left_join(rating_lookup, by = c("Opponent_ID" = "Player_ID")) %>%   
  group_by(Player_ID) %>%   
  summarize(avg_opp_rating = round(mean(Pre_Rating, na.rm = TRUE), 0))

# Attach this to final data frame  

chess_tournament_data <- chess_tournament_data %>%   
  mutate(Player_ID = row_number()) %>%   
  left_join(avg_opp_rating, by = "Player_ID") %>%   
  select(-Player_ID)  #Remove the ID helper column before export

# Show preview of newly created data frame  

library(gt)  

gt(chess_tournament_data)

Player_Name	Total_Points	State	Pre_Rating	avg_opp_rating
GARY HUA	6.0	ON	1794	1605
DAKSHESH DARURI	6.0	MI	1553	1469
ADITYA BAJAJ	6.0	MI	1384	1564
PATRICK H SCHILLING	5.5	MI	1716	1574
HANSHI ZUO	5.5	MI	1655	1501
HANSEN SONG	5.0	OH	1686	1519
GARY DEE SWATHELL	5.0	MI	1649	1372
EZEKIEL HOUGHTON	5.0	MI	1641	1468
STEFANO LEE	5.0	ON	1411	1523
ANVIT RAO	5.0	MI	1365	1554
CAMERON WILLIAM MC LEMAN	4.5	MI	1712	1468
KENNETH J TACK	4.5	MI	1663	1506
TORRANCE HENRY JR	4.5	MI	1666	1498
BRADLEY SHAW	4.5	MI	1610	1515
ZACHARY JAMES HOUGHTON	4.5	MI	1220	1484
MIKE NIKITIN	4.0	MI	1604	1386
RONALD GRZEGORCZYK	4.0	MI	1629	1499
DAVID SUNDEEN	4.0	MI	1600	1480
DIPANKAR ROY	4.0	MI	1564	1426
JASON ZHENG	4.0	MI	1595	1411
DINH DANG BUI	4.0	ON	1563	1470
EUGENE L MCCLURE	4.0	MI	1555	1300
ALAN BUI	4.0	ON	1363	1214
MICHAEL R ALDRICH	4.0	MI	1229	1357
LOREN SCHWIEBERT	3.5	MI	1745	1363
MAX ZHU	3.5	ON	1579	1507
GAURAV GIDWANI	3.5	MI	1552	1222
SOFIA ADINA STANESCU-BELLU	3.5	MI	1507	1522
CHIEDOZIE OKORIE	3.5	MI	1602	1314
GEORGE AVERY JONES	3.5	ON	1522	1144
RISHI SHETTY	3.5	MI	1494	1260
JOSHUA PHILIP MATHEWS	3.5	ON	1441	1379
JADE GE	3.5	MI	1449	1277
MICHAEL JEFFERY THOMAS	3.5	MI	1399	1375
JOSHUA DAVID LEE	3.5	MI	1438	1150
SIDDHARTH JHA	3.5	MI	1355	1388
AMIYATOSH PWNANANDAM	3.5	MI	980	1385
BRIAN LIU	3.0	MI	1423	1539
JOEL R HENDON	3.0	MI	1436	1430
FOREST ZHANG	3.0	MI	1348	1391
KYLE WILLIAM MURPHY	3.0	MI	1403	1248
JARED GE	3.0	MI	1332	1150
ROBERT GLEN VASEY	3.0	MI	1283	1107
JUSTIN D SCHILLING	3.0	MI	1199	1327
DEREK YAN	3.0	MI	1242	1152
JACOB ALEXANDER LAVALLEY	3.0	MI	377	1358
ERIC WRIGHT	2.5	MI	1362	1392
DANIEL KHAIN	2.5	MI	1382	1356
MICHAEL J MARTIN	2.5	MI	1291	1286
SHIVAM JHA	2.5	MI	1056	1296
TEJAS AYYAGARI	2.5	MI	1011	1356
ETHAN GUO	2.5	MI	935	1495
JOSE C YBARRA	2.0	MI	1393	1345
LARRY HODGE	2.0	MI	1270	1206
ALEX KONG	2.0	MI	1186	1406
MARISA RICCI	2.0	MI	1153	1414
MICHAEL LU	2.0	MI	1092	1363
VIRAJ MOHILE	2.0	MI	917	1391
SEAN M MC CORMICK	2.0	MI	853	1319
JULIA SHEN	1.5	MI	967	1330
JEZZEL FARKAS	1.5	ON	955	1327
ASHWIN BALAJI	1.0	MI	1530	1186
THOMAS JOSEPH HOSMER	1.0	MI	1175	1350
BEN LI	1.0	MI	1163	1263

Manual review of average

I will hand-calculate at least two test cases against my coded results to verify my work.

Test Case #1 -

Player 3, Aditya Bajaj, played all games. Their opponents’ ratings are:

1641, 955, 1745, 1563, 1712, 1666, 1663

This is a total of 10,945 divided by 7 is 1564

Test Case #2 -

Player 16, Mike Nikitin, played only 5 games. Their opponents’ ratings are:

1365, 1220, 1436, 1553, 1355

This is a total of 6,929 divided by 5 is 1386

Both tests verified that my code was correct!

Export .csv file

# Write to csv  

write_csv(chess_tournament_data, "chess_tournament_results.csv")