project_one_assignment

Author

Brandon Chanderban

Introduction/Approach

The objective of this Project One exercise is to extract and analyze structured information from a text file containing chest tournament results. The provided dataset is not in a tidy format, but rather, is presented as a formatted cross table harboring facets such as player identifiers, ratings, and round by round results.

As such, the primary goal of this assignment will therefore involve parsing the raw text file and transforming it into a structured dataframe for analysis within R. Additionally, beyond the aim of extraction, this project also requires the calculation of each player’s average pre-tournament ratings of their opponents, which introduces an additional analytical component to the task.

All of the work will be conducted in R, using packages such as stringr, dplyr, and potentially tidyr to assist with the data transformation process.

Data Structure

From an initial inspection of the text file, each player instance appears to posses:

  • The player name

  • The player state

  • Their total points

  • USCF ID

  • Their pre-tournament rating

  • Their post-tournament rating

  • Their round by round opponent

However, these values are formatted within text rows separated by divider lines, meaning that precise string parsing will be called for. Moreover, the opponent ratings are not directly housed within a player’s row. Instead, only the opponent pair numbers are listed for each round. As such, computing the average opponent pre-rating will require the referencing of the other player’s extracted pre-ratings.

Proposed Plan

The analytical approach will likely follow the steps outlined below:

  • Import the text file using readLines().

  • Identify and subsequently isolate the player record blocks

  • Use regular expressions to extract:

    • Player name

    • Player state

    • Total Points

    • Pre-Tournament rating

    • Opponent pair numbers for each round

  • Construct an initial dataframe containing one row per player, inclusive of their pre-rating.

  • Create a look-up structure that maps pair number to pre-rating.

  • For each player, I will then retrieve the list of opponent pair numbers, match them against the corresponding pre-ratings, and compute the average of those pre-ratings, while ensuring that potential complications are addressed, such as accounting for games not played (byes and forfeits) and excluding them from the average calculations.

  • The computed averages will then be appended as a novel variable within the final dataframe.

  • Finally, the completed dataset will be exported to a .csv file using the write.csv() function, as required by the project specifications.

Anticipated Challenges

One expected challenge involves correctly parsing the inconsistent spacing within the text file. Since the dataset itself is not comma or tab separated, the extraction process will depend heavily on deciphering recurring patterns.

Another focal challenge will be ensuring that the opponent references are correctly matched to the corresponding player records. Because opponent ratings are to be derived indirectly via pair numbers, careful indexing and validation will be required to prevent mismatches from transpiring.

Furthermore, some entries contain instances of provisional ratings (for example, “P” designations), which may require cleaning prior to numeric conversion. Games marked as byes or unplayed rounds must also be considered and excluded from the opponent-average calculations.

Code Base/Body

The first step, as per normal, involves loading the required libraries.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(stringr)
Warning: package 'stringr' was built under R version 4.5.2
library(tidyr)
Warning: package 'tidyr' was built under R version 4.5.2

Next, we will import the contents of the text file, which is housed within my GitHub repository, for the purpose of code reproducibility.

url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/Project%20One%20Assignment/tournamentinfo.txt"

raw_lines <- readLines(url)
Warning in readLines(url): incomplete final line found on
'https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/Project%20One%20Assignment/tournamentinfo.txt'
length(raw_lines)
[1] 196
head(raw_lines, 10)
 [1] "-----------------------------------------------------------------------------------------" 
 [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
 [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
 [4] "-----------------------------------------------------------------------------------------" 
 [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
 [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
 [7] "-----------------------------------------------------------------------------------------" 
 [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
 [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
[10] "-----------------------------------------------------------------------------------------" 

As it stands, the previously created raw_lines object serves as a character vector wherein each element represents one line of the text file.

Removing The Divider Lines

In examining the head of the raw_lines character vector, we can observe that some of the lines (for example, lines 4 and 10) are composed entirely of dashes. These lined are not required in the final players dataframe.

clean_lines <- raw_lines[!str_detect(raw_lines,"^[-]+$")]

head(clean_lines, 10)
 [1] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
 [2] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
 [3] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
 [4] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
 [5] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
 [6] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
 [7] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|" 
 [8] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
 [9] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|" 
[10] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |" 

Identifying The Player Blocks

Currently, within the clean_lines character vector, the record corresponding to each player is spread across two consecutive lines.

The first one contains features such as the pair number, the player name, the total points, and the round results. The second line, on the other hand, contains facets such as the player’s state, their USCF ID, and their pre and post-ratings.

As such, we will first strive to isolate/filter only the lines that commence with the pair numbers of the players.

player_line_index <- which(str_detect(clean_lines, "^\\s*\\d+\\s*\\|"))

player_lines <- clean_lines[player_line_index]

length(player_lines)
[1] 64
head(player_lines,10)
 [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
 [2] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
 [3] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
 [4] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
 [5] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
 [6] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|"
 [7] "    7 | GARY DEE SWATHELL               |5.0  |W  57|W  46|W  13|W  11|L   1|W   9|L   2|"
 [8] "    8 | EZEKIEL HOUGHTON                |5.0  |W   3|W  32|L  14|L   9|W  47|W  28|W  19|"
 [9] "    9 | STEFANO LEE                     |5.0  |W  25|L  18|W  59|W   8|W  26|L   7|W  20|"
[10] "   10 | ANVIT RAO                       |5.0  |D  16|L  19|W  55|W  31|D   6|W  25|W  18|"

Now, we will attempt to extract the second lines of each player’s information (which contains the rating figures).

rating_lines <- clean_lines[player_line_index + 1]

length(rating_lines)
[1] 64
head(rating_lines)
[1] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
[2] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
[3] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
[4] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
[5] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
[6] "   OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |"

Extracting The Required Player Information

At this stage, our prevailing goal is to extract the core player information from within the isolated lines of player information.

We will first begin by obtaining features from the player_lines character vector, starting with the pair number.

pair_number <- str_extract(player_lines, "^\\s*\\d+") %>%
  str_trim() %>%
  as.numeric()

head(pair_number)
[1] 1 2 3 4 5 6

Next we extract the players’ names.

player_name <-str_extract(player_lines, "\\|\\s*[A-Z ,.'-]+\\s*\\|") %>%
  str_remove_all("\\|") %>%
  str_trim()

head(player_name)
[1] "GARY HUA"            "DAKSHESH DARURI"     "ADITYA BAJAJ"       
[4] "PATRICK H SCHILLING" "HANSHI ZUO"          "HANSEN SONG"        

Then, we extract the total number of points.

total_points <- str_extract(player_lines, "\\|\\s*\\d+\\.\\d\\s*\\|") %>%
  str_remove_all("\\|") %>%
  str_trim() %>%
  as.numeric()

head(total_points)
[1] 6.0 6.0 6.0 5.5 5.5 5.0

Now, we shift our focus to deriving features from within the rating_lines, beginning with the players’ states.

player_state <- str_extract(rating_lines, "^\\s*[A-Z]{2}") %>%
  str_trim()

head(player_state, 10)
 [1] "ON" "MI" "MI" "MI" "MI" "OH" "MI" "MI" "ON" "MI"

Finally, we extract the players’ pre-ratings.

pre_rating <- str_extract(rating_lines, "R:\\s*\\d+") %>%
  str_extract("\\d+") %>%
  as.numeric()

head(pre_rating, 10)
 [1] 1794 1553 1384 1716 1655 1686 1649 1641 1411 1365

Construct The Initial Players Dataframe

Now that we have obtained all of the players’ base components from within the chess tournament ratings text file, we can proceed with the creation of the initial players dataframe.

players_df <- tibble(
  pair_number = pair_number,
  player_name = player_name,
  player_state = player_state,
  total_points = total_points,
  pre_rating = pre_rating
  )

head(players_df)
# A tibble: 6 × 5
  pair_number player_name         player_state total_points pre_rating
        <dbl> <chr>               <chr>               <dbl>      <dbl>
1           1 GARY HUA            ON                    6         1794
2           2 DAKSHESH DARURI     MI                    6         1553
3           3 ADITYA BAJAJ        MI                    6         1384
4           4 PATRICK H SCHILLING MI                    5.5       1716
5           5 HANSHI ZUO          MI                    5.5       1655
6           6 HANSEN SONG         OH                    5         1686

Extract Opponent Pair Numbers

Now in possession of the initial player information dataframe, our next task is to calculate the average pre chess ratings for each player’s opponents. In doing so, the first step entails extracting the pair numbers corresponding to each match in which a given player competed.

#Extract all the occurrences of a letter followed by spaces and digits from within the player_lines

opponent_list <- str_extract_all(player_lines, "[WLDHUX]\\s*\\d+")

# Extract just the numeric portion
opponent_list <- lapply(opponent_list, function(x) {
  as.numeric(str_extract(x, "\\d+"))
})

head(opponent_list)
[[1]]
[1] 39 21 18 14  7 12  4

[[2]]
[1] 63 58  4 17 16 20  7

[[3]]
[1]  8 61 25 21 11 13 12

[[4]]
[1] 23 28  2 26  5 19  1

[[5]]
[1] 45 37 12 13  4 14 17

[[6]]
[1] 34 29 11 35 10 27 21

Now that we have obtained the pair numbers corresponding to each of players’ partaken games, we must convert the obtained resultants into a long format to allow for better analytical examination.

opponent_df <- tibble(
  pair_number = pair_number,
  opponents = opponent_list
  ) %>%
  unnest(opponents)

head(opponent_df, 10)
# A tibble: 10 × 2
   pair_number opponents
         <dbl>     <dbl>
 1           1        39
 2           1        21
 3           1        18
 4           1        14
 5           1         7
 6           1        12
 7           1         4
 8           2        63
 9           2        58
10           2         4

Map The Opponent Pair Numbers To Pre-Rating

Now that we have a tidy dataframe containing the players’ pair numbers and their opponents’, the next step involves appending the opponents’ pre-rating figures to the dataframe. This will be accomplished by joining the opponent_df to the players_df.

opponent_df <- opponent_df %>%
  left_join(
    players_df %>% select (pair_number, pre_rating),
    by = c("opponents" = "pair_number")
    )

opponent_df
# A tibble: 408 × 3
   pair_number opponents pre_rating
         <dbl>     <dbl>      <dbl>
 1           1        39       1436
 2           1        21       1563
 3           1        18       1600
 4           1        14       1610
 5           1         7       1649
 6           1        12       1663
 7           1         4       1716
 8           2        63       1175
 9           2        58        917
10           2         4       1716
# ℹ 398 more rows

Compute The Average Opponent Pre-Rating

Now that our opponent_df dataframe has been updated to include the opponents’ pre-ratings, we can calculate the average pre-chess ratings of each player’s opponents.

avg_opponent_rating <- opponent_df %>%
  group_by(pair_number) %>%
  summarise(avg_opponents_pre_rating = round(mean(pre_rating, na.rm = TRUE)))
#Note in the above calculation of the average, the resultant was rounded to reflect the given example of GARY HUA. In this, his pre chess opponent average was given as 1605, instead of 1605.286...

head(avg_opponent_rating)
# A tibble: 6 × 2
  pair_number avg_opponents_pre_rating
        <dbl>                    <dbl>
1           1                     1605
2           2                     1469
3           3                     1564
4           4                     1574
5           5                     1501
6           6                     1519

Attach Discerned Averages Back To Main Dataframe

As we have obtained the desired figure within this analytical endeavor (the pre-chess rating of each player’s opponents), we can attach these values to our final dataframe.

final_df <- players_df %>%
  left_join(avg_opponent_rating, by = "pair_number")

head(final_df)
# A tibble: 6 × 6
  pair_number player_name         player_state total_points pre_rating
        <dbl> <chr>               <chr>               <dbl>      <dbl>
1           1 GARY HUA            ON                    6         1794
2           2 DAKSHESH DARURI     MI                    6         1553
3           3 ADITYA BAJAJ        MI                    6         1384
4           4 PATRICK H SCHILLING MI                    5.5       1716
5           5 HANSHI ZUO          MI                    5.5       1655
6           6 HANSEN SONG         OH                    5         1686
# ℹ 1 more variable: avg_opponents_pre_rating <dbl>

Sanity Check

Using the data from within the original text file, the opponent pre-ratings for two players were hand-calculated.

The first player Anvit Rao (pair number of 10) participated in all seven matches, and his opponents had an average pre-rating score of 1,554 (1,554.1428571429).

final_df %>% filter(pair_number == 10)
# A tibble: 1 × 6
  pair_number player_name player_state total_points pre_rating
        <dbl> <chr>       <chr>               <dbl>      <dbl>
1          10 ANVIT RAO   MI                      5       1365
# ℹ 1 more variable: avg_opponents_pre_rating <dbl>

The second player Eugene L McClure (pair number of 22) participated in six of the seven rounds, and his opponents had an average pre-rating score of 1,300 (1,300.333…).

final_df %>% filter(pair_number == 22) 
# A tibble: 1 × 6
  pair_number player_name      player_state total_points pre_rating
        <dbl> <chr>            <chr>               <dbl>      <dbl>
1          22 EUGENE L MCCLURE MI                      4       1555
# ℹ 1 more variable: avg_opponents_pre_rating <dbl>

Export To CSV

Our final_df dataframe now contains the required components of a player’s pair number, name, state, total points, their own pre-rating, and the average pre chess rating of their opponents. The final requirement of the Project One assignment involves exporting this dataframe to a CSV file, which can then be imported into an SQL database for further analysis.

write.csv(
  final_df %>%
    select(player_name, player_state, total_points, pre_rating, avg_opponents_pre_rating), "tournament_results.csv",
  row.names = FALSE
)

Conclusion

In completing this assignment, it became evident that unrefined yet structured data can be systematically transformed into meaningful analytical resultants. In the case of this project, this transformation process involved precise parsing and relational mapping. Moreover, string manipulation techniques were drawn upon alongside the dplyr library’s grouping functionality in order to extract the required player information and compute the sought after average pre-chess opponent ratings.

With respect to the most technically demanding portion of the project’s execution, I would attribute that to the parsing process itself. This step required close attention to detail, as well as multiple iterations, in order to achieve the desired functionality.

LLM Used

  • OpenAI. (2026). ChatGPT (Version 4o) [Large language model]. https://chat.openai.com . Accessed February 21, 2026.