Goal

In this project, I worked with a text file (tournamentinfo.txt) containing raw chess tournament results. My objective was to transform the text into a clean dataset with the following variables:
- Player’s Name
- Player’s State
- Total Points
- Pre-Tournament Rating
- Average Opponent Pre-Rating

1. Load Libraries

To start, I loaded the libraries I used:

readr for reading the text file

stringr for working with regular expressions

dplyr for data manipulation

2. Read the Raw Data

I read the text file into R. Each line was stored as one element in a character vector.

lines <- read_lines("tournamentinfo.txt")
length(lines) 
## [1] 196
head(lines, 10)
##  [1] "-----------------------------------------------------------------------------------------" 
##  [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
##  [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
##  [4] "-----------------------------------------------------------------------------------------" 
##  [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
##  [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
##  [7] "-----------------------------------------------------------------------------------------" 
##  [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
##  [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [10] "-----------------------------------------------------------------------------------------"

3. Identify Player Records

Each player’s information spans two lines: the first line includes name, points, and opponents, and the second line includes state and rating. I isolated those lines here.

idx <- grep("^\\s*[0-9]+\\s+\\|", lines)
player_lines <- lines[idx]
rating_lines <- lines[idx + 1]
length(idx) 
## [1] 64
head(player_lines, 3)
## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [3] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
head(rating_lines, 3)
## [1] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [2] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [3] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

4. Extract Core Fields

From the player and rating lines, I extracted the player’s name, state, total points, and pre-tournament rating.

names <- str_match(player_lines, "\\|\\s*([A-Z .'-]+)")[,2]
points <- as.numeric(str_match(player_lines, "\\|(\\d+\\.\\d)")[,2])
state <- str_match(rating_lines, "^\\s*([A-Z]{2})")[,2]
pre_rating <- as.numeric(str_match(rating_lines, "R:\\s*(\\d+)")[,2])

df <- data.frame(
  Name = names,
  State = state,
  Points = points,
  PreRating = pre_rating,
  stringsAsFactors = FALSE
)
head(df, 5)
##                               Name State Points PreRating
## 1 GARY HUA                            ON    6.0      1794
## 2 DAKSHESH DARURI                     MI    6.0      1553
## 3 ADITYA BAJAJ                        MI    6.0      1384
## 4 PATRICK H SCHILLING                 MI    5.5      1716
## 5 HANSHI ZUO                          MI    5.5      1655

5. Extract Opponents IDs

I extracted all numbers from the results row. This gave me the opponent IDs, along with a few extra digits such as the points.

opponents <- str_extract_all(player_lines, "\\d+")
opponents[[1]]
##  [1] "1"  "6"  "0"  "39" "21" "18" "14" "7"  "12" "4"

6. Calculate Average Opponent Pre-Rating

For each player, I used the opponent IDs to look up pre-ratings and calculated the average.

avg_opponent <- sapply(opponents, function(opp) {
  opp <- as.numeric(opp)
  mean(df$PreRating[opp], na.rm = TRUE)
})
df$AvgOpponent <- round(avg_opponent, 0)
head(df, 10)
##                                Name State Points PreRating AvgOpponent
## 1  GARY HUA                            ON    6.0      1794        1635
## 2  DAKSHESH DARURI                     MI    6.0      1553        1503
## 3  ADITYA BAJAJ                        MI    6.0      1384        1557
## 4  PATRICK H SCHILLING                 MI    5.5      1716        1604
## 5  HANSHI ZUO                          MI    5.5      1655        1547
## 6  HANSEN SONG                         OH    5.0      1686        1552
## 7  GARY DEE SWATHELL                   MI    5.0      1649        1434
## 8  EZEKIEL HOUGHTON                    MI    5.0      1641        1508
## 9  STEFANO LEE                         ON    5.0      1411        1525
## 10 ANVIT RAO                           MI    5.0      1365        1544

7. Export to CSV

Finally, I exported the cleaned dataset to a CSV file that can be used for SQL or further analysis.

write.csv(df, "chess_players.csv", row.names = FALSE)

8. Conclusion

I transformed the raw chess tournament text into a structured dataset. My final CSV contains each player’s name, state, total points, pre-rating, and average opponent pre-rating. The dataset is now clean and ready for further analysis.