In this project, I worked with a text file
(tournamentinfo.txt) containing raw chess tournament
results. My objective was to transform the text into a clean dataset
with the following variables:
- Player’s Name
- Player’s State
- Total Points
- Pre-Tournament Rating
- Average Opponent Pre-Rating
To start, I loaded the libraries I used:
readr for reading the text file
stringr for working with regular expressions
dplyr for data manipulation
I read the text file into R. Each line was stored as one element in a character vector.
lines <- read_lines("tournamentinfo.txt")
length(lines)
## [1] 196
head(lines, 10)
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [7] "-----------------------------------------------------------------------------------------"
## [8] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [9] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [10] "-----------------------------------------------------------------------------------------"
Each player’s information spans two lines: the first line includes name, points, and opponents, and the second line includes state and rating. I isolated those lines here.
idx <- grep("^\\s*[0-9]+\\s+\\|", lines)
player_lines <- lines[idx]
rating_lines <- lines[idx + 1]
length(idx)
## [1] 64
head(player_lines, 3)
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [3] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
head(rating_lines, 3)
## [1] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [2] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [3] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
From the player and rating lines, I extracted the player’s name, state, total points, and pre-tournament rating.
names <- str_match(player_lines, "\\|\\s*([A-Z .'-]+)")[,2]
points <- as.numeric(str_match(player_lines, "\\|(\\d+\\.\\d)")[,2])
state <- str_match(rating_lines, "^\\s*([A-Z]{2})")[,2]
pre_rating <- as.numeric(str_match(rating_lines, "R:\\s*(\\d+)")[,2])
df <- data.frame(
Name = names,
State = state,
Points = points,
PreRating = pre_rating,
stringsAsFactors = FALSE
)
head(df, 5)
## Name State Points PreRating
## 1 GARY HUA ON 6.0 1794
## 2 DAKSHESH DARURI MI 6.0 1553
## 3 ADITYA BAJAJ MI 6.0 1384
## 4 PATRICK H SCHILLING MI 5.5 1716
## 5 HANSHI ZUO MI 5.5 1655
I extracted all numbers from the results row. This gave me the opponent IDs, along with a few extra digits such as the points.
opponents <- str_extract_all(player_lines, "\\d+")
opponents[[1]]
## [1] "1" "6" "0" "39" "21" "18" "14" "7" "12" "4"
For each player, I used the opponent IDs to look up pre-ratings and calculated the average.
avg_opponent <- sapply(opponents, function(opp) {
opp <- as.numeric(opp)
mean(df$PreRating[opp], na.rm = TRUE)
})
df$AvgOpponent <- round(avg_opponent, 0)
head(df, 10)
## Name State Points PreRating AvgOpponent
## 1 GARY HUA ON 6.0 1794 1635
## 2 DAKSHESH DARURI MI 6.0 1553 1503
## 3 ADITYA BAJAJ MI 6.0 1384 1557
## 4 PATRICK H SCHILLING MI 5.5 1716 1604
## 5 HANSHI ZUO MI 5.5 1655 1547
## 6 HANSEN SONG OH 5.0 1686 1552
## 7 GARY DEE SWATHELL MI 5.0 1649 1434
## 8 EZEKIEL HOUGHTON MI 5.0 1641 1508
## 9 STEFANO LEE ON 5.0 1411 1525
## 10 ANVIT RAO MI 5.0 1365 1544
Finally, I exported the cleaned dataset to a CSV file that can be used for SQL or further analysis.
write.csv(df, "chess_players.csv", row.names = FALSE)
I transformed the raw chess tournament text into a structured dataset. My final CSV contains each player’s name, state, total points, pre-rating, and average opponent pre-rating. The dataset is now clean and ready for further analysis.