The purpose of this document is to systematically import a text file containing data from a chess tournament cross table. The below R code uses string manipulations and extractions in order to clean up the text file and extract relevant information formatted in a clean, reusable, and exported dataframe with the goal of writing the newly created table to a CSV file for easy use.
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
First you’ll need to import the raw data from my github repo. You can find a link to the data here: <https://raw.githubusercontent.com/awrubes/DATA607_Project1/refs/heads/main/tournamentinfo.txt?token=GHSAT0AAAAAACW23YBZCXZQRWYMYX2AVM46ZXQO5ZA>. I find it easier to read in the file as lines in a string vector so that we can use stringr functions to extract specific pieces of data.
#import chess data as txt file
chess_data <- readLines("https://raw.githubusercontent.com/awrubes/DATA607_Project1/main/tournamentinfo.txt")
## Warning in
## readLines("https://raw.githubusercontent.com/awrubes/DATA607_Project1/main/tournamentinfo.txt"):
## incomplete final line found on
## 'https://raw.githubusercontent.com/awrubes/DATA607_Project1/main/tournamentinfo.txt'
Next, we’ll want to clean up the text file so that it’s easier to parse. We’ll do this by getting rid of the header rows and removing the “—” row separators. This way the only text in the file will be the data we want to selectively extract.
#remove the header lines
chess_data <- chess_data[-(1:3)]
head(chess_data)
## [1] "-----------------------------------------------------------------------------------------"
## [2] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [3] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [6] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
#removes the "-" lines separating the rows of player data
chess_data <- chess_data[!grepl("^-+$", chess_data)]
head(chess_data)
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [3] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [4] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [5] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [6] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
Before we begin extracting values from the text file, we’ll want to initialize the dataframe with appropriate headers and cell value types.
#creates the dataframe with desired columns
chess_dataFrame <- data.frame(
PlayerID = numeric(),
PlayerName = character(),
State = character(),
TotalPoints = numeric(),
PreRating = numeric(),
GameCount = character(),
AvgOpponentRating = numeric(),
stringsAsFactors = FALSE
)
Once the dataframe is initialized, now we can start extracting data from the text file and adding it to the appropriate cells. In this particular case, I took a more cautious approach to extracting the data (which you’ll see momentarily). Instead of extracting everything simultaneously in a single loop indexing through vector rows, I extracted data one piece at a time, which allowed me to appropriately debug at every step to ensure the correct data was being extracted. Understood, this method is by no means the most computationally efficient.
You’ll also notice that for the total games and opponent ID’s, I extract both together and include in a single cell as values to parse later on, once the foundations of the dataframe are set.
#sequentially move through file and add player ID's dynamically
start_index <- 1
#extract player names
for (i in seq(1, length(chess_data), by = 2)) {
# Get player info from the current line
player_info <- chess_data[i]
playerID <- start_index
# Extract the names of players
player_names <- paste(unlist(str_extract(player_info, "(?<=\\|)\\s*[A-Z\\s]+(?=\\s*\\|)")))
player_names <- trimws(player_names)
# Add ID and names to the data frame, with placeholders for others
chess_dataFrame <- rbind(chess_dataFrame, data.frame(
PlayerID = playerID,
PlayerName = player_names,
State = NA, # Placeholder
TotalPoints = NA, # Placeholder
PreRating = NA, # Placeholder
GameCount = NA, # Placeholder
AvgOpponentRating = NA, # Placeholder
stringsAsFactors = FALSE
))
# Increment ID
start_index <- start_index + 1
}
#extract states
start_index <- 1
for (i in seq(1, length(chess_data),by=2)) {
#split player info into lines
player_info <- chess_data[i]
player_rating_info <- chess_data[i+1]
#extracts the states, replace NA with "na" for easier differentiation
player_state <- paste(str_extract(player_rating_info, "^\\s*[A-Z]{2}\\b"))
player_state <- trimws(player_state)
chess_dataFrame$State[start_index] <- player_state
start_index <- start_index + 1
}
#extract total points
start_index <- 1
for (i in seq(1, length(chess_data),by=2)) {
#split player info into lines
player_info <- chess_data[i]
player_rating_info <- chess_data[i+1]
#player total points
total_points <- paste(unlist(str_extract_all(player_info, "(?<=|)\\d\\.\\d")))
chess_dataFrame$TotalPoints[start_index] <- total_points
start_index <- start_index + 1
}
#extract player preratings
start_index <- 1
for (i in seq(1, length(chess_data),by=2)) {
#split player info into lines
player_info <- chess_data[i]
player_rating_info <- chess_data[i+1]
#pre-ratings for players
player_rating <- paste(unlist(str_extract_all(player_rating_info, "(?<=R:)\\s*(\\d+)(?=\\s*->|\\w)")))
player_rating <- trimws(player_rating)
chess_dataFrame$PreRating[start_index] <- as.numeric(player_rating)
start_index <- start_index + 1
}
#extract game outcomes + opponent IDs
start_index <- 1
for (i in seq(1, length(chess_data),by=2)) {
#split player info into lines
player_info <- chess_data[i]
player_rating_info <- chess_data[i+1]
#game outcomes + opponent ID
games_played <- paste(unlist(str_extract_all(player_info, "\\b(W|L|D){1}\\s*\\d+")), collapse =", ")
chess_dataFrame$GameCount[start_index] <- games_played
start_index <- start_index + 1
}
Now you’ll see we should have a pretty solid table built out. We still need to calculate avg opponent rating (sum of opponent pre-tournament score divided by games played)
head(chess_dataFrame)
## PlayerID PlayerName State TotalPoints PreRating
## 1 1 GARY HUA ON 6.0 1794
## 2 2 DAKSHESH DARURI MI 6.0 1553
## 3 3 ADITYA BAJAJ MI 6.0 1384
## 4 4 PATRICK H SCHILLING MI 5.5 1716
## 5 5 HANSHI ZUO MI 5.5 1655
## 6 6 HANSEN SONG OH 5.0 1686
## GameCount AvgOpponentRating
## 1 W 39, W 21, W 18, W 14, W 7, D 12, D 4 NA
## 2 W 63, W 58, L 4, W 17, W 16, W 20, W 7 NA
## 3 L 8, W 61, W 25, W 21, W 11, W 13, W 12 NA
## 4 W 23, D 28, W 2, W 26, D 5, W 19, D 1 NA
## 5 W 45, W 37, D 12, D 13, D 4, W 14, W 17 NA
## 6 W 34, D 29, L 11, W 35, D 10, W 27, W 21 NA
We can do this by looping through the chess dataframe extracting the number of games played and then dynamically looping through the vector of opponent ID’s to locate their pre-tournament score.
#calculate avg opponent rating
for(i in 1:nrow(chess_dataFrame)){
game_count_str <- chess_dataFrame$GameCount[i]
total_games <- str_count(game_count_str, "[A-Z]")
opponent_ids <- as.numeric(unlist(str_extract_all(game_count_str, "\\d+")))
total_opponent_rating <- 0
index <- 1
for(opponent_id in opponent_ids){
opponent_rating <- chess_dataFrame$PreRating[opponent_id]
total_opponent_rating <- total_opponent_rating + opponent_rating
}
avg_op_rating <- total_opponent_rating / total_games
chess_dataFrame$AvgOpponentRating[i] = avg_op_rating
}
The table is pretty much complete. However, there are some unnecessary columns that we can trim before exporting the final to CSV.
head(chess_dataFrame)
## PlayerID PlayerName State TotalPoints PreRating
## 1 1 GARY HUA ON 6.0 1794
## 2 2 DAKSHESH DARURI MI 6.0 1553
## 3 3 ADITYA BAJAJ MI 6.0 1384
## 4 4 PATRICK H SCHILLING MI 5.5 1716
## 5 5 HANSHI ZUO MI 5.5 1655
## 6 6 HANSEN SONG OH 5.0 1686
## GameCount AvgOpponentRating
## 1 W 39, W 21, W 18, W 14, W 7, D 12, D 4 1605.286
## 2 W 63, W 58, L 4, W 17, W 16, W 20, W 7 1469.286
## 3 L 8, W 61, W 25, W 21, W 11, W 13, W 12 1563.571
## 4 W 23, D 28, W 2, W 26, D 5, W 19, D 1 1573.571
## 5 W 45, W 37, D 12, D 13, D 4, W 14, W 17 1500.857
## 6 W 34, D 29, L 11, W 35, D 10, W 27, W 21 1518.714
In the final table, we only need 5 columns: Name, State, TotalPoints, PreRating, and Avg Opponent Rating. So, we’ll subset this dataframe in a new one, excluding the unneeded columns.
After doing that, we’re ready to export as a CSV!
#subset the dataframe
selected_columns <- chess_dataFrame[, c("PlayerName", "State", "TotalPoints", "PreRating", "AvgOpponentRating")]
#write to a csv
write.csv(selected_columns, file = "chess_tournament_selected_columns.csv", row.names = FALSE)
head(selected_columns)
## PlayerName State TotalPoints PreRating AvgOpponentRating
## 1 GARY HUA ON 6.0 1794 1605.286
## 2 DAKSHESH DARURI MI 6.0 1553 1469.286
## 3 ADITYA BAJAJ MI 6.0 1384 1563.571
## 4 PATRICK H SCHILLING MI 5.5 1716 1573.571
## 5 HANSHI ZUO MI 5.5 1655 1500.857
## 6 HANSEN SONG OH 5.0 1686 1518.714