For Project 1, we’re given a text file with chess tournament results where the information has some structure.
The below code is creating an R Markdown file that generates a .CSV
file (that could for example be imported into a SQL database) with the
following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s
Pre-Rating, and Average Pre Chess Rating of Opponents
# Fetching chess tournament data
url <- "https://raw.githubusercontent.com/hbedros/DATA607_Proj1/main/chessTextData.txt"
textData <- suppressWarnings(readLines(url))
First, we’re removing separator lines unnecessary headers, and unwanted elements from textData and creating a new vector called filtered_textData.
filtered_textData <- textData[!grepl("----", textData) & nchar(trimws(textData)) > 0 & !grepl("Pair | Player", textData)]
head(filtered_textData)
## [1] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [2] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [3] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [4] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [5] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [6] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
Next, making a player_df dataframe from raw data
using a loop. This loop looks at filtered_textData.
Even-numbered lines give player’s ID and name. The line right after
gives their state and rating.
The loop is using these regex patterns:
\\|: Splits data at the pipe (|) character./ R:: Splits data at the pattern / R: .P.*$: Removes ‘P’ and everything after it in a
string.# Initializing an empty data frame
player_df <- data.frame(
PlayerID = integer(),
Player = character(),
State = character(),
PreRating = integer(),
stringsAsFactors = FALSE
)
for (i in seq(2, length(filtered_textData) - 1, by = 2)) {
# Splitting each line by the delimiter "|"
line1 <- trimws(unlist(strsplit(filtered_textData[i], "\\|")))
line2 <- trimws(unlist(strsplit(filtered_textData[i+1], "\\|")))
playerID <- as.integer(line1[1])
player <- line1[2]
state <- line2[1]
preRatingString <- strsplit(strsplit(line2[2], "/ R: ")[[1]][2], " ->")[[1]][1]
# Removing 'P' and any characters or digits following it
preRatingString <- gsub("P.*$", "", preRatingString)
preRating <- as.integer(preRatingString)
# Binding player's data to the data frame
player_df <- rbind(player_df, data.frame(
PlayerID = playerID,
Player = player,
State = state,
PreRating = preRating
))
}
head(player_df)
## PlayerID Player State PreRating
## 1 1 GARY HUA ON 1794
## 2 2 DAKSHESH DARURI MI 1553
## 3 3 ADITYA BAJAJ MI 1384
## 4 4 PATRICK H SCHILLING MI 1716
## 5 5 HANSHI ZUO MI 1655
## 6 6 HANSEN SONG OH 1686
Next, setting up schedule_df before making the final table. I’m using a loop to get match details from filtered_textData. Even-numbered lines have player info. Another loop inside this gets the match results and opponent info.
The loop uses these regex patterns:
\\|: Splits data at the pipe (|) character.[0-9]: Checks if there’s any digit in match_data.[^0-9]: Matches any character that’s not a digit.# Initializing an empty data frame
schedule_df <- data.frame(
PlayerID = integer(),
OponentID = integer(),
OponentRating = integer(),
Result = character(),
Points = integer(),
stringsAsFactors = FALSE
)
for (i in seq(2, length(filtered_textData) - 1, by = 2)) {
line1 <- trimws(unlist(strsplit(filtered_textData[i], "\\|")))
playerID <- as.integer(line1[1])
for (j in 4:10) {
match_data <- line1[j]
result <- substr(match_data, 1, 1) # First char is the result
# Extracting opponentID if present, else give it value 0
if (grepl("[0-9]", match_data)) {
opponentID <- as.integer(gsub("[^0-9]", "", match_data))
} else {
opponentID <- 0
}
# Assigning points based on match result
if (result == "W") {
points <- 1
} else if (result == "D" || result == "H") {
points <- 0.5
} else {
points <- 0
}
# Fetching opponent's PreRating, default to 0 if not found
opponentPreRating <- ifelse(length(player_df$PreRating[player_df$PlayerID == opponentID]) > 0,
player_df$PreRating[player_df$PlayerID == opponentID],
0)
# Adding match data to `schedule_df`
schedule_df <- rbind(schedule_df, data.frame(
PlayerID = playerID,
OpponentID = opponentID,
OpponentPreRating = opponentPreRating,
Result = result,
Points = points
))
}
}
head(schedule_df)
## PlayerID OpponentID OpponentPreRating Result Points
## 1 1 39 1436 W 1.0
## 2 1 21 1563 W 1.0
## 3 1 18 1600 W 1.0
## 4 1 14 1610 W 1.0
## 5 1 7 1649 W 1.0
## 6 1 12 1663 D 0.5
Finally, creating the final dataframe final_df with the following criteria:
# Fetching the already existing data
final_df <- player_df[, c("PlayerID", "Player", "State", "PreRating")]
# Calculating games played for each Player ID
games_played <- as.data.frame(table(schedule_df$PlayerID[schedule_df$Result %in% c("W", "D", "L")]))
# Calculating total points for each Player ID
total_points <- aggregate(Points ~ PlayerID, data = schedule_df, sum)
colnames(total_points)[2] <- "TotalPts"
final_df <- merge(final_df, total_points, by = "PlayerID")
# Calculating the average opponent rating for each Player ID
avg_opponent_rating <- aggregate(OpponentPreRating ~ PlayerID, data = schedule_df, FUN = function(x) round(sum(x)/length(x)))
colnames(avg_opponent_rating)[2] <- "AvgOponentRating"
final_df <- merge(final_df, avg_opponent_rating, by = "PlayerID")
# This is the final layout for the chess tournament data
head(final_df)
## PlayerID Player State PreRating TotalPts AvgOponentRating
## 1 1 GARY HUA ON 1794 6.0 1605
## 2 2 DAKSHESH DARURI MI 1553 6.0 1469
## 3 3 ADITYA BAJAJ MI 1384 6.0 1564
## 4 4 PATRICK H SCHILLING MI 1716 5.5 1574
## 5 5 HANSHI ZUO MI 1655 5.5 1501
## 6 6 HANSEN SONG OH 1686 5.0 1519
This piece generates a CSV file using the write.csv() function so the output can be loaded to a sql database:
# Specify the filename and path
file_path <- "chess_tournament_data.csv"
# Write the dataframe to the CSV
write.csv(final_df, file_path, row.names = FALSE)
# Notify the user about the location
message("The CSV file was created at: ", getwd(), "/", file_path)