For Project 1, we’re given a text file with chess tournament results where the information has some structure.
The below code is creating an R Markdown file that generates a .CSV
file (that could for example be imported into a SQL database) with the
following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s
Pre-Rating, and Average Pre Chess Rating of Opponents
# Fetching chess tournament data
url <- "https://raw.githubusercontent.com/hbedros/DATA607_Proj1/main/chessTextData.txt"
textData <- suppressWarnings(readLines(url))
First, we’re removing separator lines unnecessary headers, and unwanted elements from textData and creating a new vector called filtered_textData.
filtered_textData <- textData[!grepl("----", textData) & nchar(trimws(textData)) > 0 & !grepl("Pair | Player", textData)]
head(filtered_textData)
## [1] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [2] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [3] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [4] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [5] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [6] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
Next, making a player_df dataframe from raw data
using a loop. This loop looks at filtered_textData.
Even-numbered lines give player’s ID and name. The line right after
gives their state and rating.
The loop is using these regex patterns:
\\|: Splits data at the pipe (|) character./ R:: Splits data at the pattern / R: .P.*$: Removes ‘P’ and everything after it in a
string.# Initializing an empty data frame
player_df <- data.frame(
PlayerID = integer(),
Player = character(),
State = character(),
PreRating = integer(),
stringsAsFactors = FALSE
)
for (i in seq(2, length(filtered_textData) - 1, by = 2)) {
# Splitting each line by the delimiter "|"
line1 <- trimws(unlist(strsplit(filtered_textData[i], "\\|")))
line2 <- trimws(unlist(strsplit(filtered_textData[i+1], "\\|")))
playerID <- as.integer(line1[1])
player <- line1[2]
state <- line2[1]
preRatingString <- strsplit(strsplit(line2[2], "/ R: ")[[1]][2], " ->")[[1]][1]
# Removing 'P' and any characters or digits following it
preRatingString <- gsub("P.*$", "", preRatingString)
preRating <- as.integer(preRatingString)
# Binding player's data to the data frame
player_df <- rbind(player_df, data.frame(
PlayerID = playerID,
Player = player,
State = state,
PreRating = preRating
))
}
head(player_df)
## PlayerID Player State PreRating
## 1 1 GARY HUA ON 1794
## 2 2 DAKSHESH DARURI MI 1553
## 3 3 ADITYA BAJAJ MI 1384
## 4 4 PATRICK H SCHILLING MI 1716
## 5 5 HANSHI ZUO MI 1655
## 6 6 HANSEN SONG OH 1686
Next, setting up schedule_df before making the final table. I’m using a loop to get match details from filtered_textData. Even-numbered lines have player info. Another loop inside this gets the match results and opponent info.
The loop uses these regex patterns:
\\|: Splits data at the pipe (|) character.[0-9]: Checks if there’s any digit in match_data.[^0-9]: Matches any character that’s not a digit.# Initializing an empty data frame
schedule_df <- data.frame(
PlayerID = integer(),
OponentID = integer(),
OponentRating = integer(),
Result = character(),
Points = integer(),
stringsAsFactors = FALSE
)
for (i in seq(2, length(filtered_textData) - 1, by = 2)) {
line1 <- trimws(unlist(strsplit(filtered_textData[i], "\\|")))
playerID <- as.integer(line1[1])
for (j in 4:10) {
match_data <- line1[j]
result <- substr(match_data, 1, 1) # First char is the result
# Extracting opponentID if present, else give it value 0
if (grepl("[0-9]", match_data)) {
opponentID <- as.integer(gsub("[^0-9]", "", match_data))
} else {
opponentID <- 0
}
# Assigning points based on match result
if (result == "W") {
points <- 1
} else if (result == "D" || result == "H") {
points <- 0.5
} else {
points <- 0
}
# Fetching opponent's PreRating, default to 0 if not found
opponentPreRating <- ifelse(length(player_df$PreRating[player_df$PlayerID == opponentID]) > 0,
player_df$PreRating[player_df$PlayerID == opponentID],
0)
# Adding match data to `schedule_df`
schedule_df <- rbind(schedule_df, data.frame(
PlayerID = playerID,
OpponentID = opponentID,
OpponentPreRating = opponentPreRating,
Result = result,
Points = points
))
}
}
head(schedule_df)
## PlayerID OpponentID OpponentPreRating Result Points
## 1 1 39 1436 W 1.0
## 2 1 21 1563 W 1.0
## 3 1 18 1600 W 1.0
## 4 1 14 1610 W 1.0
## 5 1 7 1649 W 1.0
## 6 1 12 1663 D 0.5
Finally, creating the final dataframe final_df with the following criteria:
# Fetching the already existing data
final_df <- player_df[, c("PlayerID", "Player", "State", "PreRating")]
# Calculating games played for each Player ID
games_played <- as.data.frame(table(schedule_df$PlayerID[schedule_df$Result %in% c("W", "D", "L")]))
# Calculating total points for each Player ID
total_points <- aggregate(Points ~ PlayerID, data = schedule_df, sum)
colnames(total_points)[2] <- "TotalPts"
final_df <- merge(final_df, total_points, by = "PlayerID")
# Calculating the average opponent rating for each Player ID
avg_opponent_rating <- aggregate(OpponentPreRating ~ PlayerID, data = schedule_df, FUN = function(x) round(sum(x)/length(x)))
colnames(avg_opponent_rating)[2] <- "AvgOponentRating"
final_df <- merge(final_df, avg_opponent_rating, by = "PlayerID")
# This is the final layout for the chess tournament data
head(final_df)
## PlayerID Player State PreRating TotalPts AvgOponentRating
## 1 1 GARY HUA ON 1794 6.0 1605
## 2 2 DAKSHESH DARURI MI 1553 6.0 1469
## 3 3 ADITYA BAJAJ MI 1384 6.0 1564
## 4 4 PATRICK H SCHILLING MI 1716 5.5 1574
## 5 5 HANSHI ZUO MI 1655 5.5 1501
## 6 6 HANSEN SONG OH 1686 5.0 1519
This piece generates a CSV file using the write.csv() function so the output can be loaded to a sql database:
# Specify the filename and path
file_path <- "chess_tournament_data.csv"
# Write the dataframe to the CSV
write.csv(final_df, file_path, row.names = FALSE)
# Notify the user about the location
message("The CSV file was created at: ", getwd(), "/", file_path)
Based on difference in ratings between the chess players and each of their opponents in our Project 1 tournament, calculate each player’s expected score (e.g. 4.3) and the difference from their actual score (e.g 4.0). List the five players who most overperformed relative to their expected score, and the five players that most underperformed relative to their expected score.
You’ll find some small differences in different implementation of ELO formulas. You may use any reasonably-sourced formula, but please cite your source.
library(elo)
# Using the elo.prob() function for expected score
# Reference for the ELO computation: https://cran.r-project.org/web/packages/elo/vignettes/intro.html
schedule_df$ExpectedScore <- mapply(function(player_rating, opponent_rating) {
elo.prob(player_rating, opponent_rating)
}, player_df$PreRating[schedule_df$PlayerID], schedule_df$OpponentPreRating)
# Aggregating the expected scores for each player
expected_scores <- aggregate(ExpectedScore ~ PlayerID, data = schedule_df, sum)
colnames(expected_scores)[2] <- "ExpectedTotalPts"
final_df <- merge(final_df, expected_scores, by = "PlayerID")
# Calculating the difference between actual and expected scores
final_df$ScoreDifference <- final_df$TotalPts - final_df$ExpectedTotalPts
# Identifying top performers and underperformers
overperformers <- final_df[order(-final_df$ScoreDifference), ][1:5, ]
underperformers <- final_df[order(final_df$ScoreDifference), ][1:5, ]
overperformers
## PlayerID Player State PreRating TotalPts AvgOponentRating
## 3 3 ADITYA BAJAJ MI 1384 6.0 1564
## 15 15 ZACHARY JAMES HOUGHTON MI 1220 4.5 1484
## 10 10 ANVIT RAO MI 1365 5.0 1554
## 46 46 JACOB ALEXANDER LAVALLEY MI 377 3.0 1358
## 9 9 STEFANO LEE ON 1411 5.0 1523
## ExpectedTotalPts ScoreDifference
## 3 1.94508791 4.054912
## 15 1.37330887 3.126691
## 10 1.94485405 3.055146
## 46 0.04324981 2.956750
## 9 2.28654888 2.713451
underperformers
## PlayerID Player State PreRating TotalPts AvgOponentRating
## 62 62 ASHWIN BALAJI MI 1530 1.0 169
## 53 53 JOSE C YBARRA MI 1393 2.0 577
## 54 54 LARRY HODGE MI 1270 1.0 1034
## 41 41 KYLE WILLIAM MURPHY MI 1403 2.0 713
## 25 25 LOREN SCHWIEBERT MI 1745 3.5 1363
## ExpectedTotalPts ScoreDifference
## 62 6.877807 -5.877807
## 53 5.718298 -3.718298
## 54 4.398246 -3.398246
## 41 5.363915 -3.363915
## 25 6.275650 -2.775650
The above code performs a comprehensive analysis of a chess tournament dataset. It begins by fetching the raw data and then processes this data to derive relevant player information and their match details.
By leveraging the ELO rating system (referenced from the CRAN documentation), the code calculates the expected scores of each player based on their rating differences with opponents. The results highlight players who overperformed and underperformed relative to their expected scores.