The tournament file is saved on my github account where I am pulling it using the function “read.delim”. I tried multiple separators, but decided on “\t”.
tournamentInfo <- read.delim("https://raw.githubusercontent.com/jglendrange/DATA607/main/tournamentinfo.txt", header=FALSE, sep="\t")
head(tournamentInfo)
## V1
## 1 -----------------------------------------------------------------------------------------
## 2 Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
## 3 Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## 4 -----------------------------------------------------------------------------------------
## 5 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## 6 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
Currently all of the fields are stored in 1 field: V1. In the next block of code I clean up the data using regular expressions. Then, once I’m comfortable with how the data is formatted I split V1 into 10 columns by the character “|”.
tournamentInfo <- filter(tournamentInfo, !grepl('-{89}', V1))
tournamentInfo$V1 <- gsub("\\s+\\|", "|", tournamentInfo$V1)
tournamentInfo$V1 <- gsub("\\|\\s+", "|", tournamentInfo$V1)
tournamentInfo$V1 <- sub("\\s+", "", tournamentInfo$V1)
tournamentInfo$V1 <- sub("\\|$", "", tournamentInfo$V1)
tournamentInfo <- data.frame(do.call("rbind", strsplit(as.character(tournamentInfo$V1), "|", fixed=TRUE)))
## Warning in rbind(c("Pair", "Player Name", "Total", "Round", "Round", "Round", :
## number of columns of result is not a multiple of vector length (arg 34)
colnames(tournamentInfo) <- c("pair","player_name","total_points","round1","round2","round3","round4","round5", "round6","round7")
tournamentInfo <- tournamentInfo[-1,]
tournamentInfo <- tournamentInfo[-1,]
head(tournamentInfo)
## pair player_name total_points round1 round2 round3 round4
## 3 1 GARY HUA 6.0 W 39 W 21 W 18 W 14
## 4 ON 15445895 / R: 1794 ->1817 N:2 W B W B
## 5 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17
## 6 MI 14598900 / R: 1553 ->1663 N:2 B W B W
## 7 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21
## 8 MI 14959604 / R: 1384 ->1640 N:2 W B W B
## round5 round6 round7
## 3 W 7 D 12 D 4
## 4 W B W
## 5 W 16 W 20 W 7
## 6 B W B
## 7 W 11 W 13 W 12
## 8 W B W
The next problem we face is 1 chess players data is spread across 2 rows. In the next chunk I split the dataframe into 2. The first contains the chess player information and the second contains their state and elo information. In the elo_rating table I’m only interested in the elo_rating and the state, so I remove all the other columns
players <- filter(tournamentInfo, !grepl('\\d+', player_name))
elo_rating <- filter(tournamentInfo, grepl('\\d+', player_name))
elo_rating <- elo_rating %>% select(1,2)
colnames(elo_rating) <- c('state','elo_rating')
head(players)
## pair player_name total_points round1 round2 round3 round4 round5
## 1 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7
## 2 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16
## 3 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11
## 4 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26 D 5
## 5 5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13 D 4
## 6 6 HANSEN SONG 5.0 W 34 D 29 L 11 W 35 D 10
## round6 round7
## 1 D 12 D 4
## 2 W 20 W 7
## 3 W 13 W 12
## 4 W 19 D 1
## 5 W 14 W 17
## 6 W 27 W 21
head(elo_rating)
## state elo_rating
## 1 ON 15445895 / R: 1794 ->1817
## 2 MI 14598900 / R: 1553 ->1663
## 3 MI 14959604 / R: 1384 ->1640
## 4 MI 12616049 / R: 1716 ->1744
## 5 MI 14601533 / R: 1655 ->1690
## 6 OH 15055204 / R: 1686 ->1687
In the elo rating field I have both the pre and post elo rating. Since I am only interested in each players pre rating I extract it using regular expressions. Then I add the 2 fields to the players table. From now on I will only be using the players table.
elo_rating$elo_rating <- gsub(".*R:\\s+(\\d+).*->.+", "\\1", elo_rating$elo_rating)
players$elo_rating <- elo_rating$elo_rating
players$state <- elo_rating$state
head(players)
## pair player_name total_points round1 round2 round3 round4 round5
## 1 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7
## 2 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16
## 3 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11
## 4 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26 D 5
## 5 5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13 D 4
## 6 6 HANSEN SONG 5.0 W 34 D 29 L 11 W 35 D 10
## round6 round7 elo_rating state
## 1 D 12 D 4 1794 ON
## 2 W 20 W 7 1553 MI
## 3 W 13 W 12 1384 MI
## 4 W 19 D 1 1716 MI
## 5 W 14 W 17 1655 MI
## 6 W 27 W 21 1686 OH
Here I define 2 functions. The first, “extract_pair”, uses regular expressions to pull out the pair ID and then filter the plaeyers table to grab the corresponding elo rating. The second grabs the game status (Win, Lose, Draw, etc) and return either 1 or 0 depending on if they actually played a game.
extract_pair <- function(a) {
extracted_pair <- str_replace(a,"[A-Z]\\s+(\\d+)", "\\1")
if (extracted_pair == "B" | extracted_pair == "H" | extracted_pair == "U" | extracted_pair == "X") {
return(0)
}
else {
return(as.integer(filter(players, pair == as.integer(extracted_pair))$elo_rating))
}
}
extract_status <- function(a) {
val <- str_replace(a,"([A-Z])\\s+\\d+", "\\1")
if (val == "W" | val == "L" | val == "D") {
return(1)
}
else {
return(0)
}
}
In a for loop I iterate through each row in the players data frame and use the two functions I defined on each round. I sum up all the returned results and store them in 2 separate lists. Then, after the for loop I divide the 2 lists to get the average elo of opponents faced.
games_played <- c()
total_elo <- c()
for (i in 1:nrow(players)) {
total_sum <- extract_status(players[i,]$round1) + extract_status(players[i,]$round2) + extract_status(players[i,]$round3) + extract_status(players[i,]$round4) + extract_status(players[i,]$round5) + extract_status(players[i,]$round6) + extract_status(players[i,]$round7)
games_played <- c(games_played, total_sum)
elo_sum <- extract_pair(players[i,]$round1) + extract_pair(players[i,]$round2) + extract_pair(players[i,]$round3) + extract_pair(players[i,]$round4) + extract_pair(players[i,]$round5) + extract_pair(players[i,]$round6) + extract_pair(players[i,]$round7)
total_elo <- c(total_elo, elo_sum)
}
average_faced_rating <- round(total_elo / games_played)
Lastly, I select the columns I’m interested in and add the field “average_faced_rating” to the table. With my completed table I write the results to a csv.
cleaned_tournament_info <- players %>% select(2,12,3,11)
cleaned_tournament_info$avg_opponent_rating <- average_faced_rating
write.csv(cleaned_tournament_info,"./tournamentResults.csv")
head(cleaned_tournament_info)
## player_name state total_points elo_rating avg_opponent_rating
## 1 GARY HUA ON 6.0 1794 1605
## 2 DAKSHESH DARURI MI 6.0 1553 1469
## 3 ADITYA BAJAJ MI 6.0 1384 1564
## 4 PATRICK H SCHILLING MI 5.5 1716 1574
## 5 HANSHI ZUO MI 5.5 1655 1501
## 6 HANSEN SONG OH 5.0 1686 1519