Chess ELO Table - Clean and Extract
Using tidyverse, ggplot2, and DT
I collaborated with Neil Shah on this project for some of the regex statements as well as general cleaning strategy. Once I completed this project I did a code review with Angel Claudio to see how we each approached the problem. Lastly, Sam Bellows was generous enough to provide a real, recent chess cross table from a tournament he competed in so that I could test the robustness of my code.
The crosstable information is stored in a .txt
file and will be scrubbed of all formatting and presented in a data frame of the following columns:
Name, State, Points Scored, Prior Rank, Average Opponent Ranking
General Approach
First I am going to develop the regex expressions necessary to pull the desired information from the .txt
file. Once I get the general information, I will use the data frame features of R to further clean and prepare the data in pursuit of the final goal.
Initial Formatting
There is an alternative link provided below to a real chess ELO table from a recent tournament. The table is much less well-organized and has several missing values. This code will also correctly run that data. To view those results, please use the second link labeled challenge_data
.
original_data<-'https://raw.githubusercontent.com/Shampjeff/cuny_msds/master/DATA_607/projects/Project_1/tournamentinfo.txt'
challenge_data<-'https://raw.githubusercontent.com/Shampjeff/cuny_msds/master/DATA_607/projects/Project_1/USATE_crosstable.txt'
data<-readLines(
original_data,
warn = F)
# This is a base R function to remove the "-" in every two lines
data<- data[data != strrep("-",89)]
data<-data[1:length(data)]
data<-unlist(lapply(data, function(x) x[!x %in% ""]))
data[1:10]
## [1] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [2] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [3] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [4] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [5] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [6] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [7] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [8] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [9] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [10] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
Regex
Using regex to create vectors for data fields that will go into the data frame. The USCF id number was not needed in this case, but I wanted to extract it in the event that I might need it later.
## [1] "15445895" "14598900" "14959604" "12616049" "14601533" "15055204"
## [7] "11146376" "15142253" "14954524" "14150362"
The final score of each player at the end of the tournament.
## [1] "6.0" "6.0" "6.0" "5.5" "5.5" "5.0" "5.0" "5.0" "5.0" "5.0"
The home state of each player.
state_regex<-"\\s\\s.*\\|\\s[:digit:]{2,}"
states<-unlist(str_extract_all(data, state_regex))
states<-str_remove_all(states, "\\|.*")
states[1:10]
## [1] " ON " " MI " " MI " " MI " " MI " " OH " " MI " " MI "
## [9] " ON " " MI "
I wanted to be sure to get everyone’s full name. Many players had multiple names, and I know how annoying it is to see everyone with only first and last names have their names displayed but not the person with three (or more). For example, the player Dinh Dang Bui; Bui is likely the last name (it’s a very common Vietnamese last name) and Dinh Dang would be the first name. I want to capture this player’s full name
name_regex<-"[0-9] \\| [:alpha:].*\\|*"
name<- unlist(str_extract_all(data, name_regex))
name<- str_remove_all(name, "[:digit:]\\s\\|")
name<- str_remove_all(name, "\\|.*")
name[1:10]
## [1] " GARY HUA " " DAKSHESH DARURI "
## [3] " ADITYA BAJAJ " " PATRICK H SCHILLING "
## [5] " HANSHI ZUO " " HANSEN SONG "
## [7] " GARY DEE SWATHELL " " EZEKIEL HOUGHTON "
## [9] " STEFANO LEE " " ANVIT RAO "
The rank of each player is that player’s USCF pre-touranment numerical value. The cross table also shows the player’s new rank after the tournament. One thing I learned from the cross table from a recent tournament is that it is very helpful to start with a large string extract command and then iteratively break it down. Below is an example of that procedure where I start with a long string line (everything after a “/”) and progressively take it down to the pre-tournament rank. This way I catch all the provisional ranks, as well as any non-existent rankings.
rank_regex<-"/.*\\|"
rank<-unlist(str_extract_all(data, rank_regex))
rank<-str_remove_all(rank, "->.*")
rank<-str_remove_all(rank, "R:")
rank<-str_remove_all(rank, "[P].*")
rank<-str_remove_all(rank, "/")
rank<-str_remove_all(rank, "\\D")
rank<-rank[2:length(rank)]
rank[1:10]
## [1] "1794" "1553" "1384" "1716" "1655" "1686" "1649" "1641" "1411" "1365"
This regex should pull in the outcomes of each match in the form, for example, “|W 39|…” for all seven rounds. Again, it starts with a very long string line and progressively breaks it down to just outcome and opponent number.
games_regex<-"\\|([:upper:]|[:digit:]).*\\|"
games<-unlist(str_extract_all(data, name_regex))
games<-str_remove_all(games, "[:digit:]\\s\\|")
games<-str_remove_all(games, "\\s[:alpha:]+")
games<-str_remove_all(games, "[:blank:]+\\|[:digit:][.][:digit:]")
games<- unlist(str_replace_all(games,"\\|", ","))
games<- strsplit(games, ",")
for (i in 1:length(games)){
games[[i]]<-games[[i]][2:length(games[[i]])]
}
games[1:2]
## [[1]]
## [1] "W 39" "W 21" "W 18" "W 14" "W 7" "D 12" "D 4"
##
## [[2]]
## [1] "W 63" "W 58" "L 4" "W 17" "W 16" "W 20" "W 7"
Outcome and Opponent Values
Below are two functions that work together to extract the outcome of each game per player and the rank of the opponent played. make.game.outcome
returns a list of the outcomes of one game for every player and the rank of the opponent played. make.outcome.opp.data
takes the output of make.game.outcome
and returns a dataframe column for the rank of each oppoent played. It also stores the outcome of each game, if that might be needed in the future.
I will further scrub the matches to the individual Win, Loss, Draw, etc and the opponent number. The following functions separate and organize those results. Pulling out the game outcome was not required, but I wanted to extract and store it for further analysis.
make.game.outcome<- function(games_list, game_num){
game_outcome<- rep(NA, length(games_list))
game_op<- rep(NA, length(games_list))
for (i in 1:length(games_list)){
# This pulls out the win, loss outcome. I'm extracting it now for use later.
game_outcome[i]<-str_extract(games_list[[i]][game_num],
"[WL]{1}")
game_outcome[i]<-str_replace_all(game_outcome[i], "[W]{1}", "1")
game_outcome[i]<-str_replace_all(game_outcome[i], "[L]{1}", "0")
# This pulls out the opponent rank
game_op[i]<-str_extract(games_list[[i]][game_num],
"[0-9]{1,}")
}
result<-list(game_outcome, game_op)
return (result)
}
Dataframe Creation
The following builds the dataframe and recasts some of the columns as integers or numerics.
df<-data.frame(name, states, uscf_id, score, rank,
make.outcome.opp.data(games,1), # this make a column of the opponent played in game 1
make.outcome.opp.data(games,2), # game 2 - 7 follow below
make.outcome.opp.data(games,3),
make.outcome.opp.data(games,4),
make.outcome.opp.data(games,5),
make.outcome.opp.data(games,6),
make.outcome.opp.data(games,7)
)
non_int<-c('name', 'states', 'score')
# The following cast the player ranking values to integers and the point column to numeric.
cols<- colnames(df)
cols<-cols[! cols %in% non_int]
cols <- cols[!is.na(cols)]
df[,cols] <- apply(df[,cols], 2,
function(x) as.integer(as.character(x)))
df$score<-as.numeric(as.character(df$score))
Now for each row I need to replace the opponent number with that opponent’s pre-tournament ranking. The following function iterates through each column and all the rows and swaps the opponent number with their respective pre-tournament rank.
game_cols<- c('game_opp', 'game_opp.1', 'game_opp.2', 'game_opp.3',
'game_opp.4', 'game_opp.5', 'game_opp.6')
make.opp.rank<- function(df,cols){
for (j in cols){
for (i in 1:nrow(df)){
df[[j]][i]<-df$rank[df[[j]][i]]
}
}
return(df)
}
df<-make.opp.rank(df, game_cols)
Once we have the rankings in place, we can calculate the average opponent ranking for each player. I also count the total wins of each player for later data visualization.
Result
Data Visualization
Lastly, I wanted to visualize who in this tournament played better ranked players than themselves. Chess ranks are gained by beating players better than yourself. The reward (in ranking) for beating a better player increases with the larger difference in rank between the two players. Below we can see who, on average, played better players than themselves and as such who had the most to gain from this tournament. Red is the opponent average rank and blue is the rank of the player. So places where red is higher than blue would be a player whose rank is lower than the average rank of their opponents. The bigger the gap in the colors, the more the player has to gain by winning. Conversely, if blue is higher than red, that player has a lot to lose in terms of rank for each match loss. These possible gains are only meaningful if the player actually wins so I added the total wins to the end of the plot.
We see that there are several players who, on average, played and beat higher ranked opponents. Aditya Bajaj, Anvit Rao, and Stefano Lee did very well this tournament as they won many games against better ranked players, whereas Loren Schweibert had only three wins against lower ranked players (on average) and Ashwin Balaji had only one win against lower ranked players.
ggplot(data = df, aes(x=name)) +
geom_col(aes(x=reorder(name, rank), y=rank, color='rank' ),position='dodge',
size=2,alpha=.5, fill='lightblue') +
geom_col(aes(x=name, y=opp_average, color='opponent average'), position='dodge',
size=2, alpha=.3, fill='red') +
geom_text(aes(label = total_wins, y=rank, angle=270), vjust=-1, fontface="bold",size=4) +
coord_flip() +
labs(y = "Ranking",x ="Player", color="legend") +
theme(legend.position = "top")