In this analysis we will be taking a close look at a cross table, or scoring sheet, for a particular chess tournament. In particular, we will be highlighting a particular competitor scoring system called “ELO.” ELO is a method that allocated points to the participants base on their relative rating. Using the ELO scores from the start of the tournament, we can calculate the expected outcome of the tournament and make comparisons to the actual result. Ultimately, we want to give credit to the chess players who beat the odds, and we want to examine how effective the ELO rating is for predictions.
Pull data down from repository.
Extract the variables using regular expressions to build our data frame.
Transform the data frame to facilitate competitor comparisons.
Add calculated fields related to ELO rating.
Give credit to the tournaments over-performers.
Examine the accuracy of ELO predictions.
We will start with a raw text file stored on a git repository.
library(RCurl)
library(stringr)
library(ggplot2)
library(tidyverse)
getchessdata <- getURL('https://raw.githubusercontent.com/mtelab1/Chess_ELO/master/tournamentinfo.webarchive.txt')
tourny_results <- read.csv(text = getchessdata)
Given the current state of the text. We will develope the regular expressions necessary to extract the relevant vectors of data.
rxname <- "[[:digit:]]+[[:space:]]+[|]([:space:]+[:alpha:]+){2,}"
rxstate <- "(?<=[:space:])[:alpha:]{2}(?=[:space:][|])"
rxtotalpoints <-"[:digit:][:punct:][:digit:]"
rxplayerstartingscore <- "(?<=R:[:space:]{1,2})[:digit:]+"
rxplayersfinalscore <- "(?<=->[:space:]?)[:digit:]+"
rxrnds <- "(?<=[:digit:][:punct:][:digit:][:space:]{1,2}[|]{1}).{40,}"
Here we use the regular expressions to create the vectors.
player_name <- unlist(str_extract_all(getchessdata,rxname))
player_state <- unlist(str_extract_all(getchessdata,rxstate))
total_points <- unlist(str_extract_all(getchessdata,rxtotalpoints))
player_starting_score <- unlist(str_extract_all(getchessdata,rxplayerstartingscore))
player_final_score <- unlist(str_extract_all(getchessdata,rxplayersfinalscore))
rounds <- unlist(str_extract_all(getchessdata,rxrnds))
Now we continue by combining the above vectors into a data frame for further processing.
chess_cross_table <- data.frame(
player_name,
player_state,
player_starting_score,
player_final_score,
total_points,
rounds)
The rounds were original collected above as one column, so now we need to parse those rounds out.
col_names <- c("round_1","round_2","round_3","round_4","round_5","round_6","round_7")
chess_cross_table <- chess_cross_table%>%
separate(rounds,col_names,sep = "[|]")
The analysis is more efficient with elongated data, so we want to parse out the “round” columns to their own records.
chess_cross_table_pivot <- chess_cross_table%>%
pivot_longer(col_names,names_to = "round",values_to = "result")
result_col_names <- c("result","opp_id")
id_col_names <- c("id","player_name")
chess_cross_table_pivot <- chess_cross_table_pivot%>%
separate(result,result_col_names)%>%
separate(player_name,id_col_names, sep = " [|]")
Here we create a side table for the purpose of joining back to the original data with the opponent information.
chessinfomergetable <- chess_cross_table_pivot%>%
select(id, player_state, player_starting_score,player_final_score)%>%
distinct()%>%
rename(opp_id=id,
opp_player_state = player_state ,
opp_player_starting_score = player_starting_score,
opp_player_final_score = player_final_score)
chess_cross_table_comp<- merge(chess_cross_table_pivot,chessinfomergetable,all = TRUE)%>%
mutate(starting_rating_diff = as.numeric(player_starting_score) - as.numeric(opp_player_starting_score))
str(chess_cross_table_comp)
## 'data.frame': 448 obs. of 13 variables:
## $ opp_id : chr "" "" "" "" ...
## $ id : chr "56" "53" "41" "53" ...
## $ player_name : chr " MARISA RICCI" " JOSE C YBARRA" " KYLE WILLIAM MURPHY" " JOSE C YBARRA" ...
## $ player_state : chr "MI" "MI" "MI" "MI" ...
## $ player_starting_score : chr "1153" "1393" "1403" "1393" ...
## $ player_final_score : chr "1140" "1359" "1341" "1359" ...
## $ total_points : chr "2.0" "2.0" "3.0" "2.0" ...
## $ round : chr "round_5" "round_1" "round_5" "round_3" ...
## $ result : chr "H" "H" "X" "H" ...
## $ opp_player_state : chr NA NA NA NA ...
## $ opp_player_starting_score: chr NA NA NA NA ...
## $ opp_player_final_score : chr NA NA NA NA ...
## $ starting_rating_diff : num NA NA NA NA NA NA NA NA NA NA ...
Now that we have the data in the format we want, we can summarize the opponent statistics.
chess_cross_table_smry <- chess_cross_table_comp%>%
mutate(opp_player_starting_score = as.numeric(opp_player_starting_score), opp_player_final_score = as.numeric(opp_player_final_score), player_starting_score = as.numeric(player_starting_score), player_final_score = as.numeric(player_final_score), id = as.integer(id), total_points = as.numeric(total_points))%>%
#Filter out the records with no match played
filter(!is.na(opp_player_state))%>%
#Group by player
group_by(id,player_name, player_starting_score,player_final_score, total_points)%>%
#Add average opponent rating and number of matches
summarise(avg_opp_starting_score = round(mean(opp_player_starting_score),2), avg_opp_final_score = round(mean(opp_player_final_score),2), num_of_matches = length(round))
Using the definitions and formulas found in the link below, we move forward generating the ELO expected score, which is essentially the probability of winning a match. With the expected score based on the average opponent rating, we can predict the expected point by multiplying by number of matches. link
chess_cross_table_smry <- chess_cross_table_smry%>%
mutate('expected_score' = round((1/(1+10^((avg_opp_starting_score-player_starting_score)/400))),2))%>%
mutate('expected_points' = round( expected_score * num_of_matches,2))
chess_cross_table_smry <- chess_cross_table_smry%>%
mutate(percent_deviation = round((total_points - expected_points)/total_points,2))
The below plot shows the distribution of the prediction accuracy, by looking at the total performance by participant. By plotting the total score prediction/outcome difference, we see a cluster roughly around 0 which is expected; those are the outcomes that are very similar to the predictions.
As we move closer the the right(towards 1+), we see the heros of the tournament. These participants score way more points than expected. To keep things positive, we wont dwell on the left side of the graph.
ggplot(chess_cross_table_smry, aes(x= percent_deviation))+ geom_histogram(binwidth = .1)
Here we examine the results by only looking at the underdog matches individually. In other words, from the perspective of the participant in a match with a lower starting rating.
As we can see the proportion of upsets does indeed get larger as the ELO rating difference gets smaller.
chess_wins <- chess_cross_table_comp %>%
filter(starting_rating_diff <0)
ggplot(data = chess_wins, mapping = aes(x = starting_rating_diff, fill = result)) +
geom_histogram(binwidth = 50)
And finally, we give credit to the overachievers of the tournament. The below participants did the best in terms of beating their ELO expectation.
chess_cross_table_smry%>%
subset(select = c(player_name, total_points, expected_points, percent_deviation))%>%
arrange(desc(percent_deviation))%>%
head(10)
## # A tibble: 10 x 4
## # Groups: player_name [10]
## player_name total_points expected_points percent_deviation
## <chr> <dbl> <dbl> <dbl>
## 1 " JACOB ALEXANDER LAVALLEY" 3 0 1
## 2 " ETHAN GUO" 2.5 0.28 0.89
## 3 " AMIYATOSH PWNANANDAM" 3.5 0.45 0.87
## 4 " VIRAJ MOHILE" 2 0.36 0.82
## 5 " SEAN M MC CORMICK" 2 0.36 0.82
## 6 " ZACHARY JAMES HOUGHTON" 4.5 1.26 0.72
## 7 " ADITYA BAJAJ" 6 1.82 0.7
## 8 " TEJAS AYYAGARI" 2.5 0.84 0.66
## 9 " ANVIT RAO" 5 1.75 0.65
## 10 " JULIA SHEN" 1.5 0.55 0.63
After calculating expected outcomes based on the ELO rating, it is clear that the rating system is delivering higher scores to the most effective chess players. However as the expression goes, “there is a reason we play the game.” This tournament was riddled with upsets. If I were a chess coach, I would utilize the above analysis to motivate my chess player if he has a particularly low ELO.
“Beware of barking at underdogs; don’t fight with people who have nothing to lose.” – Dory Previn
write.csv(chess_cross_table_comp,"chess_cross-table.csv", row.names = FALSE)
write.csv(chess_cross_table_smry,"chess_cross-table-summary.csv", row.names = FALSE)