# load dplyr library
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
To evaluate player performance relative to expectation for the chess players from the Project 1 data, I will use the Elo expected score formula based on how it is implemented here: https://mattmazzola.medium.com/implementing-the-elo-rating-system-a085f178e065. I will loop through each player to calculate their expected score per game. Then, I will be able to calculate their total expected score, as well as calculate the difference between their expected and actual scores. To conclude, I will be able to sort the players by performance difference to see who the most overperformed and underperformed players are.
# load dplyr library
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
The data needed for this assignment can be found in the .CSV file generated from last week’s Project 1.
# read CSV
df <- read.csv("tournament_info.csv")The loop that I mentioned in my approach is not necessary as I already have the average opponent rating of each player calculated, and included within the dataset. Thus, the Elo formula can be applied directly through the creation of a new expected score column.
#Elo formula (results in expected score)
df$expected_score <- 1 / (1 +10^((df$avg_opps_rating - df$pre_rating)/400)) * 7
head(df) name state total_points pre_rating avg_opps_rating
1 GARY HUA ON 6.0 1794 1605.286
2 DAKSHESH DARURI MI 6.0 1553 1469.286
3 ADITYA BAJAJ MI 6.0 1384 1563.571
4 PATRICK H SCHILLING MI 5.5 1716 1573.571
5 HANSHI ZUO MI 5.5 1655 1500.857
6 HANSEN SONG OH 5.0 1686 1518.714
expected_score
1 5.233826
2 4.327372
3 1.836577
4 4.859483
5 4.958354
6 5.066018
A new difference column is also needed in order to determine who the top 5 over and under performers were.
#difference
df$difference <- df$total_points - df$expected_score
head(df) name state total_points pre_rating avg_opps_rating
1 GARY HUA ON 6.0 1794 1605.286
2 DAKSHESH DARURI MI 6.0 1553 1469.286
3 ADITYA BAJAJ MI 6.0 1384 1563.571
4 PATRICK H SCHILLING MI 5.5 1716 1573.571
5 HANSHI ZUO MI 5.5 1655 1500.857
6 HANSEN SONG OH 5.0 1686 1518.714
expected_score difference
1 5.233826 0.76617424
2 4.327372 1.67262801
3 1.836577 4.16342302
4 4.859483 0.64051685
5 4.958354 0.54164582
6 5.066018 -0.06601795
df %>%
arrange(desc(difference)) %>%
slice_head(n = 5) name state total_points pre_rating avg_opps_rating
1 ADITYA BAJAJ MI 6.0 1384 1563.571
2 ZACHARY JAMES HOUGHTON MI 4.5 1220 1483.857
3 ANVIT RAO MI 5.0 1365 1554.143
4 JACOB ALEXANDER LAVALLEY MI 3.0 377 1357.714
5 AMIYATOSH PWNANANDAM MI 3.5 980 1384.800
expected_score difference
1 1.83657698 4.163423
2 1.25738160 3.242618
3 1.76291836 3.237082
4 0.02464793 2.975352
5 0.62055841 2.879442
df %>%
arrange(difference) %>%
slice_head(n = 5) name state total_points pre_rating avg_opps_rating
1 ASHWIN BALAJI MI 1.0 1530 1186.000
2 LOREN SCHWIEBERT MI 3.5 1745 1363.286
3 GEORGE AVERY JONES ON 3.5 1522 1144.143
4 GAURAV GIDWANI MI 3.5 1552 1221.667
5 CHIEDOZIE OKORIE MI 3.5 1602 1313.500
expected_score difference
1 6.150935 -5.150935
2 6.300063 -2.800063
3 6.285951 -2.785951
4 6.090469 -2.590469
5 5.882361 -2.382361
The Elo formula is an interesting method of predicting score outcomes that I did not know existed prior to this assignment. I did not expect the differences for the top over and under performers to be so large. It makes me wonder how accurate this rating system is when used in practice, and if such large margins of difference between expected scores and earned scores are common in chess tournaments. To extend this work, one might redo the Elo formula with a different scale factor. For instance, I would predict that using a larger number would mean that expected scores have less extreme values because the differences in points earned between opponents is reduced, and using a smaller scaling factor would do the opposite. It might be interesting to see how the values in the difference column would be affected by this.