In Project 1, a text file with chess tournament results was provided. The objective was to create an R Markdown file that generates a .CSV file with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents
For example:
Gary Hua, ON, 6.0, 1794, 1605
For the extra credit project the objective was based on the ratings between the chess players and each of their opponents to:
I based my calculations on:
Based on these sources, the probability that a player wins can be used as the expected score for that round.
The formula for the probability that a player wins a round is:
\[ P(A wins) = \frac{1}{1+ 10^{(R_B - R_A)/400)}} \]
To calculate a players expected score for the tournament:
A players actual score is the sum of the points from their wins (1 point), their losses (0 points), and their draws (0.5 points).
To take the example of Gary Hua again:
Gary Huas expected score is based on seven rounds of opponents. Gary Huas pre-chess-rating was 1794. His opponents and their pre-chess-ratings are:
Gary Huas expected score for each opponent can be calculated.
P39 <- 1/(1+(10^((1436 - 1794)/400)))
P21 <- 1/(1+(10^((1563 - 1794)/400)))
P18 <- 1/(1+(10^((1600 - 1794)/400)))
P14 <- 1/(1+(10^((1610 - 1794)/400)))
P7 <- 1/(1+(10^((1649 - 1794)/400)))
P12 <- 1/(1+(10^((1663 - 1794)/400)))
P4 <- 1/(1+(10^((1716 - 1794)/400)))
Gary Huas expected score for the entire tournament can be calculated.
P_expected <- P39 + P21 + P18 + P14 + P7 + P12 + P4
P_expected
## [1] 5.161574
His actual score is based on: W, W, W, W, W, D, D which translates to:
P_actual <- 1 + 1 + 1 + 1 + 1 + 0.5 + 0.5
P_actual
## [1] 6
The difference between the actual and expected is
P_d <- P_actual - P_expected
P_d
## [1] 0.8384264
If you wanted to calculate the new rating of a player than the formula is
\[ New Rating = rating + 32(score - expected score) \]
In Gary Huas case his new rating is:
round(1794 + 32*P_d)
## [1] 1821
The result here is 1821, in the score sheet the post score was 1817 - the discrepancy I speculate is due to rounding differences.
Some of the work for Project 1 is repeated here because it provides a basis for the work for the extra credit.
Load required packages.
library(RCurl)
library(tidyverse)
library(readr)
library(kableExtra)
Read in the text file from github and take a look at the start.
x <- getURL("https://raw.githubusercontent.com/klgriffen96/spring23_data607_proj1/main/chess_scores.txt")
glimpse(x)
## chr "-----------------------------------------------------------------------------------------\r\n Pair | Player Nam"| __truncated__
Text files are generally broken into new lines using
\r\n
which can be seen in this file. Start by splitting on
\r\n
so each line is separated.
# Split x
s <- str_split_fixed(x, "\r\n", n=Inf)
# View first ten entries
s[1:10]
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [7] "-----------------------------------------------------------------------------------------"
## [8] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [9] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [10] "-----------------------------------------------------------------------------------------"
From the file extract the:
# Get start row and last PID
for (i in 1:length(s)){
temp <- as.integer(str_extract_all(s[i],"^\\s+[1]\\s+"))
temp2 <- as.integer(str_extract_all(s[i],"^\\s+[0-9]+"))
if (is.na(temp) == FALSE){
start_row <- i # Get the row that the actual games start on
cat("Start row is: ", start_row)
}
if (is.na(temp2) == FALSE){
last_id <- temp2 # Get the PID of the last game
}
}
## Start row is: 5
cat("Last PID is: ",last_id)
## Last PID is: 64
# Get the number of rounds
for (i in 1:length(s)){
temp <- str_extract_all(s[i],"\\s+[0-9]+\\s+\\|\\s+$")
temp <- str_split_fixed(temp, "\\|", n=Inf)
if (temp[1][1] != "character(0)"){
n_rounds <- as.integer(temp[1][1])
cat("Number of rounds: ", n_rounds)
break
}
}
## Number of rounds: 7
Form one dataframe that has the following information:
Form a matrix that has the PID followed by the opponent IDs, with NAs for no opponent for that round.
# Make the dataframes
p_info <- data.frame(
p_id = integer(),
p_name = character(),
p_points = double(),
p_state = character(),
p_prerating = integer()
)
o_ids <- matrix(, nrow = last_id, ncol = n_rounds + 1)
for (i in seq(start_row,length(s),by =3)){
ss_1 <- str_split_fixed(s[i],regex("|", literal=TRUE),n=Inf)
# PID | P Name | Total Points | X Opponent PID | X Opponent PID | etc...
# If X is W,L or D - read Opponent PID, increment total opponents
p_id <- as.integer(ss_1[1])
p_name <- str_trim(ss_1[2],side="both")
p_points <- as.double(ss_1[3])
ss_2 <- str_split_fixed(s[i+1],regex("|", literal=TRUE), n=Inf)
# State | X/R: Player’s Pre-Rating
p_state <- str_trim(ss_2[1],side="both")
temp <- str_extract_all(ss_2[2],"[0-9]+")
p_prerating <- as.integer(temp[[1]][2])
start_i <- 4
start_o <- 2
o_ids[p_id, 1] <- p_id
for (ii in start_i:(length(ss_1)-1)){
o_ids[p_id, start_o] <- as.integer(str_extract(ss_1[ii],"[0-9]+"))
start_o <- start_o + 1
}
p_temp <- data.frame(
p_id,
p_name,
p_points,
p_state,
p_prerating)
p_info <- rbind(p_info, p_temp)
}
Take a look at the dataframe and do a quick check of it.
head(p_info)
## p_id p_name p_points p_state p_prerating
## 1 1 GARY HUA 6.0 ON 1794
## 2 2 DAKSHESH DARURI 6.0 MI 1553
## 3 3 ADITYA BAJAJ 6.0 MI 1384
## 4 4 PATRICK H SCHILLING 5.5 MI 1716
## 5 5 HANSHI ZUO 5.5 MI 1655
## 6 6 HANSEN SONG 5.0 OH 1686
Take a look at the matrix with players/ opponent ids and do a quick check.
head(o_ids)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 39 21 18 14 7 12 4
## [2,] 2 63 58 4 17 16 20 7
## [3,] 3 8 61 25 21 11 13 12
## [4,] 4 23 28 2 26 5 19 1
## [5,] 5 45 37 12 13 4 14 17
## [6,] 6 34 29 11 35 10 27 21
tail(o_ids)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [59,] 59 41 NA 9 40 43 54 44
## [60,] 60 33 34 45 42 24 NA NA
## [61,] 61 32 3 54 47 42 30 37
## [62,] 62 55 NA NA NA NA NA NA
## [63,] 63 2 48 49 43 45 NA NA
## [64,] 64 22 30 31 49 46 42 54
Now that there is a dataframe with all the player information and a matrix with all of the player-opponent information each players expected score can be calculated.
p_expected <- matrix(, nrow = last_id, ncol = 2)
for (i in 1:dim(o_ids)[1]){
temp_sum_expected <- 0
for (ii in 2:dim(o_ids)[2]){
temp_o_id <- o_ids[i,ii]
if (is.na(temp_o_id) == FALSE){
temp_ind_expected <- 1/
(1+(10^((p_info$p_prerating[temp_o_id] - p_info$p_prerating[i])/400)))
temp_sum_expected <- temp_sum_expected + temp_ind_expected
}
}
p_expected[i,1] <- o_ids[i,1]
p_expected[i,2] <- temp_sum_expected
}
Take a look at the expected.
head(p_expected)
## [,1] [,2]
## [1,] 1 5.161574
## [2,] 2 3.778825
## [3,] 3 1.945088
## [4,] 4 4.741764
## [5,] 5 4.382484
## [6,] 6 4.944596
A quick check passes.
Now each players expected score has been calculated, the difference from their actual score can be calculated. A new dataframe containing all of this information can be created.
# Make the dataframes
p_scores <- data.frame(
p_id = p_info$p_id,
p_name = p_info$p_name,
p_expected = p_expected[,2],
p_actual = p_info$p_points,
p_difference = p_info$p_points - p_expected[,2]
)
Take a look to do a quick check.
head(p_scores)
## p_id p_name p_expected p_actual p_difference
## 1 1 GARY HUA 5.161574 6.0 0.83842636
## 2 2 DAKSHESH DARURI 3.778825 6.0 2.22117517
## 3 3 ADITYA BAJAJ 1.945088 6.0 4.05491209
## 4 4 PATRICK H SCHILLING 4.741764 5.5 0.75823568
## 5 5 HANSHI ZUO 4.382484 5.5 1.11751602
## 6 6 HANSEN SONG 4.944596 5.0 0.05540355
tail(p_scores)
## p_id p_name p_expected p_actual p_difference
## 59 59 SEAN M MC CORMICK 0.4147074 2.0 1.5852926
## 60 60 JULIA SHEN 0.5960637 1.5 0.9039363
## 61 61 JEZZEL FARKAS 0.9667233 1.5 0.5332767
## 62 62 ASHWIN BALAJI 0.8787050 1.0 0.1212950
## 63 63 THOMAS JOSEPH HOSMER 1.4280600 1.0 -0.4280600
## 64 64 BEN LI 2.2746706 1.0 -1.2746706
The 5 players who most outperformed their expected score would be the
players who had the greatest number for the
p_difference
.
kable(head(arrange(p_scores, desc(p_difference)), 5)) |>
kable_styling("striped")
p_id | p_name | p_expected | p_actual | p_difference |
---|---|---|---|---|
3 | ADITYA BAJAJ | 1.9450879 | 6.0 | 4.054912 |
15 | ZACHARY JAMES HOUGHTON | 1.3733089 | 4.5 | 3.126691 |
10 | ANVIT RAO | 1.9448541 | 5.0 | 3.055146 |
46 | JACOB ALEXANDER LAVALLEY | 0.0432498 | 3.0 | 2.956750 |
37 | AMIYATOSH PWNANANDAM | 0.7734529 | 3.5 | 2.726547 |
The 5 players who most underperformed their expected score would be
the players who had the least numbers for the
p_difference
.
kable(head(arrange(p_scores, (p_difference)), 5)) |>
kable_styling("striped")
p_id | p_name | p_expected | p_actual | p_difference |
---|---|---|---|---|
25 | LOREN SCHWIEBERT | 6.275650 | 3.5 | -2.775650 |
30 | GEORGE AVERY JONES | 6.018220 | 3.5 | -2.518220 |
42 | JARED GE | 5.010416 | 3.0 | -2.010416 |
31 | RISHI SHETTY | 5.092465 | 3.5 | -1.592465 |
35 | JOSHUA DAVID LEE | 4.957890 | 3.5 | -1.457890 |
In conclusion, the work from project 1 was adapted to meet the objective of this extra credit assignment.
For each player the following was calculated:
Then the five players who most overperformed relative to their expected score and the five players that most underperformed relative to their expected score were listed.