library(tidyverse)
library(dplyr)607 Assignment 5B Dylan Gold
607 Assignment 5B Dylan Gold
Approach
In this assignment we need to evaluate elo scores from last week’s project of formatting a data frame from a chess tournament txt.
Primarily, we must find the expected score of each player, then the actual score of each player. With these show the top 5 performers, and the bottom 5 performers based on this difference.
I will first get a csv from project 1 with the data needed for this. I will probably modify the project to give me a csv with the data needed for this assignment.
I will use the implementation of the elo system shown in the video given in the assignment. It seems appropriate because it is also chess related elo system. The formula for the expected score is 1/(1 + 10^((Rb - Ra)/400)). This formula is based off the fact that you would expect someone 400 rating higher to be 10 times more likely to win. It does not look like we need the rating change but the formula for that would be new_rating = rating + 32(score - expected_score). I will probably just get the expected score and compare to their actual score.
Codebase
First I have to get the data in a format I can work with. I modified my Project 1 File to create a new csv for this.
This new file has the opponents in a single column separated with -
url <- "https://raw.githubusercontent.com/DylanGoldJ/607-Project-1/refs/heads/main/player_data.csv"
df <- read_csv(
file = url,
col_names = TRUE # We have column names
)
head(df, 10)# A tibble: 10 × 7
number name state points rating opponents avg_opponent_rating
<dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl>
1 1 Gary Hua ON 6 1794 W-39 W-21… 1605.
2 2 Dakshesh Daruri MI 6 1553 W-63 W-58… 1469.
3 3 Aditya Bajaj MI 6 1384 L-8 W-61 … 1564.
4 4 Patrick H Schilling MI 5.5 1716 W-23 D-28… 1574.
5 5 Hanshi Zuo MI 5.5 1655 W-45 W-37… 1501.
6 6 Hansen Song OH 5 1686 W-34 D-29… 1519.
7 7 Gary Dee Swathell MI 5 1649 W-57 W-46… 1372.
8 8 Ezekiel Houghton MI 5 1641 W-3 W-32 … 1468.
9 9 Stefano Lee ON 5 1411 W-25 L-18… 1523.
10 10 Anvit Rao MI 5 1365 D-16 L-19… 1554.
We can practice tidying our data to make the gather of expected score easier.
I will make longer to create more rows for each opponent faced.
# Pivot longer is not needed here because the values are all in the row already. We can use separate_rows()
player_data <- separate_rows(
df, opponents, sep = " "
) # Create new rows with - as the delimiter
head(player_data, 12)# A tibble: 12 × 7
number name state points rating opponents avg_opponent_rating
<dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl>
1 1 Gary Hua ON 6 1794 W-39 1605.
2 1 Gary Hua ON 6 1794 W-21 1605.
3 1 Gary Hua ON 6 1794 W-18 1605.
4 1 Gary Hua ON 6 1794 W-14 1605.
5 1 Gary Hua ON 6 1794 W-7 1605.
6 1 Gary Hua ON 6 1794 D-12 1605.
7 1 Gary Hua ON 6 1794 D-4 1605.
8 2 Dakshesh Daruri MI 6 1553 W-63 1469.
9 2 Dakshesh Daruri MI 6 1553 W-58 1469.
10 2 Dakshesh Daruri MI 6 1553 L-4 1469.
11 2 Dakshesh Daruri MI 6 1553 W-17 1469.
12 2 Dakshesh Daruri MI 6 1553 W-16 1469.
Now we need to separate the state of the match and the opponent. We can do this with mutate
player_data <- player_data %>%
mutate(
match_result = substring(opponents, 1,1), #Get the first character, this is always the match_result
opponent = abs(parse_number(opponents)) # Because theres a dash it interprets it as negative, abs to fix
)
#Show the new columns
head(player_data %>% select(c(number, name, opponents, match_result, opponent )), 8)# A tibble: 8 × 5
number name opponents match_result opponent
<dbl> <chr> <chr> <chr> <dbl>
1 1 Gary Hua W-39 W 39
2 1 Gary Hua W-21 W 21
3 1 Gary Hua W-18 W 18
4 1 Gary Hua W-14 W 14
5 1 Gary Hua W-7 W 7
6 1 Gary Hua D-12 D 12
7 1 Gary Hua D-4 D 4
8 2 Dakshesh Daruri W-63 W 63
We also need to drop rows where the opponent is na.
We only have NA values when there is no opponent because they did not face one. We can ignore these non existent matches.
player_data <- player_data %>%
drop_na()We now have a tidy version of our data.
I will also select the necessary rows for this task.
player_matches <- player_data %>%
select(c("number", "name", "actual_score" = "points", "rating", "opponent"))
head(player_matches, 12)# A tibble: 12 × 5
number name actual_score rating opponent
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 Gary Hua 6 1794 39
2 1 Gary Hua 6 1794 21
3 1 Gary Hua 6 1794 18
4 1 Gary Hua 6 1794 14
5 1 Gary Hua 6 1794 7
6 1 Gary Hua 6 1794 12
7 1 Gary Hua 6 1794 4
8 2 Dakshesh Daruri 6 1553 63
9 2 Dakshesh Daruri 6 1553 58
10 2 Dakshesh Daruri 6 1553 4
11 2 Dakshesh Daruri 6 1553 17
12 2 Dakshesh Daruri 6 1553 16
Lets create a function that will calculate the expected score for a match.
It will take the data frame, players number and the opponent number and output our expected score
We can manually calculate the expected score of player 1 vs player 39 and compare to test. I expect a score of .887
# Uses the player_matches, input player and opponent numbers
get_expected_score <- function(player_num, opponent_num){
#Get player rating
player_rating <- player_matches %>%
filter(opponent == opponent_num & number == player_num) %>%
select(rating) %>%
pull()
#Get opponent rating
opponent_rating <- player_matches %>%
filter(opponent == player_num & number == opponent_num) %>%
select(rating) %>%
pull()
#Calculate the expected score
expected_score <- 1/(1 + 10^((opponent_rating - player_rating)/400))
#print(paste(player_num, "-" , opponent_num, "-" , expected_score, " " ))
return(expected_score)
}
get_expected_score(1, 39)[1] 0.8870357
We now have a function for the expected score. We can use this to create a column of each matches expected score, group by player, then compare to their actual score. First get the expected score for each match
#Get the expected scores
player_scores <- player_matches %>%
rowwise() %>%
mutate(
expected_score = get_expected_score(number,opponent)
)
player_scores# A tibble: 408 × 6
# Rowwise:
number name actual_score rating opponent expected_score
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Gary Hua 6 1794 39 0.887
2 1 Gary Hua 6 1794 21 0.791
3 1 Gary Hua 6 1794 18 0.753
4 1 Gary Hua 6 1794 14 0.743
5 1 Gary Hua 6 1794 7 0.697
6 1 Gary Hua 6 1794 12 0.680
7 1 Gary Hua 6 1794 4 0.610
8 2 Dakshesh Daruri 6 1553 63 0.898
9 2 Dakshesh Daruri 6 1553 58 0.975
10 2 Dakshesh Daruri 6 1553 4 0.281
# ℹ 398 more rows
Group by the shared columns, while getting the sum of the expected_score.
player_scores_grp <- player_scores %>%
group_by(number, name, actual_score) %>%
summarise(
expected_score = sum(expected_score),
)
head(player_scores_grp, 12)# A tibble: 12 × 4
# Groups: number, name [12]
number name actual_score expected_score
<dbl> <chr> <dbl> <dbl>
1 1 Gary Hua 6 5.16
2 2 Dakshesh Daruri 6 3.78
3 3 Aditya Bajaj 6 1.95
4 4 Patrick H Schilling 5.5 4.74
5 5 Hanshi Zuo 5.5 4.38
6 6 Hansen Song 5 4.94
7 7 Gary Dee Swathell 5 4.58
8 8 Ezekiel Houghton 5 5.03
9 9 Stefano Lee 5 2.29
10 10 Anvit Rao 5 1.94
11 11 Cameron William Mc Leman 4.5 5.34
12 12 Kenneth J Tack 4.5 4.11
Get the difference
player_scores_grp <- player_scores_grp %>%
mutate(score_diff = actual_score - expected_score)
head(player_scores_grp, 12)# A tibble: 12 × 5
# Groups: number, name [12]
number name actual_score expected_score score_diff
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 Gary Hua 6 5.16 0.838
2 2 Dakshesh Daruri 6 3.78 2.22
3 3 Aditya Bajaj 6 1.95 4.05
4 4 Patrick H Schilling 5.5 4.74 0.758
5 5 Hanshi Zuo 5.5 4.38 1.12
6 6 Hansen Song 5 4.94 0.0554
7 7 Gary Dee Swathell 5 4.58 0.419
8 8 Ezekiel Houghton 5 5.03 -0.0343
9 9 Stefano Lee 5 2.29 2.71
10 10 Anvit Rao 5 1.94 3.06
11 11 Cameron William Mc Leman 4.5 5.34 -0.839
12 12 Kenneth J Tack 4.5 4.11 0.392
We can see the values, they make sense, if gary scored 6 points but was predicted to get around 5.16 points he did .83 points better than predicted.
Now we can show the greatest values and worse values to see who performed the best.
best_performances <- arrange(player_scores_grp, desc(score_diff))
head(best_performances,10)# A tibble: 10 × 5
# Groups: number, name [10]
number name actual_score expected_score score_diff
<dbl> <chr> <dbl> <dbl> <dbl>
1 3 Aditya Bajaj 6 1.95 4.05
2 15 Zachary James Houghton 4.5 1.37 3.13
3 10 Anvit Rao 5 1.94 3.06
4 46 Jacob Alexander Lavalley 3 0.0432 2.96
5 37 Amiyatosh Pwnanandam 3.5 0.773 2.73
6 9 Stefano Lee 5 2.29 2.71
7 2 Dakshesh Daruri 6 3.78 2.22
8 52 Ethan Guo 2.5 0.295 2.20
9 59 Sean M Mc Cormick 2 0.415 1.59
10 58 Viraj Mohile 2 0.426 1.57
We can see that our top 5 performers are Aditya Bajaj, Zachary James Houghton, Anvit Rao, Jacob Alexander Lavalley and Amiyatosh Pwnanandam
worse_performances <- arrange(player_scores_grp, score_diff)
head(worse_performances,10)# A tibble: 10 × 5
# Groups: number, name [10]
number name actual_score expected_score score_diff
<dbl> <chr> <dbl> <dbl> <dbl>
1 25 Loren Schwiebert 3.5 6.28 -2.78
2 30 George Avery Jones 3.5 6.02 -2.52
3 42 Jared Ge 3 5.01 -2.01
4 31 Rishi Shetty 3.5 5.09 -1.59
5 35 Joshua David Lee 3.5 4.96 -1.46
6 54 Larry Hodge 2 3.40 -1.40
7 45 Derek Yan 3 4.37 -1.37
8 43 Robert Glen Vasey 3 4.33 -1.33
9 64 Ben Li 1 2.27 -1.27
10 33 Jade Ge 3.5 4.64 -1.14
We can see our bottom 5 performers are Loren Schwiebert, George Avery Jones, Jared Ge, Rishi Shetty and Joshua David Lee
Conclusion
In this assignment we were able to futher build off of our project 1 and apply the concepts of tidying data to a data set. Because of the way I was able to export my data in Project 1 I ended up using separate_rows() to make my dataframe longer. In scenarios where we have more columns rather than a list of values in a column we could use pivot_longer to reach similar results. After the data was lengthened I could see how it would be very useful due to having access to groupby and other such features. Some ways I could further build on this project would be to look at the rating changes. We were given the rating changes already but calculating it ourselves or perhaps adjusting the rating change could be interesting.