607 Assignment 5B Dylan Gold

Approach

In this assignment we need to evaluate elo scores from last week’s project of formatting a data frame from a chess tournament txt.
Primarily, we must find the expected score of each player, then the actual score of each player. With these show the top 5 performers, and the bottom 5 performers based on this difference.

I will first get a csv from project 1 with the data needed for this. I will probably modify the project to give me a csv with the data needed for this assignment.
I will use the implementation of the elo system shown in the video given in the assignment. It seems appropriate because it is also chess related elo system. The formula for the expected score is 1/(1 + 10^((Rb - Ra)/400)). This formula is based off the fact that you would expect someone 400 rating higher to be 10 times more likely to win. It does not look like we need the rating change but the formula for that would be new_rating = rating + 32(score - expected_score). I will probably just get the expected score and compare to their actual score.

Codebase

library(tidyverse)
library(dplyr)

First I have to get the data in a format I can work with. I modified my Project 1 File to create a new csv for this.
This new file has the opponents in a single column separated with -

url <- "https://raw.githubusercontent.com/DylanGoldJ/607-Project-1/refs/heads/main/player_data.csv"

df <- read_csv(
  file = url,
  col_names = TRUE # We have column names
)

head(df, 10)

# A tibble: 10 × 7
   number name                state points rating opponents  avg_opponent_rating
    <dbl> <chr>               <chr>  <dbl>  <dbl> <chr>                    <dbl>
 1      1 Gary Hua            ON       6     1794 W-39 W-21…               1605.
 2      2 Dakshesh Daruri     MI       6     1553 W-63 W-58…               1469.
 3      3 Aditya Bajaj        MI       6     1384 L-8 W-61 …               1564.
 4      4 Patrick H Schilling MI       5.5   1716 W-23 D-28…               1574.
 5      5 Hanshi Zuo          MI       5.5   1655 W-45 W-37…               1501.
 6      6 Hansen Song         OH       5     1686 W-34 D-29…               1519.
 7      7 Gary Dee Swathell   MI       5     1649 W-57 W-46…               1372.
 8      8 Ezekiel Houghton    MI       5     1641 W-3 W-32 …               1468.
 9      9 Stefano Lee         ON       5     1411 W-25 L-18…               1523.
10     10 Anvit Rao           MI       5     1365 D-16 L-19…               1554.

We can practice tidying our data to make the gather of expected score easier.
I will make longer to create more rows for each opponent faced.

# Pivot longer is not needed here because the values are all in the row already. We can use separate_rows()
player_data <- separate_rows(
  df, opponents, sep = " "
  ) # Create new rows with - as the delimiter

head(player_data, 12)

# A tibble: 12 × 7
   number name            state points rating opponents avg_opponent_rating
    <dbl> <chr>           <chr>  <dbl>  <dbl> <chr>                   <dbl>
 1      1 Gary Hua        ON         6   1794 W-39                    1605.
 2      1 Gary Hua        ON         6   1794 W-21                    1605.
 3      1 Gary Hua        ON         6   1794 W-18                    1605.
 4      1 Gary Hua        ON         6   1794 W-14                    1605.
 5      1 Gary Hua        ON         6   1794 W-7                     1605.
 6      1 Gary Hua        ON         6   1794 D-12                    1605.
 7      1 Gary Hua        ON         6   1794 D-4                     1605.
 8      2 Dakshesh Daruri MI         6   1553 W-63                    1469.
 9      2 Dakshesh Daruri MI         6   1553 W-58                    1469.
10      2 Dakshesh Daruri MI         6   1553 L-4                     1469.
11      2 Dakshesh Daruri MI         6   1553 W-17                    1469.
12      2 Dakshesh Daruri MI         6   1553 W-16                    1469.

Now we need to separate the state of the match and the opponent. We can do this with mutate

player_data <- player_data %>%
  mutate(
    match_result = substring(opponents, 1,1), #Get the first character, this is always the match_result
    opponent = abs(parse_number(opponents)) # Because theres a dash it interprets it as negative, abs to fix
  )
#Show the new columns
head(player_data %>% select(c(number, name, opponents, match_result, opponent )), 8)

# A tibble: 8 × 5
  number name            opponents match_result opponent
   <dbl> <chr>           <chr>     <chr>           <dbl>
1      1 Gary Hua        W-39      W                  39
2      1 Gary Hua        W-21      W                  21
3      1 Gary Hua        W-18      W                  18
4      1 Gary Hua        W-14      W                  14
5      1 Gary Hua        W-7       W                   7
6      1 Gary Hua        D-12      D                  12
7      1 Gary Hua        D-4       D                   4
8      2 Dakshesh Daruri W-63      W                  63

We also need to drop rows where the opponent is na.
We only have NA values when there is no opponent because they did not face one. We can ignore these non existent matches.

player_data <- player_data %>%
  drop_na()

We now have a tidy version of our data.
I will also select the necessary rows for this task.

player_matches <- player_data %>% 
  select(c("number", "name", "actual_score" = "points", "rating", "opponent"))

head(player_matches, 12)

# A tibble: 12 × 5
   number name            actual_score rating opponent
    <dbl> <chr>                  <dbl>  <dbl>    <dbl>
 1      1 Gary Hua                   6   1794       39
 2      1 Gary Hua                   6   1794       21
 3      1 Gary Hua                   6   1794       18
 4      1 Gary Hua                   6   1794       14
 5      1 Gary Hua                   6   1794        7
 6      1 Gary Hua                   6   1794       12
 7      1 Gary Hua                   6   1794        4
 8      2 Dakshesh Daruri            6   1553       63
 9      2 Dakshesh Daruri            6   1553       58
10      2 Dakshesh Daruri            6   1553        4
11      2 Dakshesh Daruri            6   1553       17
12      2 Dakshesh Daruri            6   1553       16

Lets create a function that will calculate the expected score for a match.
It will take the data frame, players number and the opponent number and output our expected score
We can manually calculate the expected score of player 1 vs player 39 and compare to test. I expect a score of .887

# Uses the player_matches, input player and opponent numbers
get_expected_score <- function(player_num, opponent_num){
  #Get player rating
  player_rating <- player_matches %>% 
    filter(opponent == opponent_num & number == player_num) %>%
      select(rating) %>%
        pull()

  #Get opponent rating
  opponent_rating <- player_matches %>%
    filter(opponent == player_num & number == opponent_num) %>%
      select(rating) %>%
        pull()
  
  #Calculate the expected score
  expected_score <- 1/(1 + 10^((opponent_rating - player_rating)/400))
  #print(paste(player_num, "-" , opponent_num, "-" , expected_score, "    " ))
  
  return(expected_score)
  
} 

get_expected_score(1, 39)

[1] 0.8870357

We now have a function for the expected score. We can use this to create a column of each matches expected score, group by player, then compare to their actual score. First get the expected score for each match

#Get the expected scores
player_scores <- player_matches %>%
  rowwise() %>%
    mutate(
      expected_score = get_expected_score(number,opponent)
      )
player_scores

# A tibble: 408 × 6
# Rowwise: 
   number name            actual_score rating opponent expected_score
    <dbl> <chr>                  <dbl>  <dbl>    <dbl>          <dbl>
 1      1 Gary Hua                   6   1794       39          0.887
 2      1 Gary Hua                   6   1794       21          0.791
 3      1 Gary Hua                   6   1794       18          0.753
 4      1 Gary Hua                   6   1794       14          0.743
 5      1 Gary Hua                   6   1794        7          0.697
 6      1 Gary Hua                   6   1794       12          0.680
 7      1 Gary Hua                   6   1794        4          0.610
 8      2 Dakshesh Daruri            6   1553       63          0.898
 9      2 Dakshesh Daruri            6   1553       58          0.975
10      2 Dakshesh Daruri            6   1553        4          0.281
# ℹ 398 more rows

Group by the shared columns, while getting the sum of the expected_score.

player_scores_grp <- player_scores %>%
  group_by(number, name, actual_score) %>%
    summarise(
      expected_score = sum(expected_score),
      )
head(player_scores_grp, 12)

# A tibble: 12 × 4
# Groups:   number, name [12]
   number name                     actual_score expected_score
    <dbl> <chr>                           <dbl>          <dbl>
 1      1 Gary Hua                          6             5.16
 2      2 Dakshesh Daruri                   6             3.78
 3      3 Aditya Bajaj                      6             1.95
 4      4 Patrick H Schilling               5.5           4.74
 5      5 Hanshi Zuo                        5.5           4.38
 6      6 Hansen Song                       5             4.94
 7      7 Gary Dee Swathell                 5             4.58
 8      8 Ezekiel Houghton                  5             5.03
 9      9 Stefano Lee                       5             2.29
10     10 Anvit Rao                         5             1.94
11     11 Cameron William Mc Leman          4.5           5.34
12     12 Kenneth J Tack                    4.5           4.11

Get the difference

player_scores_grp <- player_scores_grp %>%
  mutate(score_diff = actual_score - expected_score)

head(player_scores_grp, 12)

# A tibble: 12 × 5
# Groups:   number, name [12]
   number name                     actual_score expected_score score_diff
    <dbl> <chr>                           <dbl>          <dbl>      <dbl>
 1      1 Gary Hua                          6             5.16     0.838 
 2      2 Dakshesh Daruri                   6             3.78     2.22  
 3      3 Aditya Bajaj                      6             1.95     4.05  
 4      4 Patrick H Schilling               5.5           4.74     0.758 
 5      5 Hanshi Zuo                        5.5           4.38     1.12  
 6      6 Hansen Song                       5             4.94     0.0554
 7      7 Gary Dee Swathell                 5             4.58     0.419 
 8      8 Ezekiel Houghton                  5             5.03    -0.0343
 9      9 Stefano Lee                       5             2.29     2.71  
10     10 Anvit Rao                         5             1.94     3.06  
11     11 Cameron William Mc Leman          4.5           5.34    -0.839 
12     12 Kenneth J Tack                    4.5           4.11     0.392

We can see the values, they make sense, if gary scored 6 points but was predicted to get around 5.16 points he did .83 points better than predicted.
Now we can show the greatest values and worse values to see who performed the best.

best_performances <- arrange(player_scores_grp, desc(score_diff))
head(best_performances,10)

# A tibble: 10 × 5
# Groups:   number, name [10]
   number name                     actual_score expected_score score_diff
    <dbl> <chr>                           <dbl>          <dbl>      <dbl>
 1      3 Aditya Bajaj                      6           1.95         4.05
 2     15 Zachary James Houghton            4.5         1.37         3.13
 3     10 Anvit Rao                         5           1.94         3.06
 4     46 Jacob Alexander Lavalley          3           0.0432       2.96
 5     37 Amiyatosh Pwnanandam              3.5         0.773        2.73
 6      9 Stefano Lee                       5           2.29         2.71
 7      2 Dakshesh Daruri                   6           3.78         2.22
 8     52 Ethan Guo                         2.5         0.295        2.20
 9     59 Sean M Mc Cormick                 2           0.415        1.59
10     58 Viraj Mohile                      2           0.426        1.57

We can see that our top 5 performers are Aditya Bajaj, Zachary James Houghton, Anvit Rao, Jacob Alexander Lavalley and Amiyatosh Pwnanandam

worse_performances <- arrange(player_scores_grp, score_diff)
head(worse_performances,10)

# A tibble: 10 × 5
# Groups:   number, name [10]
   number name               actual_score expected_score score_diff
    <dbl> <chr>                     <dbl>          <dbl>      <dbl>
 1     25 Loren Schwiebert            3.5           6.28      -2.78
 2     30 George Avery Jones          3.5           6.02      -2.52
 3     42 Jared Ge                    3             5.01      -2.01
 4     31 Rishi Shetty                3.5           5.09      -1.59
 5     35 Joshua David Lee            3.5           4.96      -1.46
 6     54 Larry Hodge                 2             3.40      -1.40
 7     45 Derek Yan                   3             4.37      -1.37
 8     43 Robert Glen Vasey           3             4.33      -1.33
 9     64 Ben Li                      1             2.27      -1.27
10     33 Jade Ge                     3.5           4.64      -1.14

We can see our bottom 5 performers are Loren Schwiebert, George Avery Jones, Jared Ge, Rishi Shetty and Joshua David Lee

Conclusion

In this assignment we were able to futher build off of our project 1 and apply the concepts of tidying data to a data set. Because of the way I was able to export my data in Project 1 I ended up using separate_rows() to make my dataframe longer. In scenarios where we have more columns rather than a list of values in a column we could use pivot_longer to reach similar results. After the data was lengthened I could see how it would be very useful due to having access to groupby and other such features. Some ways I could further build on this project would be to look at the rating changes. We were given the rating changes already but calculating it ourselves or perhaps adjusting the rating change could be interesting.