Overview

For this extra credit assignment, we will be calculating each players’ expected score and the difference between their calculated score and their actual score. Then, we will list the 5 players who most overperformed relative to their expected score and the top 5 players who have underperformed relative to their expected score.

1. What is ELO?

By using the below sources, I was able to learn the following about ELO and their scores:

Elo is a system that is used to rate players skill levels. This model assumes that actual skill and performance results from a player vary along a normal distribution and that each players’ normal distribution has an average that is somewhere along that bell curve.

A concept that allows this is the Expected Score. The Expected Score is essentially the Probability of winning plus half the Probability of drawing.

You can calculate Expected score by:

\[ S_{exp} = \dfrac{1}{1+10^{(\dfrac{R_{opponent}-R_{player}}{400})}} \] \[ S_{exp} = \text{Expected Score} \] \[ R_{player} = \text{Rating of the player of interest} \] \[ R_{opponent} = \text{Rating of the opponent} \]

With the definition, let’s create a function that calculates expected score:

calculate_expected_score <- function(r_player, r_opponent) {
  denom <- 1 + 10^((r_opponent - r_player) / 400)

  return(1 / denom)
}

Additionally, with the ELO framework, we can calculate the new rating of a player by using this expected formula. This can be done with:

\[ R_{new} = R_{old} + 32(S_{act} - S_{exp}) \] \[ R_{new} = \text{The player's new/updated rating} \] \[ R_{old} = \text{The player's rating before the match} \] \[ S_{act} = \text{The player's actual score} \]

Now with this defined, let’s create a function that performs this calculation:

calculate_new_rating <- function(r_old, s_act, s_exp) {
  return(r_old + (32 * (s_act - s_exp)))
}

2. Getting the matches data

Luckily, we can reuse my results from project 1:

players_df <- read.csv("https://raw.githubusercontent.com/riverar9/cuny-msds/main/data607/projects/project-1/players_data.csv") #nolint

head(players_df)
##   player_id         player_name player_uscfid player_state player_points
## 1         1            GARY HUA      15445895           ON           6.0
## 2         2     DAKSHESH DARURI      14598900           MI           6.0
## 3         3        ADITYA BAJAJ      14959604           MI           6.0
## 4         4 PATRICK H SCHILLING      12616049           MI           5.5
## 5         5          HANSHI ZUO      14601533           MI           5.5
## 6         6         HANSEN SONG      15055204           OH           5.0
##   player_prerating player_postrating
## 1             1794              1817
## 2             1553              1663
## 3             1384              1640
## 4             1716              1744
## 5             1655              1690
## 6             1686              1687
matches_df <- read.csv("https://raw.githubusercontent.com/riverar9/cuny-msds/main/data607/projects/project-1/match_data.csv") # nolint

head(matches_df)
##   player_id round_number round_outcome opponent_id
## 1         1            1             W          39
## 2         1            2             W          21
## 3         1            3             W          18
## 4         1            4             W          14
## 5         1            5             W           7
## 6         1            6             D          12

3. Calculating expected scores

With the data, we can simply apply our calculate_expected_score function on the dataset to calculate our expected score. But to do so, we first need to ensure that player score and opponent score are in the same dataset. Just like with the project, by joining players_df to matches_df, we will automatically remove matches that didn’t happen:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
full_data <- matches_df |>
  select(player_id, opponent_id)

player_rating_data <- players_df |>
  select(player_id, player_points, player_prerating, player_name)

opponent_rating_data <- players_df |>
  rename(
    opponent_id = player_id,
    opponent_prerating = player_prerating
  ) |>
  select(opponent_id, opponent_prerating)

full_data <- merge(
  full_data,
  player_rating_data,
  by = "player_id"
)

full_data <- merge(
  full_data,
  opponent_rating_data,
  by = "opponent_id"
)

head(full_data)
##   opponent_id player_id player_points player_prerating         player_name
## 1           1        18           4.0             1600       DAVID SUNDEEN
## 2           1        21           4.0             1563       DINH DANG BUI
## 3           1        14           4.5             1610        BRADLEY SHAW
## 4           1         4           5.5             1716 PATRICK H SCHILLING
## 5           1        12           4.5             1663      KENNETH J TACK
## 6           1        39           3.0             1436       JOEL R HENDON
##   opponent_prerating
## 1               1794
## 2               1794
## 3               1794
## 4               1794
## 5               1794
## 6               1794

With this data, we can calculate the expected score of each match:

full_data$expected_score <- calculate_expected_score(
  full_data$player_prerating,
  full_data$opponent_prerating
)

head(full_data)
##   opponent_id player_id player_points player_prerating         player_name
## 1           1        18           4.0             1600       DAVID SUNDEEN
## 2           1        21           4.0             1563       DINH DANG BUI
## 3           1        14           4.5             1610        BRADLEY SHAW
## 4           1         4           5.5             1716 PATRICK H SCHILLING
## 5           1        12           4.5             1663      KENNETH J TACK
## 6           1        39           3.0             1436       JOEL R HENDON
##   opponent_prerating expected_score
## 1               1794      0.2466139
## 2               1794      0.2092019
## 3               1794      0.2574644
## 4               1794      0.3895976
## 5               1794      0.3199293
## 6               1794      0.1129643

And with the expected score of each match, we can aggregate by player to get their respective expected and actual scores:

final_scores <- full_data |>
  select(player_name, player_prerating,player_points, expected_score) |> # nolint
  group_by(player_name, player_prerating, player_points) |>
  summarise(total_expected_score = sum(expected_score))
## `summarise()` has grouped output by 'player_name', 'player_prerating'. You can
## override using the `.groups` argument.
head(final_scores)
## # A tibble: 6 × 4
## # Groups:   player_name, player_prerating [6]
##   player_name          player_prerating player_points total_expected_score
##   <chr>                           <int>         <dbl>                <dbl>
## 1 ADITYA BAJAJ                     1384           6                  1.95 
## 2 ALAN BUI                         1363           4                  3.94 
## 3 ALEX KONG                        1186           2                  1.44 
## 4 AMIYATOSH PWNANANDAM              980           3.5                0.773
## 5 ANVIT RAO                        1365           5                  1.94 
## 6 ASHWIN BALAJI                    1530           1                  0.879

Now let’s calculate the total score and see those who overperformed and underperformed:

score_diffs <- final_scores |>
  mutate(score_diff = player_points - total_expected_score) |>
  arrange(desc(score_diff)) |>
  select(player_name, player_points, total_expected_score, score_diff)
## Adding missing grouping variables: `player_prerating`
head(score_diffs)
## # A tibble: 6 × 5
## # Groups:   player_name, player_prerating [6]
##   player_prerating player_name     player_points total_expected_score score_diff
##              <int> <chr>                   <dbl>                <dbl>      <dbl>
## 1             1384 ADITYA BAJAJ              6                 1.95         4.05
## 2             1220 ZACHARY JAMES …           4.5               1.37         3.13
## 3             1365 ANVIT RAO                 5                 1.94         3.06
## 4              377 JACOB ALEXANDE…           3                 0.0432       2.96
## 5              980 AMIYATOSH PWNA…           3.5               0.773        2.73
## 6             1411 STEFANO LEE               5                 2.29         2.71

With this, we can see that Aditya Bajaj overperformed the most with a difference of 4.05. Following her are:

  1. Zachary James Houghton
  2. Anvit Rao
  3. Jacob Alexander Lavalley
  4. Amiyatosh Pwnanandam

For those who underperformed, we can look at the same dataset but use the “tail()” function:

tail(score_diffs)
## # A tibble: 6 × 5
## # Groups:   player_name, player_prerating [6]
##   player_prerating player_name     player_points total_expected_score score_diff
##              <int> <chr>                   <dbl>                <dbl>      <dbl>
## 1             1270 LARRY HODGE               2                   3.40      -1.40
## 2             1438 JOSHUA DAVID L…           3.5                 4.96      -1.46
## 3             1494 RISHI SHETTY              3.5                 5.09      -1.59
## 4             1332 JARED GE                  3                   5.01      -2.01
## 5             1522 GEORGE AVERY J…           3.5                 6.02      -2.52
## 6             1745 LOREN SCHWIEBE…           3.5                 6.28      -2.78

From here we have to read the data from the bottom to the top but we can see that Loren Schwiebert underperformed the most with a difference of -2.78. Following her are:

  1. George Avery Jones
  2. Jared Ge
  3. Rishi Shetty
  4. Joshua David Lee

Lastly, and for fun, let’s calculate each players new rating and see who gained the most ratings during this tournament:

score_diffs$new_rating <- calculate_new_rating(
  score_diffs$player_prerating,
  score_diffs$player_points,
  score_diffs$total_expected_score
)

score_diffs <- score_diffs |>
  select(player_name, player_prerating, new_rating) |>
  mutate(rating_change = new_rating - player_prerating) |>
  arrange(desc(rating_change))

head(score_diffs)
## # A tibble: 6 × 4
## # Groups:   player_name, player_prerating [6]
##   player_name              player_prerating new_rating rating_change
##   <chr>                               <int>      <dbl>         <dbl>
## 1 ADITYA BAJAJ                         1384      1514.         130. 
## 2 ZACHARY JAMES HOUGHTON               1220      1320.         100. 
## 3 ANVIT RAO                            1365      1463.          97.8
## 4 JACOB ALEXANDER LAVALLEY              377       472.          94.6
## 5 AMIYATOSH PWNANANDAM                  980      1067.          87.2
## 6 STEFANO LEE                          1411      1498.          86.8