Assignment 5B ELO Calculations

Author

Long Lin

Overview

For this assignment, I will use the chess tournament information from Project 1 and calculate each player’s expected score and the difference from their actual score. From there, I will list the five players who overperformed the most relative to their expected score, as well as the five players that underperformed the most relative to their expected score. I think the formula given in the reference video is a good start and should be reasonable for this assignment.

source: https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%205B/tournamentinfo.txt

The Process

I will start with creating a data frame from the chess tournament information provided in project 1.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gt)

# 1. Load the raw text
raw_lines <- readLines("https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%205B/tournamentinfo.txt")
Warning in
readLines("https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%205B/tournamentinfo.txt"):
incomplete final line found on
'https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%205B/tournamentinfo.txt'
# 2. Remove dashes and empty lines
clean_lines <- raw_lines[!grepl("^-+$", raw_lines) & nzchar(raw_lines)]

# 3. Skip the header (first 2 lines of the remaining text)
data_lines <- clean_lines[3:length(clean_lines)]

# 4. Separate the two-row structure
# Every odd line is a "Name" row, every even line is a "ID/Rating" row
name_rows <- data_lines[seq(1, length(data_lines), by = 2)]
info_rows <- data_lines[seq(2, length(data_lines), by = 2)]

# 5. Parse the Name Rows (Fixed Width)
# Column positions: Pair Num (1-5), Name (9-40), Total Pts (36-40), Rounds (42-end)
names_df <- data.frame(
  pair_num = as.numeric(str_sub(name_rows, 1, 5)),
  player_name = str_trim(str_sub(name_rows, 9, 40)),
  total_pts = as.numeric(str_sub(name_rows, 42, 44)),
  stringsAsFactors = FALSE
)

# 6. Parse the Info Rows (Extracting State and Pre-Rating)
info_df <- data.frame(
  state = str_trim(str_sub(info_rows, 1, 5)),
  uscf_id = str_extract(info_rows, "\\d{8}"),
  pre_rating = as.numeric(str_extract(info_rows, "(?<=R: )\\s*\\d+")),
  stringsAsFactors = FALSE
)

# 7. Merge names with info
tournament_df <- cbind(names_df, info_df)

head(tournament_df, 100)
   pair_num                player_name total_pts state  uscf_id pre_rating
1         1                   GARY HUA       6.0    ON 15445895       1794
2         2            DAKSHESH DARURI       6.0    MI 14598900       1553
3         3               ADITYA BAJAJ       6.0    MI 14959604       1384
4         4        PATRICK H SCHILLING       5.5    MI 12616049       1716
5         5                 HANSHI ZUO       5.5    MI 14601533       1655
6         6                HANSEN SONG       5.0    OH 15055204       1686
7         7          GARY DEE SWATHELL       5.0    MI 11146376       1649
8         8           EZEKIEL HOUGHTON       5.0    MI 15142253       1641
9         9                STEFANO LEE       5.0    ON 14954524       1411
10       10                  ANVIT RAO       5.0    MI 14150362       1365
11       11   CAMERON WILLIAM MC LEMAN       4.5    MI 12581589       1712
12       12             KENNETH J TACK       4.5    MI 12681257       1663
13       13          TORRANCE HENRY JR       4.5    MI 15082995       1666
14       14               BRADLEY SHAW       4.5    MI 10131499       1610
15       15     ZACHARY JAMES HOUGHTON       4.5    MI 15619130       1220
16       16               MIKE NIKITIN       4.0    MI 10295068       1604
17       17         RONALD GRZEGORCZYK       4.0    MI 10297702       1629
18       18              DAVID SUNDEEN       4.0    MI 11342094       1600
19       19               DIPANKAR ROY       4.0    MI 14862333       1564
20       20                JASON ZHENG       4.0    MI 14529060       1595
21       21              DINH DANG BUI       4.0    ON 15495066       1563
22       22           EUGENE L MCCLURE       4.0    MI 12405534       1555
23       23                   ALAN BUI       4.0    ON 15030142       1363
24       24          MICHAEL R ALDRICH       4.0    MI 13469010       1229
25       25           LOREN SCHWIEBERT       3.5    MI 12486656       1745
26       26                    MAX ZHU       3.5    ON 15131520       1579
27       27             GAURAV GIDWANI       3.5    MI 14476567       1552
28       28 SOFIA ADINA STANESCU-BELLU       3.5    MI 14882954       1507
29       29           CHIEDOZIE OKORIE       3.5    MI 15323285       1602
30       30         GEORGE AVERY JONES       3.5    ON 12577178       1522
31       31               RISHI SHETTY       3.5    MI 15131618       1494
32       32      JOSHUA PHILIP MATHEWS       3.5    ON 14073750       1441
33       33                    JADE GE       3.5    MI 14691842       1449
34       34     MICHAEL JEFFERY THOMAS       3.5    MI 15051807       1399
35       35           JOSHUA DAVID LEE       3.5    MI 14601397       1438
36       36              SIDDHARTH JHA       3.5    MI 14773163       1355
37       37       AMIYATOSH PWNANANDAM       3.5    MI 15489571        980
38       38                  BRIAN LIU       3.0    MI 15108523       1423
39       39              JOEL R HENDON       3.0    MI 12923035       1436
40       40               FOREST ZHANG       3.0    MI 14892710       1348
41       41        KYLE WILLIAM MURPHY       3.0    MI 15761443       1403
42       42                   JARED GE       3.0    MI 14462326       1332
43       43          ROBERT GLEN VASEY       3.0    MI 14101068       1283
44       44         JUSTIN D SCHILLING       3.0    MI 15323504       1199
45       45                  DEREK YAN       3.0    MI 15372807       1242
46       46   JACOB ALEXANDER LAVALLEY       3.0    MI 15490981        377
47       47                ERIC WRIGHT       2.5    MI 12533115       1362
48       48               DANIEL KHAIN       2.5    MI 14369165       1382
49       49           MICHAEL J MARTIN       2.5    MI 12531685       1291
50       50                 SHIVAM JHA       2.5    MI 14773178       1056
51       51             TEJAS AYYAGARI       2.5    MI 15205474       1011
52       52                  ETHAN GUO       2.5    MI 14918803        935
53       53              JOSE C YBARRA       2.0    MI 12578849       1393
54       54                LARRY HODGE       2.0    MI 12836773       1270
55       55                  ALEX KONG       2.0    MI 15412571       1186
56       56               MARISA RICCI       2.0    MI 14679887       1153
57       57                 MICHAEL LU       2.0    MI 15113330       1092
58       58               VIRAJ MOHILE       2.0    MI 14700365        917
59       59          SEAN M MC CORMICK       2.0    MI 12841036        853
60       60                 JULIA SHEN       1.5    MI 14579262        967
61       61              JEZZEL FARKAS       1.5    ON 15771592        955
62       62              ASHWIN BALAJI       1.0    MI 15219542       1530
63       63       THOMAS JOSEPH HOSMER       1.0    MI 15057092       1175
64       64                     BEN LI       1.0    MI 15006561       1163

For the round data, I created a data frame called opponents_df and stored the rounds as well as the opponent_pre_rating for each round.

# Extract the string containing the rounds
rounds_raw <- str_sub(name_rows, 42, -1)

# Extract just the digits (Opponent Pair Numbers)
# We use str_extract_all to get every number in the row
opponents_list <- str_extract_all(rounds_raw, "(?<=[WLD])\\s*\\d+")

# Convert the list into a data frame where each row is a game
opponents_df <- data.frame(
  pair_num = rep(tournament_df$pair_num, lengths(opponents_list)),
  opponent_id = as.numeric(unlist(opponents_list))
)

#Grab the opponent's pre rating from the tournament_df
opponents_df$opponent_pre_rating <- tournament_df$pre_rating[match(opponents_df$opponent_id, tournament_df$pair_num)]

head(opponents_df)
  pair_num opponent_id opponent_pre_rating
1        1          39                1436
2        1          21                1563
3        1          18                1600
4        1          14                1610
5        1           7                1649
6        1          12                1663

Once I parsed the data from the tournamentinfo.txt file, I created a matches data frame by taking the opponents_df and adding the pre_rating value associated with each pair_num in the opponents_df using a left_join.

matches_df <- opponents_df |>
  left_join(
    tournament_df |> select(pair_num, pre_rating),
    by = "pair_num"
  ) |>
  relocate(pre_rating, .after = pair_num)
head(matches_df, 10)
   pair_num pre_rating opponent_id opponent_pre_rating
1         1       1794          39                1436
2         1       1794          21                1563
3         1       1794          18                1600
4         1       1794          14                1610
5         1       1794           7                1649
6         1       1794          12                1663
7         1       1794           4                1716
8         2       1553          63                1175
9         2       1553          58                 917
10        2       1553           4                1716

In order to calculate the expected score, I’ll use the formula from singingbanana’s video “The ELO Rating System for Chess and Beyond”.

singingbanana. (2019, February 15). The Elo Rating System for Chess and Beyond [Video]. YouTube. https://www.youtube.com/watch?v=AsYfbmp0To0

P(A wins) = 1/(1+(10^((RB-RA)/400)))

I used the formula to add another column of the expected_score for each match.

matches_df <- matches_df |>
  mutate(expected_score = 1 / (1 + (10^((opponent_pre_rating - pre_rating) / 400))))

head(matches_df)
  pair_num pre_rating opponent_id opponent_pre_rating expected_score
1        1       1794          39                1436      0.8870357
2        1       1794          21                1563      0.7907981
3        1       1794          18                1600      0.7533861
4        1       1794          14                1610      0.7425356
5        1       1794           7                1649      0.6973451
6        1       1794          12                1663      0.6800707

Next I calculated the expected points for each player by adding up all the expected scores for each pair_num.

expected_points_df <- matches_df |>
  group_by(pair_num) |>
  summarise(expected_points = sum(expected_score), .groups = "drop")

head(expected_points_df)
# A tibble: 6 × 2
  pair_num expected_points
     <dbl>           <dbl>
1        1            5.16
2        2            3.78
3        3            1.95
4        4            4.74
5        5            4.38
6        6            4.94

Next I calculated the differential between the player’s total points and their expected points.

differential_df <- tournament_df |>
  left_join(expected_points_df, by = "pair_num") |>
  mutate(
    differential = total_pts - expected_points
  )

head(differential_df)
  pair_num         player_name total_pts state  uscf_id pre_rating
1        1            GARY HUA       6.0    ON 15445895       1794
2        2     DAKSHESH DARURI       6.0    MI 14598900       1553
3        3        ADITYA BAJAJ       6.0    MI 14959604       1384
4        4 PATRICK H SCHILLING       5.5    MI 12616049       1716
5        5          HANSHI ZUO       5.5    MI 14601533       1655
6        6         HANSEN SONG       5.0    OH 15055204       1686
  expected_points differential
1        5.161574   0.83842636
2        3.778825   2.22117517
3        1.945088   4.05491209
4        4.741764   0.75823568
5        4.382484   1.11751602
6        4.944596   0.05540355

In order to get the top 5 overperformers, I used dplyr’s arrange function to arrange the list in descending order and then used the head function to grab the top 5.

overperformed_df <- differential_df |>
  arrange(desc(differential))

top5_overperformers <- head(overperformed_df, 5)

top5_overperformers |>
  gt() |>
  cols_label(
    pair_num = "Pair Number",
    player_name = "Player Name",
    total_pts = "Total Pts",
    state = "State",
    uscf_id = "USCF ID",
    pre_rating = "Pre-rating",
    expected_points = "Expected Pts",
    differential = "Differential"
  ) |>
  tab_header(
    title = "Top 5 Overperformers",
  )
Top 5 Overperformers
Pair Number Player Name Total Pts State USCF ID Pre-rating Expected Pts Differential
3 ADITYA BAJAJ 6.0 MI 14959604 1384 1.94508791 4.054912
15 ZACHARY JAMES HOUGHTON 4.5 MI 15619130 1220 1.37330887 3.126691
10 ANVIT RAO 5.0 MI 14150362 1365 1.94485405 3.055146
46 JACOB ALEXANDER LAVALLEY 3.0 MI 15490981 377 0.04324981 2.956750
37 AMIYATOSH PWNANANDAM 3.5 MI 15489571 980 0.77345290 2.726547

In order to get the top 5 underperformers, I used dplyr’s arrange function to arrange the list in the default ascending order and then used the head function to grab the top 5.

underperformed_df <- differential_df |>
  arrange(differential)

top5_underperformers <- head(underperformed_df, 5)

top5_underperformers |>
  gt() |>
  cols_label(
    pair_num = "Pair Number",
    player_name = "Player Name",
    total_pts = "Total Pts",
    state = "State",
    uscf_id = "USCF ID",
    pre_rating = "Pre-rating",
    expected_points = "Expected Pts",
    differential = "Differential"
  ) |>
  tab_header(
    title = "Top 5 Underperformers",
  )
Top 5 Underperformers
Pair Number Player Name Total Pts State USCF ID Pre-rating Expected Pts Differential
25 LOREN SCHWIEBERT 3.5 MI 12486656 1745 6.275650 -2.775650
30 GEORGE AVERY JONES 3.5 ON 12577178 1522 6.018220 -2.518220
42 JARED GE 3.0 MI 14462326 1332 5.010416 -2.010416
31 RISHI SHETTY 3.5 MI 15131618 1494 5.092465 -1.592465
35 JOSHUA DAVID LEE 3.5 MI 14601397 1438 4.957890 -1.457890

Finally I checked my work by hand calculating to see if it was correct. I used Gary Hua for the calculation.

tournament_df |>
  filter(pair_num == 1)
  pair_num player_name total_pts state  uscf_id pre_rating
1        1    GARY HUA         6    ON 15445895       1794
opponents_df |>
  filter(pair_num ==1)
  pair_num opponent_id opponent_pre_rating
1        1          39                1436
2        1          21                1563
3        1          18                1600
4        1          14                1610
5        1           7                1649
6        1          12                1663
7        1           4                1716
#opponent 39
expected_score39 = 1 / (1 + (10^((1436 - 1794) / 400)))

#opponent 21
expected_score21 = 1 / (1 + (10^((1563 - 1794) / 400)))

#opponent 18
expected_score18 = 1 / (1 + (10^((1600 - 1794) / 400)))

#opponent 14
expected_score14 = 1 / (1 + (10^((1610 - 1794) / 400)))

#opponent 7
expected_score7 = 1 / (1 + (10^((1649 - 1794) / 400)))

#opponent 12
expected_score12 = 1 / (1 + (10^((1663 - 1794) / 400)))

#opponent 4
expected_score4 = 1 / (1 + (10^((1716 - 1794) / 400)))

expected_gary_hua = sum(expected_score39, expected_score21, expected_score18, expected_score14, expected_score7, expected_score12, expected_score4)

expected_gary_hua
[1] 5.161574
expected_points_df |> filter(pair_num == 1)
# A tibble: 1 × 2
  pair_num expected_points
     <dbl>           <dbl>
1        1            5.16

Hand calculating for Gary Hua resulted in an expected score of 5.161574 while the algorithm resulted in the same number so it is correct.

Conclusion

The ELO formula provided by singingbanana’s youtube video was reliable in calculating each player’s expected score. In the end I was able to find the top 5 overperformers and the top 5 underperformers. I was also able to verify the accuracy of the expected score by comparing GARY HUA’s score with my own hand calculations.