For this assignment, I will use the chess tournament information from Project 1 and calculate each player’s expected score and the difference from their actual score. From there, I will list the five players who overperformed the most relative to their expected score, as well as the five players that underperformed the most relative to their expected score. I think the formula given in the reference video is a good start and should be reasonable for this assignment.
I will start with creating a data frame from the chess tournament information provided in project 1.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gt)# 1. Load the raw textraw_lines <-readLines("https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%205B/tournamentinfo.txt")
Warning in
readLines("https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%205B/tournamentinfo.txt"):
incomplete final line found on
'https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%205B/tournamentinfo.txt'
# 2. Remove dashes and empty linesclean_lines <- raw_lines[!grepl("^-+$", raw_lines) &nzchar(raw_lines)]# 3. Skip the header (first 2 lines of the remaining text)data_lines <- clean_lines[3:length(clean_lines)]# 4. Separate the two-row structure# Every odd line is a "Name" row, every even line is a "ID/Rating" rowname_rows <- data_lines[seq(1, length(data_lines), by =2)]info_rows <- data_lines[seq(2, length(data_lines), by =2)]# 5. Parse the Name Rows (Fixed Width)# Column positions: Pair Num (1-5), Name (9-40), Total Pts (36-40), Rounds (42-end)names_df <-data.frame(pair_num =as.numeric(str_sub(name_rows, 1, 5)),player_name =str_trim(str_sub(name_rows, 9, 40)),total_pts =as.numeric(str_sub(name_rows, 42, 44)),stringsAsFactors =FALSE)# 6. Parse the Info Rows (Extracting State and Pre-Rating)info_df <-data.frame(state =str_trim(str_sub(info_rows, 1, 5)),uscf_id =str_extract(info_rows, "\\d{8}"),pre_rating =as.numeric(str_extract(info_rows, "(?<=R: )\\s*\\d+")),stringsAsFactors =FALSE)# 7. Merge names with infotournament_df <-cbind(names_df, info_df)head(tournament_df, 100)
pair_num player_name total_pts state uscf_id pre_rating
1 1 GARY HUA 6.0 ON 15445895 1794
2 2 DAKSHESH DARURI 6.0 MI 14598900 1553
3 3 ADITYA BAJAJ 6.0 MI 14959604 1384
4 4 PATRICK H SCHILLING 5.5 MI 12616049 1716
5 5 HANSHI ZUO 5.5 MI 14601533 1655
6 6 HANSEN SONG 5.0 OH 15055204 1686
7 7 GARY DEE SWATHELL 5.0 MI 11146376 1649
8 8 EZEKIEL HOUGHTON 5.0 MI 15142253 1641
9 9 STEFANO LEE 5.0 ON 14954524 1411
10 10 ANVIT RAO 5.0 MI 14150362 1365
11 11 CAMERON WILLIAM MC LEMAN 4.5 MI 12581589 1712
12 12 KENNETH J TACK 4.5 MI 12681257 1663
13 13 TORRANCE HENRY JR 4.5 MI 15082995 1666
14 14 BRADLEY SHAW 4.5 MI 10131499 1610
15 15 ZACHARY JAMES HOUGHTON 4.5 MI 15619130 1220
16 16 MIKE NIKITIN 4.0 MI 10295068 1604
17 17 RONALD GRZEGORCZYK 4.0 MI 10297702 1629
18 18 DAVID SUNDEEN 4.0 MI 11342094 1600
19 19 DIPANKAR ROY 4.0 MI 14862333 1564
20 20 JASON ZHENG 4.0 MI 14529060 1595
21 21 DINH DANG BUI 4.0 ON 15495066 1563
22 22 EUGENE L MCCLURE 4.0 MI 12405534 1555
23 23 ALAN BUI 4.0 ON 15030142 1363
24 24 MICHAEL R ALDRICH 4.0 MI 13469010 1229
25 25 LOREN SCHWIEBERT 3.5 MI 12486656 1745
26 26 MAX ZHU 3.5 ON 15131520 1579
27 27 GAURAV GIDWANI 3.5 MI 14476567 1552
28 28 SOFIA ADINA STANESCU-BELLU 3.5 MI 14882954 1507
29 29 CHIEDOZIE OKORIE 3.5 MI 15323285 1602
30 30 GEORGE AVERY JONES 3.5 ON 12577178 1522
31 31 RISHI SHETTY 3.5 MI 15131618 1494
32 32 JOSHUA PHILIP MATHEWS 3.5 ON 14073750 1441
33 33 JADE GE 3.5 MI 14691842 1449
34 34 MICHAEL JEFFERY THOMAS 3.5 MI 15051807 1399
35 35 JOSHUA DAVID LEE 3.5 MI 14601397 1438
36 36 SIDDHARTH JHA 3.5 MI 14773163 1355
37 37 AMIYATOSH PWNANANDAM 3.5 MI 15489571 980
38 38 BRIAN LIU 3.0 MI 15108523 1423
39 39 JOEL R HENDON 3.0 MI 12923035 1436
40 40 FOREST ZHANG 3.0 MI 14892710 1348
41 41 KYLE WILLIAM MURPHY 3.0 MI 15761443 1403
42 42 JARED GE 3.0 MI 14462326 1332
43 43 ROBERT GLEN VASEY 3.0 MI 14101068 1283
44 44 JUSTIN D SCHILLING 3.0 MI 15323504 1199
45 45 DEREK YAN 3.0 MI 15372807 1242
46 46 JACOB ALEXANDER LAVALLEY 3.0 MI 15490981 377
47 47 ERIC WRIGHT 2.5 MI 12533115 1362
48 48 DANIEL KHAIN 2.5 MI 14369165 1382
49 49 MICHAEL J MARTIN 2.5 MI 12531685 1291
50 50 SHIVAM JHA 2.5 MI 14773178 1056
51 51 TEJAS AYYAGARI 2.5 MI 15205474 1011
52 52 ETHAN GUO 2.5 MI 14918803 935
53 53 JOSE C YBARRA 2.0 MI 12578849 1393
54 54 LARRY HODGE 2.0 MI 12836773 1270
55 55 ALEX KONG 2.0 MI 15412571 1186
56 56 MARISA RICCI 2.0 MI 14679887 1153
57 57 MICHAEL LU 2.0 MI 15113330 1092
58 58 VIRAJ MOHILE 2.0 MI 14700365 917
59 59 SEAN M MC CORMICK 2.0 MI 12841036 853
60 60 JULIA SHEN 1.5 MI 14579262 967
61 61 JEZZEL FARKAS 1.5 ON 15771592 955
62 62 ASHWIN BALAJI 1.0 MI 15219542 1530
63 63 THOMAS JOSEPH HOSMER 1.0 MI 15057092 1175
64 64 BEN LI 1.0 MI 15006561 1163
For the round data, I created a data frame called opponents_df and stored the rounds as well as the opponent_pre_rating for each round.
# Extract the string containing the roundsrounds_raw <-str_sub(name_rows, 42, -1)# Extract just the digits (Opponent Pair Numbers)# We use str_extract_all to get every number in the rowopponents_list <-str_extract_all(rounds_raw, "(?<=[WLD])\\s*\\d+")# Convert the list into a data frame where each row is a gameopponents_df <-data.frame(pair_num =rep(tournament_df$pair_num, lengths(opponents_list)),opponent_id =as.numeric(unlist(opponents_list)))#Grab the opponent's pre rating from the tournament_dfopponents_df$opponent_pre_rating <- tournament_df$pre_rating[match(opponents_df$opponent_id, tournament_df$pair_num)]head(opponents_df)
Once I parsed the data from the tournamentinfo.txt file, I created a matches data frame by taking the opponents_df and adding the pre_rating value associated with each pair_num in the opponents_df using a left_join.
pair_num player_name total_pts state uscf_id pre_rating
1 1 GARY HUA 6.0 ON 15445895 1794
2 2 DAKSHESH DARURI 6.0 MI 14598900 1553
3 3 ADITYA BAJAJ 6.0 MI 14959604 1384
4 4 PATRICK H SCHILLING 5.5 MI 12616049 1716
5 5 HANSHI ZUO 5.5 MI 14601533 1655
6 6 HANSEN SONG 5.0 OH 15055204 1686
expected_points differential
1 5.161574 0.83842636
2 3.778825 2.22117517
3 1.945088 4.05491209
4 4.741764 0.75823568
5 4.382484 1.11751602
6 4.944596 0.05540355
In order to get the top 5 overperformers, I used dplyr’s arrange function to arrange the list in descending order and then used the head function to grab the top 5.
In order to get the top 5 underperformers, I used dplyr’s arrange function to arrange the list in the default ascending order and then used the head function to grab the top 5.
Hand calculating for Gary Hua resulted in an expected score of 5.161574 while the algorithm resulted in the same number so it is correct.
Conclusion
The ELO formula provided by singingbanana’s youtube video was reliable in calculating each player’s expected score. In the end I was able to find the top 5 overperformers and the top 5 underperformers. I was also able to verify the accuracy of the expected score by comparing GARY HUA’s score with my own hand calculations.