Approach

For this project, the previous data set created in Project 1 chess tournament will be used to calculate the expected scores and the difference of their expected scores to their actual scores. After calculation to performed, the 5 players that overperformed/underperformed relatively to their expected scores can be extracted and listed.

Steps:

  1. Data Loading: To utilize the data set used in Project 1 chess tournament tables, I sourced from the uploaded R script file URL in Github

  2. Data Manipulation: From there I will create a longer formatted table and transform it so that it is easier to work with for calculations

  3. Elo Calculation: According to the video, The Elo Rating System for Chess and Beyond, to calculate a player’s expected score, the formula is:

    $$E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}$$

E is the expected score, R is the rating and A/B under is referred to as the players.

This formula will be applied to each player to get their expected score for comparison of the actual expected score vs the calculated expected score and will be use to calculate the difference.

  1. Extraction: 5 Over performer and 5 Under performers will be extracted from the table to create a list based on the differences.

Source:

Execution Code Base

Data loading

source("https://raw.githubusercontent.com/meiqing39/DATA-607/refs/heads/main/Project_1/Project1_Chess_codebase.R")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Data Manipulation

head(combined_data)
## # A tibble: 6 × 13
##   Player_Num Player_Name  Total_Points R1    R2    R3    R4    R5    R6    R7   
##        <dbl> <chr>               <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1          1 GARY HUA              6   W  39 W  21 W  18 W  14 W   7 D  12 D   4
## 2          2 DAKSHESH DA…          6   W  63 W  58 L   4 W  17 W  16 W  20 W   7
## 3          3 ADITYA BAJAJ          6   L   8 W  61 W  25 W  21 W  11 W  13 W  12
## 4          4 PATRICK H S…          5.5 W  23 D  28 W   2 W  26 D   5 W  19 D   1
## 5          5 HANSHI ZUO            5.5 W  45 W  37 D  12 D  13 D   4 W  14 W  17
## 6          6 HANSEN SONG           5   W  34 D  29 L  11 W  35 D  10 W  27 W  21
## # ℹ 3 more variables: Player_State <chr>, Rating_Info <chr>, Pre_Rating <dbl>

Data Manipulation

newchess_df <-combined_data |> 
  select(Player_Num, Total_Points, Pre_Rating, R1:R7) |> 
  pivot_longer(R1:R7, names_to = "Rounds", values_to = "Match_Result") |>  #create long table for all the rounds played
  mutate(Opponent_Number = as.numeric(str_extract(Match_Result, "\\d+")))      #extract from Match_Result table to Opponent#

head(newchess_df)                  #Check for both tables to check variables and column names
## # A tibble: 6 × 6
##   Player_Num Total_Points Pre_Rating Rounds Match_Result Opponent_Number
##        <dbl>        <dbl>      <dbl> <chr>  <chr>                  <dbl>
## 1          1            6       1794 R1     W  39                     39
## 2          1            6       1794 R2     W  21                     21
## 3          1            6       1794 R3     W  18                     18
## 4          1            6       1794 R4     W  14                     14
## 5          1            6       1794 R5     W   7                      7
## 6          1            6       1794 R6     D  12                     12
head(ratings_lookup)
## # A tibble: 6 × 2
##   Player_Num Pre_Rating
##        <dbl>      <dbl>
## 1          1       1794
## 2          2       1553
## 3          3       1384
## 4          4       1716
## 5          5       1655
## 6          6       1686

ELO Calculation

#Join old table with new ratings look up table 
newchess_df1 <- newchess_df|> left_join(ratings_lookup, by = c("Opponent_Number" = "Player_Num")) |>  
  rename(Opponent_Rating = Pre_Rating.y)          

#Calcuate expected elo score of each player then add everything up for all rounds for each player to get Total expected
ELO_Cal <-newchess_df1 |> 
  mutate(Expected_score = 1/(1+10^((Opponent_Rating - Pre_Rating.x)/400))) |> 
  group_by(Player_Num, Total_Points) |> 
  summarise(Expected_total_score = sum(Expected_score), .groups = "drop") |> 
  mutate(Score_Difference = Total_Points - Expected_total_score)  #Subtract the Total expected score of players with their actual total score 

Extract the Top 5s

Overperformers_5 <- ELO_Cal |>                
  arrange(desc(Score_Difference)) |>  #Arranging by descending order for Top 5 Overperformers
  head(5)

Underperformers_5 <- ELO_Cal |> 
  arrange(Score_Difference) |>  #Arranging by ascending order for Top 5 Underperformers
  head(5)

print(Overperformers_5)
## # A tibble: 5 × 4
##   Player_Num Total_Points Expected_total_score Score_Difference
##        <dbl>        <dbl>                <dbl>            <dbl>
## 1          3          6                 1.95               4.05
## 2         15          4.5               1.37               3.13
## 3         10          5                 1.94               3.06
## 4         46          3                 0.0432             2.96
## 5          9          5                 2.29               2.71
print(Underperformers_5)
## # A tibble: 5 × 4
##   Player_Num Total_Points Expected_total_score Score_Difference
##        <dbl>        <dbl>                <dbl>            <dbl>
## 1         25          3.5                 6.28            -2.78
## 2         30          3.5                 6.02            -2.52
## 3         42          3                   5.01            -2.01
## 4         31          3.5                 5.09            -1.59
## 5         35          3.5                 4.96            -1.46

Conclusion

The initial challenge was to tidy and transform a unstructured and messy formatted chess tournament text file where string manipulation, tidyr, and regular expression functions were used to create a clean and tidy data set that can be used for SQL relational database integration. After this data set was used in this project to evaluate the player’s ratings using the Arpad Elo rating formula. As this data set was very wide, it was necessary to transform the data table into a long format in order to record the match history and used the numbers for expected win probability calculation for every round for each player. The expected scores is then compared with the actual tournament points to identify the top 5 overperformers and underperformers.

Due to the reproducibility of this R codebase after cleaning the data set, I was able to quickly generate metrics that can be used for rating analysis and mathematic calculations.