Introduction: Chess Tournament ELO Calculations

Using the same chess tournament data from Project 1, we will use the difference between each chess player and their opponents pre-ratings to calculate each chess player’s expected score. We will then find the difference between their expected score and their actual score in the tournament. Finally, we will list the five players who most overperformed and the five players that most underperformed relative to their expected score.

To calculate each player’s expected score the following ELO formula:

knitr::include_graphics(img)

For each player the tournament info CSV file will include the following information:

The Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.

We will use the following libraries:

The tidyverse library

If you do not want to go through the data cleaning steps taken in Project 1, feel free to skip ahead to START OF ASSIGNMENT 5B on the menu, as this is where the work for Assignment 5B begins:

We will be redoing the data cleaning steps from Project 1, staring with reading in our data from Github and using the readLines() function:

#reading in the csv from Project1's repository
url <-("https://raw.githubusercontent.com/WendyR20/DATA-607-Project-1/refs/heads/main/tournamentinfo.txt")
lines <- readLines(url)

## Warning in readLines(url): incomplete final line found on
## 'https://raw.githubusercontent.com/WendyR20/DATA-607-Project-1/refs/heads/main/tournamentinfo.txt'

head(lines,12)

##  [1] "-----------------------------------------------------------------------------------------" 
##  [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
##  [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
##  [4] "-----------------------------------------------------------------------------------------" 
##  [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
##  [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
##  [7] "-----------------------------------------------------------------------------------------" 
##  [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
##  [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [10] "-----------------------------------------------------------------------------------------" 
## [11] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|" 
## [12] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

#taking a glimpse at our data
glimpse(lines)

##  chr [1:196] "-----------------------------------------------------------------------------------------" ...

First we cleaned the data:

#replacing every dash with emptiness
lines <- gsub("-", "", lines)

#deleting every empty line
lines <- lines[nzchar(lines)]

#deleting the header
lines <- lines[-c(1,2)]

#let's group every two lines 
blocks <- split(lines, ceiling(seq_along(lines)/2))

head(blocks)

## $`1`
## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "   ON | 15445895 / R: 1794   >1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## 
## $`2`
## [1] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [2] "   MI | 14598900 / R: 1553   >1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## 
## $`3`
## [1] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
## [2] "   MI | 14959604 / R: 1384   >1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## 
## $`4`
## [1] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
## [2] "   MI | 12616049 / R: 1716   >1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |" 
## 
## $`5`
## [1] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
## [2] "   MI | 14601533 / R: 1655   >1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## 
## $`6`
## [1] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|"
## [2] "   OH | 15055204 / R: 1686   >1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |"

Then created a function to extract the player information we needed:

#creating a function can extract player information from the split blocks

player_info <- function(block) {
  
  player_line1 <- block[1]
  player_line2 <- block[2]
  
  
  #extracting each round result and the opponent number
  rounds <- str_extract_all(player_line1, "[WLDB]\\s*\\d+")[[1]] 
  
  #first time using a tibble 
  tibble::tibble(
    
    Player_Number = str_extract(player_line1, "^\\s*\\d+"), #extracting the number at the new line
    State = str_extract(player_line2,"^\\s*\\w+" ), # extracting the first word at the new line
    Name = str_trim(str_extract(player_line1, "(?<=\\| ).*?(?=\\|)")),
    Total_Points = str_extract(player_line1, "\\d+\\.\\d"  ), 
    Pre_Rating = str_match(player_line2, "R:\\s+(\\d+)")[,2],
    
    #giving each round result it's own column
    !!!setNames(as.list(rounds), paste0("Round", seq_along(rounds))),
  )
  
}

# bind all players
player_data <- bind_rows(lapply(blocks,player_info))

head(player_data)

## # A tibble: 6 × 12
##   Player_Number State  Name  Total_Points Pre_Rating Round1 Round2 Round3 Round4
##   <chr>         <chr>  <chr> <chr>        <chr>      <chr>  <chr>  <chr>  <chr> 
## 1 "    1"       "   O… GARY… 6.0          1794       W  39  W  21  W  18  W  14 
## 2 "    2"       "   M… DAKS… 6.0          1553       W  63  W  58  L   4  W  17 
## 3 "    3"       "   M… ADIT… 6.0          1384       L   8  W  61  W  25  W  21 
## 4 "    4"       "   M… PATR… 5.5          1716       W  23  D  28  W   2  W  26 
## 5 "    5"       "   M… HANS… 5.5          1655       W  45  W  37  D  12  D  13 
## 6 "    6"       "   O… HANS… 5.0          1686       W  34  D  29  L  11  W  35 
## # ℹ 3 more variables: Round5 <chr>, Round6 <chr>, Round7 <chr>

#convert the tibble to a data frame
Chess_Players <- as.data.frame(player_data)

#extracting the numbers from the rounds columns.

Chess_Players <- Chess_Players %>%
  mutate(across(c(Round1, Round2, Round3, Round4, Round5, Round6,Round7),
                ~ as.numeric(str_remove_all(., "[^0-9]"))))

#Let's take a look at our data so far
head(Chess_Players)

##   Player_Number State                Name Total_Points Pre_Rating Round1 Round2
## 1             1    ON            GARY HUA          6.0       1794     39     21
## 2             2    MI     DAKSHESH DARURI          6.0       1553     63     58
## 3             3    MI        ADITYA BAJAJ          6.0       1384      8     61
## 4             4    MI PATRICK H SCHILLING          5.5       1716     23     28
## 5             5    MI          HANSHI ZUO          5.5       1655     45     37
## 6             6    OH         HANSEN SONG          5.0       1686     34     29
##   Round3 Round4 Round5 Round6 Round7
## 1     18     14      7     12      4
## 2      4     17     16     20      7
## 3     25     21     11     13     12
## 4      2     26      5     19      1
## 5     12     13      4     14     17
## 6     11     35     10     27     21

glimpse(Chess_Players)

## Rows: 64
## Columns: 12
## $ Player_Number <chr> "    1", "    2", "    3", "    4", "    5", "    6", " …
## $ State         <chr> "   ON", "   MI", "   MI", "   MI", "   MI", "   OH", " …
## $ Name          <chr> "GARY HUA", "DAKSHESH DARURI", "ADITYA BAJAJ", "PATRICK …
## $ Total_Points  <chr> "6.0", "6.0", "6.0", "5.5", "5.5", "5.0", "5.0", "5.0", …
## $ Pre_Rating    <chr> "1794", "1553", "1384", "1716", "1655", "1686", "1649", …
## $ Round1        <dbl> 39, 63, 8, 23, 45, 34, 57, 3, 25, 16, 38, 42, 36, 54, 19…
## $ Round2        <dbl> 21, 58, 61, 28, 37, 29, 46, 32, 18, 19, 56, 33, 27, 44, …
## $ Round3        <dbl> 18, 4, 25, 2, 12, 11, 13, 14, 59, 55, 6, 5, 7, 8, 30, 39…
## $ Round4        <dbl> 14, 17, 21, 26, 13, 35, 11, 9, 8, 31, 7, 38, 5, 1, 22, 2…
## $ Round5        <dbl> 7, 16, 11, 5, 4, 10, 1, 47, 26, 6, 3, 1, 33, 27, 54, 36,…
## $ Round6        <dbl> 12, 20, 13, 19, 14, 27, 9, 28, 7, 25, 34, 3, 3, 5, 33, N…
## $ Round7        <dbl> 4, 7, 12, 1, 17, 21, 2, 19, 20, 18, 26, NA, 32, 31, 38, …

#changing the pre-rating column to an integer since it will do calculations with this column later on.

Chess_Players$Pre_Rating <- as.integer(Chess_Players$Pre_Rating)

#replacing any null values in the Round Columns with zero.

Chess_Players <- Chess_Players %>%
  mutate(across(c(Round1, Round2, Round3, Round4, Round5, Round6,Round7),
                ~replace_na(., 0)))

glimpse(Chess_Players)

## Rows: 64
## Columns: 12
## $ Player_Number <chr> "    1", "    2", "    3", "    4", "    5", "    6", " …
## $ State         <chr> "   ON", "   MI", "   MI", "   MI", "   MI", "   OH", " …
## $ Name          <chr> "GARY HUA", "DAKSHESH DARURI", "ADITYA BAJAJ", "PATRICK …
## $ Total_Points  <chr> "6.0", "6.0", "6.0", "5.5", "5.5", "5.0", "5.0", "5.0", …
## $ Pre_Rating    <int> 1794, 1553, 1384, 1716, 1655, 1686, 1649, 1641, 1411, 13…
## $ Round1        <dbl> 39, 63, 8, 23, 45, 34, 57, 3, 25, 16, 38, 42, 36, 54, 19…
## $ Round2        <dbl> 21, 58, 61, 28, 37, 29, 46, 32, 18, 19, 56, 33, 27, 44, …
## $ Round3        <dbl> 18, 4, 25, 2, 12, 11, 13, 14, 59, 55, 6, 5, 7, 8, 30, 39…
## $ Round4        <dbl> 14, 17, 21, 26, 13, 35, 11, 9, 8, 31, 7, 38, 5, 1, 22, 2…
## $ Round5        <dbl> 7, 16, 11, 5, 4, 10, 1, 47, 26, 6, 3, 1, 33, 27, 54, 36,…
## $ Round6        <dbl> 12, 20, 13, 19, 14, 27, 9, 28, 7, 25, 34, 3, 3, 5, 33, 0…
## $ Round7        <dbl> 4, 7, 12, 1, 17, 21, 2, 19, 20, 18, 26, 0, 32, 31, 38, 0…

#pivoting the data frame so each round has it's own row instead of it's own column for each player.

Chess_Players_Pivot <- Chess_Players %>%
  pivot_longer(cols = starts_with("Round"),
               names_to = "Round_Column",
               values_to = "Opp_Number") %>%
  filter(!is.na(Opp_Number))

glimpse(Chess_Players_Pivot)

## Rows: 448
## Columns: 7
## $ Player_Number <chr> "    1", "    1", "    1", "    1", "    1", "    1", " …
## $ State         <chr> "   ON", "   ON", "   ON", "   ON", "   ON", "   ON", " …
## $ Name          <chr> "GARY HUA", "GARY HUA", "GARY HUA", "GARY HUA", "GARY HU…
## $ Total_Points  <chr> "6.0", "6.0", "6.0", "6.0", "6.0", "6.0", "6.0", "6.0", …
## $ Pre_Rating    <int> 1794, 1794, 1794, 1794, 1794, 1794, 1794, 1553, 1553, 15…
## $ Round_Column  <chr> "Round1", "Round2", "Round3", "Round4", "Round5", "Round…
## $ Opp_Number    <dbl> 39, 21, 18, 14, 7, 12, 4, 63, 58, 4, 17, 16, 20, 7, 8, 6…

#joining the original Chess_Players data frame to the Chess_Players_Pivot2 data frame
#both player_number columns must be a double type in both data frames.

Chess_Players$Player_Number <- as.double(Chess_Players$Player_Number)

Chess_Players_Pivot2 <- Chess_Players_Pivot %>%
  left_join(
    Chess_Players %>% 
      select(Player_Number, Pre_Rating),
    by = c("Opp_Number" = "Player_Number")
  )

#renaming the pre-ratings columns so we can tell them apart for the sake of our calculations
names(Chess_Players_Pivot2)

## [1] "Player_Number" "State"         "Name"          "Total_Points" 
## [5] "Pre_Rating.x"  "Round_Column"  "Opp_Number"    "Pre_Rating.y"

Chess_Players_Pivot2 <- Chess_Players_Pivot2 %>%
  rename(Player_PreRating = Pre_Rating.x,
         Opp_PreRating = Pre_Rating.y)

START OF ASSIGNMENT 5B

To find the five most underperforming players in the tournament (based on the difference between their actual and expected scores), we need to find the tournament expected scores for each player in the tournament and compare them with their actual tournament scores.

Calculating Expected Scores

Let’s Calculate the expected score after each round for each player.

#first let's create a function to calculate the expected score
expected_score <- function(player_a, player_b) {
  1 / (1 + 10 ^ ((player_b - player_a) / 400))
}

#now we can use our function and add a column to our data frame
Chess_Players_Pivot2 <- Chess_Players_Pivot2 %>%
  mutate( Player_Expected_Score = expected_score(Player_PreRating, Opp_PreRating))

Calculating the Tournament Expected Score for Each Player

Let’s find the overall expected score for the tournament for each player.

PlayersTournamnet_ExpScores <- Chess_Players_Pivot2 %>%
  group_by(Player_Number) %>%
  summarise(
    Tournament_ExpScore = sum(Player_Expected_Score, na.rm = TRUE)
  )
  
  head(PlayersTournamnet_ExpScores)

## # A tibble: 6 × 2
##   Player_Number Tournament_ExpScore
##   <chr>                       <dbl>
## 1 "    1"                      5.16
## 2 "    2"                      3.78
## 3 "    3"                      1.95
## 4 "    4"                      4.74
## 5 "    5"                      4.38
## 6 "    6"                      4.94

Each player’s Expected score vs. Actual score

Let’s join with the original chess players table. We need the Player Number columns for the join so we need to check the data types of columns we’ll be using for the join and make sure they are of the same type. If the columns are different data types we will need to change them.

glimpse(PlayersTournamnet_ExpScores)

## Rows: 64
## Columns: 2
## $ Player_Number       <chr> "    1", "    2", "    3", "    4", "    5", "    …
## $ Tournament_ExpScore <dbl> 5.161574, 3.778825, 1.945088, 4.741764, 4.382484, …

Since the Player Number columns are of different types, let’s convert them and make absolutely sure the column names are the exact same by trimming any white space that might exist in the column names.

Chess_Players <- Chess_Players %>%
  mutate(Player_Number = trimws(as.character(Player_Number)))

PlayersTournamnet_ExpScores <- PlayersTournamnet_ExpScores %>%
  mutate(Player_Number = trimws(as.character(Player_Number)))

Okay, now that both our Player Number columns in both data frame are of the same type, we can go ahead and join the data frames.

Chess_Players <- Chess_Players %>%
  left_join(PlayersTournamnet_ExpScores, by = "Player_Number")

Let’s subset and keep only the Total Points, Player Names and Expected Score columns in a new data frame.

Chess_Player_Scores <- Chess_Players %>%
  select(-Player_Number,-State,-Pre_Rating,-Round1, -Round2, -Round3, -Round4, -Round5, -Round6,-Round7)

Let’s round each player’s expected scores.

Chess_Player_Scores$Tournament_ExpScore <- round(Chess_Player_Scores$Tournament_ExpScore, 
                                                 digits = 1)

We will need to use the values in the Total Points column in our calculations of each player’s actual score minus their expected score, so let us change the data type to numeric.

Chess_Player_Scores <- Chess_Player_Scores %>%
  mutate(
    Total_Points = as.numeric(Total_Points)
  )

To find the five players that most overperformed and underperformed we we will need to find the difference between each player’s actual score and their expected score.

Chess_Player_Scores <- Chess_Player_Scores %>%
  mutate(ActScore_ExpMinus = Total_Points -Tournament_ExpScore )

Now, we can finally find and print the five players who most overperformed and underperformed based on the difference between their actual score and their expected score.

The Five Players Who Most Overerperformed

Let’s put the data frame in descending order so we can find the five most overperforming players:

Overperforming_Players <- Chess_Player_Scores %>%
  arrange(desc(ActScore_ExpMinus))

Okay, great now let’s print only the five most overperforming players.

head(Overperforming_Players,5)

##                       Name Total_Points Tournament_ExpScore ActScore_ExpMinus
## 1             ADITYA BAJAJ          6.0                 1.9               4.1
## 2                ANVIT RAO          5.0                 1.9               3.1
## 3   ZACHARY JAMES HOUGHTON          4.5                 1.4               3.1
## 4 JACOB ALEXANDER LAVALLEY          3.0                 0.0               3.0
## 5              STEFANO LEE          5.0                 2.3               2.7

The Five Players Who Most Underperformed

Now, let’s put the data frame in ascending order so we can find the five most underperforming players :

Underperforming_Players <- Chess_Player_Scores %>%
  arrange(ActScore_ExpMinus)

Okay, great now let’s print only the five most underperforming players.

head(Underperforming_Players,5)

##                 Name Total_Points Tournament_ExpScore ActScore_ExpMinus
## 1   LOREN SCHWIEBERT          3.5                 6.3              -2.8
## 2 GEORGE AVERY JONES          3.5                 6.0              -2.5
## 3           JARED GE          3.0                 5.0              -2.0
## 4       RISHI SHETTY          3.5                 5.1              -1.6
## 5   JOSHUA DAVID LEE          3.5                 5.0              -1.5

DATA 607 Assignment 5B