Using the same chess tournament data from Project 1, we will use the difference between each chess player and their opponents pre-ratings to calculate each chess player’s expected score. We will then find the difference between their expected score and their actual score in the tournament. Finally, we will list the five players who most overperformed and the five players that most underperformed relative to their expected score.
To calculate each player’s expected score the following ELO formula:
For each player the tournament info CSV file will include the following information:
The Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.
We will use the following libraries:
If you do not want to go through the data cleaning steps taken in Project 1, feel free to skip ahead to START OF ASSIGNMENT 5B on the menu, as this is where the work for Assignment 5B begins:
We will be redoing the data cleaning steps from Project 1, staring with reading in our data from Github and using the readLines() function:
#reading in the csv from Project1's repository
url <-("https://raw.githubusercontent.com/WendyR20/DATA-607-Project-1/refs/heads/main/tournamentinfo.txt")
lines <- readLines(url)
## Warning in readLines(url): incomplete final line found on
## 'https://raw.githubusercontent.com/WendyR20/DATA-607-Project-1/refs/heads/main/tournamentinfo.txt'
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [7] "-----------------------------------------------------------------------------------------"
## [8] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [9] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [10] "-----------------------------------------------------------------------------------------"
## [11] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [12] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## chr [1:196] "-----------------------------------------------------------------------------------------" ...
First we cleaned the data:
#replacing every dash with emptiness
lines <- gsub("-", "", lines)
#deleting every empty line
lines <- lines[nzchar(lines)]
## $`1`
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " ON | 15445895 / R: 1794 >1817 |N:2 |W |B |W |B |W |B |W |"
##
## $`2`
## [1] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [2] " MI | 14598900 / R: 1553 >1663 |N:2 |B |W |B |W |B |W |B |"
##
## $`3`
## [1] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [2] " MI | 14959604 / R: 1384 >1640 |N:2 |W |B |W |B |W |B |W |"
##
## $`4`
## [1] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [2] " MI | 12616049 / R: 1716 >1744 |N:2 |W |B |W |B |W |B |B |"
##
## $`5`
## [1] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
## [2] " MI | 14601533 / R: 1655 >1690 |N:2 |B |W |B |W |B |W |B |"
##
## $`6`
## [1] " 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|"
## [2] " OH | 15055204 / R: 1686 >1687 |N:3 |W |B |W |B |B |W |B |"
Then created a function to extract the player information we needed:
#creating a function can extract player information from the split blocks
player_info <- function(block) {
player_line1 <- block[1]
player_line2 <- block[2]
#extracting each round result and the opponent number
rounds <- str_extract_all(player_line1, "[WLDB]\\s*\\d+")[[1]]
#first time using a tibble
tibble::tibble(
Player_Number = str_extract(player_line1, "^\\s*\\d+"), #extracting the number at the new line
State = str_extract(player_line2,"^\\s*\\w+" ), # extracting the first word at the new line
Name = str_trim(str_extract(player_line1, "(?<=\\| ).*?(?=\\|)")),
Total_Points = str_extract(player_line1, "\\d+\\.\\d" ),
Pre_Rating = str_match(player_line2, "R:\\s+(\\d+)")[,2],
#giving each round result it's own column
!!!setNames(as.list(rounds), paste0("Round", seq_along(rounds))),
)
}
## # A tibble: 6 × 12
## Player_Number State Name Total_Points Pre_Rating Round1 Round2 Round3 Round4
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 " 1" " O… GARY… 6.0 1794 W 39 W 21 W 18 W 14
## 2 " 2" " M… DAKS… 6.0 1553 W 63 W 58 L 4 W 17
## 3 " 3" " M… ADIT… 6.0 1384 L 8 W 61 W 25 W 21
## 4 " 4" " M… PATR… 5.5 1716 W 23 D 28 W 2 W 26
## 5 " 5" " M… HANS… 5.5 1655 W 45 W 37 D 12 D 13
## 6 " 6" " O… HANS… 5.0 1686 W 34 D 29 L 11 W 35
## # ℹ 3 more variables: Round5 <chr>, Round6 <chr>, Round7 <chr>
#extracting the numbers from the rounds columns.
Chess_Players <- Chess_Players %>%
mutate(across(c(Round1, Round2, Round3, Round4, Round5, Round6,Round7),
~ as.numeric(str_remove_all(., "[^0-9]"))))
## Player_Number State Name Total_Points Pre_Rating Round1 Round2
## 1 1 ON GARY HUA 6.0 1794 39 21
## 2 2 MI DAKSHESH DARURI 6.0 1553 63 58
## 3 3 MI ADITYA BAJAJ 6.0 1384 8 61
## 4 4 MI PATRICK H SCHILLING 5.5 1716 23 28
## 5 5 MI HANSHI ZUO 5.5 1655 45 37
## 6 6 OH HANSEN SONG 5.0 1686 34 29
## Round3 Round4 Round5 Round6 Round7
## 1 18 14 7 12 4
## 2 4 17 16 20 7
## 3 25 21 11 13 12
## 4 2 26 5 19 1
## 5 12 13 4 14 17
## 6 11 35 10 27 21
## Rows: 64
## Columns: 12
## $ Player_Number <chr> " 1", " 2", " 3", " 4", " 5", " 6", " …
## $ State <chr> " ON", " MI", " MI", " MI", " MI", " OH", " …
## $ Name <chr> "GARY HUA", "DAKSHESH DARURI", "ADITYA BAJAJ", "PATRICK …
## $ Total_Points <chr> "6.0", "6.0", "6.0", "5.5", "5.5", "5.0", "5.0", "5.0", …
## $ Pre_Rating <chr> "1794", "1553", "1384", "1716", "1655", "1686", "1649", …
## $ Round1 <dbl> 39, 63, 8, 23, 45, 34, 57, 3, 25, 16, 38, 42, 36, 54, 19…
## $ Round2 <dbl> 21, 58, 61, 28, 37, 29, 46, 32, 18, 19, 56, 33, 27, 44, …
## $ Round3 <dbl> 18, 4, 25, 2, 12, 11, 13, 14, 59, 55, 6, 5, 7, 8, 30, 39…
## $ Round4 <dbl> 14, 17, 21, 26, 13, 35, 11, 9, 8, 31, 7, 38, 5, 1, 22, 2…
## $ Round5 <dbl> 7, 16, 11, 5, 4, 10, 1, 47, 26, 6, 3, 1, 33, 27, 54, 36,…
## $ Round6 <dbl> 12, 20, 13, 19, 14, 27, 9, 28, 7, 25, 34, 3, 3, 5, 33, N…
## $ Round7 <dbl> 4, 7, 12, 1, 17, 21, 2, 19, 20, 18, 26, NA, 32, 31, 38, …
#changing the pre-rating column to an integer since it will do calculations with this column later on.
Chess_Players$Pre_Rating <- as.integer(Chess_Players$Pre_Rating)
#replacing any null values in the Round Columns with zero.
Chess_Players <- Chess_Players %>%
mutate(across(c(Round1, Round2, Round3, Round4, Round5, Round6,Round7),
~replace_na(., 0)))
glimpse(Chess_Players)
## Rows: 64
## Columns: 12
## $ Player_Number <chr> " 1", " 2", " 3", " 4", " 5", " 6", " …
## $ State <chr> " ON", " MI", " MI", " MI", " MI", " OH", " …
## $ Name <chr> "GARY HUA", "DAKSHESH DARURI", "ADITYA BAJAJ", "PATRICK …
## $ Total_Points <chr> "6.0", "6.0", "6.0", "5.5", "5.5", "5.0", "5.0", "5.0", …
## $ Pre_Rating <int> 1794, 1553, 1384, 1716, 1655, 1686, 1649, 1641, 1411, 13…
## $ Round1 <dbl> 39, 63, 8, 23, 45, 34, 57, 3, 25, 16, 38, 42, 36, 54, 19…
## $ Round2 <dbl> 21, 58, 61, 28, 37, 29, 46, 32, 18, 19, 56, 33, 27, 44, …
## $ Round3 <dbl> 18, 4, 25, 2, 12, 11, 13, 14, 59, 55, 6, 5, 7, 8, 30, 39…
## $ Round4 <dbl> 14, 17, 21, 26, 13, 35, 11, 9, 8, 31, 7, 38, 5, 1, 22, 2…
## $ Round5 <dbl> 7, 16, 11, 5, 4, 10, 1, 47, 26, 6, 3, 1, 33, 27, 54, 36,…
## $ Round6 <dbl> 12, 20, 13, 19, 14, 27, 9, 28, 7, 25, 34, 3, 3, 5, 33, 0…
## $ Round7 <dbl> 4, 7, 12, 1, 17, 21, 2, 19, 20, 18, 26, 0, 32, 31, 38, 0…
#pivoting the data frame so each round has it's own row instead of it's own column for each player.
Chess_Players_Pivot <- Chess_Players %>%
pivot_longer(cols = starts_with("Round"),
names_to = "Round_Column",
values_to = "Opp_Number") %>%
filter(!is.na(Opp_Number))
glimpse(Chess_Players_Pivot)
## Rows: 448
## Columns: 7
## $ Player_Number <chr> " 1", " 1", " 1", " 1", " 1", " 1", " …
## $ State <chr> " ON", " ON", " ON", " ON", " ON", " ON", " …
## $ Name <chr> "GARY HUA", "GARY HUA", "GARY HUA", "GARY HUA", "GARY HU…
## $ Total_Points <chr> "6.0", "6.0", "6.0", "6.0", "6.0", "6.0", "6.0", "6.0", …
## $ Pre_Rating <int> 1794, 1794, 1794, 1794, 1794, 1794, 1794, 1553, 1553, 15…
## $ Round_Column <chr> "Round1", "Round2", "Round3", "Round4", "Round5", "Round…
## $ Opp_Number <dbl> 39, 21, 18, 14, 7, 12, 4, 63, 58, 4, 17, 16, 20, 7, 8, 6…
#joining the original Chess_Players data frame to the Chess_Players_Pivot2 data frame
#both player_number columns must be a double type in both data frames.
Chess_Players$Player_Number <- as.double(Chess_Players$Player_Number)
Chess_Players_Pivot2 <- Chess_Players_Pivot %>%
left_join(
Chess_Players %>%
select(Player_Number, Pre_Rating),
by = c("Opp_Number" = "Player_Number")
)
#renaming the pre-ratings columns so we can tell them apart for the sake of our calculations
names(Chess_Players_Pivot2)
## [1] "Player_Number" "State" "Name" "Total_Points"
## [5] "Pre_Rating.x" "Round_Column" "Opp_Number" "Pre_Rating.y"
To find the five most underperforming players in the tournament (based on the difference between their actual and expected scores), we need to find the tournament expected scores for each player in the tournament and compare them with their actual tournament scores.
Let’s Calculate the expected score after each round for each player.
#first let's create a function to calculate the expected score
expected_score <- function(player_a, player_b) {
1 / (1 + 10 ^ ((player_b - player_a) / 400))
}
#now we can use our function and add a column to our data frame
Chess_Players_Pivot2 <- Chess_Players_Pivot2 %>%
mutate( Player_Expected_Score = expected_score(Player_PreRating, Opp_PreRating))
Let’s find the overall expected score for the tournament for each player.
PlayersTournamnet_ExpScores <- Chess_Players_Pivot2 %>%
group_by(Player_Number) %>%
summarise(
Tournament_ExpScore = sum(Player_Expected_Score, na.rm = TRUE)
)
head(PlayersTournamnet_ExpScores)
## # A tibble: 6 × 2
## Player_Number Tournament_ExpScore
## <chr> <dbl>
## 1 " 1" 5.16
## 2 " 2" 3.78
## 3 " 3" 1.95
## 4 " 4" 4.74
## 5 " 5" 4.38
## 6 " 6" 4.94
Let’s join with the original chess players table. We need the Player Number columns for the join so we need to check the data types of columns we’ll be using for the join and make sure they are of the same type. If the columns are different data types we will need to change them.
## Rows: 64
## Columns: 2
## $ Player_Number <chr> " 1", " 2", " 3", " 4", " 5", " …
## $ Tournament_ExpScore <dbl> 5.161574, 3.778825, 1.945088, 4.741764, 4.382484, …
Since the Player Number columns are of different types, let’s convert them and make absolutely sure the column names are the exact same by trimming any white space that might exist in the column names.
Chess_Players <- Chess_Players %>%
mutate(Player_Number = trimws(as.character(Player_Number)))
PlayersTournamnet_ExpScores <- PlayersTournamnet_ExpScores %>%
mutate(Player_Number = trimws(as.character(Player_Number)))
Okay, now that both our Player Number columns in both data frame are of the same type, we can go ahead and join the data frames.
Let’s subset and keep only the Total Points, Player Names and Expected Score columns in a new data frame.
Chess_Player_Scores <- Chess_Players %>%
select(-Player_Number,-State,-Pre_Rating,-Round1, -Round2, -Round3, -Round4, -Round5, -Round6,-Round7)
Let’s round each player’s expected scores.
Chess_Player_Scores$Tournament_ExpScore <- round(Chess_Player_Scores$Tournament_ExpScore,
digits = 1)
We will need to use the values in the Total Points column in our calculations of each player’s actual score minus their expected score, so let us change the data type to numeric.
To find the five players that most overperformed and underperformed we we will need to find the difference between each player’s actual score and their expected score.
Chess_Player_Scores <- Chess_Player_Scores %>%
mutate(ActScore_ExpMinus = Total_Points -Tournament_ExpScore )
Now, we can finally find and print the five players who most overperformed and underperformed based on the difference between their actual score and their expected score.
Let’s put the data frame in descending order so we can find the five most overperforming players:
Okay, great now let’s print only the five most overperforming players.
## Name Total_Points Tournament_ExpScore ActScore_ExpMinus
## 1 ADITYA BAJAJ 6.0 1.9 4.1
## 2 ANVIT RAO 5.0 1.9 3.1
## 3 ZACHARY JAMES HOUGHTON 4.5 1.4 3.1
## 4 JACOB ALEXANDER LAVALLEY 3.0 0.0 3.0
## 5 STEFANO LEE 5.0 2.3 2.7
Now, let’s put the data frame in ascending order so we can find the five most underperforming players :
Okay, great now let’s print only the five most underperforming players.
## Name Total_Points Tournament_ExpScore ActScore_ExpMinus
## 1 LOREN SCHWIEBERT 3.5 6.3 -2.8
## 2 GEORGE AVERY JONES 3.5 6.0 -2.5
## 3 JARED GE 3.0 5.0 -2.0
## 4 RISHI SHETTY 3.5 5.1 -1.6
## 5 JOSHUA DAVID LEE 3.5 5.0 -1.5