This project is near and dear to me because I worked as a chess tournament director for many years. Chess in the Schools (CIS) is a non-profit organization focused on bringing chess to low-income and under-represented schools. They offer a variety of programs to teach and enforce chess. This includes but is not limited to summer chess camps, after school programs, weekly tournaments, national tournament entry and employment as an assistant tournament director. Through their employment, I managed to play many chess games, oversee tournaments and learn how to read result sheets.
I’ll be using a dataset from a recent tournament held by Chess In The Schools. The tournament took place 1-22-2022, and we will be observing the Open section - where the highest rated players compete. It is primarily for students rated over 1500, although there are exceptions from time to time. If a coach is aware of a student’s capabilities, sometimes they expose them to stronger players by placing them in the Open section. It’s also a sure-fire way to increase your rating if you win or draw against someone rated higher than you.
library('rvest')
library('tidyverse')
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag() masks stats::lag()
Unlike the assignment, the information isn’t found in a .txt file, so we need to do some mild web-scraping to take the dataset from the Chess In The Schools website and apply it to a dataframe. We begin by reading the html link, and then create the dataframe using html_element and html_table.
link <- 'https://chessintheschools.org/tournament/2022-1-22nyc-council-district-5-chess-challenge/'
link_read <- read_html(link)
data_table <- link_read %>%
html_element('table') %>%
html_table(header = TRUE)
Below is a quick view of this dataframe before any more manipulation. Everything seems to have imported nicely.
data_table
If there are an odd number of players in a tournament, one person every round is assigned a Full Point Bye. It’s meant to compensate the players for not finding them a match. Additionally, there is something called a Half Point Bye, that is very similar to the full point bye, but worth .5 points instead. This is usually given when there is a foreseeable absence from a round, but not the remainder of the tournament. Traditionally, you’ll see more half point byes in the beginning of tournaments or at the end.
Below, we are replacing the notation with the full breakdown for clarity.
data_table[data_table == "H---"] <- "Half Point Bye"
data_table[data_table == "B---"] <- "Full Point Bye"
data_table
Next, we want to start extracting information from the columns of the rounds. We want to create a match between the opponent in the respective round and their rating. To do this, we must remove all instances of L, W, or D. Since Full Point Bye and Half Point Bye are important pieces of information, we will keep that for now. The regex pattern below will help us select only numeric values from the string, or strings that are greater than or equal to 7 characters.
data_rounds <- data_table %>%
mutate(across(`Rd 1`:`Rd 3`, str_match, ("[1-9][0-9]*|0|.{7,}")))
data_rounds
Now that we extracted the indexes of the players, we can link it to their ratings. In order to have the most amount of information possible, I created three new columns to store this information.
data_rounds <- data_rounds %>%
add_column(`Rd 1 Opponent` = NA,
`Rd 2 Opponent` = NA,
`Rd 3 Opponent` = NA)
I then have a for loop that iterates through the index and substitutes it with the appropriate rating. I then repeat this for all three rounds.
for (row in data_rounds$`#`) {
data_rounds$`Rd 1 Opponent`[row] <- data_table$`Rtng`[as.numeric(data_rounds$`Rd 1`[row])]
}
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
for (row in data_rounds$`#`) {
data_rounds$`Rd 2 Opponent`[row] <- data_table$`Rtng`[as.numeric(data_rounds$`Rd 2`[row])]
}
## Warning: NAs introduced by coercion
for (row in data_rounds$`#`) {
data_rounds$`Rd 3 Opponent`[row] <- data_table$`Rtng`[as.numeric(data_rounds$`Rd 3`[row])]
}
## Warning: NAs introduced by coercion
Above you might get an error “NAs introduced by coercion. That is okay, because we will use those NAs later. Below is another glimpse at the dataframe to see the changes that have been done.
data_rounds
In order to calculate the average rating of games played, we first need to calculate the number of games played. We are not including any form of Byes as a game played, so they will not be included in the calculation. Since that is the case, in the chunk above, we converted the data into numerics, which changed the records of any Bye into NA. This was intentional, because we can then count how many records aren’t NA below, allowing us to get an accurate measure of games played. We’re counting the game count in the new columns that we created above.
data_rounds$Num_Games_Played <- apply(data_rounds[19:21], 1, function(x) sum(!is.na(x)))
data_rounds
Unrated players have made their way into this dataset. All of their games should still be counted, so I didn’t want to change their value to NA earlier, beacuse it will affect the game count. Now that we have a count of the games played, we can now change their information to NA. For calculating averages, a 0 value doesn’t have much difference than NA in this case, so we can change it for the sake of calculation. This will allow us to make sure our columns only contain numeric information.
data_rounds[data_rounds == "unr."] <- NA
We will transform the columns once again and assign it to the dataframe, because right now the records are of type character, which isn’t ideal for the function we will be using to calculate the average.
data_rounds <- transform(data_rounds, `Rd 1 Opponent` = as.numeric(`Rd 1 Opponent`),
`Rd 2 Opponent` = as.numeric(`Rd 2 Opponent`),
`Rd 3 Opponent` = as.numeric(`Rd 3 Opponent`))
Now that the data is in the format that we want, we can calculate the sum of the three rounds of opponents and then divide by the number of games played. This will give us an accurate average, regardless of Byes.
data_rounds$Average_Opponent_Rating <- rowSums(data_rounds[, c('Rd.1.Opponent', 'Rd.2.Opponent', 'Rd.3.Opponent')], na.rm = TRUE) / data_rounds$Num_Games_Played
data_rounds
chess_data <- data_rounds[, c('Name', 'St', 'Tot', 'Rtng', 'Average_Opponent_Rating')]
colnames(chess_data) <- c("Player's Name", "Player State", "Total Number of Points", "Player's Pre-Rating", "Average Pre Chess Rating of Opponents")
chess_data
write.csv(chess_data, '//Users/carlos/Desktop/ChessTournament.csv')
Next steps would be to write a function or nested loop for the section where I am replacing the game number with the opponent rating. That could look significantly cleaner. Additionally, I’d like to represent the other game modes in the tournament and practice more web-scraping. This would mean we can show the standings for Open, Reserve, Novice and Rookie. It’ll nice to see which school has the greatest impact across the whole tournament.