This project will process a text file of chess tournament results stored in a cross table format as shown below. The tournament results have some structure but will also require the use of regular expressions to extract key fields contained within.
Below is an example of the chess tournament cross table structure.
The goal of this project is to generate a .CSV file (which could for example be imported into a SQL database) with the following information for all of the players contained in the chess tournament cross table:
For the first player list aboved, Gary Hua, the information to be captured, calculated and ultimately exported will be:
Player’s Name | Player’s State | Total Number of Points | Player’s Pre-Rating | Average Pre Chess Rating of Opponents |
---|---|---|---|---|
Gary Hua | ON | 6.0 | 1794 | 1605 |
Addtionally, the results of each player’s seven matches will be converted from wide to long format. This information will also be exported to a .CSV file, which could also be used as the basis for an import into a SQL database.
The R code for this project can be found on GitHub here under “Project 1 - Chess Tournament Results”. If executing the code, be sure to set your working directory to an appropriate directory in your environment. Three files will be created in this directory: 1.) tournamentinfo.txt 2.) Chess_Players.csv and 3.) Chess_Player_Matches.csv.
The following R packages are required for this project:
The tournament results data used for the project can be found on GitHub below:
Start by downloading the results file, called `tournamentinfo.txt’, from the GitHub location. This file will be downloaded to the current working directory.
# location of the population CSV file on GitHub
url <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/tournamentinfo.txt"
download_file <- "tournamentinfo.txt" # name of the file on the local machine after download
# the file will be downloaded to the working directory
downloader::download(url, download_file)
# read the file into a vector
conn <- file(download_file, open="r")
tournament.info.raw <- readLines(conn, warn = FALSE)
close(conn)
Let’s look at the tournament info file which has been loaded into a character vector.
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [7] "-----------------------------------------------------------------------------------------"
## [8] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [9] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [10] "-----------------------------------------------------------------------------------------"
Looking at the file’s structure, we can see that the file has a header comprised of two rows:
The file also follows a fixed width structure but uses two different row formats to record information for a single chess player. The formats are described below.
In the first record format shown below, the information for Player Number, Player Name, Total Points, and the Round 1 - 7 results can be found. The results for each round capture the outcome (Win, Loss, Draw) and the Opponent’s Player Number.
In the second record format below, the Player’s State, USCF ID, Pre-Rating and Post-Rating values can be found. The remaining fields are not considered in this project.
Finally, we see that file structure uses a hyphen character “-” as a formatting line between player entries. Note that the combination of record format 1 and 2 form a complete record for each chess player.
To successfully process the tournament results file, several adjustments to the file contents will be made. These adjustments will ultimately make processing the data into an R data frame easier.
tournament.info <- str_replace_all(str_replace_all(tournament.info.raw, "->", ">>"), "-{3,}", "")
tournament.info <- tournament.info[tournament.info != ""]
tournament.info <- tournament.info[-(1:2)]
# determine the format by looking at the contents of the first field in the record
tournament.format1 <- tournament.info[str_detect(str_sub(tournament.info, 1, 6), "[0-9]")]
tournament.format2 <- tournament.info[str_detect(str_sub(tournament.info, 1, 6), "[A-Z]{2,2}")]
Vector containing record format 1 contents:
head(tournament.format1, 5)
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [3] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [4] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [5] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
Vector containing record format 2 contents:
head(tournament.format2, 5)
## [1] " ON | 15445895 / R: 1794 >>1817 |N:2 |W |B |W |B |W |B |W |"
## [2] " MI | 14598900 / R: 1553 >>1663 |N:2 |B |W |B |W |B |W |B |"
## [3] " MI | 14959604 / R: 1384 >>1640 |N:2 |W |B |W |B |W |B |W |"
## [4] " MI | 12616049 / R: 1716 >>1744 |N:2 |W |B |W |B |W |B |B |"
## [5] " MI | 14601533 / R: 1655 >>1690 |N:2 |B |W |B |W |B |W |B |"
With the two record types split into vectors, we can now parse the records’ fields while loading them into data frames. Loading the data frames format1_df
and format2_df
from the vectors leverages the fixed-width nature of the chess results. Regular expressions will be used to extract the values for uscf_id
, pre_rating
, and post_rating
.
Step One - parse the first record format for each player load into a data frame called format1_df
.
format1_df <- data.frame(player_num = as.numeric(substr(tournament.format1, 1, 6)),
player_name = str_trim(substr(tournament.format1, 8, 40), side="both"),
total_pts = as.numeric(substr(tournament.format1, 42, 46)),
round1 = substr(tournament.format1, 48, 52),
round2 = substr(tournament.format1, 54, 58),
round3 = substr(tournament.format1, 60, 64),
round4 = substr(tournament.format1, 66, 70),
round5 = substr(tournament.format1, 72, 76),
round6 = substr(tournament.format1, 78, 82),
round7 = substr(tournament.format1, 84, 88), stringsAsFactors=FALSE)
Step Two - parse the second record format for each player into a data frame called format2_df
.
format2_df <- data.frame(player_state = str_trim(substr(tournament.format2, 1, 6), side="both"),
uscf_id = str_extract(substr(tournament.format2, 8, 40), "\\d+"),
pre_rating = as.numeric(str_extract(substr(tournament.format2, 8, 40), "(?<=R: ).\\d+(?=)")),
post_rating = as.numeric(str_extract(substr(tournament.format2, 8, 40), "(?<=>>).\\d+(?=)")),
stringsAsFactors=FALSE)
Combine the two data frames to create a single, consolidated record for each player.
results_df <- cbind(format1_df, format2_df)
Tournament Results – Consolidated Record Layout | ||||||||||||||
Player Num | Player Name | Total Points | Round 1 | Round 2 | Round 3 | Round 4 | Round 5 | Round 6 | Round 7 | Player State | USCF ID | Pre-Rating | Post-Rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | GARY HUA | 6 | W 39 | W 21 | W 18 | W 14 | W 7 | D 12 | D 4 | ON | 15445895 | 1794 | 1817 |
2 | 2 | DAKSHESH DARURI | 6 | W 63 | W 58 | L 4 | W 17 | W 16 | W 20 | W 7 | MI | 14598900 | 1553 | 1663 |
3 | 3 | ADITYA BAJAJ | 6 | L 8 | W 61 | W 25 | W 21 | W 11 | W 13 | W 12 | MI | 14959604 | 1384 | 1640 |
4 | 4 | PATRICK H SCHILLING | 5.5 | W 23 | D 28 | W 2 | W 26 | D 5 | W 19 | D 1 | MI | 12616049 | 1716 | 1744 |
5 | 5 | HANSHI ZUO | 5.5 | W 45 | W 37 | D 12 | D 13 | D 4 | W 14 | W 17 | MI | 14601533 | 1655 | 1690 |
6 | 6 | HANSEN SONG | 5 | W 34 | D 29 | L 11 | W 35 | D 10 | W 27 | W 21 | OH | 15055204 | 1686 | 1687 |
7 | 7 | GARY DEE SWATHELL | 5 | W 57 | W 46 | W 13 | W 11 | L 1 | W 9 | L 2 | MI | 11146376 | 1649 | 1673 |
8 | 8 | EZEKIEL HOUGHTON | 5 | W 3 | W 32 | L 14 | L 9 | W 47 | W 28 | W 19 | MI | 15142253 | 1641 | 1657 |
9 | 9 | STEFANO LEE | 5 | W 25 | L 18 | W 59 | W 8 | W 26 | L 7 | W 20 | ON | 14954524 | 1411 | 1564 |
10 | 10 | ANVIT RAO | 5 | D 16 | L 19 | W 55 | W 31 | D 6 | W 25 | W 18 | MI | 14150362 | 1365 | 1544 |
Creating the final output will focus on calculating the value of each player’s Average Pre Chess Rating of Opponents played. To do this, break the chess results data into a dataset only containing player information, excluding any information pertaining to matches played. Matches played in rounds 1 - 7 will be converted to a separate more normalized or tidy dataset.
Chess Player Dataset
player_df <- select(results_df, player_num:total_pts, player_state:post_rating)
Chess Player Information Dataset | |||||||
Player Num | Player Name | Total Points | PlayerState | USCF ID | Pre-Rating | Post-Rating | |
---|---|---|---|---|---|---|---|
1 | 1 | GARY HUA | 6 | ON | 15445895 | 1794 | 1817 |
2 | 2 | DAKSHESH DARURI | 6 | MI | 14598900 | 1553 | 1663 |
3 | 3 | ADITYA BAJAJ | 6 | MI | 14959604 | 1384 | 1640 |
4 | 4 | PATRICK H SCHILLING | 5.5 | MI | 12616049 | 1716 | 1744 |
5 | 5 | HANSHI ZUO | 5.5 | MI | 14601533 | 1655 | 1690 |
6 | 6 | HANSEN SONG | 5 | OH | 15055204 | 1686 | 1687 |
7 | 7 | GARY DEE SWATHELL | 5 | MI | 11146376 | 1649 | 1673 |
8 | 8 | EZEKIEL HOUGHTON | 5 | MI | 15142253 | 1641 | 1657 |
9 | 9 | STEFANO LEE | 5 | ON | 14954524 | 1411 | 1564 |
10 | 10 | ANVIT RAO | 5 | MI | 14150362 | 1365 | 1544 |
Chess Player Matches Normalized Dataset
The logic below creates a normalized dataset of the player’s matches, where the match results are changed from wide to long format. The data captured in the round 1 - 7 fields will be split to indicate the outcome of the round (Win, Loss, Draw, etc.) and the opponent’s player number.
player_matches_df <-
results_df %>% select(player_num, round1:round7) %>%
melt(id.var=c("player_num"), value.name="outcome_opp") %>%
mutate( round = as.numeric(str_replace(variable, "round", "")),
outcome = str_extract(outcome_opp, "^\\w+"),
opponent_num = as.numeric(str_extract(outcome_opp, "\\d+$"))) %>%
select(-c(variable, outcome_opp)) %>%
inner_join(select(player_df, player_num, pre_rating), c("opponent_num" = "player_num")) %>%
select(player_num, round, outcome, opponent_num, pre_rating) %>%
arrange(player_num, round)
# rename pre_rating to opponent_pre_rating
names(player_matches_df)[names(player_matches_df) == "pre_rating"] <- "opponent_pre_rating"
Example Chess Player Matches Normalized Dataset | |||||
PlayerNum | Round | Outcome | Opponent’sPlayer Num | Opponent’sPre-Chess Rating | |
---|---|---|---|---|---|
1 | 1 | 1 | W | 39 | 1436 |
2 | 1 | 2 | W | 21 | 1563 |
3 | 1 | 3 | W | 18 | 1600 |
4 | 1 | 4 | W | 14 | 1610 |
5 | 1 | 5 | W | 7 | 1649 |
6 | 1 | 6 | D | 12 | 1663 |
7 | 1 | 7 | D | 4 | 1716 |
8 | 2 | 1 | W | 63 | 1175 |
9 | 2 | 2 | W | 58 | 917 |
10 | 2 | 3 | L | 4 | 1716 |
Aggregate the normalized match dataset to create the value for opponents_avg_pre_rating
. Note: In this calculation, only matches played with a Win, Loss, or Draw outcome are considered.
player_df <-
player_matches_df %>%
group_by(player_num) %>% summarise(opponents_avg_pre_rating = round(mean(opponent_pre_rating))) %>%
inner_join(player_df, by="player_num") %>%
select(player_num, player_name, player_state, total_pts, uscf_id, pre_rating, post_rating, opponents_avg_pre_rating)
Example Final Player Result | ||||||||
Player Num | Player Name | Player State | Total Points | USCF ID | Pre-Rating | Post-Rating | Opponents’ Average Pre-Chess Rating | |
---|---|---|---|---|---|---|---|---|
1 | 1 | GARY HUA | ON | 6 | 15445895 | 1794 | 1817 | 1605 |
2 | 2 | DAKSHESH DARURI | MI | 6 | 14598900 | 1553 | 1663 | 1469 |
3 | 3 | ADITYA BAJAJ | MI | 6 | 14959604 | 1384 | 1640 | 1564 |
4 | 4 | PATRICK H SCHILLING | MI | 5.5 | 12616049 | 1716 | 1744 | 1574 |
5 | 5 | HANSHI ZUO | MI | 5.5 | 14601533 | 1655 | 1690 | 1501 |
6 | 6 | HANSEN SONG | OH | 5 | 15055204 | 1686 | 1687 | 1519 |
7 | 7 | GARY DEE SWATHELL | MI | 5 | 11146376 | 1649 | 1673 | 1372 |
8 | 8 | EZEKIEL HOUGHTON | MI | 5 | 15142253 | 1641 | 1657 | 1468 |
9 | 9 | STEFANO LEE | ON | 5 | 14954524 | 1411 | 1564 | 1523 |
10 | 10 | ANVIT RAO | MI | 5 | 14150362 | 1365 | 1544 | 1554 |
Create the output .CSV files: Chess_Players.csv and Chess_Player_Matches.csv.
The output files can be found on GitHub:
write.csv(player_df, "Chess_Players.csv", row.names=FALSE)
write.csv(player_matches_df, "Chess_Player_Matches.csv", row.names=FALSE)