Project Overview

This project will process a text file of chess tournament results stored in a cross table format as shown below. The tournament results have some structure but will also require the use of regular expressions to extract key fields contained within.

Below is an example of the chess tournament cross table structure.

Chess Tournament Cross Table

The goal of this project is to generate a .CSV file (which could for example be imported into a SQL database) with the following information for all of the players contained in the chess tournament cross table:

For the first player list aboved, Gary Hua, the information to be captured, calculated and ultimately exported will be:

Player’s Name Player’s State Total Number of Points Player’s Pre-Rating Average Pre Chess Rating of Opponents
Gary Hua ON 6.0 1794 1605

Addtionally, the results of each player’s seven matches will be converted from wide to long format. This information will also be exported to a .CSV file, which could also be used as the basis for an import into a SQL database.

Setup

The R code for this project can be found on GitHub here under “Project 1 - Chess Tournament Results”. If executing the code, be sure to set your working directory to an appropriate directory in your environment. Three files will be created in this directory: 1.) tournamentinfo.txt 2.) Chess_Players.csv and 3.) Chess_Player_Matches.csv.

The following R packages are required for this project:

  1. reshape2
  2. dplyr
  3. downloader
  4. htmlTable
  5. stringr

Data

Loading and Understanding the File Format

The tournament results data used for the project can be found on GitHub below:

Chess Tournament Results File

Start by downloading the results file, called `tournamentinfo.txt’, from the GitHub location. This file will be downloaded to the current working directory.

# location of the population CSV file on GitHub

url <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/tournamentinfo.txt"

download_file <- "tournamentinfo.txt" # name of the file on the local machine after download

# the file will be downloaded to the working directory
downloader::download(url, download_file)

# read the file into a vector
conn <- file(download_file, open="r")
tournament.info.raw <- readLines(conn, warn = FALSE)

close(conn)

Let’s look at the tournament info file which has been loaded into a character vector.

##  [1] "-----------------------------------------------------------------------------------------" 
##  [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
##  [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
##  [4] "-----------------------------------------------------------------------------------------" 
##  [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
##  [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
##  [7] "-----------------------------------------------------------------------------------------" 
##  [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
##  [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [10] "-----------------------------------------------------------------------------------------"

Looking at the file’s structure, we can see that the file has a header comprised of two rows:

Header

The file also follows a fixed width structure but uses two different row formats to record information for a single chess player. The formats are described below.

Record Format 1

In the first record format shown below, the information for Player Number, Player Name, Total Points, and the Round 1 - 7 results can be found. The results for each round capture the outcome (Win, Loss, Draw) and the Opponent’s Player Number.

Record Format 1

Record Format 2

In the second record format below, the Player’s State, USCF ID, Pre-Rating and Post-Rating values can be found. The remaining fields are not considered in this project.

Record Format 2

Finally, we see that file structure uses a hyphen character “-” as a formatting line between player entries. Note that the combination of record format 1 and 2 form a complete record for each chess player.

Processing the Data

To successfully process the tournament results file, several adjustments to the file contents will be made. These adjustments will ultimately make processing the data into an R data frame easier.

  1. Convert the “->” used before the Post-Rating value to “>>”
  2. Remove all lines only containing hyphens. This step will remove the formatting rows between player entries
  3. Remove the two rows comprising the header values
  4. Split entries into one of two vectors based on the record format.
tournament.info <- str_replace_all(str_replace_all(tournament.info.raw, "->", ">>"), "-{3,}", "")

tournament.info <- tournament.info[tournament.info  != ""]

tournament.info <- tournament.info[-(1:2)]

# determine the format by looking at the contents of the first field in the record

tournament.format1 <- tournament.info[str_detect(str_sub(tournament.info, 1, 6), "[0-9]")]

tournament.format2 <- tournament.info[str_detect(str_sub(tournament.info, 1, 6), "[A-Z]{2,2}")]

Vector containing record format 1 contents:

head(tournament.format1, 5)
## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [3] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
## [4] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
## [5] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"

Vector containing record format 2 contents:

head(tournament.format2, 5)
## [1] "   ON | 15445895 / R: 1794   >>1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [2] "   MI | 14598900 / R: 1553   >>1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [3] "   MI | 14959604 / R: 1384   >>1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [4] "   MI | 12616049 / R: 1716   >>1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [5] "   MI | 14601533 / R: 1655   >>1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"

With the two record types split into vectors, we can now parse the records’ fields while loading them into data frames. Loading the data frames format1_df and format2_df from the vectors leverages the fixed-width nature of the chess results. Regular expressions will be used to extract the values for uscf_id, pre_rating, and post_rating.

Step One - parse the first record format for each player load into a data frame called format1_df.

format1_df <- data.frame(player_num   = as.numeric(substr(tournament.format1, 1,  6)),
                         player_name  = str_trim(substr(tournament.format1, 8,  40), side="both"),
                         total_pts    = as.numeric(substr(tournament.format1, 42, 46)),
                         round1       = substr(tournament.format1, 48, 52),
                         round2       = substr(tournament.format1, 54, 58),
                         round3       = substr(tournament.format1, 60, 64),
                         round4       = substr(tournament.format1, 66, 70),
                         round5       = substr(tournament.format1, 72, 76),
                         round6       = substr(tournament.format1, 78, 82),
                         round7       = substr(tournament.format1, 84, 88), stringsAsFactors=FALSE)

Step Two - parse the second record format for each player into a data frame called format2_df.

format2_df <- data.frame(player_state  = str_trim(substr(tournament.format2, 1,  6), side="both"),
                         uscf_id       = str_extract(substr(tournament.format2, 8,  40), "\\d+"),
                         pre_rating    = as.numeric(str_extract(substr(tournament.format2, 8,  40), "(?<=R: ).\\d+(?=)")),
                         post_rating   = as.numeric(str_extract(substr(tournament.format2, 8,  40), "(?<=>>).\\d+(?=)")),
                         stringsAsFactors=FALSE)

Combine the two data frames to create a single, consolidated record for each player.

results_df <- cbind(format1_df, format2_df)
Tournament Results – Consolidated Record Layout
Player Num Player Name Total Points Round 1 Round 2 Round 3 Round 4 Round 5 Round 6 Round 7 Player State USCF ID Pre-Rating Post-Rating
1 1 GARY HUA 6 W 39 W 21 W 18 W 14 W 7 D 12 D 4 ON 15445895 1794 1817
2 2 DAKSHESH DARURI 6 W 63 W 58 L 4 W 17 W 16 W 20 W 7 MI 14598900 1553 1663
3 3 ADITYA BAJAJ 6 L 8 W 61 W 25 W 21 W 11 W 13 W 12 MI 14959604 1384 1640
4 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26 D 5 W 19 D 1 MI 12616049 1716 1744
5 5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13 D 4 W 14 W 17 MI 14601533 1655 1690
6 6 HANSEN SONG 5 W 34 D 29 L 11 W 35 D 10 W 27 W 21 OH 15055204 1686 1687
7 7 GARY DEE SWATHELL 5 W 57 W 46 W 13 W 11 L 1 W 9 L 2 MI 11146376 1649 1673
8 8 EZEKIEL HOUGHTON 5 W 3 W 32 L 14 L 9 W 47 W 28 W 19 MI 15142253 1641 1657
9 9 STEFANO LEE 5 W 25 L 18 W 59 W 8 W 26 L 7 W 20 ON 14954524 1411 1564
10 10 ANVIT RAO 5 D 16 L 19 W 55 W 31 D 6 W 25 W 18 MI 14150362 1365 1544

Processing the Final Output

Creating the final output will focus on calculating the value of each player’s Average Pre Chess Rating of Opponents played. To do this, break the chess results data into a dataset only containing player information, excluding any information pertaining to matches played. Matches played in rounds 1 - 7 will be converted to a separate more normalized or tidy dataset.

Chess Player Dataset

player_df <- select(results_df, player_num:total_pts, player_state:post_rating)
Chess Player Information Dataset
Player Num Player Name   Total
Points  
  Player
State  
USCF ID   Pre-Rating     Post-Rating  
1 1 GARY HUA 6 ON 15445895 1794 1817
2 2 DAKSHESH DARURI 6 MI 14598900 1553 1663
3 3 ADITYA BAJAJ 6 MI 14959604 1384 1640
4 4 PATRICK H SCHILLING 5.5 MI 12616049 1716 1744
5 5 HANSHI ZUO 5.5 MI 14601533 1655 1690
6 6 HANSEN SONG 5 OH 15055204 1686 1687
7 7 GARY DEE SWATHELL 5 MI 11146376 1649 1673
8 8 EZEKIEL HOUGHTON 5 MI 15142253 1641 1657
9 9 STEFANO LEE 5 ON 14954524 1411 1564
10 10 ANVIT RAO 5 MI 14150362 1365 1544

Chess Player Matches Normalized Dataset

The logic below creates a normalized dataset of the player’s matches, where the match results are changed from wide to long format. The data captured in the round 1 - 7 fields will be split to indicate the outcome of the round (Win, Loss, Draw, etc.) and the opponent’s player number.

player_matches_df <-  
    results_df %>% select(player_num, round1:round7) %>%
    melt(id.var=c("player_num"), value.name="outcome_opp")  %>% 
    mutate( round        = as.numeric(str_replace(variable, "round", "")),
            outcome      = str_extract(outcome_opp, "^\\w+"),
            opponent_num = as.numeric(str_extract(outcome_opp, "\\d+$"))) %>%
    select(-c(variable, outcome_opp)) %>%
    inner_join(select(player_df, player_num, pre_rating), c("opponent_num" = "player_num")) %>%
    select(player_num, round, outcome, opponent_num, pre_rating) %>%
    arrange(player_num, round)
    
# rename pre_rating to opponent_pre_rating
names(player_matches_df)[names(player_matches_df) == "pre_rating"] <- "opponent_pre_rating"
Example Chess Player Matches Normalized Dataset
Player
Num
  Round     Outcome     Opponent’s
Player Num    
  Opponent’s
Pre-Chess Rating    
1 1 1 W 39 1436
2 1 2 W 21 1563
3 1 3 W 18 1600
4 1 4 W 14 1610
5 1 5 W 7 1649
6 1 6 D 12 1663
7 1 7 D 4 1716
8 2 1 W 63 1175
9 2 2 W 58 917
10 2 3 L 4 1716

Aggregate the normalized match dataset to create the value for opponents_avg_pre_rating. Note: In this calculation, only matches played with a Win, Loss, or Draw outcome are considered.

player_df <-  
    player_matches_df %>%
    group_by(player_num) %>% summarise(opponents_avg_pre_rating = round(mean(opponent_pre_rating))) %>%
    inner_join(player_df, by="player_num") %>%
    select(player_num, player_name, player_state, total_pts,  uscf_id, pre_rating, post_rating, opponents_avg_pre_rating)
Example Final Player Result
Player
Num
Player Name Player
State
Total
Points
USCF
ID
  Pre-Rating     Post-Rating   Opponents’ Average
Pre-Chess Rating
1 1 GARY HUA ON 6 15445895 1794 1817 1605
2 2 DAKSHESH DARURI MI 6 14598900 1553 1663 1469
3 3 ADITYA BAJAJ MI 6 14959604 1384 1640 1564
4 4 PATRICK H SCHILLING MI 5.5 12616049 1716 1744 1574
5 5 HANSHI ZUO MI 5.5 14601533 1655 1690 1501
6 6 HANSEN SONG OH 5 15055204 1686 1687 1519
7 7 GARY DEE SWATHELL MI 5 11146376 1649 1673 1372
8 8 EZEKIEL HOUGHTON MI 5 15142253 1641 1657 1468
9 9 STEFANO LEE ON 5 14954524 1411 1564 1523
10 10 ANVIT RAO MI 5 14150362 1365 1544 1554

Final Output

Create the output .CSV files: Chess_Players.csv and Chess_Player_Matches.csv.

The output files can be found on GitHub:

write.csv(player_df, "Chess_Players.csv", row.names=FALSE)
write.csv(player_matches_df, "Chess_Player_Matches.csv", row.names=FALSE)