DATA 607 - Project 1 Chess Tournament Results

Player’s Name	Player’s State	Total Number of Points	Player’s Pre-Rating	Average Pre Chess Rating of Opponents
Gary Hua	ON	6.0	1794	1605

Loading and Understanding the File Format

The tournament results data used for the project can be found on GitHub below:

Start by downloading the results file, called `tournamentinfo.txt’, from the GitHub location. This file will be downloaded to the current working directory.

# location of the population CSV file on GitHub

url <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/tournamentinfo.txt"

download_file <- "tournamentinfo.txt" # name of the file on the local machine after download

# the file will be downloaded to the working directory
downloader::download(url, download_file)

# read the file into a vector
conn <- file(download_file, open="r")
tournament.info.raw <- readLines(conn, warn = FALSE)

close(conn)

Let’s look at the tournament info file which has been loaded into a character vector.

##  [1] "-----------------------------------------------------------------------------------------" 
##  [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
##  [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
##  [4] "-----------------------------------------------------------------------------------------" 
##  [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
##  [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
##  [7] "-----------------------------------------------------------------------------------------" 
##  [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
##  [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [10] "-----------------------------------------------------------------------------------------"

Looking at the file’s structure, we can see that the file has a header comprised of two rows:

The file also follows a fixed width structure but uses two different row formats to record information for a single chess player. The formats are described below.

Record Format 1

In the first record format shown below, the information for Player Number, Player Name, Total Points, and the Round 1 - 7 results can be found. The results for each round capture the outcome (Win, Loss, Draw) and the Opponent’s Player Number.

Record Format 2

In the second record format below, the Player’s State, USCF ID, Pre-Rating and Post-Rating values can be found. The remaining fields are not considered in this project.

Finally, we see that file structure uses a hyphen character “-” as a formatting line between player entries. Note that the combination of record format 1 and 2 form a complete record for each chess player.

Processing the Data

To successfully process the tournament results file, several adjustments to the file contents will be made. These adjustments will ultimately make processing the data into an R data frame easier.

Convert the “->” used before the Post-Rating value to “>>”
Remove all lines only containing hyphens. This step will remove the formatting rows between player entries
Remove the two rows comprising the header values
Split entries into one of two vectors based on the record format.

tournament.info <- str_replace_all(str_replace_all(tournament.info.raw, "->", ">>"), "-{3,}", "")

tournament.info <- tournament.info[tournament.info  != ""]

tournament.info <- tournament.info[-(1:2)]

# determine the format by looking at the contents of the first field in the record

tournament.format1 <- tournament.info[str_detect(str_sub(tournament.info, 1, 6), "[0-9]")]

tournament.format2 <- tournament.info[str_detect(str_sub(tournament.info, 1, 6), "[A-Z]{2,2}")]

Vector containing record format 1 contents:

head(tournament.format1, 5)

## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [3] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
## [4] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
## [5] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"

Vector containing record format 2 contents:

head(tournament.format2, 5)

## [1] "   ON | 15445895 / R: 1794   >>1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [2] "   MI | 14598900 / R: 1553   >>1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [3] "   MI | 14959604 / R: 1384   >>1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [4] "   MI | 12616049 / R: 1716   >>1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [5] "   MI | 14601533 / R: 1655   >>1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"

With the two record types split into vectors, we can now parse the records’ fields while loading them into data frames. Loading the data frames format1_df and format2_df from the vectors leverages the fixed-width nature of the chess results. Regular expressions will be used to extract the values for uscf_id, pre_rating, and post_rating.

Step One - parse the first record format for each player load into a data frame called format1_df.

format1_df <- data.frame(player_num   = as.numeric(substr(tournament.format1, 1,  6)),
                         player_name  = str_trim(substr(tournament.format1, 8,  40), side="both"),
                         total_pts    = as.numeric(substr(tournament.format1, 42, 46)),
                         round1       = substr(tournament.format1, 48, 52),
                         round2       = substr(tournament.format1, 54, 58),
                         round3       = substr(tournament.format1, 60, 64),
                         round4       = substr(tournament.format1, 66, 70),
                         round5       = substr(tournament.format1, 72, 76),
                         round6       = substr(tournament.format1, 78, 82),
                         round7       = substr(tournament.format1, 84, 88), stringsAsFactors=FALSE)

Step Two - parse the second record format for each player into a data frame called format2_df.

format2_df <- data.frame(player_state  = str_trim(substr(tournament.format2, 1,  6), side="both"),
                         uscf_id       = str_extract(substr(tournament.format2, 8,  40), "\\d+"),
                         pre_rating    = as.numeric(str_extract(substr(tournament.format2, 8,  40), "(?<=R: ).\\d+(?=)")),
                         post_rating   = as.numeric(str_extract(substr(tournament.format2, 8,  40), "(?<=>>).\\d+(?=)")),
                         stringsAsFactors=FALSE)

Combine the two data frames to create a single, consolidated record for each player.

results_df <- cbind(format1_df, format2_df)

	Player Num	Player Name	Total Points	Round 1	Round 2	Round 3	Round 4	Round 5	Round 6	Round 7	Player State	USCF ID	Pre-Rating	Post-Rating
Tournament Results – Consolidated Record Layout
1	1	GARY HUA	6	W 39	W 21	W 18	W 14	W 7	D 12	D 4	ON	15445895	1794	1817
2	2	DAKSHESH DARURI	6	W 63	W 58	L 4	W 17	W 16	W 20	W 7	MI	14598900	1553	1663
3	3	ADITYA BAJAJ	6	L 8	W 61	W 25	W 21	W 11	W 13	W 12	MI	14959604	1384	1640
4	4	PATRICK H SCHILLING	5.5	W 23	D 28	W 2	W 26	D 5	W 19	D 1	MI	12616049	1716	1744
5	5	HANSHI ZUO	5.5	W 45	W 37	D 12	D 13	D 4	W 14	W 17	MI	14601533	1655	1690
6	6	HANSEN SONG	5	W 34	D 29	L 11	W 35	D 10	W 27	W 21	OH	15055204	1686	1687
7	7	GARY DEE SWATHELL	5	W 57	W 46	W 13	W 11	L 1	W 9	L 2	MI	11146376	1649	1673
8	8	EZEKIEL HOUGHTON	5	W 3	W 32	L 14	L 9	W 47	W 28	W 19	MI	15142253	1641	1657
9	9	STEFANO LEE	5	W 25	L 18	W 59	W 8	W 26	L 7	W 20	ON	14954524	1411	1564
10	10	ANVIT RAO	5	D 16	L 19	W 55	W 31	D 6	W 25	W 18	MI	14150362	1365	1544

Processing the Final Output

Creating the final output will focus on calculating the value of each player’s Average Pre Chess Rating of Opponents played. To do this, break the chess results data into a dataset only containing player information, excluding any information pertaining to matches played. Matches played in rounds 1 - 7 will be converted to a separate more normalized or tidy dataset.

Chess Player Dataset

player_df <- select(results_df, player_num:total_pts, player_state:post_rating)

	Player Num	Player Name	Total Points	Player State	USCF ID	Pre-Rating	Post-Rating
Chess Player Information Dataset
1	1	GARY HUA	6	ON	15445895	1794	1817
2	2	DAKSHESH DARURI	6	MI	14598900	1553	1663
3	3	ADITYA BAJAJ	6	MI	14959604	1384	1640
4	4	PATRICK H SCHILLING	5.5	MI	12616049	1716	1744
5	5	HANSHI ZUO	5.5	MI	14601533	1655	1690
6	6	HANSEN SONG	5	OH	15055204	1686	1687
7	7	GARY DEE SWATHELL	5	MI	11146376	1649	1673
8	8	EZEKIEL HOUGHTON	5	MI	15142253	1641	1657
9	9	STEFANO LEE	5	ON	14954524	1411	1564
10	10	ANVIT RAO	5	MI	14150362	1365	1544

Chess Player Matches Normalized Dataset

The logic below creates a normalized dataset of the player’s matches, where the match results are changed from wide to long format. The data captured in the round 1 - 7 fields will be split to indicate the outcome of the round (Win, Loss, Draw, etc.) and the opponent’s player number.

player_matches_df <-  
    results_df %>% select(player_num, round1:round7) %>%
    melt(id.var=c("player_num"), value.name="outcome_opp")  %>% 
    mutate( round        = as.numeric(str_replace(variable, "round", "")),
            outcome      = str_extract(outcome_opp, "^\\w+"),
            opponent_num = as.numeric(str_extract(outcome_opp, "\\d+$"))) %>%
    select(-c(variable, outcome_opp)) %>%
    inner_join(select(player_df, player_num, pre_rating), c("opponent_num" = "player_num")) %>%
    select(player_num, round, outcome, opponent_num, pre_rating) %>%
    arrange(player_num, round)
    
# rename pre_rating to opponent_pre_rating
names(player_matches_df)[names(player_matches_df) == "pre_rating"] <- "opponent_pre_rating"

	Player Num	Round	Outcome	Opponent’s Player Num	Opponent’s Pre-Chess Rating
Example Chess Player Matches Normalized Dataset
1	1	1	W	39	1436
2	1	2	W	21	1563
3	1	3	W	18	1600
4	1	4	W	14	1610
5	1	5	W	7	1649
6	1	6	D	12	1663
7	1	7	D	4	1716
8	2	1	W	63	1175
9	2	2	W	58	917
10	2	3	L	4	1716

Aggregate the normalized match dataset to create the value for opponents_avg_pre_rating. Note: In this calculation, only matches played with a Win, Loss, or Draw outcome are considered.

player_df <-  
    player_matches_df %>%
    group_by(player_num) %>% summarise(opponents_avg_pre_rating = round(mean(opponent_pre_rating))) %>%
    inner_join(player_df, by="player_num") %>%
    select(player_num, player_name, player_state, total_pts,  uscf_id, pre_rating, post_rating, opponents_avg_pre_rating)

	Player Num	Player Name	Player State	Total Points	USCF ID	Pre-Rating	Post-Rating	Opponents’ Average Pre-Chess Rating
Example Final Player Result
1	1	GARY HUA	ON	6	15445895	1794	1817	1605
2	2	DAKSHESH DARURI	MI	6	14598900	1553	1663	1469
3	3	ADITYA BAJAJ	MI	6	14959604	1384	1640	1564
4	4	PATRICK H SCHILLING	MI	5.5	12616049	1716	1744	1574
5	5	HANSHI ZUO	MI	5.5	14601533	1655	1690	1501
6	6	HANSEN SONG	OH	5	15055204	1686	1687	1519
7	7	GARY DEE SWATHELL	MI	5	11146376	1649	1673	1372
8	8	EZEKIEL HOUGHTON	MI	5	15142253	1641	1657	1468
9	9	STEFANO LEE	ON	5	14954524	1411	1564	1523
10	10	ANVIT RAO	MI	5	14150362	1365	1544	1554

Final Output

Create the output .CSV files: Chess_Players.csv and Chess_Player_Matches.csv.

The output files can be found on GitHub:

write.csv(player_df, "Chess_Players.csv", row.names=FALSE)
write.csv(player_matches_df, "Chess_Player_Matches.csv", row.names=FALSE)

DATA 607 - Project 1 Chess Tournament Results

Keith Folsom

February 27, 2016

Project Overview

Setup

Data