Load general libraries:

library(dplyr)
library(stringr)
library(tidyr)

Introduction

The purpose this document is to describe the process of importing the data in tournamentinfo.txt and making it amenable to R.

The data is highly formatted, using sequences of pipe and hyphen characters as column and row delimiters respectively, and splits each record across two lines. Below are the first few records of data:

##  [1] "-----------------------------------------------------------------------------------------" 
##  [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
##  [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
##  [4] "-----------------------------------------------------------------------------------------" 
##  [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
##  [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
##  [7] "-----------------------------------------------------------------------------------------" 
##  [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
##  [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [10] "-----------------------------------------------------------------------------------------" 
## [11] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|" 
## [12] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## [13] "-----------------------------------------------------------------------------------------" 
## [14] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|" 
## [15] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |" 
## [16] "-----------------------------------------------------------------------------------------"

Each record is a player in the chess tournament. Records include player ID, player state, total points, pre- and post-tournament rating, outcome of all rounds (seven total), the opponents’ IDs for each match, and the color of the player that match.

The end goal is a clean dataframe with player name, player state, total points, pre-tournament rating, and average pre-tournament rating of opponents.

Aproach

RegEx always feels slightly dirty to me—like a last resort—so I’d like to minimize its use. Even minor changes to the source data can ruin RegEx matching, even the most thoughtful RegEx can behave unexpectedly, and besides, RegEx are notoriously hard to read.

Instead I’m going to try to make R see the the data ‘by itself’ as much as possible. Hopefully this will make it easier to understand the code, make it more robust to changes such as columns added or discarded, and thus easier to maintain.

Loading the Raw Text

Rather than trying to force this messy text data into vector-dataframe form right out of the gate, I’m going to start by treating it as lines of text.

First, run each line in a for loop, categorizing them by simple RegEx expressions. One reason this data table is tricky is that the relevant data is spread over two lines. I will refer to the first line in a player row as ‘player row 1’ and the second as ‘player row 2.’ After I seperate out the relevant lines from the table into dataframes, I’ll combine the pieces later for further manipulation.

The first kind of line begins with a sequence of digits surrounded by white space, the player’s ID. The second kind starts with the player’s home state—two capital letters flanked by white space.

raw <- readLines('tournamentinfo.txt')

player_row1 <- list()
player_row2 <- list()

for (line in raw) {
    if (str_detect(line, '^\\s+\\d+\\s+') == TRUE) {
        player_row1 <- c(player_row1, line)
    }
    else if (str_detect(line, '^\\s+[A-Z]{2}\\s+') == TRUE) {
        player_row2 <- c(player_row2, line)
    }
}

Clean Tournament Data

Now to prepare player_row1, currently a list of messy text, into a dataframe of clean text.

df1_colnames <- c('player_id', 'player_name', 'total_points', 'match_1', 'match_2',
                  'match_3', 'match_4', 'match_5', 'match_6', 'match_7')

player_row1_df <- data.frame(unlist(unlist(player_row1)))
player_row1_df <- player_row1_df %>%
    separate(colnames(player_row1_df)[1], into=df1_colnames, sep='\\|') %>%
    mutate_all(str_squish)
row.names(player_row1_df) <- NULL

head(player_row1_df)

##   player_id         player_name total_points match_1 match_2 match_3
## 1         1            GARY HUA          6.0    W 39    W 21    W 18
## 2         2     DAKSHESH DARURI          6.0    W 63    W 58     L 4
## 3         3        ADITYA BAJAJ          6.0     L 8    W 61    W 25
## 4         4 PATRICK H SCHILLING          5.5    W 23    D 28     W 2
## 5         5          HANSHI ZUO          5.5    W 45    W 37    D 12
## 6         6         HANSEN SONG          5.0    W 34    D 29    L 11
##   match_4 match_5 match_6 match_7
## 1    W 14     W 7    D 12     D 4
## 2    W 17    W 16    W 20     W 7
## 3    W 21    W 11    W 13    W 12
## 4    W 26     D 5    W 19     D 1
## 5    D 13     D 4    W 14    W 17
## 6    W 35    D 10    W 27    W 21

Next is to prepare ‘player row 2’:

df2_colnames <- c('state', 'ident', 'n', 'm1_color', 'm2_color', 'm3_color',
                  'm4_color', 'm5_color', 'm6_color', 'm7_color', 'nulls')

player_row2_df <- data.frame(unlist(unlist(player_row2)))
player_row2_df <- player_row2_df %>%
    separate(colnames(player_row2_df)[1], into=df2_colnames, sep='\\|') %>%
    select(-nulls) %>%  # remove column of nulls
    mutate_all(str_squish)
row.names(player_row2_df) <- NULL

head(player_row2_df)

##   state                     ident   n m1_color m2_color m3_color m4_color
## 1    ON 15445895 / R: 1794 ->1817 N:2        W        B        W        B
## 2    MI 14598900 / R: 1553 ->1663 N:2        B        W        B        W
## 3    MI 14959604 / R: 1384 ->1640 N:2        W        B        W        B
## 4    MI 12616049 / R: 1716 ->1744 N:2        W        B        W        B
## 5    MI 14601533 / R: 1655 ->1690 N:2        B        W        B        W
## 6    OH 15055204 / R: 1686 ->1687 N:3        W        B        W        B
##   m5_color m6_color m7_color
## 1        W        B        W
## 2        B        W        B
## 3        W        B        W
## 4        W        B        B
## 5        B        W        B
## 6        B        W        B

Bringing it together:

tournaments <- cbind(player_row1_df, player_row2_df)
head(tournaments)

##   player_id         player_name total_points match_1 match_2 match_3
## 1         1            GARY HUA          6.0    W 39    W 21    W 18
## 2         2     DAKSHESH DARURI          6.0    W 63    W 58     L 4
## 3         3        ADITYA BAJAJ          6.0     L 8    W 61    W 25
## 4         4 PATRICK H SCHILLING          5.5    W 23    D 28     W 2
## 5         5          HANSHI ZUO          5.5    W 45    W 37    D 12
## 6         6         HANSEN SONG          5.0    W 34    D 29    L 11
##   match_4 match_5 match_6 match_7 state                     ident   n
## 1    W 14     W 7    D 12     D 4    ON 15445895 / R: 1794 ->1817 N:2
## 2    W 17    W 16    W 20     W 7    MI 14598900 / R: 1553 ->1663 N:2
## 3    W 21    W 11    W 13    W 12    MI 14959604 / R: 1384 ->1640 N:2
## 4    W 26     D 5    W 19     D 1    MI 12616049 / R: 1716 ->1744 N:2
## 5    D 13     D 4    W 14    W 17    MI 14601533 / R: 1655 ->1690 N:2
## 6    W 35    D 10    W 27    W 21    OH 15055204 / R: 1686 ->1687 N:3
##   m1_color m2_color m3_color m4_color m5_color m6_color m7_color
## 1        W        B        W        B        W        B        W
## 2        B        W        B        W        B        W        B
## 3        W        B        W        B        W        B        W
## 4        W        B        W        B        W        B        B
## 5        B        W        B        W        B        W        B
## 6        W        B        W        B        B        W        B

str(tournaments)

## 'data.frame':    64 obs. of  20 variables:
##  $ player_id   : chr  "1" "2" "3" "4" ...
##  $ player_name : chr  "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ" "PATRICK H SCHILLING" ...
##  $ total_points: chr  "6.0" "6.0" "6.0" "5.5" ...
##  $ match_1     : chr  "W 39" "W 63" "L 8" "W 23" ...
##  $ match_2     : chr  "W 21" "W 58" "W 61" "D 28" ...
##  $ match_3     : chr  "W 18" "L 4" "W 25" "W 2" ...
##  $ match_4     : chr  "W 14" "W 17" "W 21" "W 26" ...
##  $ match_5     : chr  "W 7" "W 16" "W 11" "D 5" ...
##  $ match_6     : chr  "D 12" "W 20" "W 13" "W 19" ...
##  $ match_7     : chr  "D 4" "W 7" "W 12" "D 1" ...
##  $ state       : chr  "ON" "MI" "MI" "MI" ...
##  $ ident       : chr  "15445895 / R: 1794 ->1817" "14598900 / R: 1553 ->1663" "14959604 / R: 1384 ->1640" "12616049 / R: 1716 ->1744" ...
##  $ n           : chr  "N:2" "N:2" "N:2" "N:2" ...
##  $ m1_color    : chr  "W" "B" "W" "W" ...
##  $ m2_color    : chr  "B" "W" "B" "B" ...
##  $ m3_color    : chr  "W" "B" "W" "W" ...
##  $ m4_color    : chr  "B" "W" "B" "B" ...
##  $ m5_color    : chr  "W" "B" "W" "W" ...
##  $ m6_color    : chr  "B" "W" "B" "B" ...
##  $ m7_color    : chr  "W" "B" "W" "B" ...

Pre-Rating

Pre-rating is stored in the ident column:

head(tournaments$ident)

## [1] "15445895 / R: 1794 ->1817" "14598900 / R: 1553 ->1663"
## [3] "14959604 / R: 1384 ->1640" "12616049 / R: 1716 ->1744"
## [5] "14601533 / R: 1655 ->1690" "15055204 / R: 1686 ->1687"

We can use Regex to extract what we need into the pre_rating column:

tournaments$pre_rating <- sapply(tournaments$ident, function(x) as.numeric(str_match(x, '(R: )(\\d+)')[3]))

Average Pre Chess Rating of Opponents

First, let’s create a function that provides the average for any vector of player IDs:

calc_pre_avg <- function(x) {
    pre_ratings <- tournaments$pre_rating[x]
    return(round(mean(pre_ratings, na.rm=TRUE)))
}

We know player 1’s average pre rating of opponents in 1605. Let’s test this function to make sure it works:

calc_pre_avg(c(39, 21, 18, 14, 7, 12, 4))

## [1] 1605

Perfect!

Note that there are NA’s in some of this data. I have added na.rm=TRUE to ignore them when calculating the pre average. The denominator will decrease by 1 for each NA.

Next, let’s make a column that contains each player’s match opponents’ IDs. This process gets a little messy, as we shift the data around in different R data types to ultimately subject it tocalc_pre_avg().

Using str_extract() and sapply we can get a nice matrix containing the opponent ID for each match:

match_opponents <- as.data.frame(sapply(tournaments[4:10], str_extract, '\\d+'))
head(match_opponents)

##   match_1 match_2 match_3 match_4 match_5 match_6 match_7
## 1      39      21      18      14       7      12       4
## 2      63      58       4      17      16      20       7
## 3       8      61      25      21      11      13      12
## 4      23      28       2      26       5      19       1
## 5      45      37      12      13       4      14      17
## 6      34      29      11      35      10      27      21

To apply to our custom function, arrange this dataframe into a list of vectors, each containing the IDs of opponents:

match_opponents_list <- apply(match_opponents, 1, as.list)
match_opponents_num <- lapply(match_opponents_list, function(x) as.numeric(unlist(x)))
head(match_opponents_num)

## [[1]]
## [1] 39 21 18 14  7 12  4
## 
## [[2]]
## [1] 63 58  4 17 16 20  7
## 
## [[3]]
## [1]  8 61 25 21 11 13 12
## 
## [[4]]
## [1] 23 28  2 26  5 19  1
## 
## [[5]]
## [1] 45 37 12 13  4 14 17
## 
## [[6]]
## [1] 34 29 11 35 10 27 21

str(head(match_opponents_num))

## List of 6
##  $ : num [1:7] 39 21 18 14 7 12 4
##  $ : num [1:7] 63 58 4 17 16 20 7
##  $ : num [1:7] 8 61 25 21 11 13 12
##  $ : num [1:7] 23 28 2 26 5 19 1
##  $ : num [1:7] 45 37 12 13 4 14 17
##  $ : num [1:7] 34 29 11 35 10 27 21

Finally, calcuate the average opp pre-rating and insert as a column:

match_pre_avg <- lapply(match_opponents_num, calc_pre_avg)
tournaments$avg_opp_pre_rating <- as.numeric(match_pre_avg)
head(tournaments[c(1, 2, 22)])

##   player_id         player_name avg_opp_pre_rating
## 1         1            GARY HUA               1605
## 2         2     DAKSHESH DARURI               1469
## 3         3        ADITYA BAJAJ               1564
## 4         4 PATRICK H SCHILLING               1574
## 5         5          HANSHI ZUO               1501
## 6         6         HANSEN SONG               1519

Final Output

Create a dataframe df with only the relevant columns:

df <- tournaments %>% 
    select(player_name, state, total_points, pre_rating, avg_opp_pre_rating) %>%
    mutate(total_points = as.numeric(total_points))

head(df)

##           player_name state total_points pre_rating avg_opp_pre_rating
## 1            GARY HUA    ON          6.0       1794               1605
## 2     DAKSHESH DARURI    MI          6.0       1553               1469
## 3        ADITYA BAJAJ    MI          6.0       1384               1564
## 4 PATRICK H SCHILLING    MI          5.5       1716               1574
## 5          HANSHI ZUO    MI          5.5       1655               1501
## 6         HANSEN SONG    OH          5.0       1686               1519

str(df)

## 'data.frame':    64 obs. of  5 variables:
##  $ player_name       : chr  "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ" "PATRICK H SCHILLING" ...
##  $ state             : chr  "ON" "MI" "MI" "MI" ...
##  $ total_points      : num  6 6 6 5.5 5.5 5 5 5 5 5 ...
##  $ pre_rating        : num  1794 1553 1384 1716 1655 ...
##  $ avg_opp_pre_rating: num  1605 1469 1564 1574 1501 ...

write.csv(df, 'output.csv', row.names=FALSE)

Bonus: Do higher-ELO players always beat lower-ELO players?

How useful is the ELO rating? How often does a higher-ELO player beat a lesser-ELO player?

First let’s create an index of players and their pre-ratings:

pre_ratings <- tournaments[c('player_id', 'pre_rating')]
head(pre_ratings)

##   player_id pre_rating
## 1         1       1794
## 2         2       1553
## 3         3       1384
## 4         4       1716
## 5         5       1655
## 6         6       1686

Next let’s convert the (wide) tournament data into (long) match data with reshape2::melt(). I’m also including only matches with a win or loss outcome. Then I get the player and opponent’s ratings from pre_ratings.

library(reshape2)
matches <- tournaments[c(1, 2, 4:10)] %>%
    melt(id.vars=c('player_id', 'player_name')) %>%
    filter(str_detect(value, 'W|L') == TRUE) %>%
    separate('value', into=c('outcome', 'opponent_id'), sep=' ') %>%
    inner_join(pre_ratings, by='player_id') %>%
    inner_join(pre_ratings, by=c('opponent_id'='player_id')) %>%
    `colnames<-`(c('player_id', 'player_name', 'match', 'outcome',
                   'opponent_id', 'player_rating', 'opponent_rating')) %>%
    select(player_id, match, outcome, opponent_id, player_rating, 
           opponent_rating)

head(matches)

##   player_id   match outcome opponent_id player_rating opponent_rating
## 1         1 match_1       W          39          1794            1436
## 2         2 match_1       W          63          1553            1175
## 3         3 match_1       L           8          1384            1641
## 4         4 match_1       W          23          1716            1363
## 5         5 match_1       W          45          1655            1242
## 6         6 match_1       W          34          1686            1399

There are obviously duplicates in this dataset, e.g., there’s a record for (player_id=X, opponent_id=Y) and one for (player_id=Y, opponent_id=X), even though it’s the same match.

To de-dupe, I’m going to create a unique ID for each match, then dedupe on that column. To create the unique ID, I put player_id and opponent_id in a two-value numeric vector, than sort numerically. Then I append the match number.

player_ids <- paste(matches$player_id, matches$opponent_id, sep=',')
player_ids_sorted <- lapply(player_ids, function(i) sort(as.numeric(unlist(str_split(i, ',')))))
player_ids_char <- lapply(player_ids_sorted, function(x) paste(x, collapse='_'))
matches$match_id <- paste(matches$match, player_ids_char, sep="_")

head(matches$match_id)

## [1] "match_1_1_39" "match_1_2_63" "match_1_3_8"  "match_1_4_23"
## [5] "match_1_5_45" "match_1_6_34"

length(matches$match_id)

## [1] 350

length(unique(matches$match_id))

## [1] 175

We see match ID has done its job: There are half as many unique IDs as there are records. Dedupe on match_id:

matches <- distinct(matches, match_id, .keep_all=TRUE)

Finally, let’s calculate how often higher-ELO beats lower-ELO:

matches$player_gt_opponent <- ifelse(matches$player_rating > matches$opponent_rating,
                                     TRUE, FALSE)
head(matches)

##   player_id   match outcome opponent_id player_rating opponent_rating
## 1         1 match_1       W          39          1794            1436
## 2         2 match_1       W          63          1553            1175
## 3         3 match_1       L           8          1384            1641
## 4         4 match_1       W          23          1716            1363
## 5         5 match_1       W          45          1655            1242
## 6         6 match_1       W          34          1686            1399
##       match_id player_gt_opponent
## 1 match_1_1_39               TRUE
## 2 match_1_2_63               TRUE
## 3  match_1_3_8              FALSE
## 4 match_1_4_23               TRUE
## 5 match_1_5_45               TRUE
## 6 match_1_6_34               TRUE

Examining the results of this analysis:

table(matches$outcome, matches$player_gt_opponent)

##    
##     FALSE TRUE
##   L    12   21
##   W    26  116

This strikes me as odd.

There where 21 + 116 = 137 matches where player had a higher ELO rating than opponent, and player won \(116/137 =\) almost 85 percent of the time.

However, of the \(12 + 26 = 38\) matches where player did not have a higher rating than opponent, player still won \(26/38 =\) 68 percent of the time.

This suggests that ELO doesn’t matter so much. It seems to me there is something else going on in this data, that some other variable is more highly associated with player winning than ELO, e.g., ranking. Perhaps what this data really says is that currently top ranked chess players are likely to beat lower ranked chess players (with higher ELO ranking increasing the probability somewhat). The data also suggest upsets in chess rankings should be few.

Cleaning Chess Tournement Data with Minimal RegEx

Ben Horvath

September 23, 2018