Load general libraries:
Introduction
The purpose this document is to describe the process of importing the data in tournamentinfo.txt and making it amenable to R.
The data is highly formatted, using sequences of pipe and hyphen characters as column and row delimiters respectively, and splits each record across two lines. Below are the first few records of data:
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [7] "-----------------------------------------------------------------------------------------"
## [8] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [9] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [10] "-----------------------------------------------------------------------------------------"
## [11] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [12] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [13] "-----------------------------------------------------------------------------------------"
## [14] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [15] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [16] "-----------------------------------------------------------------------------------------"
Each record is a player in the chess tournament. Records include player ID, player state, total points, pre- and post-tournament rating, outcome of all rounds (seven total), the opponents’ IDs for each match, and the color of the player that match.
The end goal is a clean dataframe with player name, player state, total points, pre-tournament rating, and average pre-tournament rating of opponents.
Aproach
RegEx always feels slightly dirty to me—like a last resort—so I’d like to minimize its use. Even minor changes to the source data can ruin RegEx matching, even the most thoughtful RegEx can behave unexpectedly, and besides, RegEx are notoriously hard to read.
Instead I’m going to try to make R see the the data ‘by itself’ as much as possible. Hopefully this will make it easier to understand the code, make it more robust to changes such as columns added or discarded, and thus easier to maintain.
Loading the Raw Text
Rather than trying to force this messy text data into vector-dataframe form right out of the gate, I’m going to start by treating it as lines of text.
First, run each line in a for loop, categorizing them by simple RegEx expressions. One reason this data table is tricky is that the relevant data is spread over two lines. I will refer to the first line in a player row as ‘player row 1’ and the second as ‘player row 2.’ After I seperate out the relevant lines from the table into dataframes, I’ll combine the pieces later for further manipulation.
The first kind of line begins with a sequence of digits surrounded by white space, the player’s ID. The second kind starts with the player’s home state—two capital letters flanked by white space.
Clean Tournament Data
Now to prepare player_row1, currently a list of messy text, into a dataframe of clean text.
df1_colnames <- c('player_id', 'player_name', 'total_points', 'match_1', 'match_2',
'match_3', 'match_4', 'match_5', 'match_6', 'match_7')
player_row1_df <- data.frame(unlist(unlist(player_row1)))
player_row1_df <- player_row1_df %>%
separate(colnames(player_row1_df)[1], into=df1_colnames, sep='\\|') %>%
mutate_all(str_squish)
row.names(player_row1_df) <- NULL
head(player_row1_df)## player_id player_name total_points match_1 match_2 match_3
## 1 1 GARY HUA 6.0 W 39 W 21 W 18
## 2 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4
## 3 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25
## 4 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2
## 5 5 HANSHI ZUO 5.5 W 45 W 37 D 12
## 6 6 HANSEN SONG 5.0 W 34 D 29 L 11
## match_4 match_5 match_6 match_7
## 1 W 14 W 7 D 12 D 4
## 2 W 17 W 16 W 20 W 7
## 3 W 21 W 11 W 13 W 12
## 4 W 26 D 5 W 19 D 1
## 5 D 13 D 4 W 14 W 17
## 6 W 35 D 10 W 27 W 21
Next is to prepare ‘player row 2’:
df2_colnames <- c('state', 'ident', 'n', 'm1_color', 'm2_color', 'm3_color',
'm4_color', 'm5_color', 'm6_color', 'm7_color', 'nulls')
player_row2_df <- data.frame(unlist(unlist(player_row2)))
player_row2_df <- player_row2_df %>%
separate(colnames(player_row2_df)[1], into=df2_colnames, sep='\\|') %>%
select(-nulls) %>% # remove column of nulls
mutate_all(str_squish)
row.names(player_row2_df) <- NULL
head(player_row2_df)## state ident n m1_color m2_color m3_color m4_color
## 1 ON 15445895 / R: 1794 ->1817 N:2 W B W B
## 2 MI 14598900 / R: 1553 ->1663 N:2 B W B W
## 3 MI 14959604 / R: 1384 ->1640 N:2 W B W B
## 4 MI 12616049 / R: 1716 ->1744 N:2 W B W B
## 5 MI 14601533 / R: 1655 ->1690 N:2 B W B W
## 6 OH 15055204 / R: 1686 ->1687 N:3 W B W B
## m5_color m6_color m7_color
## 1 W B W
## 2 B W B
## 3 W B W
## 4 W B B
## 5 B W B
## 6 B W B
Bringing it together:
## player_id player_name total_points match_1 match_2 match_3
## 1 1 GARY HUA 6.0 W 39 W 21 W 18
## 2 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4
## 3 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25
## 4 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2
## 5 5 HANSHI ZUO 5.5 W 45 W 37 D 12
## 6 6 HANSEN SONG 5.0 W 34 D 29 L 11
## match_4 match_5 match_6 match_7 state ident n
## 1 W 14 W 7 D 12 D 4 ON 15445895 / R: 1794 ->1817 N:2
## 2 W 17 W 16 W 20 W 7 MI 14598900 / R: 1553 ->1663 N:2
## 3 W 21 W 11 W 13 W 12 MI 14959604 / R: 1384 ->1640 N:2
## 4 W 26 D 5 W 19 D 1 MI 12616049 / R: 1716 ->1744 N:2
## 5 D 13 D 4 W 14 W 17 MI 14601533 / R: 1655 ->1690 N:2
## 6 W 35 D 10 W 27 W 21 OH 15055204 / R: 1686 ->1687 N:3
## m1_color m2_color m3_color m4_color m5_color m6_color m7_color
## 1 W B W B W B W
## 2 B W B W B W B
## 3 W B W B W B W
## 4 W B W B W B B
## 5 B W B W B W B
## 6 W B W B B W B
## 'data.frame': 64 obs. of 20 variables:
## $ player_id : chr "1" "2" "3" "4" ...
## $ player_name : chr "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ" "PATRICK H SCHILLING" ...
## $ total_points: chr "6.0" "6.0" "6.0" "5.5" ...
## $ match_1 : chr "W 39" "W 63" "L 8" "W 23" ...
## $ match_2 : chr "W 21" "W 58" "W 61" "D 28" ...
## $ match_3 : chr "W 18" "L 4" "W 25" "W 2" ...
## $ match_4 : chr "W 14" "W 17" "W 21" "W 26" ...
## $ match_5 : chr "W 7" "W 16" "W 11" "D 5" ...
## $ match_6 : chr "D 12" "W 20" "W 13" "W 19" ...
## $ match_7 : chr "D 4" "W 7" "W 12" "D 1" ...
## $ state : chr "ON" "MI" "MI" "MI" ...
## $ ident : chr "15445895 / R: 1794 ->1817" "14598900 / R: 1553 ->1663" "14959604 / R: 1384 ->1640" "12616049 / R: 1716 ->1744" ...
## $ n : chr "N:2" "N:2" "N:2" "N:2" ...
## $ m1_color : chr "W" "B" "W" "W" ...
## $ m2_color : chr "B" "W" "B" "B" ...
## $ m3_color : chr "W" "B" "W" "W" ...
## $ m4_color : chr "B" "W" "B" "B" ...
## $ m5_color : chr "W" "B" "W" "W" ...
## $ m6_color : chr "B" "W" "B" "B" ...
## $ m7_color : chr "W" "B" "W" "B" ...
Pre-Rating
Pre-rating is stored in the ident column:
## [1] "15445895 / R: 1794 ->1817" "14598900 / R: 1553 ->1663"
## [3] "14959604 / R: 1384 ->1640" "12616049 / R: 1716 ->1744"
## [5] "14601533 / R: 1655 ->1690" "15055204 / R: 1686 ->1687"
We can use Regex to extract what we need into the pre_rating column:
Average Pre Chess Rating of Opponents
First, let’s create a function that provides the average for any vector of player IDs:
calc_pre_avg <- function(x) {
pre_ratings <- tournaments$pre_rating[x]
return(round(mean(pre_ratings, na.rm=TRUE)))
}We know player 1’s average pre rating of opponents in 1605. Let’s test this function to make sure it works:
## [1] 1605
Perfect!
Note that there are NA’s in some of this data. I have added na.rm=TRUE to ignore them when calculating the pre average. The denominator will decrease by 1 for each NA.
Next, let’s make a column that contains each player’s match opponents’ IDs. This process gets a little messy, as we shift the data around in different R data types to ultimately subject it tocalc_pre_avg().
Using str_extract() and sapply we can get a nice matrix containing the opponent ID for each match:
match_opponents <- as.data.frame(sapply(tournaments[4:10], str_extract, '\\d+'))
head(match_opponents)## match_1 match_2 match_3 match_4 match_5 match_6 match_7
## 1 39 21 18 14 7 12 4
## 2 63 58 4 17 16 20 7
## 3 8 61 25 21 11 13 12
## 4 23 28 2 26 5 19 1
## 5 45 37 12 13 4 14 17
## 6 34 29 11 35 10 27 21
To apply to our custom function, arrange this dataframe into a list of vectors, each containing the IDs of opponents:
match_opponents_list <- apply(match_opponents, 1, as.list)
match_opponents_num <- lapply(match_opponents_list, function(x) as.numeric(unlist(x)))
head(match_opponents_num)## [[1]]
## [1] 39 21 18 14 7 12 4
##
## [[2]]
## [1] 63 58 4 17 16 20 7
##
## [[3]]
## [1] 8 61 25 21 11 13 12
##
## [[4]]
## [1] 23 28 2 26 5 19 1
##
## [[5]]
## [1] 45 37 12 13 4 14 17
##
## [[6]]
## [1] 34 29 11 35 10 27 21
## List of 6
## $ : num [1:7] 39 21 18 14 7 12 4
## $ : num [1:7] 63 58 4 17 16 20 7
## $ : num [1:7] 8 61 25 21 11 13 12
## $ : num [1:7] 23 28 2 26 5 19 1
## $ : num [1:7] 45 37 12 13 4 14 17
## $ : num [1:7] 34 29 11 35 10 27 21
Finally, calcuate the average opp pre-rating and insert as a column:
match_pre_avg <- lapply(match_opponents_num, calc_pre_avg)
tournaments$avg_opp_pre_rating <- as.numeric(match_pre_avg)
head(tournaments[c(1, 2, 22)])## player_id player_name avg_opp_pre_rating
## 1 1 GARY HUA 1605
## 2 2 DAKSHESH DARURI 1469
## 3 3 ADITYA BAJAJ 1564
## 4 4 PATRICK H SCHILLING 1574
## 5 5 HANSHI ZUO 1501
## 6 6 HANSEN SONG 1519
Final Output
Create a dataframe df with only the relevant columns:
df <- tournaments %>%
select(player_name, state, total_points, pre_rating, avg_opp_pre_rating) %>%
mutate(total_points = as.numeric(total_points))
head(df)## player_name state total_points pre_rating avg_opp_pre_rating
## 1 GARY HUA ON 6.0 1794 1605
## 2 DAKSHESH DARURI MI 6.0 1553 1469
## 3 ADITYA BAJAJ MI 6.0 1384 1564
## 4 PATRICK H SCHILLING MI 5.5 1716 1574
## 5 HANSHI ZUO MI 5.5 1655 1501
## 6 HANSEN SONG OH 5.0 1686 1519
## 'data.frame': 64 obs. of 5 variables:
## $ player_name : chr "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ" "PATRICK H SCHILLING" ...
## $ state : chr "ON" "MI" "MI" "MI" ...
## $ total_points : num 6 6 6 5.5 5.5 5 5 5 5 5 ...
## $ pre_rating : num 1794 1553 1384 1716 1655 ...
## $ avg_opp_pre_rating: num 1605 1469 1564 1574 1501 ...
Bonus: Do higher-ELO players always beat lower-ELO players?
How useful is the ELO rating? How often does a higher-ELO player beat a lesser-ELO player?
First let’s create an index of players and their pre-ratings:
## player_id pre_rating
## 1 1 1794
## 2 2 1553
## 3 3 1384
## 4 4 1716
## 5 5 1655
## 6 6 1686
Next let’s convert the (wide) tournament data into (long) match data with reshape2::melt(). I’m also including only matches with a win or loss outcome. Then I get the player and opponent’s ratings from pre_ratings.
library(reshape2)
matches <- tournaments[c(1, 2, 4:10)] %>%
melt(id.vars=c('player_id', 'player_name')) %>%
filter(str_detect(value, 'W|L') == TRUE) %>%
separate('value', into=c('outcome', 'opponent_id'), sep=' ') %>%
inner_join(pre_ratings, by='player_id') %>%
inner_join(pre_ratings, by=c('opponent_id'='player_id')) %>%
`colnames<-`(c('player_id', 'player_name', 'match', 'outcome',
'opponent_id', 'player_rating', 'opponent_rating')) %>%
select(player_id, match, outcome, opponent_id, player_rating,
opponent_rating)
head(matches)## player_id match outcome opponent_id player_rating opponent_rating
## 1 1 match_1 W 39 1794 1436
## 2 2 match_1 W 63 1553 1175
## 3 3 match_1 L 8 1384 1641
## 4 4 match_1 W 23 1716 1363
## 5 5 match_1 W 45 1655 1242
## 6 6 match_1 W 34 1686 1399
There are obviously duplicates in this dataset, e.g., there’s a record for (player_id=X, opponent_id=Y) and one for (player_id=Y, opponent_id=X), even though it’s the same match.
To de-dupe, I’m going to create a unique ID for each match, then dedupe on that column. To create the unique ID, I put player_id and opponent_id in a two-value numeric vector, than sort numerically. Then I append the match number.
player_ids <- paste(matches$player_id, matches$opponent_id, sep=',')
player_ids_sorted <- lapply(player_ids, function(i) sort(as.numeric(unlist(str_split(i, ',')))))
player_ids_char <- lapply(player_ids_sorted, function(x) paste(x, collapse='_'))
matches$match_id <- paste(matches$match, player_ids_char, sep="_")
head(matches$match_id)## [1] "match_1_1_39" "match_1_2_63" "match_1_3_8" "match_1_4_23"
## [5] "match_1_5_45" "match_1_6_34"
## [1] 350
## [1] 175
We see match ID has done its job: There are half as many unique IDs as there are records. Dedupe on match_id:
Finally, let’s calculate how often higher-ELO beats lower-ELO:
matches$player_gt_opponent <- ifelse(matches$player_rating > matches$opponent_rating,
TRUE, FALSE)
head(matches)## player_id match outcome opponent_id player_rating opponent_rating
## 1 1 match_1 W 39 1794 1436
## 2 2 match_1 W 63 1553 1175
## 3 3 match_1 L 8 1384 1641
## 4 4 match_1 W 23 1716 1363
## 5 5 match_1 W 45 1655 1242
## 6 6 match_1 W 34 1686 1399
## match_id player_gt_opponent
## 1 match_1_1_39 TRUE
## 2 match_1_2_63 TRUE
## 3 match_1_3_8 FALSE
## 4 match_1_4_23 TRUE
## 5 match_1_5_45 TRUE
## 6 match_1_6_34 TRUE
Examining the results of this analysis:
##
## FALSE TRUE
## L 12 21
## W 26 116
This strikes me as odd.
There where 21 + 116 = 137 matches where player had a higher ELO rating than opponent, and player won \(116/137 =\) almost 85 percent of the time.
However, of the \(12 + 26 = 38\) matches where player did not have a higher rating than opponent, player still won \(26/38 =\) 68 percent of the time.
This suggests that ELO doesn’t matter so much. It seems to me there is something else going on in this data, that some other variable is more highly associated with player winning than ELO, e.g., ranking. Perhaps what this data really says is that currently top ranked chess players are likely to beat lower ranked chess players (with higher ELO ranking increasing the probability somewhat). The data also suggest upsets in chess rankings should be few.