This is the Chess tournament dataset from our previous project. We created a tidy-ish dataset at the end, but I thought I could take it a bit further by separating the matches from the players and doing some analysis on the matches. e main thing I wanted to see was if the results of a player’s previous match could help predict the outcome of this match, given theirs and their opponent rating.
I won’t explain much on the first steps here as they are explained in more detail here: http://rpubs.com/StevenEllingson/530769
library(stringr)
library(dplyr)
library(nnet)
library(tidyr)
raw.data <- readChar('tournamentinfo.txt',file.info('tournamentinfo.txt')$size)
data.raw.2 <-unlist(str_replace_all(str_replace_all(raw.data,'\n',''),'-{89}','\n'))
data.lines <- unlist(strsplit(data.raw.2,'\n'))
data.lines <- data.lines[-1]
data.lines <- str_replace_all(data.lines,'\r','')
data.lines.2 <- str_replace_all(data.lines,'/','|')
data.lines.2 <- str_replace_all(data.lines.2,'->','|')
data.lines.2 <- str_trim(str_replace_all(data.lines.2,' *\\| *','|'))
head(data.lines.2)
## [1] "Pair|Player Name|Total|Round|Round|Round|Round|Round|Round|Round|Num|USCF ID|Rtg (Pre|Post)|Pts|1|2|3|4|5|6|7|"
## [2] "1|GARY HUA|6.0|W 39|W 21|W 18|W 14|W 7|D 12|D 4|ON|15445895|R: 1794|1817|N:2|W|B|W|B|W|B|W|"
## [3] "2|DAKSHESH DARURI|6.0|W 63|W 58|L 4|W 17|W 16|W 20|W 7|MI|14598900|R: 1553|1663|N:2|B|W|B|W|B|W|B|"
## [4] "3|ADITYA BAJAJ|6.0|L 8|W 61|W 25|W 21|W 11|W 13|W 12|MI|14959604|R: 1384|1640|N:2|W|B|W|B|W|B|W|"
## [5] "4|PATRICK H SCHILLING|5.5|W 23|D 28|W 2|W 26|D 5|W 19|D 1|MI|12616049|R: 1716|1744|N:2|W|B|W|B|W|B|B|"
## [6] "5|HANSHI ZUO|5.5|W 45|W 37|D 12|D 13|D 4|W 14|W 17|MI|14601533|R: 1655|1690|N:2|B|W|B|W|B|W|B|"
This function will take a line of match data (opponent/win/color) and insert into dataframe.
The match data is “relative” to the player, so we want to translate to a more static language.
The match should look the same in the dataset whether it came from one player’s record or the others.
match_insert = function(x) {
if(x['winLoss']=='W' | x['winLoss']=='L' | x['winLoss']=='D' ) {
if (x['color']=='W') {
white = as.integer(x['id'])
black = as.integer(x['opponent'])
if (x['winLoss']=='W') {
winner = 'W'
} else if (x['winLoss']=='L') {
winner = 'B'
} else {
winner = 'D'
}
} else {
white = as.integer(x['opponent'])
black = as.integer(x['id'])
if (x['winLoss']=='W') {
winner = 'B'
} else if (x['winLoss']=='L') {
winner = 'W'
} else {
winner = 'D'
}
}
round = as.integer(x['round'])
return(as.data.frame(t(c('round'=round,'white'=white,'black'=black,'winner'=winner)),stringsAsFactors=FALSE))
}
}
This function is similar to the one I used to pull the needed information for the first project. I’m changing it a bit here since I want to use the above function on each record to insert the match data.
pullData = function(v) {
data = unlist(str_split(v, '\\|'))
id = data[1]
name = data[2]
points = data[3]
state = data[11]
prerating = unlist(str_split(unlist(str_split(data[13], ' +'))[2], 'P'))[1]
postrating = data[14]
uscf_id = data[12]
opponent = sapply(data[4:10],function(x) unlist(str_extract_all(x, '\\d+')))
winLoss = sapply(data[4:10],function(x) unlist(str_extract_all(x, '^.')))
color = data[16:22]
temp_matches <- cbind(id,opponent,winLoss,color,round=1:7)
temp_matches_2 <- data.frame(do.call(rbind,apply(temp_matches, 1, match_insert)),stringsAsFactors = FALSE)
matches <<- matches
matches <<- as.data.frame(rbind(matches, temp_matches_2),stringsAsFactors=FALSE)
return(c(id,name,state,points,prerating,postrating,uscf_id))
}
Now we’ll run the pullData function for each player, which will runn the match_insert function for each match.
#Create blank DF for matches
matches = data.frame(round=integer(),white=integer(),black = integer(),winner=character(),
stringsAsFactors = FALSE)
#Apply the function over our list to create a dataframe
players <- data.frame(do.call(rbind,lapply(data.lines.2[-1], pullData)),stringsAsFactors = FALSE)
#Fix column names and remove the first line, as that is the header
colnames(players) <- c('ID','Name','State','Points','Pre-Rating','Post-Rating','USCF_ID')
#Fix our numeric data.
players$ID = type.convert(players$ID)
players['Pre-Rating'] = type.convert(players[,'Pre-Rating'])
players['Post-Rating'] = type.convert(players[,'Post-Rating'])
matches$round = type.convert(matches$round)
matches$white = type.convert(matches$white)
matches$black = type.convert(matches$black)
matches = drop_na(matches,round)
#Set the rows namesto the ID, and remove the weird rownames from the matches data set
rownames(players) = players$ID
rownames(matches) = c()
head(players)
## ID Name State Points Pre-Rating Post-Rating USCF_ID
## 1 1 GARY HUA ON 6.0 1794 1817 15445895
## 2 2 DAKSHESH DARURI MI 6.0 1553 1663 14598900
## 3 3 ADITYA BAJAJ MI 6.0 1384 1640 14959604
## 4 4 PATRICK H SCHILLING MI 5.5 1716 1744 12616049
## 5 5 HANSHI ZUO MI 5.5 1655 1690 14601533
## 6 6 HANSEN SONG OH 5.0 1686 1687 15055204
head(matches)
## round white black winner
## 1 1 1 39 W
## 2 2 21 1 B
## 3 3 1 18 W
## 4 4 14 1 B
## 5 5 1 7 W
## 6 6 12 1 D
The only issue now is that the matches are all in there twice, since both players have a record for each game. As I said above, the match should look exactly the same despite which record it came from. Let’s make sure that this is true.
c(nrow(matches),
nrow(unique(matches)))
## [1] 408 204
Exactly half of the matches are duplicates. Let’s remove them.
matches <- unique(matches)
I would say our datasets are now tidy. Unfortunately for the analysis I wanted to do I need to create a less tidy combination of the two datasets, with some repeated information about the players appended to the match results.
While I do think the way the data is structured is the tidiest form, it does make it tougher to use in some ways. Finding a player’s last game is a bit of a struggle since you need to join on a different column depending on if they were using the black or white pieces.
I’m going to add a column which indicates what the pre-ratings are for the players, and the result of their last round. I’m going to put an ‘F’ for First round for all the first round matches
match_data = inner_join(inner_join(matches,players[c('ID','Pre-Rating')],by=c('white'='ID')), players[c('ID','Pre-Rating')],by=c('black'='ID'))
head(match_data)
## round white black winner Pre-Rating.x Pre-Rating.y
## 1 1 1 39 W 1794 1436
## 2 2 21 1 B 1563 1794
## 3 3 1 18 W 1794 1600
## 4 4 14 1 B 1610 1794
## 5 5 1 7 W 1794 1649
## 6 6 12 1 D 1663 1794
#Fix the auto-named columns
match_data = rename(match_data,'pre_rating_black' = 'Pre-Rating.x','pre_rating_white' = 'Pre-Rating.y')
#This function return the result find the last match, by searching in both the "white" and "black" columns for their ID
last_match = function(id,round) {
last_round = as.numeric(round) - 1
if(last_round == 0) {
return('F')
} else {
winner = unlist(filter(match_data, round == last_round & white == id)[1,'winner'])
if (!is.na(winner)) {
if (winner == 'W') {
result = 'W'
} else if (winner == 'B') {
result = 'L'
} else {
result = winner
}
} else {
winner = unlist(filter(match_data, round == last_round & black == id)[1,'winner'])
if (!is.na(winner)) {
if (winner == 'W') {
result = 'L'
} else if (winner == 'B') {
result = 'W'
} else {
result = winner
}
} else {
result = 'B'
}
}
return(result)
}
}
match_data['last_result_white'] = apply(match_data,1,function(x) last_match(x['white'],x['round']))
match_data['last_result_black'] = apply(match_data,1,function(x) last_match(x['black'],x['round']))
cbind(table(match_data$last_result_black),
table(match_data$last_result_white))
## [,1] [,2]
## B 40 34
## D 23 21
## F 30 30
## L 63 66
## W 48 53
Now we can create a multinomail classification model to predict the winner. Because draws are a possibility, we can’t use a simple logistic regression.
white_winner =
#refactor so the "baseline" is "draw"
match_data$winner <- relevel(as.factor(match_data$winner), ref = "D")
model <- multinom(data=match_data,winner ~pre_rating_black+pre_rating_white+last_result_white +last_result_black)
## # weights: 36 (22 variable)
## initial value 224.116907
## iter 10 value 173.037459
## iter 20 value 170.488626
## iter 30 value 170.420186
## iter 40 value 170.418502
## iter 40 value 170.418500
## iter 40 value 170.418500
## final value 170.418500
## converged
summ <- summary(model)
#This package doesn't automatically calculate p values, so calculate manually
z <- summ$coefficients/summ$standard.errors
p <- (1 - pnorm(abs(z), 0, 1)) * 2
summ$coefficients
## (Intercept) pre_rating_black pre_rating_white last_result_whiteD
## B 2.607432 -0.001050122 -0.0003124556 0.2055672
## W 3.515568 0.002018561 -0.0037886107 -1.5304512
## last_result_whiteF last_result_whiteL last_result_whiteW
## B 0.4023402 -0.1951350 0.4592739
## W 0.1173897 -0.6321496 -0.7627893
## last_result_blackD last_result_blackF last_result_blackL
## B 14.56880 0.4023402 -0.02094317
## W 15.87816 0.1173897 0.23333956
## last_result_blackW
## B 0.08908705
## W 0.76879465
summ2 <- as.data.frame(t(rbind(summ$coefficients,p)))
names(summ2) = c('coef_black','coef_white','p_black','p_white')
summ2
## coef_black coef_white p_black p_white
## (Intercept) 2.6074318654 3.515568321 0.0000000000 0.000000e+00
## pre_rating_black -0.0010501217 0.002018561 0.1737040228 1.401258e-02
## pre_rating_white -0.0003124556 -0.003788611 0.6679252899 2.674569e-06
## last_result_whiteD 0.2055672321 -1.530451205 0.2591604651 0.000000e+00
## last_result_whiteF 0.4023402131 0.117389747 0.0003724361 2.833879e-01
## last_result_whiteL -0.1951349787 -0.632149556 0.4352638909 8.597048e-03
## last_result_whiteW 0.4592738628 -0.762789273 0.0299697529 2.691781e-04
## last_result_blackD 14.5688031499 15.878157933 0.0000000000 0.000000e+00
## last_result_blackF 0.4023402131 0.117389747 0.0003724361 2.833879e-01
## last_result_blackL -0.0209431697 0.233339559 0.9378840143 3.804700e-01
## last_result_blackW 0.0890870512 0.768794652 0.6413053358 4.759342e-05
OK, so there’s a lot of information crammed together here, but I like being able to see the coefficients as well as the standard errors and p values at a glance. It’s easier when we have a binary operator.
The first thing to notice is that the coefficients for “last_result_black=D” are very high. Because they are high for both Black and White means that essentially it’s saying that there is a very small chance of a draw if black’s last game was a draw. I have no idea why this would be the case, and it makes me wonder if I’m missing something structural about chess tournaments that is showing up in the data.
There are more confusing results, like the last_result_white =‘W’ having a negative coefficient. The thing to remember is that one result is always excluded, in this case ‘B’ for Bye. So the white player having won their last match means their less likely to win this one “given the rest of the data” but only compared to them having a Bye in the last match. If they had a draw in their last round their probability of winning is even smaller. The coefficient for “Loss” is very similiar, so it doesn’t seem to matter if they won or lost, but it does matter if it was a draw or wasn’t played at all.
Most likely all of these things are not actual causative relationships. I know nothing about how these tournaments are run, but I imaginine the decisions for who plays who are not random and what goes into those decisions is showing up in the analysis.