Introduction

The idea behind this paper is to explore possibilities to perform clastring on chess openings data used by Lychess players form kaggle dataset.

What is known for every chess player, chess openings are globally refined to give player as good position as possible and one could suspect that there are few main schema used at begginning of every chess game.

What would be nice is the possibility to perform real time clustering to extract possible opponents used schema (for AI engines for example).

Data Transformation

Pure data downloaded from https://www.kaggle.com/datasnaek/chess.

##       id               rated             created_at         last_move_at      
##  Length:20058       Length:20058       Min.   :1.377e+12   Min.   :1.377e+12  
##  Class :character   Class :character   1st Qu.:1.478e+12   1st Qu.:1.478e+12  
##  Mode  :character   Mode  :character   Median :1.496e+12   Median :1.496e+12  
##                                        Mean   :1.484e+12   Mean   :1.484e+12  
##                                        3rd Qu.:1.503e+12   3rd Qu.:1.503e+12  
##                                        Max.   :1.504e+12   Max.   :1.504e+12  
##      turns        victory_status        winner          increment_code    
##  Min.   :  1.00   Length:20058       Length:20058       Length:20058      
##  1st Qu.: 37.00   Class :character   Class :character   Class :character  
##  Median : 55.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 60.47                                                           
##  3rd Qu.: 79.00                                                           
##  Max.   :349.00                                                           
##    white_id          white_rating    black_id          black_rating 
##  Length:20058       Min.   : 784   Length:20058       Min.   : 789  
##  Class :character   1st Qu.:1398   Class :character   1st Qu.:1391  
##  Mode  :character   Median :1567   Mode  :character   Median :1562  
##                     Mean   :1597                      Mean   :1589  
##                     3rd Qu.:1793                      3rd Qu.:1784  
##                     Max.   :2700                      Max.   :2723  
##     moves           opening_eco        opening_name        opening_ply    
##  Length:20058       Length:20058       Length:20058       Min.   : 1.000  
##  Class :character   Class :character   Class :character   1st Qu.: 3.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 4.000  
##                                                           Mean   : 4.817  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :28.000

The dataset contains a lot of technical variables which are not useful in our case, the only information we will use is if game was rated, what was the opening name and of course moves themselves. While the dataset will contain opening names, for the purpose of unsupervised learning we will pretend that its unavailable information, at the end we will check how do the distributions of real openings look in certain clusters.

There are a lot of games anyway, that’s why we will choose only rated ones to eliminate games without much motivation / bots etc.

The obvious condition is for game to least at least 10 moves for specific player as we do not really examine whole games, just openings.

import pandas as pd
df = pd.read_csv("games.csv")
df["moves"] = df["moves"].apply(lambda x: x.split(' ', 20)[:20])
df[sum([["W"+str(x), "B"+str(x)] for x in range(1, 11)], [])] = pd.DataFrame(df.moves.tolist(), index=df.index)
df = df.loc[df["rated"]]
df = df.loc[(~df["B10"].isnull()) & (~df["W10"].isnull())]
df = df[sum([["W"+str(x), "B"+str(x)] for x in range(1, 11)], [])+['opening_name']]
df2 = pd.concat(
    [
     df[[x for x in df.columns if x.startswith("W")]+['opening_name']].rename(lambda x: x.replace("W", "M"), axis=1),
     df[[x for x in df.columns if x.startswith("B")]+['opening_name']].rename(lambda x: x.replace("B", "M"), axis=1)
    ]
)
df2.to_csv("lichess_final.csv", index=False)
##    M1   M2  M3  M4   M5    M6   M7   M8    M9  M10
## 1  e4   d3 Be3 Be2  Nd2    a4 axb5 bxc6   Nc4   c3
## 2  d4  Nf3 Nc3 Bf4   e3   Be2  O-O  Nb5   Rc1  Ra1
## 3  e4  Nf3  d4  d5   a3   Nc3   b4  Bg5    b5 Bxf6
## 4  d4   e4 Nc3  f3 Nxf3   Bb5  Bd3  O-O   Be2 Qxe2
## 5  e4  Bc4 Nf3  d3 Qxf3    h3   a3  Be3  Qxe3  Qf3
## 6  e4 exd5 Nc3 Be2   d4  Bxa6  Nf3  Be3   Ng5  Qh5
## 7  c4  Nc3  g3 Bg2   e3    a3   b4  Nb5  Nxd6  Ne2
## 8  d4  Nc3 Bf4 Nf3   e3  Bb5+  Bd3 Qxd3   Bg5 Bxe7
## 9  e4  Nf3  d4 Bc4   c3 Bxf7+ Qd5+ Qxc5 Qxe7+ Nxc3
## 10 d4  Nc3 Nf3  e3   h4    g3   a3 bxc3   Rh2  Qd2
##                                                             opening_name
## 1                                  King's Pawn Game: Leonardis Variation
## 2                                 Queen's Pawn Game: Zukertort Variation
## 3                                                       Philidor Defense
## 4                             Blackmar-Diemer Gambit: Pietrowsky Defense
## 5                                  Italian Game: Schilling-Kostic Gambit
## 6                          Scandinavian Defense: Mieses-Kotroc Variation
## 7  English Opening: King's English Variation |  Reversed Closed Sicilian
## 8                                  Queen's Pawn Game: Chigorin Variation
## 9                                               Scotch Game: Haxo Gambit
## 10                                 Queen's Pawn Game: Chigorin Variation

What does dataframe contains is exactly nearly 30000 chess openings written in chess notation. For the purpose of this test we will randomly select 100 and perform clustering on smaller subsample in order to achieve better readability. For more details please visit https://en.wikipedia.org/wiki/Algebraic_notation_(chess)

For example:

e4 means moving pawn to position e4

Be3 means moving bishop moving to e3

Qxe7+ means queen capturing something at e7 giving chess at the same time

O-O means kingside castling.

Hierarchical Clustering

Methodology

Hierarchical Clustering is type of clustering in which clustering is performed iteratively, from situation where each observation has its cluster up to the one big cluster. The first merge of points is between points the closest to each other, then basing on the calculation between different groups of points, (here ward.D is used) we merge clusters until there is one big cluster.

Source: https://en.wikipedia.org/wiki/Hierarchical_clustering

Calculation

In order to perform hierarchical clustring, first the dissimilarity matrix must be calculated, in order to do that nomclust package is going to be used. The selected formula used for distance is Inverse Occurence Frequency (IOF). Metric orginaly designed for text mining is robust in analyzing highly categorized datasets, also this metric should assign higher similarity on less frequent values what may be crucial in for example capturing or not.

Now we may be sure that there are 3 main ‘types’ of opening, but before drawing to many conclusions lets check what is suggested while using Kmeans, PAM and clara.

The silhouette is surprisingly good, it shows some misslocation at only one of the clusters. Thats preety good score. Hierarchical Clustering with IOF metric is able to split datapoints into 3 main groups with smaller families of points under them. It may suggest that 3 clusters is the number that should be used also for kmeans and other centroide based algorithms.

Kmeans, PAM, CLARA

Kmeans

Alghoritm of vector quantization, used as one the main clustering algorithms. Given number of clusters to split data to, the initial points are choosen randomly and then in each iteration: 1. Points are assigned to the nearest centroide. 2. Centroide position is corrected so its exactly in arithmetic mean of all the points assigned to that centroid.

PAM

Similar to Kmeans but centroid must be in one of the points, after each iteration the existing centroid is evaluated wheather its the best possible centroid (sum of distances from each point should be the lowest).

CLARA

Equivalent of PAM for big datasets, its operating on subsamples rather then on whole sample.

Calculation

Depending on the method, optimal number of clusters is suggested to be 2.

For what we see is 2 things, first is that eclust function does not handle custom metrices and I am forced to use ‘binary’ distance on dummied data, which result in potentially undesirable clusters. The binary metric is just not correct for this type of data. Hopefully there are other ways to perform clustering, combination of MDS and regular euclidean clustering should work.

MDS adjustment

In order to use kmeans on highly dimensional data, one way is to perform additional preprocessing. Multidimensional scaling is a way to reduce dimnensions of dataset to exactly 2. Position of points is chosen strictly to best replicate distances in 2 dimensions of N-dimnesional dataset.

b3b = fviz_nbclust(mds, kmeans, method='s')+ ggtitle("Kmeans")
b31 = fviz_nbclust(mds, kmeans, method='wss')+ ggtitle("Kmeans")
grid.arrange(b3b, b31)

Kmeans on reduced dataset

Basing on the metrics I will choose 4 clusters, while silhouette suggest having 5, sum of squares do not change that much. When considering higher or lower number of clusters sticking with lower number gives more stable results.

clust <- kmeans(mds, 4)$cluster %>%
  as.factor()
mds <- mds %>%
  mutate(groups = clust)
# Plot and color by groups
ggscatter(mds, x = "Dim.1", y = "Dim.2", 
          label = df[, 11],
          color = "groups",
          palette = "jco",
          size = 1, 
          ellipse = TRUE,
          ellipse.type = "convex",
          repel = TRUE)

mds$true_gr = df[,11]
occurences = mds %>% select(groups, true_gr) %>% table() %>% data.frame()
occurences%>% filter(grepl("Scotch Game", true_gr) & Freq>0)
##   groups                  true_gr Freq
## 1      4              Scotch Game    2
## 2      4 Scotch Game: Haxo Gambit    1
occurences%>% filter(grepl("Queen's Pawn", true_gr) & Freq>0)
##    groups                                   true_gr Freq
## 1       1                              Queen's Pawn    1
## 2       3                              Queen's Pawn    1
## 3       3     Queen's Pawn Game: Chigorin Variation    2
## 4       1          Queen's Pawn Game: London System    4
## 5       3          Queen's Pawn Game: London System    5
## 6       1           Queen's Pawn Game: Mason Attack    2
## 7       3           Queen's Pawn Game: Mason Attack    4
## 8       3 Queen's Pawn Game: Steinitz Countergambit    1
## 9       2    Queen's Pawn Game: Zukertort Variation    1
## 10      3    Queen's Pawn Game: Zukertort Variation    1
occurences%>% filter(grepl("Queen's Gambit", true_gr) & Freq>0)
##   groups                                           true_gr Freq
## 1      2            Queen's Gambit Accepted: Old Variation    2
## 2      2 Queen's Gambit Declined: Queen's Knight Variation    1
## 3      3    Queen's Gambit Declined: Traditional Variation    1
## 4      2          Queen's Gambit Refused: Marshall Defense    5
occurences%>% filter(grepl("Sicilian", true_gr) & Freq>0)
##   groups                                                               true_gr
## 1      2 English Opening: King's English Variation |  Reversed Closed Sicilian
## 2      4                                                      Sicilian Defense
## 3      4                           Sicilian Defense: Canal Attack |  Main Line
## 4      4                      Sicilian Defense: Modern Variations |  Main Line
## 5      2                               Sicilian Defense: Nimzowitsch Variation
## 6      4                               Sicilian Defense: Smith-Morra Gambit #2
##   Freq
## 1    1
## 2    1
## 3    1
## 4    1
## 5    1
## 6    1
occurences%>% filter(grepl("French", true_gr) & Freq>0)
##    groups                                                     true_gr Freq
## 1       4                                           French Defense #2    1
## 2       4               French Defense: Advance |  Steinitz Variation    1
## 3       4              French Defense: Advance Variation |  Main Line    1
## 4       4         French Defense: Advance Variation |  Paulsen Attack    1
## 5       2                          French Defense: Chigorin Variation    1
## 6       4                          French Defense: Exchange Variation    1
## 7       4                            French Defense: Knight Variation    3
## 8       4                    French Defense: La Bourdonnais Variation    1
## 9       4                          French Defense: Marshall Variation    1
## 10      2                              French Defense: Queen's Knight    2
## 11      4 French Defense: Rubinstein Variation |  Fort Knox Variation    1
## 12      2   French Defense: Winawer Variation |  Advance Variation #2    1
## 13      2                      King's Indian Attack: French Variation    1

The real names of openings are mapped as labels, for the third group we used summary table, it shows that the most ‘specific’ opening are the most easy to recognize, by the end of the day what counts is exactly predictive power. In that case predictive power, even for unsuperivsed task seems to be satisfactory.

For example:

-All Queens Gambit almost all are in cluster 2. -French Defences are in clusters 2 and 4 (all the classical one’s in cluster 4) -Sicilian Defense in cluster 4 (but 1). -Queen’s Pawns in clusters 3 and 1. (Ideally if that was the same cluster) -Scotch game in cluster 4.

Conclusions

Chess data seem to be similar in characteristics to text data acheved by text mining, its characterized by long vectors and relatively (or rather completely) nominal structure. We managed to deal with uneasy data with properly choosen metric and using MDS. I treat it as a complete success, the learning was done in fully unsupervised form, the assignments are satisfactory.