The idea behind this paper is to explore possibilities to perform clastring on chess openings data used by Lychess players form kaggle dataset.
What is known for every chess player, chess openings are globally refined to give player as good position as possible and one could suspect that there are few main schema used at begginning of every chess game.
What would be nice is the possibility to perform real time clustering to extract possible opponents used schema (for AI engines for example).
Pure data downloaded from https://www.kaggle.com/datasnaek/chess.
## id rated created_at last_move_at
## Length:20058 Length:20058 Min. :1.377e+12 Min. :1.377e+12
## Class :character Class :character 1st Qu.:1.478e+12 1st Qu.:1.478e+12
## Mode :character Mode :character Median :1.496e+12 Median :1.496e+12
## Mean :1.484e+12 Mean :1.484e+12
## 3rd Qu.:1.503e+12 3rd Qu.:1.503e+12
## Max. :1.504e+12 Max. :1.504e+12
## turns victory_status winner increment_code
## Min. : 1.00 Length:20058 Length:20058 Length:20058
## 1st Qu.: 37.00 Class :character Class :character Class :character
## Median : 55.00 Mode :character Mode :character Mode :character
## Mean : 60.47
## 3rd Qu.: 79.00
## Max. :349.00
## white_id white_rating black_id black_rating
## Length:20058 Min. : 784 Length:20058 Min. : 789
## Class :character 1st Qu.:1398 Class :character 1st Qu.:1391
## Mode :character Median :1567 Mode :character Median :1562
## Mean :1597 Mean :1589
## 3rd Qu.:1793 3rd Qu.:1784
## Max. :2700 Max. :2723
## moves opening_eco opening_name opening_ply
## Length:20058 Length:20058 Length:20058 Min. : 1.000
## Class :character Class :character Class :character 1st Qu.: 3.000
## Mode :character Mode :character Mode :character Median : 4.000
## Mean : 4.817
## 3rd Qu.: 6.000
## Max. :28.000
The dataset contains a lot of technical variables which are not useful in our case, the only information we will use is if game was rated, what was the opening name and of course moves themselves. While the dataset will contain opening names, for the purpose of unsupervised learning we will pretend that its unavailable information, at the end we will check how do the distributions of real openings look in certain clusters.
There are a lot of games anyway, that’s why we will choose only rated ones to eliminate games without much motivation / bots etc.
The obvious condition is for game to least at least 10 moves for specific player as we do not really examine whole games, just openings.
import pandas as pd
df = pd.read_csv("games.csv")
df["moves"] = df["moves"].apply(lambda x: x.split(' ', 20)[:20])
df[sum([["W"+str(x), "B"+str(x)] for x in range(1, 11)], [])] = pd.DataFrame(df.moves.tolist(), index=df.index)
df = df.loc[df["rated"]]
df = df.loc[(~df["B10"].isnull()) & (~df["W10"].isnull())]
df = df[sum([["W"+str(x), "B"+str(x)] for x in range(1, 11)], [])+['opening_name']]
df2 = pd.concat(
[
df[[x for x in df.columns if x.startswith("W")]+['opening_name']].rename(lambda x: x.replace("W", "M"), axis=1),
df[[x for x in df.columns if x.startswith("B")]+['opening_name']].rename(lambda x: x.replace("B", "M"), axis=1)
]
)
df2.to_csv("lichess_final.csv", index=False)
## M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
## 1 e4 d3 Be3 Be2 Nd2 a4 axb5 bxc6 Nc4 c3
## 2 d4 Nf3 Nc3 Bf4 e3 Be2 O-O Nb5 Rc1 Ra1
## 3 e4 Nf3 d4 d5 a3 Nc3 b4 Bg5 b5 Bxf6
## 4 d4 e4 Nc3 f3 Nxf3 Bb5 Bd3 O-O Be2 Qxe2
## 5 e4 Bc4 Nf3 d3 Qxf3 h3 a3 Be3 Qxe3 Qf3
## 6 e4 exd5 Nc3 Be2 d4 Bxa6 Nf3 Be3 Ng5 Qh5
## 7 c4 Nc3 g3 Bg2 e3 a3 b4 Nb5 Nxd6 Ne2
## 8 d4 Nc3 Bf4 Nf3 e3 Bb5+ Bd3 Qxd3 Bg5 Bxe7
## 9 e4 Nf3 d4 Bc4 c3 Bxf7+ Qd5+ Qxc5 Qxe7+ Nxc3
## 10 d4 Nc3 Nf3 e3 h4 g3 a3 bxc3 Rh2 Qd2
## opening_name
## 1 King's Pawn Game: Leonardis Variation
## 2 Queen's Pawn Game: Zukertort Variation
## 3 Philidor Defense
## 4 Blackmar-Diemer Gambit: Pietrowsky Defense
## 5 Italian Game: Schilling-Kostic Gambit
## 6 Scandinavian Defense: Mieses-Kotroc Variation
## 7 English Opening: King's English Variation | Reversed Closed Sicilian
## 8 Queen's Pawn Game: Chigorin Variation
## 9 Scotch Game: Haxo Gambit
## 10 Queen's Pawn Game: Chigorin Variation
What does dataframe contains is exactly nearly 30000 chess openings written in chess notation. For the purpose of this test we will randomly select 100 and perform clustering on smaller subsample in order to achieve better readability. For more details please visit https://en.wikipedia.org/wiki/Algebraic_notation_(chess)
For example:
e4 means moving pawn to position e4
Be3 means moving bishop moving to e3
Qxe7+ means queen capturing something at e7 giving chess at the same time
O-O means kingside castling.
Hierarchical Clustering is type of clustering in which clustering is performed iteratively, from situation where each observation has its cluster up to the one big cluster. The first merge of points is between points the closest to each other, then basing on the calculation between different groups of points, (here ward.D is used) we merge clusters until there is one big cluster.
Source: https://en.wikipedia.org/wiki/Hierarchical_clustering
In order to perform hierarchical clustring, first the dissimilarity matrix must be calculated, in order to do that nomclust package is going to be used. The selected formula used for distance is Inverse Occurence Frequency (IOF). Metric orginaly designed for text mining is robust in analyzing highly categorized datasets, also this metric should assign higher similarity on less frequent values what may be crucial in for example capturing or not.
Now we may be sure that there are 3 main ‘types’ of opening, but before drawing to many conclusions lets check what is suggested while using Kmeans, PAM and clara.
The silhouette is surprisingly good, it shows some misslocation at only one of the clusters. Thats preety good score. Hierarchical Clustering with IOF metric is able to split datapoints into 3 main groups with smaller families of points under them. It may suggest that 3 clusters is the number that should be used also for kmeans and other centroide based algorithms.
Alghoritm of vector quantization, used as one the main clustering algorithms. Given number of clusters to split data to, the initial points are choosen randomly and then in each iteration: 1. Points are assigned to the nearest centroide. 2. Centroide position is corrected so its exactly in arithmetic mean of all the points assigned to that centroid.
Similar to Kmeans but centroid must be in one of the points, after each iteration the existing centroid is evaluated wheather its the best possible centroid (sum of distances from each point should be the lowest).
Equivalent of PAM for big datasets, its operating on subsamples rather then on whole sample.
Depending on the method, optimal number of clusters is suggested to be 2.
For what we see is 2 things, first is that eclust function does not handle custom metrices and I am forced to use ‘binary’ distance on dummied data, which result in potentially undesirable clusters. The binary
metric is just not correct for this type of data. Hopefully there are other ways to perform clustering, combination of MDS and regular euclidean clustering should work.
In order to use kmeans on highly dimensional data, one way is to perform additional preprocessing. Multidimensional scaling is a way to reduce dimnensions of dataset to exactly 2. Position of points is chosen strictly to best replicate distances in 2 dimensions of N-dimnesional dataset.
b3b = fviz_nbclust(mds, kmeans, method='s')+ ggtitle("Kmeans")
b31 = fviz_nbclust(mds, kmeans, method='wss')+ ggtitle("Kmeans")
grid.arrange(b3b, b31)
Basing on the metrics I will choose 4 clusters, while silhouette suggest having 5, sum of squares do not change that much. When considering higher or lower number of clusters sticking with lower number gives more stable results.
clust <- kmeans(mds, 4)$cluster %>%
as.factor()
mds <- mds %>%
mutate(groups = clust)
# Plot and color by groups
ggscatter(mds, x = "Dim.1", y = "Dim.2",
label = df[, 11],
color = "groups",
palette = "jco",
size = 1,
ellipse = TRUE,
ellipse.type = "convex",
repel = TRUE)
mds$true_gr = df[,11]
occurences = mds %>% select(groups, true_gr) %>% table() %>% data.frame()
occurences%>% filter(grepl("Scotch Game", true_gr) & Freq>0)
## groups true_gr Freq
## 1 4 Scotch Game 2
## 2 4 Scotch Game: Haxo Gambit 1
occurences%>% filter(grepl("Queen's Pawn", true_gr) & Freq>0)
## groups true_gr Freq
## 1 1 Queen's Pawn 1
## 2 3 Queen's Pawn 1
## 3 3 Queen's Pawn Game: Chigorin Variation 2
## 4 1 Queen's Pawn Game: London System 4
## 5 3 Queen's Pawn Game: London System 5
## 6 1 Queen's Pawn Game: Mason Attack 2
## 7 3 Queen's Pawn Game: Mason Attack 4
## 8 3 Queen's Pawn Game: Steinitz Countergambit 1
## 9 2 Queen's Pawn Game: Zukertort Variation 1
## 10 3 Queen's Pawn Game: Zukertort Variation 1
occurences%>% filter(grepl("Queen's Gambit", true_gr) & Freq>0)
## groups true_gr Freq
## 1 2 Queen's Gambit Accepted: Old Variation 2
## 2 2 Queen's Gambit Declined: Queen's Knight Variation 1
## 3 3 Queen's Gambit Declined: Traditional Variation 1
## 4 2 Queen's Gambit Refused: Marshall Defense 5
occurences%>% filter(grepl("Sicilian", true_gr) & Freq>0)
## groups true_gr
## 1 2 English Opening: King's English Variation | Reversed Closed Sicilian
## 2 4 Sicilian Defense
## 3 4 Sicilian Defense: Canal Attack | Main Line
## 4 4 Sicilian Defense: Modern Variations | Main Line
## 5 2 Sicilian Defense: Nimzowitsch Variation
## 6 4 Sicilian Defense: Smith-Morra Gambit #2
## Freq
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
occurences%>% filter(grepl("French", true_gr) & Freq>0)
## groups true_gr Freq
## 1 4 French Defense #2 1
## 2 4 French Defense: Advance | Steinitz Variation 1
## 3 4 French Defense: Advance Variation | Main Line 1
## 4 4 French Defense: Advance Variation | Paulsen Attack 1
## 5 2 French Defense: Chigorin Variation 1
## 6 4 French Defense: Exchange Variation 1
## 7 4 French Defense: Knight Variation 3
## 8 4 French Defense: La Bourdonnais Variation 1
## 9 4 French Defense: Marshall Variation 1
## 10 2 French Defense: Queen's Knight 2
## 11 4 French Defense: Rubinstein Variation | Fort Knox Variation 1
## 12 2 French Defense: Winawer Variation | Advance Variation #2 1
## 13 2 King's Indian Attack: French Variation 1
The real names of openings are mapped as labels, for the third group we used summary table, it shows that the most ‘specific’ opening are the most easy to recognize, by the end of the day what counts is exactly predictive power. In that case predictive power, even for unsuperivsed task seems to be satisfactory.
For example:
-All Queens Gambit almost all are in cluster 2. -French Defences are in clusters 2 and 4 (all the classical one’s in cluster 4) -Sicilian Defense in cluster 4 (but 1). -Queen’s Pawns in clusters 3 and 1. (Ideally if that was the same cluster) -Scotch game in cluster 4.
Chess data seem to be similar in characteristics to text data acheved by text mining, its characterized by long vectors and relatively (or rather completely) nominal structure. We managed to deal with uneasy data with properly choosen metric and using MDS. I treat it as a complete success, the learning was done in fully unsupervised form, the assignments are satisfactory.