Chess

Introduction

Unless you have been living under a rock, you would have known by now that in the recent FIDE Chess World Championships, China’s Ding Liren beat Russia’s Ian Nepomniachtchi to become the new world chess champion. This was after the previous reigning champion Magnus Carlsen from Norway announced that he would be stepping away from the championships this year.

Chess is a game played by many and loved by all. Evey chess game can be seen as a war story-the strategies, the positionings, the defence, and the attack all unfold over a humble 8x8 board. Mathematically, there are an estimated 10^111 to 10^123 different variations, with 1 single objective: to checkmate the king.

Ask

This is a fun data analysis project that aims to answer the following questions:

Do higher rated chess players perform better?
What is the most common opening move?
Is it better to play as white or black?

Prepare

A summary of the dataset is shown below:

##     game_id        rated             turns        victory_status    
##  Min.   :    1   Mode :logical   Min.   :  1.00   Length:20058      
##  1st Qu.: 5015   FALSE:3903      1st Qu.: 37.00   Class :character  
##  Median :10030   TRUE :16155     Median : 55.00   Mode  :character  
##  Mean   :10030                   Mean   : 60.47                     
##  3rd Qu.:15044                   3rd Qu.: 79.00                     
##  Max.   :20058                   Max.   :349.00                     
##     winner          time_increment       white_id          white_rating 
##  Length:20058       Length:20058       Length:20058       Min.   : 784  
##  Class :character   Class :character   Class :character   1st Qu.:1398  
##  Mode  :character   Mode  :character   Mode  :character   Median :1567  
##                                                           Mean   :1597  
##                                                           3rd Qu.:1793  
##                                                           Max.   :2700  
##    black_id          black_rating     moves           opening_code      
##  Length:20058       Min.   : 789   Length:20058       Length:20058      
##  Class :character   1st Qu.:1391   Class :character   Class :character  
##  Mode  :character   Median :1562   Mode  :character   Mode  :character  
##                     Mean   :1589                                        
##                     3rd Qu.:1784                                        
##                     Max.   :2723                                        
##  opening_moves    opening_fullname   opening_shortname  opening_response  
##  Min.   : 1.000   Length:20058       Length:20058       Length:20058      
##  1st Qu.: 3.000   Class :character   Class :character   Class :character  
##  Median : 4.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 4.817                                                           
##  3rd Qu.: 6.000                                                           
##  Max.   :28.000                                                           
##  opening_variation 
##  Length:20058      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Remove unecessary columns:

chess=chess %>% select(-c(opening_code, opening_moves, opening_variation, opening_response))

Find out how many unique players there are:

white_players=as.data.frame(chess$white_id)
colnames(white_players)[1]="players"
black_players=as.data.frame(chess$black_id)
colnames(black_players)[1]="players"
rbind(white_players,black_players)
players=as.data.frame(unique(chess$white_id))

dim(players)

## [1] 9438    1

There are 9,438 unique players

Process

What is the most popular opening move?

openings_plot=chess %>% group_by(opening_fullname) %>% 
    summarize(count=n())%>%arrange(desc(count))

head(openings_plot)

The Van’t Kruijs Opening is most often used-368 times, followed by the Sicilian Defense-358 times.

Do higher rated players perform better than i.e score more victories?

chess$victory_status=as.factor(chess$victory_status)

chess$winner=as.factor(chess$winner)


ggplot(chess, aes(x=black_rating, y=white_rating), )+geom_point(aes(color=winner))+
    geom_smooth(method="lm")

## `geom_smooth()` using formula = 'y ~ x'

Generally, it seems that regardless of whether the player played black or white, a higher rating is generally positively correlated to a higher chance of victory.

counts=chess %>% group_by(winner) %>% summarize(count=n())

ggplot(counts, aes(x=winner, y=count, fill=winner))+geom_bar(stat="identity")+labs(title="Number of Games won by black, white, or both(draw)")

894 games are won by white more over black.

victory_type=chess %>% group_by(victory_status) %>% summarize(count=n())


ggplot(victory_type, aes(x=victory_status, y=count, fill=victory_status))+geom_bar(stat="identity")+labs(title="How do games end")+
  scale_fill_manual(values=c("#9933FF",
                             "#33FFFF",
                             "darkblue", "red"))

Most games end by resignation, followed by Checkmates, then time limitations, and lastly draws.

chess1=chess %>% filter(winner=="White") %>% 
    mutate(difference_in_ratings=white_rating-black_rating)

mean(chess1$difference_in_ratings)

## [1] 95.30747

On average, the difference in rating between a white player and black player is about 95 for games where white players win.

chess2=chess %>% filter(winner=="Black") %>% 
    mutate(difference_in_ratings1=black_rating-white_rating)
chess

mean(chess2$difference_in_ratings1)

## [1] 88.98111

On average, the difference in rating between a black player and white player is about 88 for games where black players win.

Who are the top 5 best players in terms of number of victories?

chicken_circle <- data.frame(ifelse(chess$winner == "White",  chess$white_id, chess$black_id), ifelse(chess$winner == "White", chess$white_rating, chess$black_rating))

colnames(chicken_circle)[1:2]=c("winners", "rating")

winner_circle=chicken_circle %>% group_by(winners) 
winner_circle_above2400=chicken_circle %>% group_by(winners) %>% filter(rating>=2400)

ggplot(winner_circle_above2400, aes(x=winners, y=rating)) + geom_point(color="pink", size=1) + coord_flip()

chicken_circle1=chicken_circle %>% group_by(winners)%>% summarize(count_of=n()) 

chicken_circle1[order(desc(chicken_circle1$count_of)),]

The player with the most wins (72 wins) is taranga.

chicken_circle %>% filter(winners=="taranga") %>%arrange(-rating) %>% head()

Surprisingly, Taranga’s highest rating is only 1307.

winner_circle[order(desc(winner_circle$rating)),] %>% head()

The highest rated player is justicebot with an ELO of 2700.

Is there a best opening move (for white)?

white_victories=chess %>% filter(winner=="White")


white_victories_count=white_victories %>% group_by(opening_fullname) %>% summarize(count=n()) %>% arrange(-count) %>% filter(count>=100)

ggplot(white_victories_count, aes(x=opening_fullname, y=count)) +
  geom_segment( aes(x=opening_fullname, xend=opening_fullname, y=0, yend=count), color="skyblue") +
  geom_point( color="blue", size=4, alpha=0.6) +
  theme_light() +
  coord_flip() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.border = element_blank(),
    axis.ticks.y = element_blank()
  )+labs(title="White's Opening Move where White wins")

A high proportion of the victories for White begins with Scandinavian Defense, followed by the Sicilian Defense.

What is the “worst” opening move (for white)?

white_losses=chess %>% filter(winner=="Black")


white_losses_count=white_losses %>% group_by(opening_fullname) %>% summarize(count=n()) %>% arrange(-count) %>% filter(count>=100)

ggplot(white_losses_count, aes(x=opening_fullname, y=count)) +
  geom_segment( aes(x=opening_fullname, xend=opening_fullname, y=0, yend=count), color="green") +
  geom_point( color="gold", size=4, alpha=0.6) +
  theme_light() +
  coord_flip() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.border = element_blank(),
    axis.ticks.y = element_blank()
  )+labs(title="White's opening move where Black wins")

A high proportion of the losess for White begins with the Van’t Krujis Opening, followed by the Sicilian Defense.

What is the “best” defense for black?

wmoves=str_sub(white_losses$moves, 3, 6) %>% as.data.frame()
black_wins=white_losses %>% mutate(white_losses, wmoves) 

colnames(black_wins)[14]="defence"

black_wins_freq=black_wins %>% group_by(defence) %>% summarize(count=n()) %>% arrange(-count) %>% filter(count>=100)

ggplot(black_wins_freq, aes(x=defence, y=count)) +
  geom_segment( aes(x=defence, xend=defence, y=0, yend=count), color="purple") +
  geom_point( color="pink", size=4, alpha=0.6) +
  theme_light() +
  coord_flip() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.border = element_blank(),
    axis.ticks.y = element_blank()
  )+labs(title="Black's 1st Move Frequency where Black wins")

e5, followed by d5 seems to be the best defence for black.

Analyse

Apriori Principle:

The Apriori Principle is a fundamental concept in data mining that refers to the idea that if a subset of items is frequent, then all of its supersets must also be frequent.

In simpler terms, if a particular itemset (a set of items) is commonly found in a given dataset, then it is likely that any larger itemset that includes this itemset will also be commonly found.

This principle is often used in market basket analysis, where it helps identify relationships between items that tend to be purchased together. By identifying frequent itemsets, businesses can gain insights into which products or services are often purchased together and use this information to make decisions about pricing, promotions, and product placement.

Overall, the Apriori Principle is a useful tool for identifying patterns and relationships in large datasets and is widely used in data mining and machine learning.

library(tm)
library(text2vec)
library(arules)
library(RColorBrewer)
library(arulesViz)
tokens=space_tokenizer(chess$moves)

it=itoken(tokens, progressbar = FALSE)
vocab=create_vocabulary(it, stopwords = 'english') %>% prune_vocabulary()
vectorizer=vocab_vectorizer(vocab)

dtm=create_dtm(it, vectorizer)
cat('The dimensions of the Document Term Matrix is, ', dim(dtm))

## The dimensions of the Document Term Matrix is,  20058 4447

itemMat=dtm %>% as.matrix() %>% as("itemMatrix")

## Warning in asMethod(object): matrix contains values other than 0 and 1! Setting
## all entries != 0 to 1.

itemFrequencyPlot(itemMat, topN=20, type="relative", col=brewer.pal(8, 'Accent'),
                  main="Support of most common items")

Knight to f3 and pawn to e4 (central pawn move) are the most common moves.

Extract association rules with minimum support 0.005, minimum confidence 0.5.

association.rules=apriori(itemMat, parameter=list(supp=0.005, conf=0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 100 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4447 item(s), 20058 transaction(s)] done [0.14s].
## sorting and recoding items ... [1056 item(s)] done [0.02s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4

## Warning in apriori(itemMat, parameter = list(supp = 0.005, conf = 0.5)): Mining
## stopped (time limit reached). Only patterns up to a length of 4 returned!

##  done [6.99s].
## writing ... [11332193 rule(s)] done [0.49s].
## creating S4 object  ... done [1.09s].

summary(association.rules)

## set of 11332193 rules
## 
## rule length distribution (lhs + rhs):sizes
##        1        2        3        4 
##        9    13153   772823 10546208 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   4.000   3.929   4.000   4.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.005035   Min.   :0.5000   Min.   :0.005035   Min.   : 0.5976  
##  1st Qu.:0.005983   1st Qu.:0.5986   1st Qu.:0.008426   1st Qu.: 1.0713  
##  Median :0.007678   Median :0.7393   Median :0.011118   Median : 1.1721  
##  Mean   :0.010813   Mean   :0.7287   Mean   :0.015233   Mean   : 1.4579  
##  3rd Qu.:0.011367   3rd Qu.:0.8529   3rd Qu.:0.016353   3rd Qu.: 1.6020  
##  Max.   :0.836723   Max.   :1.0000   Max.   :1.000000   Max.   :82.6409  
##      count        
##  Min.   :  101.0  
##  1st Qu.:  120.0  
##  Median :  154.0  
##  Mean   :  216.9  
##  3rd Qu.:  228.0  
##  Max.   :16783.0  
## 
## mining info:
##     data ntransactions support confidence
##  itemMat         20058   0.005        0.5
##                                                                 call
##  apriori(data = itemMat, parameter = list(supp = 0.005, conf = 0.5))

lift10=association.rules %>% subset(lift>=10)


plot(lift10, method='graph')

## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).

In the context of the Apriori algorithm, support is a measure of the frequency with which an itemset appears in a dataset. It represents the proportion of transactions that contain the itemset.

For example, if you are analyzing a dataset of customer transactions, and you are interested in the itemset {milk, bread}, the support of this itemset is the number of transactions that contain both milk and bread divided by the total number of transactions.

Based on the results above, some of the most common moves are pawn promotions (g8=Q, c8=Q) and h4, and h5.

ar=DATAFRAME(association.rules)



ar %>% arrange(-lift) %>% head

In the context of the Apriori algorithm, lift is a measure of the strength of association between two items, and it measures the degree to which the occurrence of one item (e.g., item A) is dependent on the occurrence of another item (e.g., item B).

A higher lift value means that the two items are more strongly associated with each other, i.e., the occurrence of one item is more dependent on the occurrence of the other item. A lift value of 1 indicates that the two items are independent, while a lift value greater than 1 indicates a positive association between the two items, and a lift value less than 1 indicates a negative association.

As seen in the results above, some of the more common associations have to do with g8=q, which refer to a pawn’s promotion to a queen, which is closely associated with

Another strong association would be h7,b4,h6 and h8=Q, indicating that pawn movements are closely associated with each other.

Share

The main findings from my results are summarized below:

Overall, players who are white generally win more than players who are black
While a higher rating is generally positively related with the number of victories, the best players in terms of number of victories are not the players with the highest ratings.
The Van’t Krujis Opening is the most common opening, followed by the Sicilian Defence.
Interestingly, these are also the 2 most common openings that are associated with white losing the match.
e4 and d5 variations are played most by black in response to white openings.
Using Apriori Principle, it seems that pawn moves are highly associated with one another, followed by knight moves.

Act

Some information which could be useful for further analysis are:

The actual time taken for each match.
The cheating records of players, if the objective of future studies is to determine whether there is a habitual pattern of moves for cheaters in order to conduct some cheating detection analysis.

Chess

2023-05-05