Introduction
Unless you have been living under a rock, you would have known by now that in the recent FIDE Chess World Championships, China’s Ding Liren beat Russia’s Ian Nepomniachtchi to become the new world chess champion. This was after the previous reigning champion Magnus Carlsen from Norway announced that he would be stepping away from the championships this year.
Chess is a game played by many and loved by all. Evey chess game can be seen as a war story-the strategies, the positionings, the defence, and the attack all unfold over a humble 8x8 board. Mathematically, there are an estimated 10^111 to 10^123 different variations, with 1 single objective: to checkmate the king.
Ask
This is a fun data analysis project that aims to answer the following questions:
Do higher rated chess players perform better?
What is the most common opening move?
Is it better to play as white or black?
Prepare
A summary of the dataset is shown below:
## game_id rated turns victory_status
## Min. : 1 Mode :logical Min. : 1.00 Length:20058
## 1st Qu.: 5015 FALSE:3903 1st Qu.: 37.00 Class :character
## Median :10030 TRUE :16155 Median : 55.00 Mode :character
## Mean :10030 Mean : 60.47
## 3rd Qu.:15044 3rd Qu.: 79.00
## Max. :20058 Max. :349.00
## winner time_increment white_id white_rating
## Length:20058 Length:20058 Length:20058 Min. : 784
## Class :character Class :character Class :character 1st Qu.:1398
## Mode :character Mode :character Mode :character Median :1567
## Mean :1597
## 3rd Qu.:1793
## Max. :2700
## black_id black_rating moves opening_code
## Length:20058 Min. : 789 Length:20058 Length:20058
## Class :character 1st Qu.:1391 Class :character Class :character
## Mode :character Median :1562 Mode :character Mode :character
## Mean :1589
## 3rd Qu.:1784
## Max. :2723
## opening_moves opening_fullname opening_shortname opening_response
## Min. : 1.000 Length:20058 Length:20058 Length:20058
## 1st Qu.: 3.000 Class :character Class :character Class :character
## Median : 4.000 Mode :character Mode :character Mode :character
## Mean : 4.817
## 3rd Qu.: 6.000
## Max. :28.000
## opening_variation
## Length:20058
## Class :character
## Mode :character
##
##
##
Remove unecessary columns:
chess=chess %>% select(-c(opening_code, opening_moves, opening_variation, opening_response))
Find out how many unique players there are:
white_players=as.data.frame(chess$white_id)
colnames(white_players)[1]="players"
black_players=as.data.frame(chess$black_id)
colnames(black_players)[1]="players"
rbind(white_players,black_players)
players=as.data.frame(unique(chess$white_id))
dim(players)
## [1] 9438 1
There are 9,438 unique players
Process
What is the most popular opening move?
openings_plot=chess %>% group_by(opening_fullname) %>%
summarize(count=n())%>%arrange(desc(count))
head(openings_plot)
The Van’t Kruijs Opening is most often used-368 times, followed by the Sicilian Defense-358 times.
Do higher rated players perform better than i.e score more victories?
chess$victory_status=as.factor(chess$victory_status)
chess$winner=as.factor(chess$winner)
ggplot(chess, aes(x=black_rating, y=white_rating), )+geom_point(aes(color=winner))+
geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
Generally, it seems that regardless of whether the player played black or white, a higher rating is generally positively correlated to a higher chance of victory.
counts=chess %>% group_by(winner) %>% summarize(count=n())
ggplot(counts, aes(x=winner, y=count, fill=winner))+geom_bar(stat="identity")+labs(title="Number of Games won by black, white, or both(draw)")
894 games are won by white more over black.
victory_type=chess %>% group_by(victory_status) %>% summarize(count=n())
ggplot(victory_type, aes(x=victory_status, y=count, fill=victory_status))+geom_bar(stat="identity")+labs(title="How do games end")+
scale_fill_manual(values=c("#9933FF",
"#33FFFF",
"darkblue", "red"))
Most games end by resignation, followed by Checkmates, then time limitations, and lastly draws.
chess1=chess %>% filter(winner=="White") %>%
mutate(difference_in_ratings=white_rating-black_rating)
mean(chess1$difference_in_ratings)
## [1] 95.30747
On average, the difference in rating between a white player and black player is about 95 for games where white players win.
chess2=chess %>% filter(winner=="Black") %>%
mutate(difference_in_ratings1=black_rating-white_rating)
chess
mean(chess2$difference_in_ratings1)
## [1] 88.98111
On average, the difference in rating between a black player and white player is about 88 for games where black players win.
Who are the top 5 best players in terms of number of victories?
chicken_circle <- data.frame(ifelse(chess$winner == "White", chess$white_id, chess$black_id), ifelse(chess$winner == "White", chess$white_rating, chess$black_rating))
colnames(chicken_circle)[1:2]=c("winners", "rating")
winner_circle=chicken_circle %>% group_by(winners)
winner_circle_above2400=chicken_circle %>% group_by(winners) %>% filter(rating>=2400)
ggplot(winner_circle_above2400, aes(x=winners, y=rating)) + geom_point(color="pink", size=1) + coord_flip()
chicken_circle1=chicken_circle %>% group_by(winners)%>% summarize(count_of=n())
chicken_circle1[order(desc(chicken_circle1$count_of)),]
The player with the most wins (72 wins) is taranga.
chicken_circle %>% filter(winners=="taranga") %>%arrange(-rating) %>% head()
Surprisingly, Taranga’s highest rating is only 1307.
winner_circle[order(desc(winner_circle$rating)),] %>% head()
The highest rated player is justicebot with an ELO of 2700.
Is there a best opening move (for white)?
white_victories=chess %>% filter(winner=="White")
white_victories_count=white_victories %>% group_by(opening_fullname) %>% summarize(count=n()) %>% arrange(-count) %>% filter(count>=100)
ggplot(white_victories_count, aes(x=opening_fullname, y=count)) +
geom_segment( aes(x=opening_fullname, xend=opening_fullname, y=0, yend=count), color="skyblue") +
geom_point( color="blue", size=4, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
)+labs(title="White's Opening Move where White wins")
A high proportion of the victories for White begins with Scandinavian Defense, followed by the Sicilian Defense.
What is the “worst” opening move (for white)?
white_losses=chess %>% filter(winner=="Black")
white_losses_count=white_losses %>% group_by(opening_fullname) %>% summarize(count=n()) %>% arrange(-count) %>% filter(count>=100)
ggplot(white_losses_count, aes(x=opening_fullname, y=count)) +
geom_segment( aes(x=opening_fullname, xend=opening_fullname, y=0, yend=count), color="green") +
geom_point( color="gold", size=4, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
)+labs(title="White's opening move where Black wins")
A high proportion of the losess for White begins with the Van’t Krujis Opening, followed by the Sicilian Defense.
What is the “best” defense for black?
wmoves=str_sub(white_losses$moves, 3, 6) %>% as.data.frame()
black_wins=white_losses %>% mutate(white_losses, wmoves)
colnames(black_wins)[14]="defence"
black_wins_freq=black_wins %>% group_by(defence) %>% summarize(count=n()) %>% arrange(-count) %>% filter(count>=100)
ggplot(black_wins_freq, aes(x=defence, y=count)) +
geom_segment( aes(x=defence, xend=defence, y=0, yend=count), color="purple") +
geom_point( color="pink", size=4, alpha=0.6) +
theme_light() +
coord_flip() +
theme(
panel.grid.major.y = element_blank(),
panel.border = element_blank(),
axis.ticks.y = element_blank()
)+labs(title="Black's 1st Move Frequency where Black wins")
e5, followed by d5 seems to be the best defence for black.
Analyse
Apriori Principle:
The Apriori Principle is a fundamental concept in data mining that refers to the idea that if a subset of items is frequent, then all of its supersets must also be frequent.
In simpler terms, if a particular itemset (a set of items) is commonly found in a given dataset, then it is likely that any larger itemset that includes this itemset will also be commonly found.
This principle is often used in market basket analysis, where it helps identify relationships between items that tend to be purchased together. By identifying frequent itemsets, businesses can gain insights into which products or services are often purchased together and use this information to make decisions about pricing, promotions, and product placement.
Overall, the Apriori Principle is a useful tool for identifying patterns and relationships in large datasets and is widely used in data mining and machine learning.
library(tm)
library(text2vec)
library(arules)
library(RColorBrewer)
library(arulesViz)
tokens=space_tokenizer(chess$moves)
it=itoken(tokens, progressbar = FALSE)
vocab=create_vocabulary(it, stopwords = 'english') %>% prune_vocabulary()
vectorizer=vocab_vectorizer(vocab)
dtm=create_dtm(it, vectorizer)
cat('The dimensions of the Document Term Matrix is, ', dim(dtm))
## The dimensions of the Document Term Matrix is, 20058 4447
itemMat=dtm %>% as.matrix() %>% as("itemMatrix")
## Warning in asMethod(object): matrix contains values other than 0 and 1! Setting
## all entries != 0 to 1.
itemFrequencyPlot(itemMat, topN=20, type="relative", col=brewer.pal(8, 'Accent'),
main="Support of most common items")
Knight to f3 and pawn to e4 (central pawn move) are the most common moves.
Extract association rules with minimum support 0.005, minimum confidence 0.5.
association.rules=apriori(itemMat, parameter=list(supp=0.005, conf=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 100
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4447 item(s), 20058 transaction(s)] done [0.14s].
## sorting and recoding items ... [1056 item(s)] done [0.02s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4
## Warning in apriori(itemMat, parameter = list(supp = 0.005, conf = 0.5)): Mining
## stopped (time limit reached). Only patterns up to a length of 4 returned!
## done [6.99s].
## writing ... [11332193 rule(s)] done [0.49s].
## creating S4 object ... done [1.09s].
summary(association.rules)
## set of 11332193 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4
## 9 13153 772823 10546208
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 4.000 3.929 4.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.005035 Min. :0.5000 Min. :0.005035 Min. : 0.5976
## 1st Qu.:0.005983 1st Qu.:0.5986 1st Qu.:0.008426 1st Qu.: 1.0713
## Median :0.007678 Median :0.7393 Median :0.011118 Median : 1.1721
## Mean :0.010813 Mean :0.7287 Mean :0.015233 Mean : 1.4579
## 3rd Qu.:0.011367 3rd Qu.:0.8529 3rd Qu.:0.016353 3rd Qu.: 1.6020
## Max. :0.836723 Max. :1.0000 Max. :1.000000 Max. :82.6409
## count
## Min. : 101.0
## 1st Qu.: 120.0
## Median : 154.0
## Mean : 216.9
## 3rd Qu.: 228.0
## Max. :16783.0
##
## mining info:
## data ntransactions support confidence
## itemMat 20058 0.005 0.5
## call
## apriori(data = itemMat, parameter = list(supp = 0.005, conf = 0.5))
lift10=association.rules %>% subset(lift>=10)
plot(lift10, method='graph')
## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).
In the context of the Apriori algorithm, support is a measure of the frequency with which an itemset appears in a dataset. It represents the proportion of transactions that contain the itemset.
For example, if you are analyzing a dataset of customer transactions, and you are interested in the itemset {milk, bread}, the support of this itemset is the number of transactions that contain both milk and bread divided by the total number of transactions.
Based on the results above, some of the most common moves are pawn promotions (g8=Q, c8=Q) and h4, and h5.
ar=DATAFRAME(association.rules)
ar %>% arrange(-lift) %>% head
In the context of the Apriori algorithm, lift is a measure of the strength of association between two items, and it measures the degree to which the occurrence of one item (e.g., item A) is dependent on the occurrence of another item (e.g., item B).
A higher lift value means that the two items are more strongly associated with each other, i.e., the occurrence of one item is more dependent on the occurrence of the other item. A lift value of 1 indicates that the two items are independent, while a lift value greater than 1 indicates a positive association between the two items, and a lift value less than 1 indicates a negative association.
As seen in the results above, some of the more common associations have to do with g8=q, which refer to a pawn’s promotion to a queen, which is closely associated with
Another strong association would be h7,b4,h6 and h8=Q, indicating that pawn movements are closely associated with each other.
Share
The main findings from my results are summarized below:
Overall, players who are white generally win more than players who are black
While a higher rating is generally positively related with the number of victories, the best players in terms of number of victories are not the players with the highest ratings.
The Van’t Krujis Opening is the most common opening, followed by the Sicilian Defence.
Interestingly, these are also the 2 most common openings that are associated with white losing the match.
e4 and d5 variations are played most by black in response to white openings.
Using Apriori Principle, it seems that pawn moves are highly associated with one another, followed by knight moves.
Act
Some information which could be useful for further analysis are:
The actual time taken for each match.
The cheating records of players, if the objective of future studies is to determine whether there is a habitual pattern of moves for cheaters in order to conduct some cheating detection analysis.