The purpose of this analysis is to determine if there is a difference in playing a game of chess as Black or White, as that is an option you can select on chess.com.
## Rows: 20058 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): victory_status, winner, time_increment, white_id, black_id, moves,...
## dbl (5): game_id, turns, white_rating, black_rating, opening_moves
## lgl (1): rated
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 17
## game_id rated turns victory_status winner time_increment white_id white_rating
## <dbl> <lgl> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 1 FALSE 13 Out of Time White 15+2 bourgris 1500
## 2 2 TRUE 16 Resign Black 5+10 a-00 1322
## 3 3 TRUE 61 Mate White 5+10 ischia 1496
## 4 4 TRUE 61 Mate White 20+0 daniamu… 1439
## 5 5 TRUE 95 Mate White 30+3 nik2211… 1523
## 6 6 FALSE 5 Draw Draw 10+0 trelynn… 1250
## # ℹ 9 more variables: black_id <chr>, black_rating <dbl>, moves <chr>,
## # opening_code <chr>, opening_moves <dbl>, opening_fullname <chr>,
## # opening_shortname <chr>, opening_response <chr>, opening_variation <chr>
## game_id rated turns victory_status
## Min. : 1 Mode :logical Min. : 1.00 Length:20058
## 1st Qu.: 5015 FALSE:3903 1st Qu.: 37.00 Class :character
## Median :10030 TRUE :16155 Median : 55.00 Mode :character
## Mean :10030 Mean : 60.47
## 3rd Qu.:15044 3rd Qu.: 79.00
## Max. :20058 Max. :349.00
## winner time_increment white_id white_rating
## Length:20058 Length:20058 Length:20058 Min. : 784
## Class :character Class :character Class :character 1st Qu.:1398
## Mode :character Mode :character Mode :character Median :1567
## Mean :1597
## 3rd Qu.:1793
## Max. :2700
## black_id black_rating moves opening_code
## Length:20058 Min. : 789 Length:20058 Length:20058
## Class :character 1st Qu.:1391 Class :character Class :character
## Mode :character Median :1562 Mode :character Mode :character
## Mean :1589
## 3rd Qu.:1784
## Max. :2723
## opening_moves opening_fullname opening_shortname opening_response
## Min. : 1.000 Length:20058 Length:20058 Length:20058
## 1st Qu.: 3.000 Class :character Class :character Class :character
## Median : 4.000 Mode :character Mode :character Mode :character
## Mean : 4.817
## 3rd Qu.: 6.000
## Max. :28.000
## opening_variation
## Length:20058
## Class :character
## Mode :character
##
##
##
This provides visibility into the data, alongside with a summary of each column within the dataset.
## [1] 9438 1
There are a total of 9438 unique chess players in this dataset.
## # A tibble: 6 × 2
## opening_fullname count
## <chr> <int>
## 1 Van't Kruijs Opening 368
## 2 Sicilian Defense 358
## 3 Sicilian Defense: Bowdler Attack 296
## 4 French Defense: Knight Variation 271
## 5 Scotch Game 271
## 6 Scandinavian Defense: Mieses-Kotroc Variation 259
Van’t Kruijs Opening (e.3) is the highest first move, followed by Sicilian Defense.
Now let’s see if playing beginning the game as White or Black has an impact on the winning results.
Now that we know playing as White has a higher likelihood of winning,
let’s see if these winning outcomes are mainly by Checkmate (considered
“Normal” in this dataset), abandonment, time forfeit, etc.
## [1] 95.30747
And finally, on average the difference in ratings between Black and White where White wins is about 95 games. Let’s see the difference for Black.
## [1] 88.98111
In the cases where Black has the higher rating, they are on average
winning 89 games. Maybe time to start those chess games playing as
white?
Now let’s see what Black’s winning opening moves are:
## tibble [20,058 × 4] (S3: tbl_df/tbl/data.frame)
## $ winner : chr [1:20058] "White" "Black" "White" "White" ...
## $ turns : num [1:20058] 13 16 61 61 95 5 33 9 66 119 ...
## $ rating_difference: num [1:20058] 309 61 -4 -15 54 248 97 -695 47 172 ...
## $ time_increment : chr [1:20058] "15+2" "5+10" "5+10" "20+0" ...
## turns rating_difference
## turns 1.00000000 -0.03578093
## rating_difference -0.03578093 1.00000000
The correlation shows that each variable is perfectly correlated with itself. For example, White’s ratings have a linear relationship with Black’s ratings among different players. Alongside, the correlation between players’ turns and rating differences are basically nonexistent, meaning that the players’ ratings has nothing to do with the amount of turns they make in a game.
## # weights: 12 (6 variable)
## initial value 4406.533890
## iter 10 value 3060.133711
## iter 20 value 3058.247057
## final value 3058.243818
## converged
| y.level | term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|---|
| Black | (Intercept) | -0.2311 | 0.0738 | -3.1328 | 0.0017 |
| Black | rating_difference | -0.0039 | 0.0002 | -21.2244 | 0.0000 |
| Black | turns | 0.0024 | 0.0011 | 2.2891 | 0.0221 |
| Draw | (Intercept) | -3.6949 | 0.1749 | -21.1245 | 0.0000 |
| Draw | rating_difference | -0.0017 | 0.0004 | -4.4089 | 0.0000 |
| Draw | turns | 0.0206 | 0.0019 | 10.8097 | 0.0000 |
For each winning variable in this output, we have the following: 1. Black’s p-value of ~0.0221 for turns and p-value of ~0.0 for rating_difference, this means that there is not a significant prediction for turns but there is with rating differences. 2. Both of Draw’s p-values are ~0.0, meaning they both have significant predictions in the outcome of the game.
##
## White Black Draw
## White 1457 544 3
## Black 744 1062 1
## Draw 105 93 2
## [1] 62.85216
The accuracy of the training model is 62.85%
##
## White Black Draw
## Black 3030 4262 8
## Draw 375 363 12
## White 5715 2280 2
## [1] 62.2484
The accuracy of the testing model is 62.25%
In conclusion of this regression model confirms that by either playing chess as White or Black does not have a linear relationship with each other. Do not let this discourage you from playing a game of chess, as the outcome is entirely dependent on who you are playing against, not who you’re playing as.
But let’s see if a different machine learning model says the same. The next model to predict the outcome of the game is the Support Vector Classification Model.
So you could understand the SVR results better, I have assigned the winner column as the following three classes: 0 (draw), 1 (White wins), and 2 (Black wins).
##
## Call:
## svm(formula = winner_numeric ~ rating_difference + turns, data = m2_train,
## type = "C-classification", kernel = "radial", cost = 1, )
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 3305
##
## ( 1574 1531 200 )
##
##
## Number of Classes: 3
##
## Levels:
## 0 1 2
## Confusion Matrix:
##
## 0 1 2
## 0 0 385 365
## 1 0 5867 2130
## 2 0 3189 4111
##
## Accuracy: 0.6217985
The confusion matrix summarizes the model’s predictions versus the actual outcomes for the test set: 1. Rows represent actual classes. 2. Columns represent predicted classes.
True Positives (diagonal):
Class 0: None of the actual draws (0) were predicted correctly. Class 1: 5867 instances were correctly classified as 1 (White wins). Class 2: 4111 instances were correctly classified as 2 (Black wins).
The SV model has an accuracy of ~62.18%, meaning that only ~62% of the model was predicted correctly.
In conclusion of this Chess Exploratory Analysis: 1. The Support Vector model confirms that White has a statistical advantage, likely due White having the first move 2. The Support Vector Model also reveals that Black has a strong chance of winning, but success is harder to predict based on the available features. 3. Both regression and classification models have an accuracy of ~62%, meaning there is 40% of the data unaccounted for that could help predict the game outcome.