The purpose of this analysis is to determine if there is a difference in playing a game of chess as Black or White, as that is an option you can select on chess.com.

## Rows: 20058 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): victory_status, winner, time_increment, white_id, black_id, moves,...
## dbl  (5): game_id, turns, white_rating, black_rating, opening_moves
## lgl  (1): rated
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 17
##   game_id rated turns victory_status winner time_increment white_id white_rating
##     <dbl> <lgl> <dbl> <chr>          <chr>  <chr>          <chr>           <dbl>
## 1       1 FALSE    13 Out of Time    White  15+2           bourgris         1500
## 2       2 TRUE     16 Resign         Black  5+10           a-00             1322
## 3       3 TRUE     61 Mate           White  5+10           ischia           1496
## 4       4 TRUE     61 Mate           White  20+0           daniamu…         1439
## 5       5 TRUE     95 Mate           White  30+3           nik2211…         1523
## 6       6 FALSE     5 Draw           Draw   10+0           trelynn…         1250
## # ℹ 9 more variables: black_id <chr>, black_rating <dbl>, moves <chr>,
## #   opening_code <chr>, opening_moves <dbl>, opening_fullname <chr>,
## #   opening_shortname <chr>, opening_response <chr>, opening_variation <chr>
##     game_id        rated             turns        victory_status    
##  Min.   :    1   Mode :logical   Min.   :  1.00   Length:20058      
##  1st Qu.: 5015   FALSE:3903      1st Qu.: 37.00   Class :character  
##  Median :10030   TRUE :16155     Median : 55.00   Mode  :character  
##  Mean   :10030                   Mean   : 60.47                     
##  3rd Qu.:15044                   3rd Qu.: 79.00                     
##  Max.   :20058                   Max.   :349.00                     
##     winner          time_increment       white_id          white_rating 
##  Length:20058       Length:20058       Length:20058       Min.   : 784  
##  Class :character   Class :character   Class :character   1st Qu.:1398  
##  Mode  :character   Mode  :character   Mode  :character   Median :1567  
##                                                           Mean   :1597  
##                                                           3rd Qu.:1793  
##                                                           Max.   :2700  
##    black_id          black_rating     moves           opening_code      
##  Length:20058       Min.   : 789   Length:20058       Length:20058      
##  Class :character   1st Qu.:1391   Class :character   Class :character  
##  Mode  :character   Median :1562   Mode  :character   Mode  :character  
##                     Mean   :1589                                        
##                     3rd Qu.:1784                                        
##                     Max.   :2723                                        
##  opening_moves    opening_fullname   opening_shortname  opening_response  
##  Min.   : 1.000   Length:20058       Length:20058       Length:20058      
##  1st Qu.: 3.000   Class :character   Class :character   Class :character  
##  Median : 4.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 4.817                                                           
##  3rd Qu.: 6.000                                                           
##  Max.   :28.000                                                           
##  opening_variation 
##  Length:20058      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

This provides visibility into the data, alongside with a summary of each column within the dataset.

## [1] 9438    1

There are a total of 9438 unique chess players in this dataset.

## # A tibble: 6 × 2
##   opening_fullname                              count
##   <chr>                                         <int>
## 1 Van't Kruijs Opening                            368
## 2 Sicilian Defense                                358
## 3 Sicilian Defense: Bowdler Attack                296
## 4 French Defense: Knight Variation                271
## 5 Scotch Game                                     271
## 6 Scandinavian Defense: Mieses-Kotroc Variation   259

Van’t Kruijs Opening (e.3) is the highest first move, followed by Sicilian Defense.

Now let’s see if playing beginning the game as White or Black has an impact on the winning results.

Now that we know playing as White has a higher likelihood of winning, let’s see if these winning outcomes are mainly by Checkmate (considered “Normal” in this dataset), abandonment, time forfeit, etc.

## [1] 95.30747

And finally, on average the difference in ratings between Black and White where White wins is about 95 games. Let’s see the difference for Black.

## [1] 88.98111

In the cases where Black has the higher rating, they are on average winning 89 games. Maybe time to start those chess games playing as white? Now let’s see what Black’s winning opening moves are:

## tibble [20,058 × 4] (S3: tbl_df/tbl/data.frame)
##  $ winner           : chr [1:20058] "White" "Black" "White" "White" ...
##  $ turns            : num [1:20058] 13 16 61 61 95 5 33 9 66 119 ...
##  $ rating_difference: num [1:20058] 309 61 -4 -15 54 248 97 -695 47 172 ...
##  $ time_increment   : chr [1:20058] "15+2" "5+10" "5+10" "20+0" ...
##                         turns rating_difference
## turns              1.00000000       -0.03578093
## rating_difference -0.03578093        1.00000000

The correlation shows that each variable is perfectly correlated with itself. For example, White’s ratings have a linear relationship with Black’s ratings among different players. Alongside, the correlation between players’ turns and rating differences are basically nonexistent, meaning that the players’ ratings has nothing to do with the amount of turns they make in a game.

## # weights:  12 (6 variable)
## initial  value 4406.533890 
## iter  10 value 3060.133711
## iter  20 value 3058.247057
## final  value 3058.243818 
## converged
y.level term estimate std.error statistic p.value
Black (Intercept) -0.2311 0.0738 -3.1328 0.0017
Black rating_difference -0.0039 0.0002 -21.2244 0.0000
Black turns 0.0024 0.0011 2.2891 0.0221
Draw (Intercept) -3.6949 0.1749 -21.1245 0.0000
Draw rating_difference -0.0017 0.0004 -4.4089 0.0000
Draw turns 0.0206 0.0019 10.8097 0.0000

For each winning variable in this output, we have the following: 1. Black’s p-value of ~0.0221 for turns and p-value of ~0.0 for rating_difference, this means that there is not a significant prediction for turns but there is with rating differences. 2. Both of Draw’s p-values are ~0.0, meaning they both have significant predictions in the outcome of the game.

##        
##         White Black Draw
##   White  1457   544    3
##   Black   744  1062    1
##   Draw    105    93    2
## [1] 62.85216

The accuracy of the training model is 62.85%

##        
##         White Black Draw
##   Black  3030  4262    8
##   Draw    375   363   12
##   White  5715  2280    2
## [1] 62.2484

The accuracy of the testing model is 62.25%

In conclusion of this regression model confirms that by either playing chess as White or Black does not have a linear relationship with each other. Do not let this discourage you from playing a game of chess, as the outcome is entirely dependent on who you are playing against, not who you’re playing as.

But let’s see if a different machine learning model says the same. The next model to predict the outcome of the game is the Support Vector Classification Model.

So you could understand the SVR results better, I have assigned the winner column as the following three classes: 0 (draw), 1 (White wins), and 2 (Black wins).

## 
## Call:
## svm(formula = winner_numeric ~ rating_difference + turns, data = m2_train, 
##     type = "C-classification", kernel = "radial", cost = 1, )
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  3305
## 
##  ( 1574 1531 200 )
## 
## 
## Number of Classes:  3 
## 
## Levels: 
##  0 1 2
## Confusion Matrix:
##    
##        0    1    2
##   0    0  385  365
##   1    0 5867 2130
##   2    0 3189 4111
## 
## Accuracy: 0.6217985

The confusion matrix summarizes the model’s predictions versus the actual outcomes for the test set: 1. Rows represent actual classes. 2. Columns represent predicted classes.

True Positives (diagonal):

Class 0: None of the actual draws (0) were predicted correctly. Class 1: 5867 instances were correctly classified as 1 (White wins). Class 2: 4111 instances were correctly classified as 2 (Black wins).

The SV model has an accuracy of ~62.18%, meaning that only ~62% of the model was predicted correctly.

In conclusion of this Chess Exploratory Analysis: 1. The Support Vector model confirms that White has a statistical advantage, likely due White having the first move 2. The Support Vector Model also reveals that Black has a strong chance of winning, but success is harder to predict based on the available features. 3. Both regression and classification models have an accuracy of ~62%, meaning there is 40% of the data unaccounted for that could help predict the game outcome.