The topic of our project is chess. We looked over the winning position in chess. The nature of the game is that it is possible for either player to come out on top, so it is often difficult to tell, without advanced computer algoritms, which player is winning, until the game is won. However, we have compiled our best interpretation of a few indicators of who will win, before the match even starts.

Thesis: What different indicators in Chess game like the color of pieces, number of turns and rating difference between the players influence the possibility of winning in a Chess match the most?

First we bring in our data:

Project_Chess <- read.csv("~/Mscs 150 I20/Project/Amitabh-Nils Project/Project-Chess.csv")

Let us start by taking a closer look at exactly what our data consists of.

view(Project_Chess)

There are 5 main variables we will be looking at to help us analyze advantages or disadvantages there are to be had in a game.

  1. turns - This is the number of turns taken in the game by both players. It is recorded as Quantitative but is limited to integer values because you cannot have less than a whole turn.
  2. victory_status - This is a catagorical variable with 4 possibilities:
  1. outoftime - one player ran out of time and loses
  2. resign - one player resigns and forfits the game
  3. mate - one player captures the opponents king and wins
  4. draw - the game ends in a stalemate/draw situation, or an endless loop of moves
  1. winner (Black or White) - This is also a catagorical variable but is practically an indicator as it is either one or the other, no other option 4,5) rating - Both players have a rating that is a score based on their performance in previous matches. This score is Quantitative but is also limited to integer values because of the nature of chess ratings.

Next, we have built some pipelines to filter and create variables to suit the needs of our data exploration.

Here we create two new variables: 1) Rate_Difference - the difference between the rating of the white player and the black player. 2) Advantage - we then assigned an advantage for each game to the player with the higher ranking.

##For finding rating differences and assigning them their own new variable "Advantage"
Project_Chess2 = Project_Chess %>% mutate(Rate_Difference=white_rating-black_rating)
Project_Chess3 = Project_Chess2 %>% mutate(Advantage=if_else(Rate_Difference>0, "White advantage", "Black advantage",missing=NULL))

Next we made subgroups of advantage based on how much the advantaged player was in fact advantaged. We called this variable: Adv_Size - and gave it values of LAB - Large Advantage Black SAB - Small Advantage BLack SAW - Small Advantage White LAW - Large Advantage White We based these levels on summary data from the Rate_Difference as you can see in our mutate function.

##For defining levels to rating advantage
Q1dif=summary(Project_Chess3$Rate_Difference) [2]
Q2dif=summary(Project_Chess3$Rate_Difference) [4]
Q3dif=summary(Project_Chess3$Rate_Difference) [5]
Project_Chess5 = mutate(Project_Chess3, Adv_Size=ifelse(Rate_Difference<=Q1dif, "LAB", ifelse(Rate_Difference<=Q2dif, "SAB", ifelse(Rate_Difference<=Q3dif, "SAW", ifelse(Rate_Difference> Q3dif, "LAW", NA)))))

Finally, we fitered out matches ending in a draw for later use.

##For filtering out the matches that resulted in a draw
Project_Chess4 = filter(Project_Chess3, winner!="draw")

1

What is the relationship of win rate based on rating advantage of a piece color?

p = ggplot(Project_Chess4, aes(x=Advantage, color=winner, fill=winner))
p + geom_bar(alpha=.7) + 
  labs(title="Games won per Piece Color",
       subtitle="As compared by Rating Advantage",
       x="Rating Advantage",
       y="Number of Games",
       caption="Source: Chess Game Dataset (Lichess)")

Here we have used Advantage as the variable in x-axis which means the piece color which has a higher rating in the game. Black advantage means the black piece has a greater rating than the white piece and white advantage means that the white piece has a higher rating than the black one. From the bar graph, we can find out that the number of games won by a piece color is more when it has the advantage. However, the proportion of the number of games won for the white piece being the advantaged piece is more than that of black piece being the advantaged piece. Overall from the bar graph we can observe that, the winner of the match is favoured nearly 2/3 of the whole winning pissibilty according to its piece color advantage.

2

Is there a correlation between the amount someone is favored and their winning rates?

p = ggplot(Project_Chess5, aes(x=Adv_Size, color=winner, fill=winner))
p + geom_bar(alpha=.7) + 
  labs(title="Games Won per Piece Color",
       subtitle="As compared by Rating Advantage",
       x="Rating Advantage",
       y="Number of Games",
       caption="Source: Chess Game Dataset (Lichess)")

We divided each of the piece color advantage into two parts: Large Advantage and Small advantage. Thus, we have four categorical objects: LAB, LAW, SAW and SAB in the Rating Advantage variable. From the new bar graph, we can observe that when the advantage for a certain piece color is large, then the winning chance for that piece is higher than the chance in case of small advantage. When the white or black has a larger advantage, the winning possibility for that advantaged piece is almost 3/4 and when the advantage is small, the winning possibility for the advantaged piece is nearly half.

One thing that is particularilly interesting is that the number of games played between more evenly matched opponents is about equal to the number played by unevenly matched players. To further investigate this correlation we made a scatter plot of games placed by each players relative rating.

p = ggplot(Project_Chess4, aes(x=black_rating, y=white_rating, color=winner))
p + geom_point(alpha=0.3) + geom_smooth(method="lm", size=1.5) + 
  labs(title="Correlation between Ratings in each Game",
       subtitle="Colored by winner",
       x="Blacks Rating",
       y="Whites Rating",
       caption="Source: Chess Game Dataset (Lichess)")

Here we can see quite a stroung positive correlation indicating that the higher your opponents rating, the higher you rating likely is. What this tells us is, that matchups are typically formed between somewhat similarly ranked opponents. Finally, when we look at the trend lines we can see that when white wins, there are typically higher ranked than when black wins.

3

Now we create this graph to see if the length of the game (in turns) has an effect on the result.

p = ggplot(Project_Chess4, aes(x=turns, fill=factor(victory_status))) + 
  facet_wrap(~ victory_status) +
  labs(title="Game Results",
       subtitle="analysed by number of turns and faceted by piece color",
       x="Number of Turns",
       y="Number of Games",
       caption="Source: Chess Game Dataset (Lichess)")
p2 <- p + geom_density(alpha = .7) 
p2

This graph allows us to take a look at how diferrent likely different outcomes of a game are based on haw many moves are made during the game. This shows us that a resignation or mate is most likely at about 50 turns, and then at 75 turns, running out of time becomes the main concern. This is logical because the more turns that a game lasts, the less and less time you progressivly have, depending on the format of time.

4

Do I win or lose more as the games get longer?

p = ggplot(Project_Chess4, aes(x=turns, fill=victory_status))
p + geom_density(alpha=0.4) + facet_wrap(~winner, ncol=1) + scale_x_sqrt() +
  labs(title="Game Results",
       subtitle="Based in Number of Turns",
       x="Square Root of Number of Turns",
       y="Number of Games",
       caption="Source: Chess Game Dataset (Lichess)")

Now, we have broken the graph of games results out by piece color and see that black has the higher mate peak of the two piece colors. we can also see that white’s peak for ran out of time is higher. One argument fo this making sense is that the white pieces begin every game, so it is possible it takes more time to be making each move, while black is simply reacting in a natural way in the opening.

5

How does the possibility to win is affected by the rating difference and how does the rating difference is affected by the number of turns?

p = ggplot(Project_Chess4, aes(x=abs(Rate_Difference), color=winner, fill=winner)) +
  labs(title="Relationship of Winning Possibility with Rate Difference",
       subtitle="Based in Rating Difference between The Players",
       x="Rating Difference",
       y="Possibility of Winning",
       caption="Source: Chess Game Dataset (Lichess)")
p2 <- p + geom_density() + facet_grid(~winner)
p2

From the density graph, we can observe that as the rating difference decreases, the the chance of winning increases. On the other had, as the rate difference increases, the winning possibility decreases. Therefore, we can determine from this graph that there is a negetive correlation between the winning possibilty and the rating difference between the players.

q = ggplot(Project_Chess4, aes(x = turns, y=(Rate_Difference), fill=winner)) +
  scale_y_log10() +
  labs(title="Relationship of Rate Difference with The Number of Turns",
       subtitle="Based in Number of Turns In The Whole Game",
       y="Log of Rating Difference",
       x="Number of Turns",
       caption="Source: Chess Game Dataset (Lichess)")
q2 <- q + geom_violin(alpha=.6) + geom_smooth(color="yellow", size=1.5, method="lm") + facet_wrap(~winner)
q2
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 9390 rows containing non-finite values (stat_ydensity).
## Warning: Removed 9390 rows containing non-finite values (stat_smooth).

In this case, we are looking over two numeric variables. Therefore, we tried to make a scatterplot to examine these variables but as our data is quite big, scatterplot would be very unconvenient as the graph would be very densed. Therefore, we created violin graph with a linear smoother for the better interpretation. It turns out that there is a slim relationship between the rate difference and number of turns. In case of black pieces, the correlation is slimmer. For both pieces, there is a negetive correlation of rate difference according to the number of turns, which means, with the increase in the number of turns, the rate difference decreases.

Who cares? Well our data is actually a testament to that fact that there are, while they may be slight, advantages to be gained regardless of tactics in some places in the game of chess. But we also discovered that rating is the best indicator of who is likely to win any given match. One issue that we ran into was that many of the variables givin in our data set were not very usable in the r format. One thing that we would have changed might be also finding a data set that allows us to look at some of these things as they have changed over time.

Conclusion: From our analysis of a dataset regarding chess matches, we figured out that number of turns, rating differences, rating advantage and color of the pieces are some major indicators which actually can be used as both categorical and quantitative variables to calculate the possibility of winning in a chess game. From our five sub-questions related to our major research question, we can conclude that the possibility of winning is comparatively more for the piece which is in advantage regarding its rating and this possibility varies with the amount of the advantage of that piece. We also determined that the most typical chess match usually finishes at about 100 turns, and the rating-difference between the players is negetively correlated to the number of turns. In addition, the winning possibility for a player increases with the lowering of its rating difference.