These are the packages used in this assignment.
library(magrittr) # Allowed by Sona on the discussion board. Enables use of %<>%
library(MVN)
library(forecast)
library(editrules)
library(readr)
library(rvest)
library(tidyr)
library(dplyr)
library(lubridate)
In this assignment the data for ~20,000 chess games played on the online platform Lichess was preprocessed. The Lichess data was joined with expected win rate data from The Chess World website using a combination of programmatic and manual data manipulation to match opening move names between the two data sets.
The variables were explained and underwent appropriate data type conversion to date/times and factors.
The data set was initially untidy due to compound values in some cells, so was tidied using separate. The game duration was calculated and added to the data set as a new variable.
The data was checked for missing and special values, and for obvious errors, although no significant problems were detected.
All numeric variables were checked for outliers. Some of these outliers were removed from the data set, but others were deemed appropriate and left in. These values tended to be a result of the variability of the game of Chess itself, rather than errors in data collection.
The turns variable was transformed to reduce the skew and allow for better analysis.
Two data sets are used:
id: Lichess game IDrated: Whether or not the game affects the players’ ratingscreated_at: The POSIX timestamp the game started, in millisecondslast_move_at: The POSIX timestamp the game ended, in millisecondsturns: The number of turns takenvictory_status: What the result waswinner: The colour that won (if any) or a drawincrement_code: The code used for the time limit mechanicwhite_id: The user name playing whitewhite_rating: White’s skill ratingblack_id: The user name playing blackblack_rating: Black’s skill ratingmoves: The full movelist of the gameopening_eco: The Encyclopedia of Chess Openings (ECO) classification code for the opening movesopening_name: The ECO classification names of the openingopening_ply: The number of moves contributing to the ECO classificationRank: The rank of the opening, as determined by Points per 100 GamesOpening: The opening nameWhite Win %: How often white won with this openingDraw %: How often games were drawn with this openingPoints per 100 games: How many points this opening received for white per 100 games, calculated as 1 point per win and 0.5 points per draw.Load Lichess data from csv with automatic column detection
lichess <- read_csv("games.csv")
-- Column specification ---------------------------------------------------------------------------------------------------
cols(
id = col_character(),
rated = col_logical(),
created_at = col_double(),
last_move_at = col_double(),
turns = col_double(),
victory_status = col_character(),
winner = col_character(),
increment_code = col_character(),
white_id = col_character(),
white_rating = col_double(),
black_id = col_character(),
black_rating = col_double(),
moves = col_character(),
opening_eco = col_character(),
opening_name = col_character(),
opening_ply = col_double()
)
head(lichess)
Load opening move data from a html table at The Chess World
openings_page <- read_html("https://thechessworld.com/articles/openings/chess-statistics-top-10-best-openings-for-white-and-black/")
chess_world <- html_table(
html_nodes(openings_page, "table")[[1]],
header = TRUE)
head(chess_world)
There are two issues preventing a straightforward join based on the ovening move name:
opening_name column, but the Chess World data only works on overall archetype.
#1. Separate Lichess names on : or | and discard all pieces after the first.
lichess %<>% separate(opening_name,
into = c("opening_name_archetype"),
sep = "[:|]",
extra = "drop")
#2. Adjust Chess World opening names to match Lichess data
chess_world$Opening[chess_world$Opening %in%
c("Queen’s Gambit",
"Blackmar Diemer Gambit",
"Bishop’s Opening",
"Benko Opening",
"Centre Game")] <-
c("Queen's Gambit", #Smart quote to regular quote
"Blackmar-Diemer Gambit", #Add hyphen
"Bishop's Opening", #Smart quote to regular quote
"Benko Gambit", #'Opening' to 'Gambit'
"Center Game") #Change 'Center' spelling from British to US
Now the columns match and the join can be completed. left-join is used to preserve all game data, including games with openings not in the ‘top 10’.
games <- lichess %>% left_join(chess_world,
by = c("opening_name_archetype" = "Opening"))
head(games)
Check the structure of the data frame:
str(games)
tibble [20,058 x 20] (S3: tbl_df/tbl/data.frame)
$ id : chr [1:20058] "TZJHLljE" "l1NXvwaE" "mIICvQHh" "kWKvrqYL" ...
$ rated : logi [1:20058] FALSE TRUE TRUE TRUE TRUE FALSE ...
$ created_at : num [1:20058] 1.5e+12 1.5e+12 1.5e+12 1.5e+12 1.5e+12 ...
$ last_move_at : num [1:20058] 1.5e+12 1.5e+12 1.5e+12 1.5e+12 1.5e+12 ...
$ turns : num [1:20058] 13 16 61 61 95 5 33 9 66 119 ...
$ victory_status : chr [1:20058] "outoftime" "resign" "mate" "mate" ...
$ winner : chr [1:20058] "white" "black" "white" "white" ...
$ increment_code : chr [1:20058] "15+2" "5+10" "5+10" "20+0" ...
$ white_id : chr [1:20058] "bourgris" "a-00" "ischia" "daniamurashov" ...
$ white_rating : num [1:20058] 1500 1322 1496 1439 1523 ...
$ black_id : chr [1:20058] "a-00" "skinnerua" "a-00" "adivanov2009" ...
$ black_rating : num [1:20058] 1191 1261 1500 1454 1469 ...
$ moves : chr [1:20058] "d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5 Bf4" "d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6 Qe5+ Nxe5 c4 Bb4+" "e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc6 bxc6 Ra6 Nc4 a4 c3 a3 Nxa3 Rxa3 Rxa3 c4 dxc4 d5 cxd5 Qxd5 exd5 "| __truncated__ "d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O-O O-O-O Nb5 Nb4 Rc1 Nxa2 Ra1 Nb4 Nxa7+ Kb8 Nb5 Bxc2 Bxc7+ Kc8 Qd"| __truncated__ ...
$ opening_eco : chr [1:20058] "D10" "B00" "C20" "D02" ...
$ opening_name_archetype: chr [1:20058] "Slav Defense" "Nimzowitsch Defense" "King's Pawn Game" "Queen's Pawn Game" ...
$ opening_ply : num [1:20058] 5 4 3 3 5 4 10 5 6 4 ...
$ Rank : chr [1:20058] NA NA NA NA ...
$ White Win % : int [1:20058] NA NA NA NA NA NA 49 NA NA NA ...
$ Draw % : int [1:20058] NA NA NA NA NA NA 16 NA NA NA ...
$ Points per 100 games : num [1:20058] NA NA NA NA NA NA 57 NA NA NA ...
Many of these types were correctly detected during load, but some adjustments must be made:
created at and last_move_at are loaded as numbers, but should be date/times. as_datetime is used for conversion, but the loaded number (in milliseconds) must be divided by the 1000 to get the required number in seconds.
games$created_at <- as_datetime(games$created_at/1000)
games$last_move_at <- as_datetime(games$last_move_at/1000)
victory_status should be a factor. These codes can be labeled with more readable names.
games$victory_status %<>% factor(
levels = c("draw", "mate", "outoftime", "resign"),
labels = c("Draw", "Checkmate", "Out of Time", "Resigned")
)
winner and opening_name_archetype should be factors, but are already descriptive enough and so don’t need labeling.
#winner
games$winner %<>% as.factor
#opening archetype
games$opening_name_archetype %<>% as.factor
opening_eco should be a factor. Although ECO codes are used to classify openings, it’s not possible to label each ECO code with a opening name, since a single code can have many different names. This is demonstrated below.
#Count distinct names for each opening code
games %>%
group_by(opening_eco) %>%
summarise(archetypes = length(unique(opening_name_archetype)),
.groups = "drop") %>%
arrange(desc(archetypes)) %>%
head()
#opening ECO code as a factor without labels
games$opening_eco %<>% as.factor
All other variables were loaded correctly.
The data is not tidy, as there are values which don’t have their own cells.
increment_code comprises two values for each observation; An initial time limit measured in minutes, and a time increment measured in seconds. This will be handled by splitting the data on the ‘+’ character, then converting the results into duration data types.
#split increment code into time limit (mins) and increment (secs)
games %<>% separate(increment_code,
into = c("time_limit_initial", "time_increment"),
sep = "\\+",
convert = TRUE)
#convert both columns to durations
games$time_limit_initial %<>% dminutes
games$time_increment %<>% dseconds
The rank column was loaded as part of the Chess World data set, but has no utility when applied to individual games in the Lichess data set. This column will be dropped.
games %<>% select(-Rank)
The duration between created_at and last_move_at data will be saved as a new column, duration, with a data type duration. This will be more useful for analysis than either date/times or an interval.
This is achieved subtracting created_at from last_move_at, converting the resulting interval into a duration with as.duration, and then adding it to the data set with mutate.
games %<>% mutate(
duration = as.duration(last_move_at - created_at)
)
Missing values are first checked for globally using complete.cases() and all(), which reveals there are indeed missing values in the data.
A per-column summary is generated using ‘sapply()’ which shows missing values only in columns from the Chess World data set.
This is expected due to using left-join to combine the data sets, which leaves missing values for games not using one of the openers listed in the top 10.
#scan for missing values
complete.cases(games) %>% all
[1] FALSE
games %>% sapply(function(x) sum(is.na(x)))
id rated created_at last_move_at turns
0 0 0 0 0
victory_status winner time_limit_initial time_increment white_id
0 0 0 0 0
white_rating black_id black_rating moves opening_eco
0 0 0 0 0
opening_name_archetype opening_ply White Win % Draw % Points per 100 games
0 0 17320 17320 17320
duration
0
#Check the proportion of rows which are complete
games %>% complete.cases %>% table %>% prop.table
.
FALSE TRUE
0.8634959 0.1365041
Due to the large proportion of observations that have missing values, these observations will not be removed to allow better analysis on the columns with no missing values.
If required, this data can easily be filtered for analysis on columns with missing values using comlpete.cases():
games_top_ten <- games[complete.cases(games),]
complete.cases(games_top_ten) %>% all
[1] TRUE
Numerical columns are checked for special values (Infinity and NaN) using a custom function is.special(). No special values are present.
is.special <- function(x){
if (is.numeric(x))
(is.infinite(x) | is.nan(x))
}
games %>% sapply(function(x) sum(is.special(x)))
id rated created_at last_move_at turns
0 0 0 0 0
victory_status winner time_limit_initial time_increment white_id
0 0 0 0 0
white_rating black_id black_rating moves opening_eco
0 0 0 0 0
opening_name_archetype opening_ply White Win % Draw % Points per 100 games
0 0 0 0 0
duration
0
Check for obvious errors using violatedEdits(). That function seems to have issues with spaces and ‘%’ characters in column names, so those are adjusted first.
#replace column name spaces with underscores, and '%' with 'Pct'.
games %<>% rename(
White_Win_Pct = `White Win %`,
Draw_Pct = `Draw %`,
Points_per_100_games = `Points per 100 games`
)
A set of rules is created to check for obvious errors. Most are straightforward but some may require explanation:
time_limit_initial >= 0: Games start with (initial + increment) time, so zero is allowed herewhite_rating < 3500, black_rating < 3500: The current world record highest rating is 28822*White_Win_Pct+Draw_Pct == 2*Points_per_100_games Points = wins + (draws/2), but editset doesn’t support division so multiply everything by 2.The data set was then tested with these rules.
No inconsistencies were detected.
#rules for checking inconsistencies
rules <- editset(expression(
duration >= 0,
turns > 0,
time_limit_initial >= 0,
time_increment >= 0,
white_rating > 0, white_rating < 3500,
black_rating > 0, black_rating < 3500,
opening_ply <= turns,
White_Win_Pct >= 0, White_Win_Pct <= 100,
Draw_Pct >= 0, Draw_Pct <= 100,
White_Win_Pct + Draw_Pct <= 100,
2*White_Win_Pct+Draw_Pct == 2*Points_per_100_games ))
#test the rules and display results
Violations <- violatedEdits(rules, games)
summary(Violations)
No violations detected, 103920 checks evaluated to NA
NULL
Numeric columns must be checked for outliers. For each column, a box plot and histogram are used to visually inspect the data distribution, followed by other analysis as required
#Set up side-by-side plots
par(mfrow = c(1,2))
#turns
boxplot(games$turns, xlab = "Turns")
hist(games$turns)
#inspect high-turn games
upper_fence <- quantile(games$turns, 0.75, names=F) + 1.5*IQR(games$turns)
games %>% filter(turns > upper_fence)
#inspect low-turn games
games %>% filter(turns < 3)
#Remove low-turn games from the dataset
games %<>% filter(turns >= 3)
The turns boxplot shows many possible outliers, however manual inspection on a sample of these shows that these are valid games. The nature of Chess means that some games legitimately last much longer than average, and these have statistical value when doing some kinds of analysis. For this reason, these observations are left in the data set.
However, there are also a number of games with less than 3 turns. It is unlikely that these represent actual games of Chess – standard rules do not allow Checkmate before turn 3, and time limits are usually not a factor for each player’s first turn. These could be due to absent players or other issues, and so were removed from the data set as invalid.
#Set up side-by-side plots
par(mfrow = c(1,3))
#player ratings
boxplot(games$white_rating, games$black_rating,
names = c("White Rating", "Black Rating"))
hist(games$white_rating)
hist(games$black_rating)
#Check for multivariate outliers
par(mfrow = c(1,1))
games %>% plot(white_rating ~ black_rating, data = .)
The box plots for player ratings (both black and white) show some possible outliers, however the histograms show a relatively normal distribution. These values are likely a result of the ELO rating system used, and so none have been removed.
The black and white ratings were also plotted against each other to check for multivariate outliers, but no obvious outliers were detected.
#Set up side-by-side plots
par(mfrow = c(1,2))
#opening plays
boxplot(games$opening_ply)
hist(games$opening_ply)
#Check for patterns in long opening games
games %>%
filter(opening_ply > 10) %>%
group_by(opening_name_archetype) %>%
summarise(games = length(id), longest_opening = max(opening_ply)) %>%
arrange(desc(longest_opening))
`summarise()` ungrouping output (override with `.groups` argument)
The box plot for opening_ply shows some possible outliers with very long opening sequences. These outliers were summarised to look for patterns, but none were detected. Investigation showed the longest ECO opening at 36 moves, above even the longest in this dataset (itub, 2019). Thus, these values were kept as valid data.
#Set up side-by-side plots
par(mfrow = c(2,2))
#win rate, draw rate, and points per 100 games
boxplot(games$White_Win_Pct, games$Draw_Pct, games$Points_per_100_games,
names = c("White Win %", "Draw %", "Points"))
hist(games$White_Win_Pct)
hist(games$Draw_Pct)
hist(games$Points_per_100_games)
Possible outliers shown for White_Win_Pct and Draw_Pct are a result of the small number of possible values in the Chess World data, and were not discarded.
#Set up side-by-side plots
par(mfrow = c(1,2))
boxplot(games$duration)
hist(games$duration)
#inspect values
games %>%
filter(duration > 10000) %>%
select(id,
time_limit_initial,
time_increment,
turns,
duration,
victory_status)
#remove two outliers
games %<>% filter(duration < ddays(1))
#turn off tiling
par(mfrow = c(1,1))
The duration data show a number of outliers with extremely high values. Manual inspection shows two games that have a duration higher than should be possible based on time_limit_initial, time_increment and turns. These outliers were removed.
Several records show a zero duration, or duration 2-3 hours longer than should be possible. Inspection of the original import data shows that some timestamp entries store the exact number of milliseconds (e.g. “1499811847779”) but others are expressed in scientific notation (e.g. “1.50421e+12”). The latter has a precision of about 3 hours. This means that durations calculated on these entries could be up to 3 hours longer or shorter than the actual duration, if the actual timestamps were rounded in opposite directions when converting to scientific notation.
Because of this, all games within 3 hours of the expected time (including zero duration games) were accepted as valid and retained in the data set.
The turns data will be normalized.
The shape before transformation shows a right skew, so a number of transformations that reduce right skew will be tested: logarithmic base 10, natural logarithmic, and square root transformations.
#set up tiling for histograms
par(mfrow = c(2,2))
#unmodified
hist(games$turns)
#natural log transformation
hist(log(games$turns))
#log(base 10) transformation
hist(log10(games$turns))
#square root transformation
hist(sqrt(games$turns))
#turn off tiling
par(mfrow = c(1,1))
The square root transformation gives the best result, so this transformation will be added to the data set for future analysis.
games %<>% mutate(turns_sqrt = sqrt(turns))
itub. (2019). Which opening requires the longest amount of chess moves? Chess Stack Exchange. https://chess.stackexchange.com/questions/24608/which-opening-requires-the-longest-amount-of-chess-moves
Markushin, Y. (2012). Chess statistics: Top 10 best openings for white and black. https://thechessworld.com/articles/openings/chess-statistics-top-10-best-openings-for-white-and-black/
Mitchell, J. (2017). Chess game dataset(Lichess). https://kaggle.com/datasnaek/chess