# Load packages
library(tidyverse)
library(here)
library(skimr)
# Import dataset
<- read_csv(here("Week_3", "games.csv")) chess
BGEN516 - Intro to Stats Thinking Lab
Lab Overview
In this lab, I’ll import the games.csv
dataset, explore its structure, handle missing values, create a new column, and filter the data to focus on rated games. I’ll also consider whether the dataset represents a sample or population and then I’ll classify the variables in the dataset.
Prep Workspace
I’ll be working with the tidyverse in this lab. I’m also going to work with the skimr package. You’ll need to install the package if you want to load it and run the corresponding code in this document.
Inspect Data
To start, I’ll quickly inspect the data. We have a number of functions available to do this. For example, we can use head(chess)
which is included in base R or we can use glimpse(chess)
from dplyr in the tidyverse.
I’m going to demonstrate the use of the skim
function from the skimr
package. This function provides a summary of the data, including number of rows and columns in addition to data types. It also provides a quantitative summary for each variable in the dataset.
# Using tidyverse-style workflow - this code is commented out, R won't evaluate it
#chess %>%
# skim()
# Using a direct function call - this code will be run and its output displayed
skim(chess)
Name | chess |
Number of rows | 20058 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
character | 9 |
logical | 1 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
id | 0 | 1 | 8 | 8 | 0 | 19113 | 0 |
victory_status | 0 | 1 | 4 | 9 | 0 | 4 | 0 |
winner | 0 | 1 | 4 | 5 | 0 | 3 | 0 |
increment_code | 0 | 1 | 3 | 7 | 0 | 400 | 0 |
white_id | 0 | 1 | 2 | 20 | 0 | 9438 | 0 |
black_id | 0 | 1 | 2 | 20 | 0 | 9331 | 0 |
moves | 0 | 1 | 2 | 1413 | 0 | 18920 | 0 |
opening_eco | 0 | 1 | 3 | 3 | 0 | 365 | 0 |
opening_name | 0 | 1 | 9 | 91 | 0 | 1477 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
rated | 0 | 1 | 0.81 | TRU: 16155, FAL: 3903 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
created_at | 0 | 1 | 1.483617e+12 | 2.850151e+10 | 1.376772e+12 | 1.477548e+12 | 1.49601e+12 | 1.50317e+12 | 1.504493e+12 | ▁▁▁▂▇ |
last_move_at | 0 | 1 | 1.483618e+12 | 2.850140e+10 | 1.376772e+12 | 1.477548e+12 | 1.49601e+12 | 1.50317e+12 | 1.504494e+12 | ▁▁▁▂▇ |
turns | 0 | 1 | 6.047000e+01 | 3.357000e+01 | 1.000000e+00 | 3.700000e+01 | 5.50000e+01 | 7.90000e+01 | 3.490000e+02 | ▇▃▁▁▁ |
white_rating | 0 | 1 | 1.596630e+03 | 2.912500e+02 | 7.840000e+02 | 1.398000e+03 | 1.56700e+03 | 1.79300e+03 | 2.700000e+03 | ▁▇▇▂▁ |
black_rating | 0 | 1 | 1.588830e+03 | 2.910400e+02 | 7.890000e+02 | 1.391000e+03 | 1.56200e+03 | 1.78400e+03 | 2.723000e+03 | ▁▇▇▂▁ |
opening_ply | 0 | 1 | 4.820000e+00 | 2.800000e+00 | 1.000000e+00 | 3.000000e+00 | 4.00000e+00 | 6.00000e+00 | 2.800000e+01 | ▇▂▁▁▁ |
Relevant to the next prompt in our assignment, the output tells us how many missing values exist in each column. This information is provided under the n_missing
header.
Missing Values
If we only wanted to count missing values without inspecting the entire dataset, we could use the colSums
function in base R to check for NAs:
# Count missing values in each column
colSums(is.na(chess))
id rated created_at last_move_at turns
0 0 0 0 0
victory_status winner increment_code white_id white_rating
0 0 0 0 0
black_id black_rating moves opening_eco opening_name
0 0 0 0 0
opening_ply
0
I can confidently say that there are no missing values in this dataset.
New Column: elo_diff
Next, I’ll create a new value that allows us to examine the difference in Elo ratings between players in the dataset.
Elo is a rating system that represents the skill level of a chess player, where a higher rating indicates a higher skill level.
In chess, each player controls 16 pieces. The pieces are physically colored white (or light-colored) and black (or dark-colored) to differentiate the two players. One player controls the white pieces, and the other the black. These colors help identify the two sides. Our data thus has two relevant variables: white_rating
and black_rating
. We will compute the difference between two players like so:
elo_diff = white_rating - black_rating
Our new variable elo_diff
represents the difference in Elo ratings between the two players in a chess game. Specifically, it tells us the advantage that the white player has over the black player in each game.
A positive
elo_diff
score means the white player was rated higher.A negative
elo_diff
score means the black player was rated higher.A
elo_diff
score close to zero means both players were similarly rated.
Let’s compute elo_diff
and find the largest difference.
# Create new column for diff between players
<- chess %>%
chess mutate(elo_diff = white_rating - black_rating)
# Find the maximum value of the new column
<- max(chess$elo_diff) max_elo_diff
The largest Elo difference present in the dataset is 1499. That’s a pretty massive gap in skill! 👀
Filter Data: Rated Games Only
Moving on to my next task, I want to examine rated games.
# Filter to only rated games
<- chess %>%
rated_games filter(rated == TRUE)
Rated games are games that count toward a player’s ranking. By focusing on these games, we’re able to analyze competitive games, rather than casual or practice matches.
Sample or Population?
This dataset is a sample of chess games. The dataset includes chess games played on lichess.org, an open source platform for online chess. The dataset is large but it does not include all chess games. It excludes other chess games that are played online (e.g., on chess.com) and chess games that are played in person (e.g., over-the-board tournaments).
Data Types
To wrap up, let’s classify the variables in our dataset. The Markdown table below describes each variable in the dataset, including its statistical data type classification.
Variable | Description | Classification |
---|---|---|
id | Unique game identifier | Nominal |
rated | Indicates if game affects ratings | Nominal |
created_at | Date and time game started | Interval |
last_move_at | Date and time of last move | Interval |
turns | Total moves made by both players | Ratio |
victory_status | How the game ended | Nominal |
winner | Which side won | Nominal |
increment_code | Time allowed plus extra per move | Nominal |
white_id | Username of white player | Nominal |
white_rating | Elo rating of white player | Ratio |
black_id | Username of black player | Nominal |
black_rating | Elo rating of black player | Ratio |
moves | Full move list in chess notation | Nominal |
opening_eco | Identifies the type of opening move | Nominal |
opening_name | Name of the chess opening | Nominal |
opening_ply | Number of opening moves | Ratio |
elo_diff | Difference between white and black ratings | Interval |
And that wraps up this lab.