BGEN516 - Intro to Stats Thinking Lab

Author
Affiliation

University of Montana

Published

September 23, 2025

Lab Overview

In this lab, I’ll import the games.csv dataset, explore its structure, handle missing values, create a new column, and filter the data to focus on rated games. I’ll also consider whether the dataset represents a sample or population and then I’ll classify the variables in the dataset.

Prep Workspace

I’ll be working with the tidyverse in this lab. I’m also going to work with the skimr package. You’ll need to install the package if you want to load it and run the corresponding code in this document.

# Load packages
library(tidyverse)
library(here)
library(skimr)

# Import dataset
chess <- read_csv(here("Week_3", "games.csv"))

Inspect Data

To start, I’ll quickly inspect the data. We have a number of functions available to do this. For example, we can use head(chess) which is included in base R or we can use glimpse(chess) from dplyr in the tidyverse.

I’m going to demonstrate the use of the skim function from the skimr package. This function provides a summary of the data, including number of rows and columns in addition to data types. It also provides a quantitative summary for each variable in the dataset.

# Using tidyverse-style workflow - this code is commented out, R won't evaluate it
#chess %>%
#  skim()

# Using a direct function call - this code will be run and its output displayed
skim(chess)
Data summary
Name chess
Number of rows 20058
Number of columns 16
_______________________
Column type frequency:
character 9
logical 1
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1 8 8 0 19113 0
victory_status 0 1 4 9 0 4 0
winner 0 1 4 5 0 3 0
increment_code 0 1 3 7 0 400 0
white_id 0 1 2 20 0 9438 0
black_id 0 1 2 20 0 9331 0
moves 0 1 2 1413 0 18920 0
opening_eco 0 1 3 3 0 365 0
opening_name 0 1 9 91 0 1477 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rated 0 1 0.81 TRU: 16155, FAL: 3903

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
created_at 0 1 1.483617e+12 2.850151e+10 1.376772e+12 1.477548e+12 1.49601e+12 1.50317e+12 1.504493e+12 ▁▁▁▂▇
last_move_at 0 1 1.483618e+12 2.850140e+10 1.376772e+12 1.477548e+12 1.49601e+12 1.50317e+12 1.504494e+12 ▁▁▁▂▇
turns 0 1 6.047000e+01 3.357000e+01 1.000000e+00 3.700000e+01 5.50000e+01 7.90000e+01 3.490000e+02 ▇▃▁▁▁
white_rating 0 1 1.596630e+03 2.912500e+02 7.840000e+02 1.398000e+03 1.56700e+03 1.79300e+03 2.700000e+03 ▁▇▇▂▁
black_rating 0 1 1.588830e+03 2.910400e+02 7.890000e+02 1.391000e+03 1.56200e+03 1.78400e+03 2.723000e+03 ▁▇▇▂▁
opening_ply 0 1 4.820000e+00 2.800000e+00 1.000000e+00 3.000000e+00 4.00000e+00 6.00000e+00 2.800000e+01 ▇▂▁▁▁

Relevant to the next prompt in our assignment, the output tells us how many missing values exist in each column. This information is provided under the n_missing header.

Missing Values

If we only wanted to count missing values without inspecting the entire dataset, we could use the colSums function in base R to check for NAs:

# Count missing values in each column
colSums(is.na(chess))
            id          rated     created_at   last_move_at          turns 
             0              0              0              0              0 
victory_status         winner increment_code       white_id   white_rating 
             0              0              0              0              0 
      black_id   black_rating          moves    opening_eco   opening_name 
             0              0              0              0              0 
   opening_ply 
             0 

I can confidently say that there are no missing values in this dataset.

New Column: elo_diff

Next, I’ll create a new value that allows us to examine the difference in Elo ratings between players in the dataset.

Elo is a rating system that represents the skill level of a chess player, where a higher rating indicates a higher skill level.

In chess, each player controls 16 pieces. The pieces are physically colored white (or light-colored) and black (or dark-colored) to differentiate the two players. One player controls the white pieces, and the other the black. These colors help identify the two sides. Our data thus has two relevant variables: white_rating and black_rating. We will compute the difference between two players like so:

elo_diff = white_rating - black_rating

Our new variable elo_diff represents the difference in Elo ratings between the two players in a chess game. Specifically, it tells us the advantage that the white player has over the black player in each game.

  • A positive elo_diff score means the white player was rated higher.

  • A negative elo_diff score means the black player was rated higher.

  • A elo_diff score close to zero means both players were similarly rated.

Let’s compute elo_diff and find the largest difference.

# Create new column for diff between players
chess <- chess %>%
  mutate(elo_diff = white_rating - black_rating)

# Find the maximum value of the new column
max_elo_diff <- max(chess$elo_diff)

The largest Elo difference present in the dataset is 1499. That’s a pretty massive gap in skill! 👀

Filter Data: Rated Games Only

Moving on to my next task, I want to examine rated games.

# Filter to only rated games
rated_games <- chess %>%
  filter(rated == TRUE)

Rated games are games that count toward a player’s ranking. By focusing on these games, we’re able to analyze competitive games, rather than casual or practice matches.

Sample or Population?

This dataset is a sample of chess games. The dataset includes chess games played on lichess.org, an open source platform for online chess. The dataset is large but it does not include all chess games. It excludes other chess games that are played online (e.g., on chess.com) and chess games that are played in person (e.g., over-the-board tournaments).

Data Types

To wrap up, let’s classify the variables in our dataset. The Markdown table below describes each variable in the dataset, including its statistical data type classification.

Variable Description Classification
id Unique game identifier Nominal
rated Indicates if game affects ratings Nominal
created_at Date and time game started Interval
last_move_at Date and time of last move Interval
turns Total moves made by both players Ratio
victory_status How the game ended Nominal
winner Which side won Nominal
increment_code Time allowed plus extra per move Nominal
white_id Username of white player Nominal
white_rating Elo rating of white player Ratio
black_id Username of black player Nominal
black_rating Elo rating of black player Ratio
moves Full move list in chess notation Nominal
opening_eco Identifies the type of opening move Nominal
opening_name Name of the chess opening Nominal
opening_ply Number of opening moves Ratio
elo_diff Difference between white and black ratings Interval

And that wraps up this lab.