BGEN516 - Intro to Stats Thinking Lab

Author

Affiliation

University of Montana

Published

September 23, 2025

Lab Overview

In this lab, I’ll import the games.csv dataset, explore its structure, handle missing values, create a new column, and filter the data to focus on rated games. I’ll also consider whether the dataset represents a sample or population and then I’ll classify the variables in the dataset.

Prep Workspace

I’ll be working with the tidyverse in this lab. I’m also going to work with the skimr package. You’ll need to install the package if you want to load it and run the corresponding code in this document.

# Load packages
library(tidyverse)
library(here)
library(skimr)

# Import dataset
chess <- read_csv(here("Week_3", "games.csv"))

Inspect Data

To start, I’ll quickly inspect the data. We have a number of functions available to do this. For example, we can use head(chess) which is included in base R or we can use glimpse(chess) from dplyr in the tidyverse.

I’m going to demonstrate the use of the skim function from the skimr package. This function provides a summary of the data, including number of rows and columns in addition to data types. It also provides a quantitative summary for each variable in the dataset.

# Using tidyverse-style workflow - this code is commented out, R won't evaluate it
#chess %>%
#  skim()

# Using a direct function call - this code will be run and its output displayed
skim(chess)

Data summary
Name	chess
Number of rows	20058
Number of columns	16
_______________________
Column type frequency:
character	9
logical	1
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
id	1	8	8	19113
victory_status	1	4	9	4
winner	1	4	5	3
increment_code	1	3	7	400
white_id	1	2	20	9438
black_id	1	2	20	9331
moves	1	2	1413	18920
opening_eco	1	3	3	365
opening_name	1	9	91	1477

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
rated	0	1	0.81	TRU: 16155, FAL: 3903

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
created_at	1	1.483617e+12	2.850151e+10	1.376772e+12	1.477548e+12	1.49601e+12	1.50317e+12	1.504493e+12	▁▁▁▂▇
last_move_at	1	1.483618e+12	2.850140e+10	1.376772e+12	1.477548e+12	1.49601e+12	1.50317e+12	1.504494e+12	▁▁▁▂▇
turns	1	6.047000e+01	3.357000e+01	1.000000e+00	3.700000e+01	5.50000e+01	7.90000e+01	3.490000e+02	▇▃▁▁▁
white_rating	1	1.596630e+03	2.912500e+02	7.840000e+02	1.398000e+03	1.56700e+03	1.79300e+03	2.700000e+03	▁▇▇▂▁
black_rating	1	1.588830e+03	2.910400e+02	7.890000e+02	1.391000e+03	1.56200e+03	1.78400e+03	2.723000e+03	▁▇▇▂▁
opening_ply	1	4.820000e+00	2.800000e+00	1.000000e+00	3.000000e+00	4.00000e+00	6.00000e+00	2.800000e+01	▇▂▁▁▁

Relevant to the next prompt in our assignment, the output tells us how many missing values exist in each column. This information is provided under the n_missing header.

Missing Values

If we only wanted to count missing values without inspecting the entire dataset, we could use the colSums function in base R to check for NAs:

# Count missing values in each column
colSums(is.na(chess))

            id          rated     created_at   last_move_at          turns 
             0              0              0              0              0 
victory_status         winner increment_code       white_id   white_rating 
             0              0              0              0              0 
      black_id   black_rating          moves    opening_eco   opening_name 
             0              0              0              0              0 
   opening_ply 
             0

I can confidently say that there are no missing values in this dataset.

New Column: elo_diff

Next, I’ll create a new value that allows us to examine the difference in Elo ratings between players in the dataset.

Elo is a rating system that represents the skill level of a chess player, where a higher rating indicates a higher skill level.

In chess, each player controls 16 pieces. The pieces are physically colored white (or light-colored) and black (or dark-colored) to differentiate the two players. One player controls the white pieces, and the other the black. These colors help identify the two sides. Our data thus has two relevant variables: white_rating and black_rating. We will compute the difference between two players like so:

elo_diff = white_rating - black_rating

Our new variable elo_diff represents the difference in Elo ratings between the two players in a chess game. Specifically, it tells us the advantage that the white player has over the black player in each game.

A positive elo_diff score means the white player was rated higher.
A negative elo_diff score means the black player was rated higher.
A elo_diff score close to zero means both players were similarly rated.

Let’s compute elo_diff and find the largest difference.

# Create new column for diff between players
chess <- chess %>%
  mutate(elo_diff = white_rating - black_rating)

# Find the maximum value of the new column
max_elo_diff <- max(chess$elo_diff)

The largest Elo difference present in the dataset is 1499. That’s a pretty massive gap in skill! 👀

Filter Data: Rated Games Only

Moving on to my next task, I want to examine rated games.

# Filter to only rated games
rated_games <- chess %>%
  filter(rated == TRUE)

Rated games are games that count toward a player’s ranking. By focusing on these games, we’re able to analyze competitive games, rather than casual or practice matches.

Sample or Population?

This dataset is a sample of chess games. The dataset includes chess games played on lichess.org, an open source platform for online chess. The dataset is large but it does not include all chess games. It excludes other chess games that are played online (e.g., on chess.com) and chess games that are played in person (e.g., over-the-board tournaments).

Data Types

To wrap up, let’s classify the variables in our dataset. The Markdown table below describes each variable in the dataset, including its statistical data type classification.

Variable	Description	Classification
id	Unique game identifier	Nominal
rated	Indicates if game affects ratings	Nominal
created_at	Date and time game started	Interval
last_move_at	Date and time of last move	Interval
turns	Total moves made by both players	Ratio
victory_status	How the game ended	Nominal
winner	Which side won	Nominal
increment_code	Time allowed plus extra per move	Nominal
white_id	Username of white player	Nominal
white_rating	Elo rating of white player	Ratio
black_id	Username of black player	Nominal
black_rating	Elo rating of black player	Ratio
moves	Full move list in chess notation	Nominal
opening_eco	Identifies the type of opening move	Nominal
opening_name	Name of the chess opening	Nominal
opening_ply	Number of opening moves	Ratio
elo_diff	Difference between white and black ratings	Interval

And that wraps up this lab.