BGEN516 - Descriptive Stats Lab

Author

Affiliation

Olivia B. Newton

University of Montana

Published

September 25, 2025

Lab Overview

In this lab, I’ll work with Olympic athlete data to demonstrate the computation of descriptive statistics.

Prep Workspace

As usual, I’ll be working with the tidyverse and here packages. Additionally, I’ll use functions from the knitr and gt packages, which offer enhanced formatting options for tables.

# Load required packages
library(tidyverse)
library(here)
library(knitr)
library(gt)

# Import the dataset
olympics <- read_csv(here("olympics.csv"))

First, I’ll take a look at the data:

# Quick inspect
glimpse(olympics)

Rows: 271,116
Columns: 15
$ id     <dbl> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, …
$ name   <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Lindenau…
$ sex    <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", "M", "M"…
$ age    <dbl> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31, 33, 33,…
$ height <dbl> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 188, 188, …
$ weight <dbl> 80, 60, NA, NA, 82, 82, 82, 82, 82, 82, 75, 75, 75, 75, 75, 75,…
$ team   <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands", "…
$ noc    <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED", "NED", …
$ games  <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer", "19…
$ year   <dbl> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1994, 199…
$ season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Wi…
$ city   <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "Calgar…
$ sport  <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skating"…
$ event  <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweight", …
$ medal  <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

This glimpse of the data shows us that each row, or observation, represents a single athlete. However, a closer look at the name column reveals that some athletes appear multiple times in the dataset.

Filtering Data: Olympics After 1991

My next task is to filter the data to include only Olympic Games that took place after 1991. Why 1991? It marks the end of the Cold War and a significant shift in global geopolitics. By focusing on Olympic data from this period onward, we can explore how geopolitical shifts, newly independent nations, and changing global dynamics influenced athlete participation and performance.

# Filter for data after 1991
olympics_post_1991 <- olympics %>%
  filter(year > 1991)

# Confirm filtering
summary(olympics_post_1991$year)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1992    1996    2004    2004    2012    2016

The output of summary(olympics_post_1991$year) confirms that we’ve only kept data from Olympic games that took place after 1991. Specifically, the minimum value for the year column is 1992.

Summarizing Athlete Data: The Physical Profile of Modern Olympians

Next, I want to zoom in on unique athlete profiles. I’ll do this by removing duplicate combinations of ID, age, height, and weight, capturing only distinct records of each athlete’s physical traits. By summarizing these key characteristics through summary statistics, we generate an overview of the physiques of Olympic athletes competing after the Cold War era.

# Subset unique athlete data
unique_athletes <- olympics_post_1991 %>%
  select(id, age, height, weight) %>%
  distinct()

# Quick inspect
head(unique_athletes)

# A tibble: 6 × 4
     id   age height weight
  <dbl> <dbl>  <dbl>  <dbl>
1     1    24    180     80
2     2    23    170     60
3     5    25    185     82
4     5    27    185     82
5     6    31    188     75
6     6    33    188     75

Did you notice the repeated values in the id column? Some athletes compete in multiple Olympic Games over the years, which is reflected in the age column. For example, the athlete with ID 5 competed in the Olympics at age 25 and again at age 27.

Now let’s use statistics to summarize the physical attributes of these athletes.

# Summary statistics for athletes
athlete_summary <- unique_athletes %>%
  summarise(
    mean_age = mean(age, na.rm = TRUE),
    median_age = median(age, na.rm = TRUE),
    sd_age = sd(age, na.rm = TRUE),
    iqr_age = IQR(age, na.rm = TRUE),
    min_age = min(age, na.rm = TRUE),
    max_age = max(age, na.rm = TRUE),
    
    mean_height = mean(height, na.rm = TRUE),
    median_height = median(height, na.rm = TRUE),
    sd_height = sd(height, na.rm = TRUE),
    iqr_height = IQR(height, na.rm = TRUE),
    min_height = min(height, na.rm = TRUE),
    max_height = max(height, na.rm = TRUE),
    
    mean_weight = mean(weight, na.rm = TRUE),
    median_weight = median(weight, na.rm = TRUE),
    sd_weight = sd(weight, na.rm = TRUE),
    iqr_weight = IQR(weight, na.rm = TRUE),
    min_weight = min(weight, na.rm = TRUE),
    max_weight = max(weight, na.rm = TRUE)
  )

I could display the summary statistics by simply adding a line of code with the name of my new data frame athlete_summary. Alternatively, I can use kable() to create a table.

athlete_summary %>%
  kable()

mean_age	median_age	sd_age	iqr_age	min_age	max_age	mean_height	median_height	sd_height	iqr_height	min_height	max_height	mean_weight	median_weight	sd_weight	iqr_weight	min_weight	max_weight
25.77801	25	5.276563	7	11	71	176.794	177	10.78911	15	133	226	72.56303	70	15.6738	21	28	214

The table in its current form is quite wide and difficult to read. It may be easier to interpret in a long format instead.

Recall from Week 3 that we discussed long and wide data formats, as well as how to reshape data using pivot_longer() and pivot_wider() from the tidyr package.

In the code below, I’ll first convert the data from wide to long format. Then, I’ll clean and organize the resulting data frame using mutate(), select(), and arrange().

# Reshape from wide to long
athlete_long <- athlete_summary %>%
  pivot_longer(
    cols = everything(),           # Take all columns and gather them into key-value pairs
    names_to = c("stat", "trait"), # Split column names 2 new columns: 'stat' and 'trait'
    names_sep = "_"                # Use underscore to separate the original column names
    ) %>%
  # Make the 'stat' and 'trait' names look nicer by capitalizing each word
  mutate(
    # Tools package is provided in base R (you don't need to install it)
    stat = if_else(stat %in% c("sd", "iqr"),  # Make "sd" and "iqr" fully uppercase
                   toupper(stat),         
                   tools::toTitleCase(stat) # Title case all other stats
                   ),
    trait = tools::toTitleCase(trait)
    ) %>%
  # Select columns in the order: Trait, Statistic, Value
  select(trait, stat, value) %>%
  # Arrange rows alphabetically by Trait and then by Statistic
  arrange(trait, stat)

# Inspect first few rows
head(athlete_long)

# A tibble: 6 × 3
  trait stat   value
  <chr> <chr>  <dbl>
1 Age   IQR     7   
2 Age   Max    71   
3 Age   Mean   25.8 
4 Age   Median 25   
5 Age   Min    11   
6 Age   SD      5.28

Great, the data is in long format and values are nicely formatted.

athlete_long %>%
  kable()

trait	stat	value
Age	IQR	7.000000
Age	Max	71.000000
Age	Mean	25.778005
Age	Median	25.000000
Age	Min	11.000000
Age	SD	5.276563
Height	IQR	15.000000
Height	Max	226.000000
Height	Mean	176.793958
Height	Median	177.000000
Height	Min	133.000000
Height	SD	10.789109
Weight	IQR	21.000000
Weight	Max	214.000000
Weight	Mean	72.563029
Weight	Median	70.000000
Weight	Min	28.000000
Weight	SD	15.673798

While the table is no longer wide, it has become quite long and remains difficult to read. This format still impedes interpretation and discussion of the summary statistics of each individual variable. But now that I have the data cleaned and formatted, I can easily examine each variable on its own by filtering the data and displaying the data subset in a nicely formatted table.

Athlete Age

# Subset age stats
age_table <- athlete_long %>%
  filter(trait == "Age") %>%
  select(Statistic = stat, Value = value)

# Display age stats as table
age_table %>%
  kable(
    format = "html",
    caption = "Summary Statistics: Age",
    digits = 2,
    align = c("l", "r")
  )

Summary Statistics: Age
Statistic	Value
IQR	7.00
Max	71.00
Mean	25.78
Median	25.00
Min	11.00
SD	5.28

I could spend time adjusting this code to customize the formatting and make the table appear more compact in the rendered document. For example, I might explore additional formatting options for kable in the kableExtra package. Alternatively, I can utilize the gt package.

The gt package works well with Quarto because it creates tables that already include all necessary styling. These tables do not rely on extra style files or external programs to look good. This means the tables display correctly and consistently, without running into issues caused by missing or conflicting styles.

age_table %>%
  gt() %>%
  tab_header(title = "Athlete Age") %>%
  fmt_number(columns = c(Value), decimals = 2)

Athlete Age
Statistic	Value
IQR	7.00
Max	71.00
Mean	25.78
Median	25.00
Min	11.00
SD	5.28

This table neatly summarizes the distribution of the age variable. The mean and median ages are close (25.8 and 25), which may indicate a roughly symmetric distribution, though this alone is not conclusive. The wide range of values, from 11 to 71 years, suggests potential skewness or the presence of outliers, which would require further exploration using visualizations or additional statistics.

Athlete Height

# Subset height stats
height_table <- athlete_long %>%
  filter(trait == "Height") %>%
  select(Statistic = stat, Value = value)

# Display height stats as table
height_table %>%
  gt() %>%
  tab_header(title = "Athlete Height") %>%
  fmt_number(columns = c(Value), decimals = 2)

Athlete Height
Statistic	Value
IQR	15.00
Max	226.00
Mean	176.79
Median	177.00
Min	133.00
SD	10.79

This table neatly summarizes the distribution of the height variable. The mean and median heights are nearly identical (176.8 and 177), suggesting the distribution is approximately symmetric. The values range from 133 to 226 cm, with an interquartile range of 15 and a standard deviation of 10.79, indicating a moderate spread.

Athlete Weight

# Subset weight stats
weight_table <- athlete_long %>%
  filter(trait == "Weight") %>%
  select(Statistic = stat, Value = value)

weight_table %>%
  gt() %>%
  tab_header(title = "Athlete Weight") %>%
  fmt_number(columns = c(Value), decimals = 2)

Athlete Weight
Statistic	Value
IQR	21.00
Max	214.00
Mean	72.56
Median	70.00
Min	28.00
SD	15.67

This table neatly summarizes the distribution of the weight variable. Although the mean and median weights are close (72.6 and 70), the large maximum value of 214 suggests a right-skewed distribution or the presence of high outliers. The interquartile range is 21, and the standard deviation is 15.67, showing considerable variability in weights.

Summarizing Years: Top Olympic Medal Winners

Next, I want to compute statistics for each Olympic year. Specifically, I’m interested in the distribution of medals awarded in Olympic games.

# Filter for medal winners
medalists <- olympics_post_1991 %>%
  filter(!is.na(medal))

# Group and summarize
yearly_summary <- medalists %>%
  group_by(year) %>%
  summarise(
    total_medals = n(),
    mean_medals_per_country = n() / n_distinct(noc),
    top_country = names(sort(table(noc), decreasing = TRUE)[1]),
    top_country_medals = max(table(noc))
  )

# Create table
yearly_summary %>%
  gt() %>%
  cols_label(
    year = "Year",
    total_medals = "Total Medals",
    mean_medals_per_country = "Mean Medals per Country",
    top_country = "Top Country",
    top_country_medals = "Top Country Medal Count"
    ) %>%
  fmt_number(
    columns = mean_medals_per_country,
    decimals = 2
    )

Year	Total Medals	Mean Medals per Country	Top Country	Top Country Medal Count
1992	2030	31.23	EUN	279
1994	331	15.05	GER	40
1996	1842	23.32	USA	259
1998	440	18.33	FIN	58
2000	2004	25.05	USA	242
2002	478	19.92	USA	84
2004	2001	27.04	USA	263
2006	526	20.23	CAN	69
2008	2048	23.81	USA	317
2010	520	20.00	USA	97
2012	1941	22.84	USA	248
2014	597	22.96	CAN	86
2016	2023	23.52	USA	264

Based on the table above, the United States clearly dominates the Olympics in terms of medal counts during the post-1991 period. The U.S. appears as the top medal-winning country in 8 out of the 13 Olympic years listed. Notably, the U.S. reached its peak in 2008 with 317 medals.

Interestingly, EUN led in 1992 with 279 medals, which reflects the temporary coalition of former Soviet republics after the USSR dissolved.

Comparing Seasons: Differences between Summer and Winter

Lastly, I want to compute statistics to examine differences between the Summer Olympic games and the Winter Olympics games.

# Summarize medals by season and country
season_country_summary <- medalists %>%
  group_by(season, noc) %>%
  summarise(medals = n(), .groups = 'drop')

# Summary statistics
season_comparison <- season_country_summary %>%
  group_by(season) %>%
  summarise(
    mean_medals = mean(medals),
    median_medals = median(medals),
    sd_medals = sd(medals)
  )

# Create table
season_comparison %>%
  gt() %>%
  cols_label(
    season = "Season",
    mean_medals = "Mean Medals",
    median_medals = "Median Medals",
    sd_medals = "Standard Deviation"
    ) %>%
  fmt_number(
    columns = c(mean_medals, median_medals, sd_medals),
    decimals = 2
    )

Season	Mean Medals	Median Medals	Standard Deviation
Summer	106.86	17.00	238.02
Winter	86.76	25.00	115.14

This table above medal counts by season and highlights key differences in how medals are distributed across countries in the Summer and Winter Olympics.

While the mean number of medals is slightly higher in the Summer Olympics (106.86) compared to the Winter Olympics (86.76), the standard deviation is much larger in the Summer data (238.02 vs. 115.14). This suggests that medal distribution in the Summer Games is more uneven, with a few countries winning a disproportionately large share of the medals. In contrast, the Winter Games have a more balanced distribution, as indicated by a higher median (25 vs. 17) and lower variability.

In summary, while both seasons show variation, Summer Olympics are more dominated by top-performing countries, whereas Winter Games tend to have a more even spread of medals among participating nations.