# Load required packages
library(tidyverse)
library(here)
library(knitr)
library(gt)
# Import the dataset
<- read_csv(here("olympics.csv")) olympics
BGEN516 - Descriptive Stats Lab
Lab Overview
In this lab, I’ll work with Olympic athlete data to demonstrate the computation of descriptive statistics.
Prep Workspace
As usual, I’ll be working with the tidyverse and here packages. Additionally, I’ll use functions from the knitr
and gt
packages, which offer enhanced formatting options for tables.
First, I’ll take a look at the data:
# Quick inspect
glimpse(olympics)
Rows: 271,116
Columns: 15
$ id <dbl> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, …
$ name <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Lindenau…
$ sex <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", "M", "M"…
$ age <dbl> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31, 33, 33,…
$ height <dbl> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 188, 188, …
$ weight <dbl> 80, 60, NA, NA, 82, 82, 82, 82, 82, 82, 75, 75, 75, 75, 75, 75,…
$ team <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands", "…
$ noc <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED", "NED", …
$ games <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer", "19…
$ year <dbl> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1994, 199…
$ season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Wi…
$ city <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "Calgar…
$ sport <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skating"…
$ event <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweight", …
$ medal <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
This glimpse of the data shows us that each row, or observation, represents a single athlete. However, a closer look at the name
column reveals that some athletes appear multiple times in the dataset.
Filtering Data: Olympics After 1991
My next task is to filter the data to include only Olympic Games that took place after 1991. Why 1991? It marks the end of the Cold War and a significant shift in global geopolitics. By focusing on Olympic data from this period onward, we can explore how geopolitical shifts, newly independent nations, and changing global dynamics influenced athlete participation and performance.
# Filter for data after 1991
<- olympics %>%
olympics_post_1991 filter(year > 1991)
# Confirm filtering
summary(olympics_post_1991$year)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1992 1996 2004 2004 2012 2016
The output of summary(olympics_post_1991$year)
confirms that we’ve only kept data from Olympic games that took place after 1991. Specifically, the minimum value for the year
column is 1992.
Summarizing Athlete Data: The Physical Profile of Modern Olympians
Next, I want to zoom in on unique athlete profiles. I’ll do this by removing duplicate combinations of ID, age, height, and weight, capturing only distinct records of each athlete’s physical traits. By summarizing these key characteristics through summary statistics, we generate an overview of the physiques of Olympic athletes competing after the Cold War era.
# Subset unique athlete data
<- olympics_post_1991 %>%
unique_athletes select(id, age, height, weight) %>%
distinct()
# Quick inspect
head(unique_athletes)
# A tibble: 6 × 4
id age height weight
<dbl> <dbl> <dbl> <dbl>
1 1 24 180 80
2 2 23 170 60
3 5 25 185 82
4 5 27 185 82
5 6 31 188 75
6 6 33 188 75
Did you notice the repeated values in the id
column? Some athletes compete in multiple Olympic Games over the years, which is reflected in the age
column. For example, the athlete with ID 5 competed in the Olympics at age 25 and again at age 27.
Now let’s use statistics to summarize the physical attributes of these athletes.
# Summary statistics for athletes
<- unique_athletes %>%
athlete_summary summarise(
mean_age = mean(age, na.rm = TRUE),
median_age = median(age, na.rm = TRUE),
sd_age = sd(age, na.rm = TRUE),
iqr_age = IQR(age, na.rm = TRUE),
min_age = min(age, na.rm = TRUE),
max_age = max(age, na.rm = TRUE),
mean_height = mean(height, na.rm = TRUE),
median_height = median(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE),
iqr_height = IQR(height, na.rm = TRUE),
min_height = min(height, na.rm = TRUE),
max_height = max(height, na.rm = TRUE),
mean_weight = mean(weight, na.rm = TRUE),
median_weight = median(weight, na.rm = TRUE),
sd_weight = sd(weight, na.rm = TRUE),
iqr_weight = IQR(weight, na.rm = TRUE),
min_weight = min(weight, na.rm = TRUE),
max_weight = max(weight, na.rm = TRUE)
)
I could display the summary statistics by simply adding a line of code with the name of my new data frame athlete_summary
. Alternatively, I can use kable()
to create a table.
%>%
athlete_summary kable()
mean_age | median_age | sd_age | iqr_age | min_age | max_age | mean_height | median_height | sd_height | iqr_height | min_height | max_height | mean_weight | median_weight | sd_weight | iqr_weight | min_weight | max_weight |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
25.77801 | 25 | 5.276563 | 7 | 11 | 71 | 176.794 | 177 | 10.78911 | 15 | 133 | 226 | 72.56303 | 70 | 15.6738 | 21 | 28 | 214 |
The table in its current form is quite wide and difficult to read. It may be easier to interpret in a long format instead.
Recall from Week 3 that we discussed long and wide data formats, as well as how to reshape data using pivot_longer()
and pivot_wider()
from the tidyr package.
In the code below, I’ll first convert the data from wide to long format. Then, I’ll clean and organize the resulting data frame using mutate()
, select()
, and arrange()
.
# Reshape from wide to long
<- athlete_summary %>%
athlete_long pivot_longer(
cols = everything(), # Take all columns and gather them into key-value pairs
names_to = c("stat", "trait"), # Split column names 2 new columns: 'stat' and 'trait'
names_sep = "_" # Use underscore to separate the original column names
%>%
) # Make the 'stat' and 'trait' names look nicer by capitalizing each word
mutate(
# Tools package is provided in base R (you don't need to install it)
stat = if_else(stat %in% c("sd", "iqr"), # Make "sd" and "iqr" fully uppercase
toupper(stat),
::toTitleCase(stat) # Title case all other stats
tools
),trait = tools::toTitleCase(trait)
%>%
) # Select columns in the order: Trait, Statistic, Value
select(trait, stat, value) %>%
# Arrange rows alphabetically by Trait and then by Statistic
arrange(trait, stat)
# Inspect first few rows
head(athlete_long)
# A tibble: 6 × 3
trait stat value
<chr> <chr> <dbl>
1 Age IQR 7
2 Age Max 71
3 Age Mean 25.8
4 Age Median 25
5 Age Min 11
6 Age SD 5.28
Great, the data is in long format and values are nicely formatted.
%>%
athlete_long kable()
trait | stat | value |
---|---|---|
Age | IQR | 7.000000 |
Age | Max | 71.000000 |
Age | Mean | 25.778005 |
Age | Median | 25.000000 |
Age | Min | 11.000000 |
Age | SD | 5.276563 |
Height | IQR | 15.000000 |
Height | Max | 226.000000 |
Height | Mean | 176.793958 |
Height | Median | 177.000000 |
Height | Min | 133.000000 |
Height | SD | 10.789109 |
Weight | IQR | 21.000000 |
Weight | Max | 214.000000 |
Weight | Mean | 72.563029 |
Weight | Median | 70.000000 |
Weight | Min | 28.000000 |
Weight | SD | 15.673798 |
While the table is no longer wide, it has become quite long and remains difficult to read. This format still impedes interpretation and discussion of the summary statistics of each individual variable. But now that I have the data cleaned and formatted, I can easily examine each variable on its own by filtering the data and displaying the data subset in a nicely formatted table.
Athlete Age
# Subset age stats
<- athlete_long %>%
age_table filter(trait == "Age") %>%
select(Statistic = stat, Value = value)
# Display age stats as table
%>%
age_table kable(
format = "html",
caption = "Summary Statistics: Age",
digits = 2,
align = c("l", "r")
)
Statistic | Value |
---|---|
IQR | 7.00 |
Max | 71.00 |
Mean | 25.78 |
Median | 25.00 |
Min | 11.00 |
SD | 5.28 |
I could spend time adjusting this code to customize the formatting and make the table appear more compact in the rendered document. For example, I might explore additional formatting options for kable
in the kableExtra
package. Alternatively, I can utilize the gt
package.
The gt package works well with Quarto because it creates tables that already include all necessary styling. These tables do not rely on extra style files or external programs to look good. This means the tables display correctly and consistently, without running into issues caused by missing or conflicting styles.
%>%
age_table gt() %>%
tab_header(title = "Athlete Age") %>%
fmt_number(columns = c(Value), decimals = 2)
Athlete Age | |
---|---|
Statistic | Value |
IQR | 7.00 |
Max | 71.00 |
Mean | 25.78 |
Median | 25.00 |
Min | 11.00 |
SD | 5.28 |
This table neatly summarizes the distribution of the age variable. The mean and median ages are close (25.8 and 25), which may indicate a roughly symmetric distribution, though this alone is not conclusive. The wide range of values, from 11 to 71 years, suggests potential skewness or the presence of outliers, which would require further exploration using visualizations or additional statistics.
Athlete Height
# Subset height stats
<- athlete_long %>%
height_table filter(trait == "Height") %>%
select(Statistic = stat, Value = value)
# Display height stats as table
%>%
height_table gt() %>%
tab_header(title = "Athlete Height") %>%
fmt_number(columns = c(Value), decimals = 2)
Athlete Height | |
---|---|
Statistic | Value |
IQR | 15.00 |
Max | 226.00 |
Mean | 176.79 |
Median | 177.00 |
Min | 133.00 |
SD | 10.79 |
This table neatly summarizes the distribution of the height variable. The mean and median heights are nearly identical (176.8 and 177), suggesting the distribution is approximately symmetric. The values range from 133 to 226 cm, with an interquartile range of 15 and a standard deviation of 10.79, indicating a moderate spread.
Athlete Weight
# Subset weight stats
<- athlete_long %>%
weight_table filter(trait == "Weight") %>%
select(Statistic = stat, Value = value)
%>%
weight_table gt() %>%
tab_header(title = "Athlete Weight") %>%
fmt_number(columns = c(Value), decimals = 2)
Athlete Weight | |
---|---|
Statistic | Value |
IQR | 21.00 |
Max | 214.00 |
Mean | 72.56 |
Median | 70.00 |
Min | 28.00 |
SD | 15.67 |
This table neatly summarizes the distribution of the weight variable. Although the mean and median weights are close (72.6 and 70), the large maximum value of 214 suggests a right-skewed distribution or the presence of high outliers. The interquartile range is 21, and the standard deviation is 15.67, showing considerable variability in weights.
Summarizing Years: Top Olympic Medal Winners
Next, I want to compute statistics for each Olympic year. Specifically, I’m interested in the distribution of medals awarded in Olympic games.
# Filter for medal winners
<- olympics_post_1991 %>%
medalists filter(!is.na(medal))
# Group and summarize
<- medalists %>%
yearly_summary group_by(year) %>%
summarise(
total_medals = n(),
mean_medals_per_country = n() / n_distinct(noc),
top_country = names(sort(table(noc), decreasing = TRUE)[1]),
top_country_medals = max(table(noc))
)
# Create table
%>%
yearly_summary gt() %>%
cols_label(
year = "Year",
total_medals = "Total Medals",
mean_medals_per_country = "Mean Medals per Country",
top_country = "Top Country",
top_country_medals = "Top Country Medal Count"
%>%
) fmt_number(
columns = mean_medals_per_country,
decimals = 2
)
Year | Total Medals | Mean Medals per Country | Top Country | Top Country Medal Count |
---|---|---|---|---|
1992 | 2030 | 31.23 | EUN | 279 |
1994 | 331 | 15.05 | GER | 40 |
1996 | 1842 | 23.32 | USA | 259 |
1998 | 440 | 18.33 | FIN | 58 |
2000 | 2004 | 25.05 | USA | 242 |
2002 | 478 | 19.92 | USA | 84 |
2004 | 2001 | 27.04 | USA | 263 |
2006 | 526 | 20.23 | CAN | 69 |
2008 | 2048 | 23.81 | USA | 317 |
2010 | 520 | 20.00 | USA | 97 |
2012 | 1941 | 22.84 | USA | 248 |
2014 | 597 | 22.96 | CAN | 86 |
2016 | 2023 | 23.52 | USA | 264 |
Based on the table above, the United States clearly dominates the Olympics in terms of medal counts during the post-1991 period. The U.S. appears as the top medal-winning country in 8 out of the 13 Olympic years listed. Notably, the U.S. reached its peak in 2008 with 317 medals.
Interestingly, EUN led in 1992 with 279 medals, which reflects the temporary coalition of former Soviet republics after the USSR dissolved.
Comparing Seasons: Differences between Summer and Winter
Lastly, I want to compute statistics to examine differences between the Summer Olympic games and the Winter Olympics games.
# Summarize medals by season and country
<- medalists %>%
season_country_summary group_by(season, noc) %>%
summarise(medals = n(), .groups = 'drop')
# Summary statistics
<- season_country_summary %>%
season_comparison group_by(season) %>%
summarise(
mean_medals = mean(medals),
median_medals = median(medals),
sd_medals = sd(medals)
)
# Create table
%>%
season_comparison gt() %>%
cols_label(
season = "Season",
mean_medals = "Mean Medals",
median_medals = "Median Medals",
sd_medals = "Standard Deviation"
%>%
) fmt_number(
columns = c(mean_medals, median_medals, sd_medals),
decimals = 2
)
Season | Mean Medals | Median Medals | Standard Deviation |
---|---|---|---|
Summer | 106.86 | 17.00 | 238.02 |
Winter | 86.76 | 25.00 | 115.14 |
This table above medal counts by season and highlights key differences in how medals are distributed across countries in the Summer and Winter Olympics.
While the mean number of medals is slightly higher in the Summer Olympics (106.86) compared to the Winter Olympics (86.76), the standard deviation is much larger in the Summer data (238.02 vs. 115.14). This suggests that medal distribution in the Summer Games is more uneven, with a few countries winning a disproportionately large share of the medals. In contrast, the Winter Games have a more balanced distribution, as indicated by a higher median (25 vs. 17) and lower variability.
In summary, while both seasons show variation, Summer Olympics are more dominated by top-performing countries, whereas Winter Games tend to have a more even spread of medals among participating nations.