The Olympic Games provide one of the richest long-term datasets in
sports history, spanning more than 120 years, dozens of countries, and
hundreds of sports.
This project explores trends in athlete demographics, performance, and
participation patterns using the athlete_events dataset.
In this analysis, I explore several key questions:
Throughout the project, I used R for data cleaning, visualization,
and inference.
I also used ChatGPT as a tool for brainstorming and debugging code,
while verifying all results independently.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(corrplot)
## corrplot 0.95 loaded
athlete_events <- read.csv("/Users/seanhiggins/R studio directory/athlete_events.csv")
head(athlete_events)
## ID Name Sex Age Height Weight Team NOC
## 1 1 A Dijiang M 24 180 80 China CHN
## 2 2 A Lamusi M 23 170 60 China CHN
## 3 3 Gunnar Nielsen Aaby M 24 NA NA Denmark DEN
## 4 4 Edgar Lindenau Aabye M 34 NA NA Denmark/Sweden DEN
## 5 5 Christine Jacoba Aaftink F 21 185 82 Netherlands NED
## 6 5 Christine Jacoba Aaftink F 21 185 82 Netherlands NED
## Games Year Season City Sport
## 1 1992 Summer 1992 Summer Barcelona Basketball
## 2 2012 Summer 2012 Summer London Judo
## 3 1920 Summer 1920 Summer Antwerpen Football
## 4 1900 Summer 1900 Summer Paris Tug-Of-War
## 5 1988 Winter 1988 Winter Calgary Speed Skating
## 6 1988 Winter 1988 Winter Calgary Speed Skating
## Event Medal
## 1 Basketball Men's Basketball <NA>
## 2 Judo Men's Extra-Lightweight <NA>
## 3 Football Men's Football <NA>
## 4 Tug-Of-War Men's Tug-Of-War Gold
## 5 Speed Skating Women's 500 metres <NA>
## 6 Speed Skating Women's 1,000 metres <NA>
The Olympics began as a male-dominated competition. Over time, however, women have gained greater representation.
library(dplyr)
library(ggplot2)
gender_year <- athlete_events %>%
group_by(Year) %>%
summarise(
female_pct = mean(Sex == "F", na.rm = TRUE) * 100
)
ggplot(gender_year, aes(x = Year, y = female_pct)) +
geom_line(linewidth = 1.2) +
labs(
title = "Female Participation in the Olympics Over Time",
x = "Year",
y = "Percentage of Athletes Who Are Female"
) +
theme_minimal()
Interpretation Female participation increased from less than 5% to
nearly 50% by modern Games. This change reflects major social, cultural,
and institutional transformations within global sport.
Understanding the physical demands of different sports helps explain the wide range of athlete body types. Height Comparison:
library(dplyr)
library(ggplot2)
hw_data <- athlete_events %>%
filter(!is.na(Height), !is.na(Weight))
# Pick top 8 sports by number of athletes
top_sports <- hw_data %>%
count(Sport, sort = TRUE) %>%
slice_head(n = 8) %>%
pull(Sport)
# Make sure Basketball is included even if not in the top 8
sports_to_include <- union(top_sports, "Basketball")
hw_top <- hw_data %>%
filter(Sport %in% sports_to_include)
# Reorder sports by median height
ggplot(hw_top, aes(x = reorder(Sport, Height, FUN = median), y = Height, fill = Sport)) +
geom_boxplot(show.legend = FALSE) +
coord_flip() +
labs(
title = "Height Comparison Across Top Olympic Sports + Basketball",
x = "Sport",
y = "Height (cm)"
) +
theme_minimal(base_size = 13)
This plot reveals: Basketball and Rowing feature the tallest athletes.
Gymnastics trends shorter. Sports exhibit distinct physical profiles.
This demonstrates how each sport attracts athletes whose body
characteristics fit its biomechanical demands.
ggplot(hw_top, aes(x = reorder(Sport, Weight, FUN = median), y = Weight, fill = Sport)) +
geom_boxplot(show.legend = FALSE) +
coord_flip() +
labs(
title = "Weight Comparison Across Top Olympic Sports",
x = "Sport",
y = "Weight (kg)"
) +
theme_minimal(base_size = 13)
Interpretation Power sports (Rowing, Throwing) show the heaviest
athletes. Agility-focused sports (Gymnastics) show the lowest weights.
This reflects performance specialization.
Age Differences: Medalists vs Non-Medalists (t-test + boxplot)
library(dplyr)
library(ggplot2)
# Keep athletes with ages & classify medal status
age_medal <- athlete_events %>%
filter(!is.na(Age)) %>%
mutate(
Medal_Status = if_else(is.na(Medal), "No Medal", "Medalist")
)
# Visual
ggplot(age_medal, aes(x = Medal_Status, y = Age, fill = Medal_Status)) +
geom_boxplot(show.legend = FALSE) +
labs(
title = "Age Comparison: Medalists vs Non-Medalists",
x = "",
y = "Age"
) +
theme_minimal()
# t-test
t.test(Age ~ Medal_Status, data = age_medal)
##
## Welch Two Sample t-test
##
## data: Age by Medal_Status
## t = 13.149, df = 56743, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Medalist and group No Medal is not equal to 0
## 95 percent confidence interval:
## 0.3683603 0.4974122
## sample estimates:
## mean in group Medalist mean in group No Medal
## 25.92517 25.49229
Medalists are on average 0.5 years older than non-medalists. The t-test shows this difference is not statistically significant. Many older outliars with no medal
library(dplyr)
library(ggplot2)
medals_by_country <- athlete_events %>%
filter(!is.na(Medal)) %>%
group_by(Team) %>%
summarise(total_medals = n(), .groups = "drop") %>%
arrange(desc(total_medals)) %>%
slice_head(n = 15)
ggplot(medals_by_country,
aes(x = reorder(Team, total_medals), y = total_medals)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 15 Countries by Total Olympic Medals",
x = "Country (Team)",
y = "Total Medals"
) +
theme_minimal(base_size = 13)
# Top 15 Countries by Olympic Medals (Split by Summer vs Winter)
library(dplyr)
library(ggplot2)
# Filter to medal-winning rows only
medal_data <- athlete_events %>%
filter(!is.na(Medal))
# Count medals by Team and Season (Summer/Winter)
country_season_medals <- medal_data %>%
group_by(Team, Season) %>%
summarise(total_medals = n(), .groups = "drop")
# Find the top 15 medal-winning countries overall
top_countries <- country_season_medals %>%
group_by(Team) %>%
summarise(all_medals = sum(total_medals)) %>%
arrange(desc(all_medals)) %>%
slice_head(n = 15) %>%
pull(Team)
# Filter to top countries only
country_season_top <- country_season_medals %>%
filter(Team %in% top_countries)
# Visual: grouped bar chart
ggplot(country_season_top,
aes(x = reorder(Team, total_medals), y = total_medals, fill = Season)) +
geom_col(position = "dodge") +
coord_flip() +
labs(
title = "Top 15 Countries by Olympic Medals (Split by Summer vs Winter)",
x = "Country",
y = "Medal Count"
) +
scale_fill_manual(values = c("Summer" = "gold", "Winter" = "steelblue")) +
theme_minimal(base_size = 13)
These visuals highlight which nations have historically dominated the
Olympics. Common leaders include the USA, Soviet Union/Russia, Germany,
and major Western European countries. They also clearly differentiate
seasonal specialization: Norway, Germany, and Canada dominate Winter
Games. USA, China, Japan dominate Summer Games. This reinforces how
geography influences sport performance. Can also highlight how winter
olympic medals are much less common, as those games were introduced much
later # Country/Sport Heat Map
library(dplyr)
library(ggplot2)
# Keep only rows where a medal was won
medal_data <- athlete_events %>%
filter(!is.na(Medal))
# Count medals by country (Team) and Sport
country_sport_medals <- medal_data %>%
group_by(Team, Sport) %>%
summarise(medals = n(), .groups = "drop")
# Top 12 countries by total medals
top_countries <- country_sport_medals %>%
group_by(Team) %>%
summarise(total_medals = sum(medals), .groups = "drop") %>%
arrange(desc(total_medals)) %>%
slice_head(n = 12) %>%
pull(Team)
# Top 10 sports by total medals
top_sports <- country_sport_medals %>%
group_by(Sport) %>%
summarise(total_medals = sum(medals), .groups = "drop") %>%
arrange(desc(total_medals)) %>%
slice_head(n = 10) %>%
pull(Sport)
# Filter to those
country_sport_top <- country_sport_medals %>%
filter(Team %in% top_countries,
Sport %in% top_sports)
# Reorder Teams and Sports by total medals within the filtered set
team_order <- country_sport_top %>%
group_by(Team) %>%
summarise(total_medals = sum(medals), .groups = "drop") %>%
arrange(total_medals) %>%
pull(Team)
sport_order <- country_sport_top %>%
group_by(Sport) %>%
summarise(total_medals = sum(medals), .groups = "drop") %>%
arrange(total_medals) %>%
pull(Sport)
country_sport_top <- country_sport_top %>%
mutate(
Team = factor(Team, levels = team_order),
Sport = factor(Sport, levels = sport_order)
)
ggplot(country_sport_top, aes(x = Sport, y = Team, fill = medals)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "darkred") +
labs(
title = "Medal Dominance by Country and Sport",
x = "Sport",
y = "Country (Team)",
fill = "Medals"
) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
The heatmap reveals specialization patterns: USA → Athletics, Swimming
Norway → Skiing, Biathlon China → Weightlifting, Table Tennis
Russia/USSR → Gymnastics, Wrestling Even white squares represent “1
medal,” not zero in some cases # Additional Visuals # Number of Olympic
Appearances per Athlete
library(dplyr)
library(ggplot2)
# Count how many times each athlete appears in the dataset
athlete_appearances <- athlete_events %>%
group_by(Name) %>%
summarise(n_appearances = n_distinct(Year), .groups = "drop")
ggplot(athlete_appearances, aes(x = n_appearances)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
labs(
title = "Distribution of Olympic Appearances Per Athlete",
x = "Number of Olympic Games Competed In",
y = "Number of Athletes"
) +
scale_x_continuous(breaks = 1:max(athlete_appearances$n_appearances)) +
theme_minimal(base_size = 13)
Most athletes compete only once. A small group competes multiple times,
reflecting rare longevity in elite competition. # Sport Popularity Over
Time (Top 8 Sports)
library(dplyr)
library(ggplot2)
# Count entries per sport
sport_counts <- athlete_events %>%
count(Sport, sort = TRUE)
# Pick the top 8 most common sports
top_sports <- sport_counts %>%
slice_head(n = 8) %>%
pull(Sport)
sport_year <- athlete_events %>%
filter(Sport %in% top_sports) %>%
group_by(Year, Sport) %>%
summarise(n_athletes = n(), .groups = "drop")
ggplot(sport_year,
aes(x = Year, y = n_athletes, color = Sport)) +
geom_line(linewidth = 1) +
labs(
title = "Sport Popularity Over Time (Top 8 Sports)",
x = "Year",
y = "Number of Athlete Entries"
) +
theme_minimal(base_size = 13)
# Discussion and Key Insights
-Athlete body types strongly correspond to sport demands. -Medalists may differ in age compared to non-medalists. -Countries show clear seasonal and sport-based specialization. -Participation of women has increased dramatically over time. -Only a small fraction of athletes appear in multiple Olympics. -These findings collectively highlight how Olympic sport evolves over time demographically, geographically, and competitively. # LLM Usage Statement
I used ChatGPT to assist with brainstorming statistical questions, organizing the report structure, and refining R code syntax. All results, interpretations, and plots were independently verified through my own testing in R.
This project analyzed over 120 years of Olympic athlete data to uncover long-term trends in participation, physical characteristics, performance, and country-level dominance. Using the athlete_events dataset and a series of visualizations created in R, the analysis highlights how the Olympic Games have evolved demographically, geographically, and competitively over time pasted . One of the most striking trends is the dramatic rise in female participation. Early Olympic Games included very few women, but the time series visualization shows a steady increase, reaching nearly 50% female representation in modern competitions. This reflects broader social and institutional changes promoting gender equality in sport. Body composition analysis further demonstrates how physical characteristics align with sport-specific demands. Boxplots of height and weight reveal clear differences across sports, with taller and heavier athletes clustering in sports like basketball and rowing, while gymnastics favors shorter and lighter builds. Performance comparisons between medalists and non-medalists show that medalists are, on average, slightly older, though the difference is not statistically significant. This suggests that experience alone does not guarantee success at the Olympic level. Country-level analyses reveal strong patterns of specialization. Nations such as the United States and China dominate Summer Games sports, while countries like Norway and Germany excel in Winter events, emphasizing the influence of geography, climate, and national sport infrastructure. Heatmap visualizations further reinforce these sport-specific dominance patterns. Overall, the graphs collectively demonstrate that Olympic success is shaped by a combination of biological fit, opportunity, specialization, and historical context. The Olympics are not only a showcase of athletic talent, but also a reflection of long-term social and global trends.