Introduction

The Olympic Games provide one of the richest long-term datasets in sports history, spanning more than 120 years, dozens of countries, and hundreds of sports.
This project explores trends in athlete demographics, performance, and participation patterns using the athlete_events dataset.

In this analysis, I explore several key questions:

Throughout the project, I used R for data cleaning, visualization, and inference.
I also used ChatGPT as a tool for brainstorming and debugging code, while verifying all results independently.


Data Loading and Preparation

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(corrplot)
## corrplot 0.95 loaded
athlete_events <- read.csv("/Users/seanhiggins/R studio directory/athlete_events.csv")

head(athlete_events)
##   ID                     Name Sex Age Height Weight           Team NOC
## 1  1                A Dijiang   M  24    180     80          China CHN
## 2  2                 A Lamusi   M  23    170     60          China CHN
## 3  3      Gunnar Nielsen Aaby   M  24     NA     NA        Denmark DEN
## 4  4     Edgar Lindenau Aabye   M  34     NA     NA Denmark/Sweden DEN
## 5  5 Christine Jacoba Aaftink   F  21    185     82    Netherlands NED
## 6  5 Christine Jacoba Aaftink   F  21    185     82    Netherlands NED
##         Games Year Season      City         Sport
## 1 1992 Summer 1992 Summer Barcelona    Basketball
## 2 2012 Summer 2012 Summer    London          Judo
## 3 1920 Summer 1920 Summer Antwerpen      Football
## 4 1900 Summer 1900 Summer     Paris    Tug-Of-War
## 5 1988 Winter 1988 Winter   Calgary Speed Skating
## 6 1988 Winter 1988 Winter   Calgary Speed Skating
##                                Event Medal
## 1        Basketball Men's Basketball  <NA>
## 2       Judo Men's Extra-Lightweight  <NA>
## 3            Football Men's Football  <NA>
## 4        Tug-Of-War Men's Tug-Of-War  Gold
## 5   Speed Skating Women's 500 metres  <NA>
## 6 Speed Skating Women's 1,000 metres  <NA>

Female Participation Over Time

The Olympics began as a male-dominated competition. Over time, however, women have gained greater representation.

library(dplyr)
library(ggplot2)

gender_year <- athlete_events %>%
  group_by(Year) %>%
  summarise(
    female_pct = mean(Sex == "F", na.rm = TRUE) * 100
  )

ggplot(gender_year, aes(x = Year, y = female_pct)) +
  geom_line(linewidth = 1.2) +
  labs(
    title = "Female Participation in the Olympics Over Time",
    x = "Year",
    y = "Percentage of Athletes Who Are Female"
  ) +
  theme_minimal()

Interpretation Female participation increased from less than 5% to nearly 50% by modern Games. This change reflects major social, cultural, and institutional transformations within global sport.

Body Composition By Top Sports

Understanding the physical demands of different sports helps explain the wide range of athlete body types. Height Comparison:

library(dplyr)
library(ggplot2)

hw_data <- athlete_events %>%
  filter(!is.na(Height), !is.na(Weight))

# Pick top 8 sports by number of athletes
top_sports <- hw_data %>%
  count(Sport, sort = TRUE) %>%
  slice_head(n = 8) %>%
  pull(Sport)

# Make sure Basketball is included even if not in the top 8
sports_to_include <- union(top_sports, "Basketball")

hw_top <- hw_data %>%
  filter(Sport %in% sports_to_include)

# Reorder sports by median height
ggplot(hw_top, aes(x = reorder(Sport, Height, FUN = median), y = Height, fill = Sport)) +
  geom_boxplot(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Height Comparison Across Top Olympic Sports + Basketball",
    x = "Sport",
    y = "Height (cm)"
  ) +
  theme_minimal(base_size = 13)

This plot reveals: Basketball and Rowing feature the tallest athletes. Gymnastics trends shorter. Sports exhibit distinct physical profiles. This demonstrates how each sport attracts athletes whose body characteristics fit its biomechanical demands.

Weight Comparisons

ggplot(hw_top, aes(x = reorder(Sport, Weight, FUN = median), y = Weight, fill = Sport)) +
  geom_boxplot(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Weight Comparison Across Top Olympic Sports",
    x = "Sport",
    y = "Weight (kg)"
  ) +
  theme_minimal(base_size = 13)

Interpretation Power sports (Rowing, Throwing) show the heaviest athletes. Agility-focused sports (Gymnastics) show the lowest weights. This reflects performance specialization.

Performance Analysis

Age Differences: Medalists vs Non-Medalists (t-test + boxplot)

library(dplyr)
library(ggplot2)

# Keep athletes with ages & classify medal status
age_medal <- athlete_events %>%
  filter(!is.na(Age)) %>%
  mutate(
    Medal_Status = if_else(is.na(Medal), "No Medal", "Medalist")
  )

# Visual
ggplot(age_medal, aes(x = Medal_Status, y = Age, fill = Medal_Status)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = "Age Comparison: Medalists vs Non-Medalists",
    x = "",
    y = "Age"
  ) +
  theme_minimal()

# t-test
t.test(Age ~ Medal_Status, data = age_medal)
## 
##  Welch Two Sample t-test
## 
## data:  Age by Medal_Status
## t = 13.149, df = 56743, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Medalist and group No Medal is not equal to 0
## 95 percent confidence interval:
##  0.3683603 0.4974122
## sample estimates:
## mean in group Medalist mean in group No Medal 
##               25.92517               25.49229

Medalists are on average 0.5 years older than non-medalists. The t-test shows this difference is not statistically significant. Many older outliars with no medal

Country Performance and Sport Specialization

Total Medals by Country (Top 15)

library(dplyr)
library(ggplot2)

medals_by_country <- athlete_events %>%
  filter(!is.na(Medal)) %>%
  group_by(Team) %>%
  summarise(total_medals = n(), .groups = "drop") %>%
  arrange(desc(total_medals)) %>%
  slice_head(n = 15)

ggplot(medals_by_country,
       aes(x = reorder(Team, total_medals), y = total_medals)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 15 Countries by Total Olympic Medals",
    x = "Country (Team)",
    y = "Total Medals"
  ) +
  theme_minimal(base_size = 13)

# Top 15 Countries by Olympic Medals (Split by Summer vs Winter)

library(dplyr)
library(ggplot2)

# Filter to medal-winning rows only
medal_data <- athlete_events %>%
  filter(!is.na(Medal))

# Count medals by Team and Season (Summer/Winter)
country_season_medals <- medal_data %>%
  group_by(Team, Season) %>%
  summarise(total_medals = n(), .groups = "drop")

# Find the top 15 medal-winning countries overall
top_countries <- country_season_medals %>%
  group_by(Team) %>%
  summarise(all_medals = sum(total_medals)) %>%
  arrange(desc(all_medals)) %>%
  slice_head(n = 15) %>%
  pull(Team)

# Filter to top countries only
country_season_top <- country_season_medals %>%
  filter(Team %in% top_countries)

# Visual: grouped bar chart
ggplot(country_season_top,
       aes(x = reorder(Team, total_medals), y = total_medals, fill = Season)) +
  geom_col(position = "dodge") +
  coord_flip() +
  labs(
    title = "Top 15 Countries by Olympic Medals (Split by Summer vs Winter)",
    x = "Country",
    y = "Medal Count"
  ) +
  scale_fill_manual(values = c("Summer" = "gold", "Winter" = "steelblue")) +
  theme_minimal(base_size = 13)

These visuals highlight which nations have historically dominated the Olympics. Common leaders include the USA, Soviet Union/Russia, Germany, and major Western European countries. They also clearly differentiate seasonal specialization: Norway, Germany, and Canada dominate Winter Games. USA, China, Japan dominate Summer Games. This reinforces how geography influences sport performance. Can also highlight how winter olympic medals are much less common, as those games were introduced much later # Country/Sport Heat Map

library(dplyr)
library(ggplot2)

# Keep only rows where a medal was won
medal_data <- athlete_events %>%
  filter(!is.na(Medal))

# Count medals by country (Team) and Sport
country_sport_medals <- medal_data %>%
  group_by(Team, Sport) %>%
  summarise(medals = n(), .groups = "drop")
# Top 12 countries by total medals
top_countries <- country_sport_medals %>%
  group_by(Team) %>%
  summarise(total_medals = sum(medals), .groups = "drop") %>%
  arrange(desc(total_medals)) %>%
  slice_head(n = 12) %>%
  pull(Team)

# Top 10 sports by total medals
top_sports <- country_sport_medals %>%
  group_by(Sport) %>%
  summarise(total_medals = sum(medals), .groups = "drop") %>%
  arrange(desc(total_medals)) %>%
  slice_head(n = 10) %>%
  pull(Sport)

# Filter to those
country_sport_top <- country_sport_medals %>%
  filter(Team %in% top_countries,
         Sport %in% top_sports)
# Reorder Teams and Sports by total medals within the filtered set
team_order <- country_sport_top %>%
  group_by(Team) %>%
  summarise(total_medals = sum(medals), .groups = "drop") %>%
  arrange(total_medals) %>%
  pull(Team)

sport_order <- country_sport_top %>%
  group_by(Sport) %>%
  summarise(total_medals = sum(medals), .groups = "drop") %>%
  arrange(total_medals) %>%
  pull(Sport)

country_sport_top <- country_sport_top %>%
  mutate(
    Team  = factor(Team, levels = team_order),
    Sport = factor(Sport, levels = sport_order)
  )
ggplot(country_sport_top, aes(x = Sport, y = Team, fill = medals)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "darkred") +
  labs(
    title = "Medal Dominance by Country and Sport",
    x = "Sport",
    y = "Country (Team)",
    fill = "Medals"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

The heatmap reveals specialization patterns: USA → Athletics, Swimming Norway → Skiing, Biathlon China → Weightlifting, Table Tennis Russia/USSR → Gymnastics, Wrestling Even white squares represent “1 medal,” not zero in some cases # Additional Visuals # Number of Olympic Appearances per Athlete

library(dplyr)
library(ggplot2)

# Count how many times each athlete appears in the dataset
athlete_appearances <- athlete_events %>%
  group_by(Name) %>%
  summarise(n_appearances = n_distinct(Year), .groups = "drop")
ggplot(athlete_appearances, aes(x = n_appearances)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Olympic Appearances Per Athlete",
    x = "Number of Olympic Games Competed In",
    y = "Number of Athletes"
  ) +
  scale_x_continuous(breaks = 1:max(athlete_appearances$n_appearances)) +
  theme_minimal(base_size = 13)

Most athletes compete only once. A small group competes multiple times, reflecting rare longevity in elite competition. # Sport Popularity Over Time (Top 8 Sports)

library(dplyr)
library(ggplot2)

# Count entries per sport
sport_counts <- athlete_events %>%
  count(Sport, sort = TRUE)

# Pick the top 8 most common sports
top_sports <- sport_counts %>%
  slice_head(n = 8) %>%
  pull(Sport)

sport_year <- athlete_events %>%
  filter(Sport %in% top_sports) %>%
  group_by(Year, Sport) %>%
  summarise(n_athletes = n(), .groups = "drop")

ggplot(sport_year,
       aes(x = Year, y = n_athletes, color = Sport)) +
  geom_line(linewidth = 1) +
  labs(
    title = "Sport Popularity Over Time (Top 8 Sports)",
    x = "Year",
    y = "Number of Athlete Entries"
  ) +
  theme_minimal(base_size = 13)

# Discussion and Key Insights

-Athlete body types strongly correspond to sport demands. -Medalists may differ in age compared to non-medalists. -Countries show clear seasonal and sport-based specialization. -Participation of women has increased dramatically over time. -Only a small fraction of athletes appear in multiple Olympics. -These findings collectively highlight how Olympic sport evolves over time demographically, geographically, and competitively. # LLM Usage Statement

I used ChatGPT to assist with brainstorming statistical questions, organizing the report structure, and refining R code syntax. All results, interpretations, and plots were independently verified through my own testing in R.

Final Report Summary and Conclusion

This project analyzed over 120 years of Olympic athlete data to uncover long-term trends in participation, physical characteristics, performance, and country-level dominance. Using the athlete_events dataset and a series of visualizations created in R, the analysis highlights how the Olympic Games have evolved demographically, geographically, and competitively over time pasted . One of the most striking trends is the dramatic rise in female participation. Early Olympic Games included very few women, but the time series visualization shows a steady increase, reaching nearly 50% female representation in modern competitions. This reflects broader social and institutional changes promoting gender equality in sport. Body composition analysis further demonstrates how physical characteristics align with sport-specific demands. Boxplots of height and weight reveal clear differences across sports, with taller and heavier athletes clustering in sports like basketball and rowing, while gymnastics favors shorter and lighter builds. Performance comparisons between medalists and non-medalists show that medalists are, on average, slightly older, though the difference is not statistically significant. This suggests that experience alone does not guarantee success at the Olympic level. Country-level analyses reveal strong patterns of specialization. Nations such as the United States and China dominate Summer Games sports, while countries like Norway and Germany excel in Winter events, emphasizing the influence of geography, climate, and national sport infrastructure. Heatmap visualizations further reinforce these sport-specific dominance patterns. Overall, the graphs collectively demonstrate that Olympic success is shaped by a combination of biological fit, opportunity, specialization, and historical context. The Olympics are not only a showcase of athletic talent, but also a reflection of long-term social and global trends.