library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.4
✔ lubridate 1.9.2     ✔ stringr   1.5.0
✔ purrr     1.0.2     ✔ tibble    3.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Question 1: Distribution of Medals

df <- read.csv("data/athlete_events.csv")

medals <- df %>% 
  filter(!is.na(Medal))

medal_counts <- medals %>%
  group_by(Year, Season, Medal) %>%
  summarise(count = n(), .groups = "drop") %>%
  pivot_wider(names_from = Medal,
              values_from = count,
              values_fill = list(count = 0)) %>%
  arrange(Year)

medal_counts <- medal_counts %>%
  mutate(
    total = Gold + Silver + Bronze,
    prop_gold = Gold / total,
    prop_silver = Silver / total,
    prop_bronze = Bronze / total
  )

ggplot(medal_counts, aes(x = Year)) +
  geom_line(aes(y = Gold, color = "Gold"), linewidth = 1) +
  geom_line(aes(y = Silver, color = "Silver"), linewidth = 1) +
  geom_line(aes(y = Bronze, color = "Bronze"), linewidth = 1) +
  facet_wrap(~ Season, scales = "free_y") +
  labs(title = "Medal Counts Over Time",
       y = "Number of Medals",
       color = "Medal Type")

ggplot(medal_counts, aes(x = Year)) +
  geom_line(aes(y = prop_gold, color = "Gold"), linewidth = 1) +
  geom_line(aes(y = prop_silver, color = "Silver"), linewidth = 1) +
  geom_line(aes(y = prop_bronze, color = "Bronze"), linewidth = 1) +
  facet_wrap(~ Season) +
  labs(title = "Proportion of Medal Types Over Time",
       y = "Proportion",
       color = "Medal Type")

This data shows that the overall medals have generally increased over the years. While there was a decrease early in the Summer Olympics, it has continued to increase over time.

Question 2: Average Age of Olympians

ages <- df %>%
  filter(!is.na(Age), Age > 0)

avg_age <- ages %>%
  group_by(Year) %>%
  summarise(mean_age = mean(Age, na.rm = TRUE))

ggplot(avg_age, aes(x = Year, y = mean_age)) +
  geom_line(linewidth = 1, color = "darkred") +
  geom_point(color = "darkred") +
  labs(title = "Average Age of Olympic Athletes Over Time",
       x = "Year",
       y = "Average Age") +
  theme_minimal()

The graph shows that while Olympic athletes have generally remained in their twenties over the years, 1932 was a noticeable outlier, as it had a much higher average age at around 32.

Question 3: Historic Medal Domination

medals <- df %>% 
  filter(!is.na(Medal))

country_totals <- medals %>%
  group_by(NOC) %>%
  summarise(total_medals = n()) %>%
  arrange(desc(total_medals))

head(country_totals, 10)
# A tibble: 10 × 2
   NOC   total_medals
   <chr>        <int>
 1 USA           5637
 2 URS           2503
 3 GER           2165
 4 GBR           2068
 5 FRA           1777
 6 ITA           1637
 7 SWE           1536
 8 CAN           1352
 9 AUS           1320
10 RUS           1165
top10 <- country_totals %>% top_n(10, total_medals)

ggplot(top10, aes(x = reorder(NOC, total_medals), 
                  y = total_medals, 
                  fill = NOC)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 10 Countries by Total Medals (All Years)",
       x = "Country (NOC)",
       y = "Total Medals") +
  theme_minimal() +
  theme(legend.position = "none")

The data shows that the USA has absolutely dominated in total medals taken home. A far second is the USR, otherwise known as the Soviet Union, which is impressive considering they don’t exist anymore. It is worth noting that Russia has it’s own separate count.

Question 4: Physiological Change over Time

athletes <- read_csv("data/athlete_events.csv")
Rows: 271116 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
dbl  (5): ID, Age, Height, Weight, Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
phys_trends <- athletes %>%
  group_by(Year) %>%
  summarize(
    avg_height = mean(Height, na.rm = TRUE),
    avg_weight = mean(Weight, na.rm = TRUE),
    avg_age    = mean(Age, na.rm = TRUE)
  )

ggplot(phys_trends, aes(x = Year, y = avg_height)) +
  geom_line(color = "steelblue", linewidth = 1.2) +
  geom_point(color = "steelblue") +
  labs(
    title = "Average Athlete Height Over Time",
    x = "Year",
    y = "Height (cm)"
  ) +
  theme_minimal(base_size = 14)

ggplot(phys_trends, aes(x = Year, y = avg_weight)) +
  geom_line(color = "darkgreen", linewidth = 1.2) +
  geom_point(color = "darkgreen") +
  labs(
    title = "Average Athlete Weight Over Time",
    x = "Year",
    y = "Weight (kg)"
  ) +
  theme_minimal(base_size = 14)

ggplot(phys_trends, aes(x = Year, y = avg_age)) +
  geom_line(color = "purple", linewidth = 1.2) +
  geom_point(color = "purple") +
  labs(
    title = "Average Athlete Age Over Time",
    x = "Year",
    y = "Age"
  ) +
  theme_minimal(base_size = 14)

Strangely enough, even those these are showing three different factors, the graphs look similar. They show that for height, weight, and age, they slowly increased, and when they peaked, they dramatically decreased, before leveling out recently, remaining fairly consistent. Notably, height and weight peaked at the same time, while age peaked a little later.