1. Abstract

This project explores 120 years of Olympic data (starting in 1896 and ending in 2016), considering variables such as medals awarded, countries, and gender, to name a few; to answer research questions on how the Olympics have changed over time, we visualize the results using R programming.

2. Introduction

Many wonder how the Olympics’ long history has changed over time. In this project, we use R programming methods to explore questions related to medal distribution, top sports and countries, athletes’ ages, gender distribution, and winning trends.

3. Methodology

  • Data Source: Kaggle 120 years of Olympic History dataset.
  • Tools: R programming language, base packages, ggplot2, dplyr, tidyr, reshape2.
  • Data Pre-processing: cleaning, merging, grouping, and transforming data.
  • Data Visualization: Creating bar plots, line plots and heatmaps using ggplot2.

4. Results

4.1 Stacked Bar plot of Medal Distribution

This bar plot shows the distribution of gold, silver, and bronze medals over time. We can see a general trend that the total medals awarded have increased over time; the total amount of gold, silver, and bronze medals has also increased to reflect the change in the total. We can see that in 1994, when the summer and Winter Olympics changed to being held every two years instead of occurring in the same year, the Winter Olympics had much fewer medals awarded, likely due to a more narrow range of sports. However, the trend for total medals awarded in the Winter Olympics still increases over time.

4.2 Bar plot of Top Ten Countries in the Olympics Overall

The bar plot above shows the top ten countries that dominate the Olympics based on total medals awarded over the years. The US has historically dominated the Olympics overall. Although Russia and Germany, in second and third place respectively, have way more medals than the rest of the remaining top ten countries, they still fall short of the US by a similar margin by which the remaining top ten countries fall short of them.

4.3 Bar plot of Top Ten Sports in the Olympics Overall

In this bar plot, we can see the top ten sports based on the most medals awarded in the Olympics over the years. Unsurprisingly, Athletics, Swimming, and Gymnastics are all in the top five. However, Rowing is a surprise, as it is not often thought of when considering the most popular sport in the Olympics.

4.4 Line plot of Average and Median Athlete Ages Over Time

The line plot above shows the average and median ages of Olympic athletes over time. We included the median age in addition to the average because it is more resistant to being skewed by outliers in the data. The general trend is an increase in ages as we approach 1930. Athletes start getting younger and younger as we approach 1980, and ages begin to increase towards the present day.

4.5 Heatmap of Medal Count of Top Ten Countries vs Top Ten Sports

This heatmap shows which countries perform well in specific sports. We can see that the US dominated Athletics and Swimming, France and Italy did quite well in Fencing, Australia did well in Swimming, and unsurprisingly Canada did well in Ice Hockey which makes sense due to how popular the sport is in that country.

4.6 Line plot of Gender Distribution of Athletes Over Time

The line plot above displays the gender distribution of athletes over the years. Women were first allowed to compete in the Olympics in 1900, and female participation slowly increased until around 1950, when the increase sped up a little as we approached 1980. After 1980, there was a steady increase in female participation until around 1994, when the increase began to slow, likely due to when the summer and Winter Olympics split that same year. During the split in 1994, we can see fewer overall athletes in the Winter Olympics. However, female participation continues to slowly increase. Male participation generally increases over the years but decreases slightly after 1994, likely due to the split.

4.7 Line plot of USA’s Performance in the Olympics Over Time

In this line plot, we see the US performance in the Olympics over the years. The general trend is an increased performance over the years for both the summer and winter Olympics. However, for the summer Olympics, their best years were around 1904, 1984, and 2008. Their performance significantly decreased in the games prior to WWII.

4.8 Bar plot of USA’s Perfomance Over Time with Season and Hosting Status

With this bar plot, we analyze the USA’s performance when they hosted the Olympics versus when they did not. We can see that the US tends to do very well when they host in the summer but not as much when they host in the winter.

4.9 Line plot of Amount of Medals Won Over Time Based on Season

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on our other graphs, we hypothesized that many fewer medals are awarded in the Winter Olympics than in the Summer Olympics, but we wanted to plot that relationship directly to support our educated guesses and visualize the trend. Medals awarded in the summer steadily increased until 1920, when they fell dramatically, probably leading up to WWII. Afterward, they increased very quickly up until the end of our dataset. Since the Winter Olympics began, the increase in the amount of medals awarded has been much slower.

5. Conclusion

This project aimed to answer research questions that considered Olympic participation and performance. We were able to see how history has affected these two metrics, from the First and Second World Wars to when women were first allowed to compete, and we took these events into account during our analysis of the results.

7. Appendices

7.1 Setup Code, Loading Libraries, Reading in Data

knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(dplyr)
library(tidyr)
library(reshape2)

olympic_data <- read.csv("athlete_events.csv")

noc_regions <- read.csv("noc_regions.csv")

7.2 Grouping Data, Stacked Bar plot 4.1

df_medals <- olympic_data %>%
  filter(!is.na(Medal)) %>%
  group_by(Year, Medal) %>%
  summarise(MedalCount = n())

df_medals$Medal <- factor(df_medals$Medal, levels = c("Gold", "Silver", "Bronze"))

ggplot(df_medals, aes(x = Year, y = MedalCount, fill = Medal)) +
  geom_bar(stat = "identity", position = "stack") + 
  labs(title = "Distribution of Gold, Silver, and Bronze Medals Over Time",
       x = "Year", 
       y = "Number of Medals") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = c("Gold" = "gold", 
                               "Silver" = "gray", 
                               "Bronze" = "brown"))

7.3 Cleaning, Grouping and Merging Data, Bar plot 4.2

olympic_data_combined <- olympic_data %>%
  mutate(NOC = ifelse(NOC == "USR", "RUS", NOC))

total_medals_by_country <- olympic_data_combined %>%
  filter(!is.na(Medal)) %>%
  group_by(NOC) %>%
  summarise(TotalMedals = n(), .groups = 'drop') %>%
  arrange(desc(TotalMedals))  

total_medals_with_names <- total_medals_by_country %>%
  left_join(noc_regions, by = "NOC")

total_medals_with_names_combined <- total_medals_with_names %>%
  mutate(region = ifelse(NOC == "USR", "Russia", region)) %>% 
  group_by(region) %>%
  summarise(TotalMedals = sum(TotalMedals), .groups = 'drop')

top_10_countries <- total_medals_with_names_combined %>%
  slice_max(TotalMedals, n = 10)

top_10_countries <- top_10_countries %>%
  arrange(TotalMedals) %>%  
  mutate(region = factor(region, levels = region))

ggplot(top_10_countries, aes(x = reorder(region, TotalMedals), y = TotalMedals, fill = region)) +
  geom_bar(stat = "identity") +  
  labs(title = "Top 10 Countries by Total Medals Awarded in the Olympics",
       x = "Country",
       y = "Total Medals") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0),  
        legend.position = "none") + 
  scale_fill_viridis_d() + 
  coord_flip()

7.4 Grouping and Filtering Data, Bar plot 4.3

medals_by_sport <- olympic_data %>%
  filter(!is.na(Medal)) %>%  
  group_by(Sport) %>%
  summarise(TotalMedals = n()) %>%
  arrange(desc(TotalMedals)) 

top_10_sports <- medals_by_sport %>%
  top_n(10, TotalMedals) 

ggplot(top_10_sports, aes(x = reorder(Sport, TotalMedals), y = TotalMedals, fill = Sport)) +
  geom_bar(stat = "identity") +  
  labs(title = "Top 10 Sports by Total Medals Awarded in the Olympics",
       x = "Sport",
       y = "Total Medals") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +  
  scale_fill_viridis_d() +
  coord_flip()

7.5 Cleaing and Grouping Data, Line plot 4.4

age_by_year <- olympic_data %>%
  filter(!is.na(Age)) %>%  
  group_by(Year) %>%
  summarise(AverageAge = mean(Age, na.rm = TRUE),  
            MedianAge = median(Age, na.rm = TRUE))  

ggplot(age_by_year, aes(x = Year)) +
  geom_line(aes(y = AverageAge, color = "Average Age"), linewidth = 1) + 
  geom_line(aes(y = MedianAge, color = "Median Age"), linewidth = 1, linetype = "dashed") +  
  labs(title = "Average and Median Age of Athletes Over Time",
       x = "Year",
       y = "Age",
       color = "Age Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

7.6 Cleaning, Grouping, and Merging Data, Heatmap 4.5

medal_counts <- olympic_data_combined %>%
  filter(!is.na(Medal)) %>%
  group_by(NOC, Sport) %>%
  summarise(MedalCount = n()) %>%
  ungroup()

top_countries <- medal_counts %>%
  group_by(NOC) %>%
  summarise(TotalMedals = sum(MedalCount)) %>%
  top_n(10, TotalMedals) %>%
  pull(NOC)

top_sports <- medal_counts %>%
  group_by(Sport) %>%
  summarise(TotalMedals = sum(MedalCount)) %>%
  top_n(10, TotalMedals) %>%
  pull(Sport)

filtered_data <- medal_counts %>%
  filter(NOC %in% top_countries & Sport %in% top_sports)

noc_regions_clean <- noc_regions %>%
  select(-notes) 

total_medals_with_regions <- filtered_data %>%
  left_join(noc_regions_clean, by = "NOC")

total_medals_with_regions <- total_medals_with_regions %>%
  mutate(MedalCount = as.numeric(MedalCount)) %>%
  filter(!is.na(MedalCount)) 

heatmap_data <- total_medals_with_regions %>%
  pivot_wider(names_from = Sport, values_from = MedalCount, values_fill = list(MedalCount = 0))

heatmap_data_long <- heatmap_data %>%
  pivot_longer(cols = -c(NOC, region), names_to = "Sport", values_to = "MedalCount")

ggplot(heatmap_data_long, aes(x = Sport, y = region, fill = MedalCount)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") + 
  labs(title = "Top 10 Countries' Medal Counts by Sport (Region)",
       x = "Sport",
       y = "Region",
       fill = "Medal Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  
        axis.text.y = element_text(size = 8)) 

7.7 Grouping Data, Line plot 4.6

gender_distribution <- olympic_data %>%
  group_by(Year, Sex) %>%
  summarise(Athletes = n()) %>%
  ungroup()

ggplot(gender_distribution, aes(x = Year, y = Athletes, color = Sex, group = Sex)) +
  geom_line() + 
  geom_point() +
  labs(title = "Gender Distribution of Athletes Over the Years",
       x = "Year", 
       y = "Number of Athletes",
       color = "Sex") +
  theme_minimal() +
  scale_color_manual(values = c("lightcoral", "steelblue")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

7.8 Filtering and Grouping Data, Line plot 4.7

usa_medals <- olympic_data %>%
  filter(NOC == 'USA', !is.na(Medal))

usa_medal_trends <- usa_medals %>%
  group_by(Year, Season) %>%
  summarise(Medals = n()) %>%
  ungroup()

ggplot(usa_medal_trends, aes(x = Year, y = Medals, color = Season, group = Season)) +
  geom_line() + 
  geom_point() +
  labs(title = "USA's Medal Trends Over the Years",
       x = "Year", 
       y = "Number of Medals",
       color = "Season") +
  scale_color_manual(values = c("Summer" = "lightcoral", "Winter" = "steelblue")) +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

7.9 Grouping and Filtering Data, Bar plot 4.8

hosting_years <- c(1904, 1932, 1984, 1996, 1932, 1960, 1980, 2002)

usa_medal_trends <- usa_medals %>%
  group_by(Year, Season) %>%
  summarise(Medals = n()) %>%
  ungroup() %>%
  mutate(Hosting = ifelse(Year %in% hosting_years, "Hosting", "Non-Hosting"))

ggplot(usa_medal_trends, aes(x = Year, y = Medals, fill = interaction(Season, Hosting))) +
  geom_bar(stat = "identity", position = "stack") + 
  labs(title = "USA's Medal Trends Over the Years (Season and Hosting)",
       x = "Year", 
       y = "Number of Medals",
       fill = "Season and Hosting Status") +
  scale_fill_manual(values = c("Summer.Hosting" = "lightcoral", "Winter.Hosting" = "steelblue", 
                               "Summer.Non-Hosting" = "darkred", "Winter.Non-Hosting" = "darkblue")) +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

7.10 Filtering and Grouping Data, Line plot 4.9

medals_overtime <- olympic_data %>%
  filter(!is.na(Medal)) %>%  
  group_by(Year, Season) %>%
  summarise(MedalCount = n()) %>%
  ungroup()

ggplot(medals_overtime, aes(x = Year, y = MedalCount, color = Season)) +
  geom_line(linewidth = 1.2) +  
  geom_point(size = 2) +   
  labs(title = "Amount of Medals Won Over Time in Summer vs Winter Olympics",
       x = "Year",
       y = "Number of Medals",
       color = "Olympic Season") +
  scale_color_manual(values = c("Summer" = "darkorange", "Winter" = "steelblue")) + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))