Project Topic

My project explores data related to the Winter and Summer Olympics. The data available is for both Summer and Winter Olympics that were held between 1896 and 2016.

The first year of the Winter Olympics was in 1924, so prior to that, there was only Summer Olympics. Noting as well that Winter and Summer Games were held in the same year up until 1992 and then they were staggered two-years apart, each running every four years (i.e., Winter Games in 1994, then Summer in 1996, and so forth).

The dataset includes information about each individual athlete that competed in the respective games for each sport/event. Different subsets of years, events/sports, countries, etc. will be used throughout the visualizations within this report.

This report includes eight figures as part of the Capstone Project (including dumbbell plots, bar plots, line plots, and choropleths) and some are interactive. For completeness, I’ve included the code I used to create each figure, as well as the code required to clean the data in preparation for each figure. Each figure includes short description. Enjoy!

Data Sources

rgriffin. (2018). 120 years of Olympic history: athletes and results, Version 2. Retrieved March 25, 2025 from https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results/version/2.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(maps)
## Warning: package 'maps' was built under R version 4.4.3
## 
## Attaching package: 'maps'
## 
## The following object is masked from 'package:purrr':
## 
##     map
library(mapproj)
## Warning: package 'mapproj' was built under R version 4.4.3
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Import Data

## import raw data
dat <- read_csv(paste0("athlete_events.csv"), show_col_types = FALSE)

noc_regions <- read_csv(paste0("noc_regions.csv"), show_col_types = FALSE)

## combine the above
dat <- dat %>%
  left_join(noc_regions,by="NOC") 

my_world_map <- map_data("world") ## note i attached library(maps) to access this data

Figure 1

The following line chart shows the number of participants for each year from 1924 to 2016 for the Winter and Summer Olympics. The chart shows that the number of participants is typically increasing, however, there are instances where participation rates dropped (e.g., the boycott from Summer 1980 where some larger countries (e.g., United States, Canada) did not send athletes, etc.). The Winter games has far fewer athletes than the Summer games in every case.

## prepare data
fig_dat1 <- dat %>% 
  filter(Year >= 1924) %>% 
  select(ID,Year,Season) %>% 
  distinct() %>% 
  group_by(Year, Season) %>% 
  summarise(`Count of Participants`=n(), .groups = "drop") %>% ungroup()
## create plot
ggplot(fig_dat1, aes(x=Year, y = `Count of Participants`, color=Season)) +
  geom_line(aes(linetype = Season)) + geom_point(size = 3) +
  labs(title = "Participants in Olympic Games from 1924 to 2016") +
  scale_color_manual(values = c("darkgreen","purple")) +
  theme_minimal()

Figure 2

The following line chart shows the number of events for each year from 1924 to 2016 for the Winter and Summer Olympics. The chart shows that there are far less events in the Winter than in the Summer Olympics, and that there has been some increase in events over time, however, it has been relatively stable for the last few games. The fact that there are more summer events helps explain the difference in number of participants in the former figure.

## prepare data
fig_dat2 <- dat %>% 
  filter(Year >= 1924) %>% 
  select(Year,Season,Event) %>% 
  distinct() %>% 
  group_by(Year, Season) %>% 
  summarise(`Count of Events`=n(), .groups = "drop") %>% ungroup()
## create plot
ggplot(fig_dat2, aes(x=Year, y = `Count of Events`, color=Season)) +
  geom_line(aes(linetype = Season)) + geom_point(size = 3) +
  labs(title = "Events in Olympic Games from 1924 to 2016") +
  scale_color_manual(values = c("darkgreen","purple")) +
  theme_minimal()

Figure 3

The following choropleth chart was created with the use of the maps package, and provides information on the athletes who competed in the 2016 Summer Olympics in Rio de Janeiro (Brazil). The counts of the participants in each country is used as a gradient to demonstrate where the most (lighter greens) or least (darker greens) athletes were from across the World. As you can see within the figure, there are athletes competing from many areas of the World (note: grey depicts countries with zero athletes).

## prepare data
fig_dat3_tmp <- dat %>%
  filter(!is.na(region)) %>%
  select(ID,Year,Season,region) %>%
  distinct() %>%
  filter(Season == "Summer" & Year == "2016") %>%
  group_by(Year, Season,region) %>%
  summarise(`Count of Participants`=n(), .groups = "drop") %>% ungroup()

fig_dat3 <- my_world_map %>% left_join(fig_dat3_tmp, by="region")
## create plot
ggplot(data = fig_dat3, mapping = aes(x= long, y= lat,group=group,fill=`Count of Participants`)) +
  geom_polygon(color="black") +
  theme_classic() +
  labs(x="",y="",title="Participants in the 2016 Summer Olympics across the World") +
  theme(axis.text.x = element_blank(),axis.text.y=element_blank(),
        axis.line = element_blank(),axis.ticks = element_blank()) +
  theme(legend.position = "inside", legend.position.inside = c(0.1, 0.25),
        plot.title = element_text(size = 18)) +
  scale_fill_distiller(palette="Greens", na.value = "grey50")

Figure 4

The following choropleth chart was created with the use of the maps package, and provides information on the athletes who competed in the 2014 Winter Olympics in Sochi (Russia). The counts of the participants in each country is used as a gradient to demonstrate where the most (lighter purples) or least (darker purples) athletes were from across the World. As you can see within the figure, and in contrast to the Summer Olympics figure from above, far fewer countries send athletes to the Olympic games or participate in Winter sports (note: grey depicts countries with zero athletes). This is understandable, as climate would play a larger role in Winter sports than more universal sports for the Summer events.

## prepare data
fig_dat4_tmp <- dat %>% 
  filter(!is.na(region)) %>% 
  select(ID,Year,Season,region) %>% 
  distinct() %>% 
  filter(Season == "Winter" & Year == "2014") %>% 
  group_by(Year, Season,region) %>% 
  summarise(`Count of Participants`=n(), .groups = "drop") %>% ungroup()

fig_dat4 <- my_world_map %>% left_join(fig_dat4_tmp, by="region") 
## create plot
ggplot(data = fig_dat4, mapping = aes(x= long, y= lat,group=group,fill=`Count of Participants`)) +
  geom_polygon(color="black") +
  theme_classic() +
  labs(x="",y="",title="Participants in the 2014 Winter Olympics across the World") +
  theme(axis.text.x = element_blank(),axis.text.y=element_blank(),
        axis.line = element_blank(),axis.ticks = element_blank()) +
  theme(legend.position = "inside", legend.position.inside = c(0.1, 0.25),
        plot.title = element_text(size = 18)) +
  scale_fill_distiller(palette="Purples", na.value = "grey50")

Figure 5

The following bar chart shows the different medals (Gold, Silver, or Bronze) received by Canadian athletes in Summer or Winter Olympics from the 1994 Winter Olympics through to the 2016 Summer Olympics. I’ve made the chart interactive using plotly, so that you can see the change in medal counts across these games - feel free to press ‘Play’ to automatically see these changes over time or select a specific Olympic Games of interest and/or hover over the bars for more specific information. You’ll notice that the most Gold medals won by Canadians (14) was done in the 2010 Winter Olympics (which was hosted in Canada - Vancouver, B.C.).

## prepare data
fig_dat5 <- dat %>% 
  filter(!is.na(Medal)) %>%
  filter(Year >= 1994) %>% 
  filter(NOC=="CAN") %>% 
  mutate(Sex = ifelse(Sex=="M","Male","Female")) %>% 
  mutate(Medal = factor(Medal,levels=c("Gold","Silver","Bronze"))) %>% 
  select(Event,Games,Medal) %>% distinct() %>% 
  group_by(Medal,Games) %>% summarise(Count=n(), .groups = "drop") %>% ungroup()
## create plot
fig5 <- ggplot(fig_dat5, aes(x=Medal,y=Count,fill=Medal,frame=Games)) +
  geom_col(position = "identity") +
  scale_fill_manual(values = c("gold","darkgrey","#CD7F32")) +
  labs(title = "Medals Won by Canada in the Olympics from 1994 to 2016") +
  guides(fill = "none") +
  theme_bw()

ggplotly(fig5)

Figure 6

The following bar chart shows historical medals (Gold, Silver, or Bronze) won by Team from Summer or Winter Olympics from 1896 to 2016. The chart focuses only on teams that have won a total of at least 400 medals of any type during this time. As you can see right away, the United States of America have the most medals of any colour by a large margin, with the next closest being the Soviet Union, where their number of gold medals is less than the americans number of bronze medals.

## prepare data
fig_dat6_tmp <- dat %>% 
  filter(!is.na(Medal)) %>%
  mutate(Medal = factor(Medal,levels=c("Bronze","Silver","Gold"))) %>% 
  select(Team,Event,Games,Medal) %>% distinct() %>% 
  group_by(Team, Medal) %>% summarise(Count=n(), .groups = "drop") %>% ungroup()

tot_medals_won <- fig_dat6_tmp %>% 
  group_by(Team) %>% summarise(Count=sum(Count), .groups = "drop") %>% ungroup() %>% 
  filter(Count >= 400)

fig_dat6 <- fig_dat6_tmp %>% 
  filter(Team %in% tot_medals_won$Team) %>% 
  arrange(desc(Count))
## create plot
ggplot(fig_dat6, aes(x=reorder(Team, -Count, decreasing = TRUE),y=Count,fill=Medal, label = Count)) +
  geom_bar(stat = "identity",position="dodge") +
  coord_flip() +
  scale_fill_manual(values = rev(c("gold","darkgrey","#CD7F32"))) +
  labs(title = "Highest Medal Counts for all Olympics from 1896 to 2016", x="Team") +
  guides(fill=guide_legend(title="Medal",reverse = TRUE)) +
  theme_minimal() +
  ylim(0,1200)

Figure 7

The following dumbbell plot illustrates the differences between the counts of both Medal and Non-Medal recipients at the 2016 Summer Olympics. To simplify the plot, only those teams with at least 100 athletes are included in the figure. This plot helps at a glance see the teams where their ratio to medalled versus non-medalled is high or low. For example, we can see that Brazil had the most Non-Medal recipients, however, not many medalled (why the distant between the two dots is huge); alternatively, we can see that Serbia had the opposite; more of Serbia’s atheletes received a medal than did not (why the red dot is to the left of the blue dot). As seen depicted in prior images, the United States typically sends a lot of athletes to the games, but they also perform well and receive a lot of medals, which is also shown in this figure, where the distance between the dots isn’t super large, but counts are high for both.

## prepare data
fig_dat7_tmp <- dat %>% 
  filter(Year == 2016) %>% 
  select(ID,Medal, Team) %>% distinct() %>% 
  mutate(Status = ifelse(is.na(Medal),"Non-Medal Recipients","Medal Recipients")) %>% 
  group_by(Status,Team) %>% 
  summarise(Count=n(), .groups = "drop") %>% ungroup()

tot_participants <- dat %>% 
  filter(Year == 2016) %>% 
  select(ID,Medal, Team) %>% distinct() %>% 
  group_by(Team) %>% 
  summarise(Count=n(), .groups = "drop") %>% ungroup() %>% 
  filter(Count >=100)

fig_dat7 <- fig_dat7_tmp %>% 
  filter(Team %in% tot_participants$Team)
## create plot
ggplot(fig_dat7, aes(x = Count, y = reorder(Team,Count, FUN = max))) +
  geom_line() +
  geom_point(aes(color = Status), size = 3) +
  scale_color_brewer(palette = "Set1", direction = -1) +
  theme(legend.position = "bottom") +
  labs(title = "Difference between Number of Medal vs. Non-Medal Recipients at 2016 Summer Olympics by Team",
       x = "Athlete Counts", y = "Team", color = "Medal Status")

Figure 8

The following scatter plot provides information on the average heights and average weights of athletes in each sport by sex for the 2016 Summer Olympic games in Rio de Janeiro as well as the 2014 Winter Olympic games in Sochi. I’ve generated the plot using plotly so that the chart can be interactive - filter for a specific Sex, or hover over a point to learn more (i.e., specific values of height/weight, as well as the Sport associated with this average). For example, the tallest/heaviest athlete in the Summer games was a male basketball player.

## prepare data
fig_dat8 <- dat %>% 
  filter(Games=="2016 Summer"| Games == "2014 Winter") %>% 
  mutate(Sex = ifelse(Sex=="M","Male","Female")) %>% 
  filter(!is.na(Height)) %>% filter(!is.na(Weight)) %>% 
  select(ID,Weight,Height,Sport,Season,Sex) %>% distinct()  %>% 
  group_by(Sport,Sex) %>% mutate(`Average Height (cm)` = mean(Height),
                                 `Average Weight (kg)` = mean(Weight)) %>% ungroup() %>% 
  select(-ID, - Weight, - Height) %>% distinct()
## create plot
fig8 <- ggplot(fig_dat8, 
               aes(y=`Average Height (cm)`,x=`Average Weight (kg)`, color=Sex, group=Season, label=Sport)) + 
  geom_point(size=3) +
  scale_color_manual(values=c("#D41159","#1A85FF")) +
  facet_wrap(~Season) +
  theme_bw() +
  theme(panel.spacing.x = unit(1, "lines")) +
  labs(title = "Average Size of Olympic Athletes by Sex and Season for Each Sport") +
  theme(legend.position="bottom",legend.background = element_blank(),
        legend.box.background = element_rect(colour = "black"),
        strip.text = element_text(size = 10))

ggplotly(fig8)