This project explores 120 years of Olympic data (starting in 1896 and ending in 2016), considering variables such as medals awarded, countries, and gender, to name a few; to answer research questions on how the Olympics have changed over time, we visualize the results using R programming.
Many wonder how the Olympics’ long history has changed over time. In this project, we use R programming methods to explore questions related to medal distribution, top sports and countries, athletes’ ages, gender distribution, and winning trends.
This bar plot shows the distribution of gold, silver, and bronze medals over time. We can see a general trend that the total medals awarded have increased over time; the total amount of gold, silver, and bronze medals has also increased to reflect the change in the total. We can see that in 1994, when the summer and Winter Olympics changed to being held every two years instead of occurring in the same year, the Winter Olympics had much fewer medals awarded, likely due to a more narrow range of sports. However, the trend for total medals awarded in the Winter Olympics still increases over time.
The bar plot above shows the top ten countries that dominate the Olympics based on total medals awarded over the years. The US has historically dominated the Olympics overall. Although Russia and Germany, in second and third place respectively, have way more medals than the rest of the remaining top ten countries, they still fall short of the US by a similar margin by which the remaining top ten countries fall short of them.
In this bar plot, we can see the top ten sports based on the most medals awarded in the Olympics over the years. Unsurprisingly, Athletics, Swimming, and Gymnastics are all in the top five. However, Rowing is a surprise, as it is not often thought of when considering the most popular sport in the Olympics.
The line plot above shows the average and median ages of Olympic athletes over time. We included the median age in addition to the average because it is more resistant to being skewed by outliers in the data. The general trend is an increase in ages as we approach 1930. Athletes start getting younger and younger as we approach 1980, and ages begin to increase towards the present day.
This heatmap shows which countries perform well in specific sports. We can see that the US dominated Athletics and Swimming, France and Italy did quite well in Fencing, Australia did well in Swimming, and unsurprisingly Canada did well in Ice Hockey which makes sense due to how popular the sport is in that country.
The line plot above displays the gender distribution of athletes over the years. Women were first allowed to compete in the Olympics in 1900, and female participation slowly increased until around 1950, when the increase sped up a little as we approached 1980. After 1980, there was a steady increase in female participation until around 1994, when the increase began to slow, likely due to when the summer and Winter Olympics split that same year. During the split in 1994, we can see fewer overall athletes in the Winter Olympics. However, female participation continues to slowly increase. Male participation generally increases over the years but decreases slightly after 1994, likely due to the split.
In this line plot, we see the US performance in the Olympics over the years. The general trend is an increased performance over the years for both the summer and winter Olympics. However, for the summer Olympics, their best years were around 1904, 1984, and 2008. Their performance significantly decreased in the games prior to WWII.
With this bar plot, we analyze the USA’s performance when they hosted the Olympics versus when they did not. We can see that the US tends to do very well when they host in the summer but not as much when they host in the winter.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Based on our other graphs, we hypothesized that many fewer medals are awarded in the Winter Olympics than in the Summer Olympics, but we wanted to plot that relationship directly to support our educated guesses and visualize the trend. Medals awarded in the summer steadily increased until 1920, when they fell dramatically, probably leading up to WWII. Afterward, they increased very quickly up until the end of our dataset. Since the Winter Olympics began, the increase in the amount of medals awarded has been much slower.
This project aimed to answer research questions that considered Olympic participation and performance. We were able to see how history has affected these two metrics, from the First and Second World Wars to when women were first allowed to compete, and we took these events into account during our analysis of the results.
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(dplyr)
library(tidyr)
library(reshape2)
olympic_data <- read.csv("athlete_events.csv")
noc_regions <- read.csv("noc_regions.csv")
df_medals <- olympic_data %>%
filter(!is.na(Medal)) %>%
group_by(Year, Medal) %>%
summarise(MedalCount = n())
df_medals$Medal <- factor(df_medals$Medal, levels = c("Gold", "Silver", "Bronze"))
ggplot(df_medals, aes(x = Year, y = MedalCount, fill = Medal)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Distribution of Gold, Silver, and Bronze Medals Over Time",
x = "Year",
y = "Number of Medals") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = c("Gold" = "gold",
"Silver" = "gray",
"Bronze" = "brown"))
olympic_data_combined <- olympic_data %>%
mutate(NOC = ifelse(NOC == "USR", "RUS", NOC))
total_medals_by_country <- olympic_data_combined %>%
filter(!is.na(Medal)) %>%
group_by(NOC) %>%
summarise(TotalMedals = n(), .groups = 'drop') %>%
arrange(desc(TotalMedals))
total_medals_with_names <- total_medals_by_country %>%
left_join(noc_regions, by = "NOC")
total_medals_with_names_combined <- total_medals_with_names %>%
mutate(region = ifelse(NOC == "USR", "Russia", region)) %>%
group_by(region) %>%
summarise(TotalMedals = sum(TotalMedals), .groups = 'drop')
top_10_countries <- total_medals_with_names_combined %>%
slice_max(TotalMedals, n = 10)
top_10_countries <- top_10_countries %>%
arrange(TotalMedals) %>%
mutate(region = factor(region, levels = region))
ggplot(top_10_countries, aes(x = reorder(region, TotalMedals), y = TotalMedals, fill = region)) +
geom_bar(stat = "identity") +
labs(title = "Top 10 Countries by Total Medals Awarded in the Olympics",
x = "Country",
y = "Total Medals") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0),
legend.position = "none") +
scale_fill_viridis_d() +
coord_flip()
medals_by_sport <- olympic_data %>%
filter(!is.na(Medal)) %>%
group_by(Sport) %>%
summarise(TotalMedals = n()) %>%
arrange(desc(TotalMedals))
top_10_sports <- medals_by_sport %>%
top_n(10, TotalMedals)
ggplot(top_10_sports, aes(x = reorder(Sport, TotalMedals), y = TotalMedals, fill = Sport)) +
geom_bar(stat = "identity") +
labs(title = "Top 10 Sports by Total Medals Awarded in the Olympics",
x = "Sport",
y = "Total Medals") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
scale_fill_viridis_d() +
coord_flip()
age_by_year <- olympic_data %>%
filter(!is.na(Age)) %>%
group_by(Year) %>%
summarise(AverageAge = mean(Age, na.rm = TRUE),
MedianAge = median(Age, na.rm = TRUE))
ggplot(age_by_year, aes(x = Year)) +
geom_line(aes(y = AverageAge, color = "Average Age"), linewidth = 1) +
geom_line(aes(y = MedianAge, color = "Median Age"), linewidth = 1, linetype = "dashed") +
labs(title = "Average and Median Age of Athletes Over Time",
x = "Year",
y = "Age",
color = "Age Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
medal_counts <- olympic_data_combined %>%
filter(!is.na(Medal)) %>%
group_by(NOC, Sport) %>%
summarise(MedalCount = n()) %>%
ungroup()
top_countries <- medal_counts %>%
group_by(NOC) %>%
summarise(TotalMedals = sum(MedalCount)) %>%
top_n(10, TotalMedals) %>%
pull(NOC)
top_sports <- medal_counts %>%
group_by(Sport) %>%
summarise(TotalMedals = sum(MedalCount)) %>%
top_n(10, TotalMedals) %>%
pull(Sport)
filtered_data <- medal_counts %>%
filter(NOC %in% top_countries & Sport %in% top_sports)
noc_regions_clean <- noc_regions %>%
select(-notes)
total_medals_with_regions <- filtered_data %>%
left_join(noc_regions_clean, by = "NOC")
total_medals_with_regions <- total_medals_with_regions %>%
mutate(MedalCount = as.numeric(MedalCount)) %>%
filter(!is.na(MedalCount))
heatmap_data <- total_medals_with_regions %>%
pivot_wider(names_from = Sport, values_from = MedalCount, values_fill = list(MedalCount = 0))
heatmap_data_long <- heatmap_data %>%
pivot_longer(cols = -c(NOC, region), names_to = "Sport", values_to = "MedalCount")
ggplot(heatmap_data_long, aes(x = Sport, y = region, fill = MedalCount)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue") +
labs(title = "Top 10 Countries' Medal Counts by Sport (Region)",
x = "Sport",
y = "Region",
fill = "Medal Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(size = 8))
gender_distribution <- olympic_data %>%
group_by(Year, Sex) %>%
summarise(Athletes = n()) %>%
ungroup()
ggplot(gender_distribution, aes(x = Year, y = Athletes, color = Sex, group = Sex)) +
geom_line() +
geom_point() +
labs(title = "Gender Distribution of Athletes Over the Years",
x = "Year",
y = "Number of Athletes",
color = "Sex") +
theme_minimal() +
scale_color_manual(values = c("lightcoral", "steelblue")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
usa_medals <- olympic_data %>%
filter(NOC == 'USA', !is.na(Medal))
usa_medal_trends <- usa_medals %>%
group_by(Year, Season) %>%
summarise(Medals = n()) %>%
ungroup()
ggplot(usa_medal_trends, aes(x = Year, y = Medals, color = Season, group = Season)) +
geom_line() +
geom_point() +
labs(title = "USA's Medal Trends Over the Years",
x = "Year",
y = "Number of Medals",
color = "Season") +
scale_color_manual(values = c("Summer" = "lightcoral", "Winter" = "steelblue")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
hosting_years <- c(1904, 1932, 1984, 1996, 1932, 1960, 1980, 2002)
usa_medal_trends <- usa_medals %>%
group_by(Year, Season) %>%
summarise(Medals = n()) %>%
ungroup() %>%
mutate(Hosting = ifelse(Year %in% hosting_years, "Hosting", "Non-Hosting"))
ggplot(usa_medal_trends, aes(x = Year, y = Medals, fill = interaction(Season, Hosting))) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "USA's Medal Trends Over the Years (Season and Hosting)",
x = "Year",
y = "Number of Medals",
fill = "Season and Hosting Status") +
scale_fill_manual(values = c("Summer.Hosting" = "lightcoral", "Winter.Hosting" = "steelblue",
"Summer.Non-Hosting" = "darkred", "Winter.Non-Hosting" = "darkblue")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
medals_overtime <- olympic_data %>%
filter(!is.na(Medal)) %>%
group_by(Year, Season) %>%
summarise(MedalCount = n()) %>%
ungroup()
ggplot(medals_overtime, aes(x = Year, y = MedalCount, color = Season)) +
geom_line(linewidth = 1.2) +
geom_point(size = 2) +
labs(title = "Amount of Medals Won Over Time in Summer vs Winter Olympics",
x = "Year",
y = "Number of Medals",
color = "Olympic Season") +
scale_color_manual(values = c("Summer" = "darkorange", "Winter" = "steelblue")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))