The Olympic Games are a global competition that currently take place every 2 years, alternating between Summer and Winter editions with 3 Medal winners in each event. The data analyzed in this report looks at Games that took place from 1896 to 2014. It will also find trends in which countries are most successful at the games while also discovering important nuances in Olympic history.
This Olympic Dataset covers 120 years of Historical data (1896-2014) to uncover trends in athlete performance, country dominance, and sport evolution over time. The Data is supplied in three separate tables, one for Summer Games, one for Winter Games, and a third with Country information.
URL: https://www.kaggle.com/datasets/hassanjameelahmed/olympic-dataset/data
The datapoints included across all tables were: Year, City, Sport, Discipline, Athlete, Code, Gender, Event, Medal, Country, Population, GDP per Capita.
setwd("C://Users//Perry//OneDrive - University of Pittsburgh//Desktop//Loyola Classes//DS 736//Module 1//Project")
#library statements
library(data.table)
library(dplyr)
library(ggplot2)
library(scales)
library(RColorBrewer)
library(ggthemes)
library(plotly)
library(tidyr)
#read in csv file
country_df <- read.csv("CountriesSD.csv")
summer_df <- read.csv("SummerSD.csv")
winter_df <- read.csv("WinterSD.csv")
There were some simple data cleaning before the data was fully usable.
Removed a blank index column from the tables.
Filled in missing population and GDP data using 2014 data.
Adjusted column names between tables and removed duplicate data.
Joined the Summer and Winter Games Data Sets together.
# simple data cleaning from initial imports
#removing blank column off dfs
ColsToDrop <- c("X")
country_df <- country_df[, !names(country_df) %in% ColsToDrop]
summer_df <- summer_df[, !names(summer_df) %in% ColsToDrop]
winter_df = winter_df[, !names(winter_df) %in% ColsToDrop]
#filling in missing population data
country_df$Population[country_df$Code %in% c("COK", "ERI", "AHO", "PLE", "TPE")] <- c(17532, 3095173, 24279, 4550000, 23433753)
#filling in missing GDP data
country_df$GDP.per.Capita[country_df$Code %in%
c("ASA", "AND", "ARU", "BER", "IVB", "CAY", "COK", "CUB", "ERI", "GUM", "IRI", "PRK", "LBA", "LIE", "MTN", "MON", "AHO", "PLE", "PNG", "PUR", "SMR", "SYR", "TPE", "VEN", "ISV")] <-
c(12026.3, 44369.7, 26129.8, 100961.6, 0.0, 75845.3, 19523.0, 7146.7, 688.7, 33900.1, 5672.1, 0.0, 8926.4, 178735.2, 1713.0, 195693.6, 25800.0, 3352.1, 2723.2, 28576.1, 51260.5, 1060.6, 22700.0, 7096.2, 33045.4)
#removing country column
code_col_drop <- c("Country")
summer_df <- summer_df[, !names(summer_df) %in% code_col_drop]
#renaming code column
colnames(winter_df)[colnames(winter_df) == "Country"] <- "Code"
#adding labels to tables before join
summer_df$Season <- "Summer"
winter_df$Season <- "Winter"
#rearranging columns and joining into one table
winter_df <- winter_df[, colnames(summer_df)]
both_games_df <- rbind(summer_df, winter_df)
Using R to analyze the overall data set I was able to find different trends in the data that were outliers to the expected outcome. I then looked closer at the top 5 performing countries in overall medal count to see if there were any consistent trends between the top countries.
In the first visual, looking at total number of countries that medaled in the Olympics by year you see a clear consistent increase in the number of countries that won medals increasing with each consecutive games. However, you will notice that between the 1992 Games and the 1994 games there begins a new trend showing a large oscillation between a high number of winning countries and a low number. The growing number of countries earning medals is due to the continual growth of participating countries in the games, starting with 14 in the 1896 Games and growing to 204 in the 2012 Summer Games. The oscillation we see at the top of the graph notes the divide of the games into their current format of Winter or Summer Games happening every 2 years instead of having both in the same year but 4 years apart.
Moving down to the second visual, I wanted to see how the different countries performed in the medals. You can see a clear trend across all the games with a gradient moving from higher number of countries winning Bronze Medals and a lower number winning Gold Medals. That is with the exception of the first games in the data set in 1896 Games and in the 1928 Games.
#Bar Chart showing number of countries that won medals in each Olympic Games by Year
year_country_df <- both_games_df %>%
select(Year, Code) %>%
group_by(Year) %>%
summarise(n = length(unique(Code)), .groups = 'keep') %>%
data.frame()
#Convert Year to Factor to remove blank spaces when using as axis
year_country_df$Year <- as.factor(year_country_df$Year)
#plotting the data
ggplot(year_country_df, aes(x = Year, y = n)) +
geom_bar(color = "black", fill = "lightblue", stat = "identity") +
geom_text(aes(label = n), hjust = -0.25, size = 4) +
labs(title = "Number of Countries that Medaled in the Olympic Games by Year",
x = "Year",
y = "Number of Countries") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(breaks = seq(min(year_country_df$n), max(year_country_df$n), by = 10)) +
coord_flip()
#create the dataframe for heatmap
medal_heatmap_df <- both_games_df %>%
select(Year, Medal, Code) %>%
group_by(Year, Medal) %>%
summarise(n = length(unique(Code)), .groups = 'keep') %>%
data.frame()
medal_heatmap_df$Year <- factor(medal_heatmap_df$Year)
medal_heatmap_df$Medal <- factor(medal_heatmap_df$Medal, levels = c("Bronze", "Silver", "Gold"))
breaks <- seq(0, max(medal_heatmap_df$n), by = 10)
ggplot(medal_heatmap_df, aes(x = Medal, y = Year, fill = n)) +
geom_tile(color = "black") +
geom_text(aes(label = comma(n)), size = 4) +
labs(title = "Heatmap: Number of Medal Winning Countries by Type and Year",
x = "Medal Type",
y = "Year",
fill = "Winner Count") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_continuous(low = "white", high = "red", breaks = breaks) +
guides(fill = guide_legend(reverse = TRUE))
I wanted to take a look at the Top performing countries in the Olympics overall through medal count and see if there were any major trends. *An important note about the medal counts used for this analysis is that it is overall total medal counts based on athletes, meaning a team with 12 athletes would count as 12 medals.
In the first visual, the total medals earned by year are included. There are some very notable spikes on the graph where countries performed much better than expected. The USA in 1904, marking the first Games held in the Western Hemisphere (St. Louis, Missouri) and also notably low participation by other countries. GBR in 1908 held in London, England with new event inclusions that were English specialties and Irish athletes having to participate under the British team. GER in 1936, hosted in Berlin, Germany.These Games were also seen as a propaganda machine for the party in power in Germany at that time. Finally, the URS in 1980, hosted in Moscow, Russia, was boycotted by many nations over the Soviet invasion of Afghanistan. The high oscillation on the graph in more recent years is due to the chance in format between the Summer and Winter Games with less events and medal chances in the Winter Games versus the Summer Games.
Moving down to the second visual you will see the overall medal type breakdown for each of the top countries. It is interesting to see how Great Britain, Germany, and France all have a rather balanced ratio of Gold, Silver, and Bronze medals while the United States and Soviet Union are both skewed towards Gold Medals over Silver and Bronze.
The Third Visual looks at there total medal count in the line versus the total events won in the bar. An interesting point to notice on this graph is that while the Soviet Union (URS) only participated in the Games from 1952 to 1988, their total medal count is the second highest on the list compared to all the other countries that have participated from the start of the games in 1896 to 2014.
#Line Graph showing top 5 countries with the most medals and their medal count over time
country_total <- count(both_games_df, Code)
country_total <- country_total[order(country_total$n, decreasing = TRUE),]
top5 <- country_total$Code[1:5]
medal_year <- both_games_df %>%
filter(Code %in% top5) %>%
group_by(Year, Code) %>%
summarise(n = length(Code), .groups = 'keep') %>%
data.frame()
#graph the data
ggplot(medal_year, aes(x = Year, y = n, group = Code)) +
geom_line(aes(color = Code), size = 1) +
labs(title = "Top 5 Countries in Total Medal Count: Medals Earned by Year",
x = "Year",
y = "Number of Medals Won") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
geom_point(shape = 23, size = 2, color = "black", fill = "gold") +
scale_x_continuous(breaks = seq(min(medal_year$Year), max(medal_year$Year), by = 10)) +
scale_color_brewer(palette = "Paired", guide = guide_legend(reverse = TRUE))
# Creating dataframe with appropriate data from top 5 countries
top5_pie_df <- both_games_df %>%
filter(Code %in% top5) %>%
group_by(Code, Medal) %>%
summarise(n = length(Medal), .groups = 'keep') %>%
data.frame()
top5_pie_df$Medal <- factor(top5_pie_df$Medal, levels = c("Bronze", "Silver", "Gold"))
#create trellis donut pie charts,
#not sure why the order of the marker list color is different from the vector order,
#guessing it is alphabetical
plot_ly(textposition = "inside", labels = ~Medal, values = ~n, hole = 0.6) %>%
add_pie(data = top5_pie_df[top5_pie_df$Code == top5[1],],
name = top5[1],
title = top5[1],
domain = list(row = 0, column = 0),
marker = list(colors = c("peru", "gold", "silver"))) %>%
add_pie(data = top5_pie_df[top5_pie_df$Code == top5[2],],
name = top5[2],
title = top5[2],
domain = list(row = 0, column = 1),
marker = list(colors = c("peru", "gold", "silver"))) %>%
add_pie(data = top5_pie_df[top5_pie_df$Code == top5[3],],
name = top5[3],
title = top5[3],
domain = list(row = 0, column = 2),
marker = list(colors = c("peru", "gold", "silver"))) %>%
add_pie(data = top5_pie_df[top5_pie_df$Code == top5[4],],
name = top5[4],
title = top5[4],
domain = list(row = 1, column = 0),
marker = list(colors = c("peru", "gold", "silver"))) %>%
add_pie(data = top5_pie_df[top5_pie_df$Code == top5[5],],
name = top5[5],
title = top5[5],
domain = list(row = 1, column = 1),
marker = list(colors = c("peru", "gold", "silver"))) %>%
layout(title = "Top 5 Countries: Medal Type Breakdown",
showlegend = TRUE,
grid = list(rows = 2, columns = 5))
#creating dataframe for combo graph
combo_df <- both_games_df %>%
filter(Code %in% top5) %>%
group_by(Code) %>%
summarise(events_medaled = length(unique(Event)),
total_medals = length(Medal), .groups = 'keep') %>%
data.frame()
combo_df$Code <- factor(combo_df$Code, levels = top5)
#scale for different y axis without hardcoding numbers and keeping the ratio similar
scale_factor <- max(combo_df$events_medaled) / max(combo_df$total_medals)
#create combo graph of bar and line graph for events medaled and total medals
ggplot(combo_df, aes(x = Code)) +
geom_bar(aes(y = events_medaled), stat = "identity", fill = "lightblue", color = "black") +
geom_text(aes(y = 0, label = events_medaled), vjust = -1) +
geom_line(aes(y = total_medals * scale_factor, group = 1), color = "gold", size = 1) +
geom_point(aes(y = total_medals * scale_factor), color = "gold", size = 1) +
geom_text(aes(y = total_medals * scale_factor, label = total_medals)) +
labs(title = "Top 5 Countries: Events Medaled In vs Total Medals",
x = "Country",
y = "Number of Events Medaled In") +
scale_y_continuous(sec.axis = sec_axis(~ . / scale_factor, name = "Total Medals Won")) +
theme_light() +
theme(plot.title = element_text(hjust = 0.5))
The analysis of this dataset shows the continual growth of the Olympic Games through history to include nearly all of the countries in the world. Hopefully we will see a continued increase in the number of countries earning medals in the Games as well as increased participation in the Winter Games. The “Home Field Advantage” of hosting the Olympics is also show to be a very important indicator for how well a country will perform. Continued analysis of this data set should explore the relationship between GDP and population in producing medal earning Olympic athletes while also exploring specific country specialization in sports.