With the 2022 Olympics underway, I chose to focus my project on a dataset with various information from the Olympics. Although I had originally chose a dataset with billboard hot 100 data, I found that there was a lot of data missing and found the variables overall hard to work with so I decided to switch to this dataset instead. With this dataset I was interested in looking at what countries had high success rates in the Olympics, as well as what factors contributed to the United States success in past Olympics. I focused mainly on how the sport, year, and athlete’s age contributed to a higher count of a bronze, silver, or gold medal achievement for various countries as well as the United States. I also focused a majority of my charts on summer Olympic games and events.
This dataset contains 271,116 observations of data from all Olympic games from Athens 1896 to Rio 2016. It contains many useful variables about the athlete’s including an athlete id, name, sex, age, height and weight. It also contains the athlete’s team, national Olympic committee region, sport, event, and what medal they won if any. In addition it has information about what games it was including year, season, host city, and official name of the Olympic games that took place.
Overall, I found the average age of Olympic athlete’s are going up. I also found a majority of medals won in the Olympics was done so by athletes in between 20 and 30 years old. The summer sports I found to be most prominent were Athletics (Track and Field) and Swimming because they have so many events they show highest in medal counts. I also found the United States holds the most Olympic medals by quite a wide margin and the countries that follow are mainly found in Europe.
Here is where I have loaded and created the dataframes I will be using.
#------------------------------------------------------------------------------#
library(ggplot2)
library(scales)
library(RColorBrewer)
library(ggthemes)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.6 ✓ purrr 0.3.4
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange() masks plyr::arrange()
## x readr::col_factor() masks scales::col_factor()
## x purrr::compact() masks plyr::compact()
## x dplyr::count() masks plyr::count()
## x purrr::discard() masks scales::discard()
## x dplyr::failwith() masks plyr::failwith()
## x dplyr::filter() masks stats::filter()
## x dplyr::id() masks plyr::id()
## x dplyr::lag() masks stats::lag()
## x dplyr::mutate() masks plyr::mutate()
## x dplyr::rename() masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(plotly)
##
## Attaching package: 'plotly'
## The following objects are masked from 'package:plyr':
##
## arrange, mutate, rename, summarise
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(data.table)
##
## Attaching package: 'data.table'
## The following object is masked from 'package:purrr':
##
## transpose
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(ggrepel)
#------------------------------------------------------------------------------#
### SETTING UP DATA FRAMES ###
# Load dataset
olympicdata <- tidytuesdayR::tt_load(2021, week = 31)
## --- Compiling #TidyTuesday Information for 2021-07-27 ----
## --- There are 2 files available ---
## --- Starting Download ---
##
## Downloading file 1 of 2: `olympics.csv`
## Downloading file 2 of 2: `regions.csv`
## --- Download complete ---
# Create Dataframe
olympicdf <- data.frame(olympicdata$olympics)
olympicdf$medal[is.na(olympicdf$meda)] <- "No Medal" ### Changes NA's to no medal
# Create Dataframe with only available data for athlete age, height, weight
revisedolympicdf <- na.omit(olympicdf) ### Drops 64951
# Create Dataframes with only Bronze, Silver, Gold Medals
gsbolympicdf <-subset(revisedolympicdf, medal == "Gold" | medal == "Silver" | medal == "Bronze")
### Drops 175984 rows
# Subset US team, year, and medal for stacked bar chart
stackedchartdf <- data.frame(table(olympicdf$team, olympicdf$medal))
stackedchartdf <- subset(stackedchartdf, Var1 == "United States" | Var1 == "Italy" | Var1 == "Great Britain" | Var1 == "France" | Var1 == "Canada")
# Create dataframe with age averages for line graphs
linegraphdf <- data.frame(revisedolympicdf$age, revisedolympicdf$year)
linegraphdf <- subset(linegraphdf, revisedolympicdf.year >= 2000)
linegraphdf <- aggregate(linegraphdf$revisedolympicdf.age, list(linegraphdf$revisedolympicdf.year), mean)
# Create mins/maxs for linegraph
x_axis_labels = min(linegraphdf$Group.1):max(linegraphdf$Group.1)
hi_lo <- linegraphdf %>%
filter((x == min(x)) | x == max(x)) %>%
data.frame()
# Subset US team, year, and medal for piechart
piechartdf <- data.frame(table(revisedolympicdf$team, revisedolympicdf$year, revisedolympicdf$medal, revisedolympicdf$age))
piechartdf <- subset(piechartdf, Var1 == "United States")
# Convert Medal Count to Numeric
piechartdf$Var4 <- as.numeric(as.character(piechartdf$Var4))
# Create new column
piechartdf <- piechartdf %>% mutate(AgeGroup =
case_when(Var4 < 20 ~ "Under 20",
Var4 <= 30 ~ "20-30",
Var4 > 30 ~ "Above 30"))
# Create dataframe with sports, medal counts, and athelete ages for multi bar chart
multibardf <- data.frame(table(gsbolympicdf$sport, gsbolympicdf$year, gsbolympicdf$medal, gsbolympicdf$age, gsbolympicdf$team))
multibardf$Var4 <- as.numeric(as.character(multibardf$Var4))
# Subset US and popular sports
multibardf <- subset(multibardf, Var5 == "United States")
multibardf <- subset(multibardf, Var1 == "Swimming" | Var1 == "Athletics" | Var1 == "Diving" | Var1 == "Gymnastics"| Var1 == "Volleyball")
# Create age group column
multibardf <- multibardf %>% mutate(AgeGroup =
case_when(Var4 < 20 ~ "Under 20",
Var4 <= 30 ~ "20-30",
Var4 > 30 ~ "Above 30"))
# Ordering x axis
myagegroups <- c('Under 20', '20-30', 'Above 30')
age_order <- factor(multibardf$AgeGroup, level=myagegroups )
# Create and subset dataframe with teams, sports, and medals for heatmap
# Create and subset dataframe with teams, sports, and medals for heatmap
heatmapdf <- data.frame(table(gsbolympicdf$team, gsbolympicdf$sport, gsbolympicdf$medal))
heatmapdf <- subset(heatmapdf, Var1 == "United States")
heatmapdf <- subset(heatmapdf, Var2 == "Swimming" | Var2 == "Athletics" | Var2 == "Gymnastics"| Var2 == "Volleyball"| Var2 == "Diving")
I wanted to create a visualization showing the top countries and the amount of medals they have gotten over the course of the Olympics. I created a data frame with an overall medal count for gold, silver, and bronze and sorted the amount of gold medals from highest to lowest to see which countries had the most. I used the top 5 countries with the most gold medals in my visualization, and displayed the amount of medals they have won or not won overall. It was interesting to see the top 5 countries all came from Europe with the exception of the United States which dominated the medal count by a wide margin.
### Plot Olympics Stacked Bar Chart ###
stackedchart <- ggplot(stackedchartdf, aes(stackedchartdf, x = Freq, y = reorder(Var1, +Freq), fill = Var2))+
geom_bar(stat = "identity") +
labs(title = "Count of Medals for Top 5 Countries", x = "Medal Count", y="Country") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5))
stackedchart
I was interested to see what the average age of an Olympic athlete was and how it has changed over the past 20 or so years. I created a data frame that calculated the average age of an Olympic athlete for each game from 2000 to 2016. As we can see the average age of the Olympic athletes has gone up slightly over the past two decades. This could mean that as time goes on athletes are able to stay in better shape longer. From 2000 to 2016 the average age of an Olympian went up my nearly a full year with the lowest and highest average being shown on the chart. Although there are some visible dips, there appears to be a steady incline across the 16 years.
### Plot Olympic Line Graph ###
linegraph <- ggplot(linegraphdf, aes(linegraphdf, x = Group.1, y = x)) +
geom_line(aes(color='black'), size=1) +
geom_point(shape=21, size=4, color='pink', fill= 'white' ) +
labs(title = "Average Age of Oympic Athletes by Year 2000-2016", x = "Year", y="Age") +
scale_y_continuous(labels=comma) +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_x_continuous(labels=x_axis_labels, breaks = x_axis_labels, minor_breaks = NULL) +
geom_point(data = hi_lo, aes(x = Group.1, y = x), shape=21, size=4, fill='red', color='red') +
geom_label_repel(aes(label= ifelse(x == max(x) | x == min(x), scales::comma(x), "")),
box.padding=1,
point.padding=1,
size=4,
color='Grey50',
segment.color = 'darkblue')
linegraph
I was also interested to see how the amount of medals one by each age group has changed if at all over the past few summer Olympics, specifically for the United States. Using a data frame that sorted athletes into three age groups; under 20, 20-30, or above 30, I created a nested pie chart that showed the overall amount of gold, silver, and bronze medal by each age group for the 2008, 2012, and 2016 Summer Olympic Games. Not surprisingly, the age group that achieved the most medals was 20-30 year olds. Although athletes are typically associated with being young, athletes over 30 brought home more medals then those under 20 although this could be due to different team’s minimum age requirements.
### Plot Nested Pie Chart ###
piechart <- plot_ly(hole=0.7) %>%
layout(title="United States Summer Olympic Medals (2012-2016)") %>%
add_trace(data = piechartdf[piechartdf$Var2 ==2008,],
labels = ~AgeGroup,
values = ~piechartdf[piechartdf$Var2 == 2008, "Freq"],
type = "pie",
textposition = "inside",
hovertemplate = "Year: 2008<br>AgeGroup:%{label}<br>Percent:%{percent}<br>Medal Count: %{value}<extra></extra>") %>%
add_trace(data = piechartdf[piechartdf$Var2 ==2012,],
labels = ~AgeGroup,
values = ~piechartdf[piechartdf$Var2 == 2012, "Freq"],
type = "pie",
textposition = "inside",
hovertemplate = "Year: 2012<br>AgeGroup:%{label}<br>Percent:%{percent}<br>Medal Count: %{value}<extra></extra>",
domain = list(
x = c(0.16, 0.84),
y = c(0.16, 0.84))) %>%
add_trace(data = piechartdf[piechartdf$Var2 ==2016,],
labels = ~AgeGroup,
values = ~piechartdf[piechartdf$Var2 == 2016, "Freq"],
type = "pie",
textposition = "inside",
hovertemplate = "Year: 2016<br>AgeGroup:%{label}<br>Percent:%{percent}<br>Medal Count: %{value}<extra></extra>",
domain = list(
x = c(0.27, 0.73),
y = c(0.27, 0.73)))
piechart
Next, I wanted to see how ages of Olympic athletes were distributed among popular summer Olympic sports and how if age impacted the amount of medals won. I created a data frame that sorted athletes into three age groups; under 20, 20-30, or above 30 then created a graph that showed the amount of gold, silver, and bronze medals for each sport by age group.I found that 20-30 year olds brought home the most medals fror 4/5 sports with atheletes younger then 20 winning the most for gymnastics which makes sense since most gymmnasts retire at an early age.
### Plot Multi Bar Chart ###
multibarchart <- ggplot(multibardf, aes( x= age_order, y = Freq, fill = Var1)) +
geom_bar(stat="identity", position = "dodge") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = comma) +
labs(title = "Total Medals by Athlete Age by Sport for the United States",
x = "Age Group",
y = "Medal Count",
fill = "Sport" +
scale_fill_brewer(palette="Set2")+
facet_wrap(~sport, ncol = 3, nrow = 2))
multibarchart
For my final visualization I wanted to investigate what summer Olympic sports the United States had the most success in. To do this, I used subset on my Olympic data to narrow down 8 popular Summer Olympic sports in the US and gold, silver, and bronze medals. Next I created a heatmap, of these sports and the amount of medals one in each by Team USA. I found that overall the US wins the most medals overall athletics (in track and field) and swimming. Like with the multi bar chart could be because there are more events in these sports and therfore more oppurtunities to succeed.
### Plot Olympic Medal/Sports Heatmap ###
heatmap <- ggplot(heatmapdf, aes(x= Var2, y = Var3, fill = Freq)) +
geom_tile(color="black") +
geom_text(aes(label=comma(Freq))) +
coord_equal(ratio=1) +
labs(title = "Olympic Medal Counts by Sport for the United States",
x = "Sport",
y = "Medal Type",
fill = "Medal Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_discrete(limits =rev(levels(heatmapdf$Var3))) +
scale_fill_continuous(low="white", high="pink") +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(colour="black")))
heatmap