The dataset used for this project contains information on the modern
Olympics, including all Games from 1896 to 2016. The variables included
in the dataset are the ID number for each athlete, name, sex, age,
height (cm), weight (kg), team (country represented), NOC (National
Olympic Committee 3-letter code), Games (year and season), year, season
(Summer or Winter), host city, sport, event, and medal (Gold, Silver,
Bronze, or NA). The data is sourced from Sports Reference (http://www.sports-reference.com/). The aim of this
project was to illustrate the distribution of the different types of
medals between the top participating countries in the time period of the
data collection, in addition to further exploring the correlation
between participation and success in the Olympics. The only cleaning I
did in this project was setting all variable names to lowercase. The NAs
in this dataset were either in the medal column, which
meant the athlete did not win a medal in that year’s event, or in the
age, height, and/or weight
columns, which meant that the athlete was a posthumous competitor (e.g.,
artwork from dead artists can be submitted into the Olympic Art
competitions) or those measurements were not recorded during the time of
the event. Since this project is focused on team participation and
medals won, the NAs in the dataset did not affect the results of the
project. I chose this topic and dataset because I love playing and
watching sports and the dataset is very rich in its content as it has
120 years of data on the Olympics.
The Olympic Games are an international sporting event that occurs every four years, dating back ancient Greece in 776 B.C. The modern era of the Games started in 1896 and persisted despite interruptions during World Wars I and II, maintaining numerical designations for all years of the Games (“The Olympic Games”). The iconic symbol of interlocking colored rings representing continents was introduced at the Antwerp Games in 1920. The Olympics gained global recognition after the 1924 Games in Paris, which featured 3,000 athletes, including more than 100 women, from 44 participating nations, marking the introduction of a closing ceremony. The Winter Olympics made their debut that same year, showcasing events such as figure skating, ice hockey, and bobsledding. While the Olympics have faced challenges like political boycotts and controversies, they persist as a globally celebrated event, transcending borders and promoting ideals of peace and cooperation through sports.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
olympics <- read_csv("athlete_events.csv")
## Rows: 271116 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl (5): ID, Age, Height, Weight, Year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(olympics) <- tolower(names(olympics)) #set all column names to lowercase
head(olympics)
## # A tibble: 6 × 15
## id name sex age height weight team noc games year season city
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 1 A Dijiang M 24 180 80 China CHN 1992… 1992 Summer Barc…
## 2 2 A Lamusi M 23 170 60 China CHN 2012… 2012 Summer Lond…
## 3 3 Gunnar N… M 24 NA NA Denm… DEN 1920… 1920 Summer Antw…
## 4 4 Edgar Li… M 34 NA NA Denm… DEN 1900… 1900 Summer Paris
## 5 5 Christin… F 21 185 82 Neth… NED 1988… 1988 Winter Calg…
## 6 5 Christin… F 21 185 82 Neth… NED 1988… 1988 Winter Calg…
## # ℹ 3 more variables: sport <chr>, event <chr>, medal <chr>
top20 <- olympics |> group_by(team) |> summarise(count = n_distinct(id)) |> arrange(desc(count)) |> head(20)
top20
## # A tibble: 20 × 2
## team count
## <chr> <int>
## 1 United States 9115
## 2 France 5784
## 3 Great Britain 5764
## 4 Italy 4698
## 5 Germany 4579
## 6 Canada 4547
## 7 Japan 4012
## 8 Australia 3768
## 9 Sweden 3618
## 10 Poland 2938
## 11 Netherlands 2864
## 12 Soviet Union 2864
## 13 Hungary 2726
## 14 Switzerland 2722
## 15 China 2599
## 16 Spain 2597
## 17 South Korea 2353
## 18 Russia 2324
## 19 Finland 2276
## 20 Austria 2212
medals <- olympics |> filter(team %in% c(top20$team)) |> group_by(team, medal) |> filter(medal != "NA") |> summarise(count = n())
## `summarise()` has grouped output by 'team'. You can override using the
## `.groups` argument.
medals
## # A tibble: 60 × 3
## # Groups: team [20]
## team medal count
## <chr> <chr> <int>
## 1 Australia Bronze 511
## 2 Australia Gold 342
## 3 Australia Silver 453
## 4 Austria Bronze 150
## 5 Austria Gold 95
## 6 Austria Silver 168
## 7 Canada Bronze 408
## 8 Canada Gold 422
## 9 Canada Silver 413
## 10 China Bronze 268
## # ℹ 50 more rows
p1 <- medals |> ggplot(aes(x = team, y = count, fill = medal)) +
geom_col() +
# make bar graph horizontal
coord_flip() +
# associate medal types with their corresponding colors
scale_fill_manual(values=c("gold2", "#B4B4B4", "#AD8A56")) +
labs(
x = "Team",
y = "Number of Medals",
title = "Distribution of Olympic Medals of \nTop Participating Teams (1896-2016)",
fill = "Medal Type"
) +
theme_minimal()
#use `ggplotly()` to add interactivty
p1 <- ggplotly(p1)
# use `layout()` to create a caption and adjust its position on the plot
p1 <- p1 |> layout(annotations = list(
text = "Source: Sports Reference",
x = 1,
y = -0.105,
xref = "paper",
yref = "paper",
showarrow = FALSE,
font = list(size = 10)
))
p1
# Modify the factor levels of the `medal` variable so that the medal types are organized in descending order in the legend.
medals$medal <- factor(medals$medal, levels = c("Gold", "Silver", "Bronze"))
levels(medals$medal)
## [1] "Gold" "Silver" "Bronze"
# Do the same with the `team` variable so the countries are arranged in ascending order on the x-axis (`coord_flip()` will show the countries in descending order)
medals$team <- factor(medals$team, levels = c(top20$team))
medals$team <- factor(medals$team, levels = rev(levels(medals$team)))
levels(medals$team)
## [1] "Austria" "Finland" "Russia" "South Korea"
## [5] "Spain" "China" "Switzerland" "Hungary"
## [9] "Soviet Union" "Netherlands" "Poland" "Sweden"
## [13] "Australia" "Japan" "Canada" "Germany"
## [17] "Italy" "Great Britain" "France" "United States"
p1 <- medals |> ggplot(aes(x = team, y = count, fill = medal)) +
geom_col() +
# make bar graph horizontal
coord_flip() +
# associate medal types with their corresponding colors
scale_fill_manual(values=c("gold2", "#B4B4B4", "#AD8A56")) +
labs(
x = "Team",
y = "Number of Medals",
title = "Distribution of Olympic Medals of \nTop Participating Teams (1896-2016)",
fill = "Medal Type"
) +
theme_minimal()
# use `ggplotly()` to add interactivty
p1 <- ggplotly(p1)
# use `layout()` to create a caption and adjust its font size and position on the plot
p1 <- p1 |> layout(annotations = list(
text = "Source: Sports Reference",
x = 1,
y = -0.105,
xref = "paper",
yref = "paper",
showarrow = FALSE,
font = list(size = 10)
))
p1
par_med <- olympics |> filter(team %in% c(top20$team)) |> group_by(team) |> summarise(participants = n_distinct(id), medals = sum(!is.na(medal)))
par_med
## # A tibble: 20 × 3
## team participants medals
## <chr> <int> <int>
## 1 Australia 3768 1306
## 2 Austria 2212 413
## 3 Canada 4547 1243
## 4 China 2599 901
## 5 Finland 2276 876
## 6 France 5784 1550
## 7 Germany 4579 1984
## 8 Great Britain 5764 1673
## 9 Hungary 2726 1127
## 10 Italy 4698 1527
## 11 Japan 4012 911
## 12 Netherlands 2864 988
## 13 Poland 2938 563
## 14 Russia 2324 1110
## 15 South Korea 2353 592
## 16 Soviet Union 2864 2451
## 17 Spain 2597 483
## 18 Sweden 3618 1434
## 19 Switzerland 2722 588
## 20 United States 9115 5219
p2 <- par_med |> ggplot(aes(x = participants, y = medals)) +
geom_point() +
# add a linear regression line
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
x = "Total Number of Participants",
y = "Total Number of Medals",
title = "Olympic Participation-Success Correlation",
) +
theme_minimal()
# use `ggplotly() to add interactivity
p2 <- ggplotly(p2)
## `geom_smooth()` using formula = 'y ~ x'
# use `layout()` to create a caption and adjust its font size and position on the plot
p2<- p2 |> layout(annotations = list(
text = "Source: Sports Reference",
x = 1,
y = -0.105,
xref = "paper",
yref = "paper",
showarrow = FALSE,
font = list(size = 10)
))
p2
team variable in the hover information for
each pointSince ggplotly() would remove the regression line when
adding a text aesthetic to the points, I used
plot_ly() and add_trace() to remake the
scatter plot with the regression line intact along with the edited hover
information for each point.
p2.5 <- plot_ly() |>
add_trace(data = par_med, x = ~participants, y = ~medals, text = ~paste(
"Participants: ", participants,
"\nMedals: ", medals,
"\nTeam: ", team),
mode = "markers",
type = "scatter",
marker = list(color = 'black') # keep all the points black
) |>
add_trace(data = par_med, x = ~participants, y = ~predict(lm(medals ~ participants)),
mode = "lines",
line = list(color = 'red')) |>
layout(title = "Olympic Participation-Success Correlation",
xaxis = list(title = "Total Participants"),
yaxis = list(title = "Total Medals"),
showlegend = FALSE) # remove legend
p2.5 <- p2.5 |> layout(annotations = list(
text = "Source: Sports Reference",
x = 1,
y = -0.105,
xref = "paper",
yref = "paper",
showarrow = FALSE,
font = list(size = 10)
))
p2.5
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
cor(par_med$participants, par_med$medals)
## [1] 0.8349943
summary(lm(medals ~ participants, data = par_med))
##
## Call:
## lm(formula = medals ~ participants, data = par_med)
##
## Residuals:
## Min 1Q Median 3Q Max
## -866.13 -340.41 -57.69 212.43 1546.01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -577.16138 327.13605 -1.764 0.0946 .
## participants 0.51751 0.08038 6.438 4.65e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 594.9 on 18 degrees of freedom
## Multiple R-squared: 0.6972, Adjusted R-squared: 0.6804
## F-statistic: 41.45 on 1 and 18 DF, p-value: 4.654e-06
p1
p2.5
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
The first visualization depicts the distribution of gold, silver, and bronze medals among the top 20 participating teams in the Olympics from 1896 to 2016. Based on the bar chart, the US not only won the most medals of each type, but also had the most athletes participating in the Olympics during this time frame, based on the descending arrangement of the teams on the plot by total number of distinct participants. I wanted to further explore the correlation between participation and success in the Olympics, so I created a scatter plot in which the x-axis represented the total number distinct participants of a team from 1896 to 2016 and the y-axis represented the total number of medals won by the team in that time frame. The correlation coefficient of the regression line was calculated to be approximately 0.83, which shows there is a strong positive correlation between participation and medals won. In the summary statistics for the regression model, the 3 asterisks next to participants suggest it is a meaningful variable to explain the linear increase in total medals won, and the adjusted r-squared value of about 0.68 means 68% of the variation in the data is likely to be explained by the model. Thus, the US’s large teams over the years could have played a factor in their historical domination in the Olympics. However, this is not the case for all teams; for example, during the 1896-2016 period, the Soviet Union (USSR) team had the 11th most total participants but was second in total medals won, only behind the United States. This may have been due to the fact that the Soviet Union only competed in the Olympics from 1952 to 1988. If I were to use this dataset for further research, I would want to explore the success of host cities’ teams in the years they hosted the Olympics.