Introduction

The dataset used for this project contains information on the modern Olympics, including all Games from 1896 to 2016. The variables included in the dataset are the ID number for each athlete, name, sex, age, height (cm), weight (kg), team (country represented), NOC (National Olympic Committee 3-letter code), Games (year and season), year, season (Summer or Winter), host city, sport, event, and medal (Gold, Silver, Bronze, or NA). The data is sourced from Sports Reference (http://www.sports-reference.com/). The aim of this project was to illustrate the distribution of the different types of medals between the top participating countries in the time period of the data collection, in addition to further exploring the correlation between participation and success in the Olympics. The only cleaning I did in this project was setting all variable names to lowercase. The NAs in this dataset were either in the medal column, which meant the athlete did not win a medal in that year’s event, or in the age, height, and/or weight columns, which meant that the athlete was a posthumous competitor (e.g., artwork from dead artists can be submitted into the Olympic Art competitions) or those measurements were not recorded during the time of the event. Since this project is focused on team participation and medals won, the NAs in the dataset did not affect the results of the project. I chose this topic and dataset because I love playing and watching sports and the dataset is very rich in its content as it has 120 years of data on the Olympics.

Background Research

The Olympic Games are an international sporting event that occurs every four years, dating back ancient Greece in 776 B.C. The modern era of the Games started in 1896 and persisted despite interruptions during World Wars I and II, maintaining numerical designations for all years of the Games (“The Olympic Games”). The iconic symbol of interlocking colored rings representing continents was introduced at the Antwerp Games in 1920. The Olympics gained global recognition after the 1924 Games in Paris, which featured 3,000 athletes, including more than 100 women, from 44 participating nations, marking the introduction of a closing ceremony. The Winter Olympics made their debut that same year, showcasing events such as figure skating, ice hockey, and bobsledding. While the Olympics have faced challenges like political boycotts and controversies, they persist as a globally celebrated event, transcending borders and promoting ideals of peace and cooperation through sports.

Load libraries and read in the data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
olympics <- read_csv("athlete_events.csv")
## Rows: 271116 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl  (5): ID, Age, Height, Weight, Year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean variable names

names(olympics) <- tolower(names(olympics)) #set all column names to lowercase
head(olympics)
## # A tibble: 6 × 15
##      id name      sex     age height weight team  noc   games  year season city 
##   <dbl> <chr>     <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
## 1     1 A Dijiang M        24    180     80 China CHN   1992…  1992 Summer Barc…
## 2     2 A Lamusi  M        23    170     60 China CHN   2012…  2012 Summer Lond…
## 3     3 Gunnar N… M        24     NA     NA Denm… DEN   1920…  1920 Summer Antw…
## 4     4 Edgar Li… M        34     NA     NA Denm… DEN   1900…  1900 Summer Paris
## 5     5 Christin… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
## 6     5 Christin… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
## # ℹ 3 more variables: sport <chr>, event <chr>, medal <chr>

Arrange countries in descending order of total number of participants, then show the top 20

top20 <- olympics |> group_by(team) |> summarise(count = n_distinct(id)) |> arrange(desc(count)) |> head(20)
top20
## # A tibble: 20 × 2
##    team          count
##    <chr>         <int>
##  1 United States  9115
##  2 France         5784
##  3 Great Britain  5764
##  4 Italy          4698
##  5 Germany        4579
##  6 Canada         4547
##  7 Japan          4012
##  8 Australia      3768
##  9 Sweden         3618
## 10 Poland         2938
## 11 Netherlands    2864
## 12 Soviet Union   2864
## 13 Hungary        2726
## 14 Switzerland    2722
## 15 China          2599
## 16 Spain          2597
## 17 South Korea    2353
## 18 Russia         2324
## 19 Finland        2276
## 20 Austria        2212

Summarize each country’s counts of each type medal

medals <- olympics |> filter(team %in% c(top20$team)) |> group_by(team, medal) |> filter(medal != "NA") |> summarise(count = n())
## `summarise()` has grouped output by 'team'. You can override using the
## `.groups` argument.
medals
## # A tibble: 60 × 3
## # Groups:   team [20]
##    team      medal  count
##    <chr>     <chr>  <int>
##  1 Australia Bronze   511
##  2 Australia Gold     342
##  3 Australia Silver   453
##  4 Austria   Bronze   150
##  5 Austria   Gold      95
##  6 Austria   Silver   168
##  7 Canada    Bronze   408
##  8 Canada    Gold     422
##  9 Canada    Silver   413
## 10 China     Bronze   268
## # ℹ 50 more rows

Make a horizontal stacked bar graph with the countries (“Teams”) on the x-axis (positioned vertically) and the counts of medals on the y-axis(positioned horizontally), color-coding the stacked bars by medal types

p1 <- medals |> ggplot(aes(x = team, y = count, fill = medal)) +
  geom_col() +
  # make bar graph horizontal
  coord_flip() + 
  # associate medal types with their corresponding colors
  scale_fill_manual(values=c("gold2", "#B4B4B4", "#AD8A56")) +
  labs(
    x = "Team",
    y = "Number of Medals",
    title = "Distribution of Olympic Medals of \nTop Participating Teams (1896-2016)",
    fill = "Medal Type"
  ) +
  theme_minimal()
#use `ggplotly()` to add interactivty
p1 <- ggplotly(p1) 
# use `layout()` to create a caption and adjust its position on the plot
p1 <- p1 |> layout(annotations = list(
    text = "Source: Sports Reference",
    x = 1,
    y = -0.105,
    xref = "paper",
    yref = "paper",
    showarrow = FALSE,
    font = list(size = 10)
  ))
p1
# Modify the factor levels of the `medal` variable so that the medal types are organized in descending order in the legend.  

medals$medal <- factor(medals$medal, levels = c("Gold", "Silver", "Bronze"))
levels(medals$medal)
## [1] "Gold"   "Silver" "Bronze"
# Do the same with the `team` variable so the countries are arranged in ascending order on the x-axis (`coord_flip()` will show the countries in descending order)

medals$team <- factor(medals$team, levels = c(top20$team))
medals$team <- factor(medals$team, levels = rev(levels(medals$team)))
levels(medals$team)
##  [1] "Austria"       "Finland"       "Russia"        "South Korea"  
##  [5] "Spain"         "China"         "Switzerland"   "Hungary"      
##  [9] "Soviet Union"  "Netherlands"   "Poland"        "Sweden"       
## [13] "Australia"     "Japan"         "Canada"        "Germany"      
## [17] "Italy"         "Great Britain" "France"        "United States"

Re-plot

p1 <- medals |> ggplot(aes(x = team, y = count, fill = medal)) +
  geom_col() +
  # make bar graph horizontal
  coord_flip() + 
  # associate medal types with their corresponding colors
  scale_fill_manual(values=c("gold2", "#B4B4B4", "#AD8A56")) +
  labs(
    x = "Team",
    y = "Number of Medals",
    title = "Distribution of Olympic Medals of \nTop Participating Teams (1896-2016)",
    fill = "Medal Type"
  ) +
  theme_minimal()
# use `ggplotly()` to add interactivty
p1 <- ggplotly(p1) 
# use `layout()` to create a caption and adjust its font size and position on the plot
p1 <- p1 |> layout(annotations = list(
    text = "Source: Sports Reference",
    x = 1,
    y = -0.105,
    xref = "paper",
    yref = "paper",
    showarrow = FALSE,
    font = list(size = 10)
  ))
p1

Summarise each country’s total participants and medals

par_med <- olympics |> filter(team %in% c(top20$team)) |> group_by(team) |> summarise(participants = n_distinct(id), medals = sum(!is.na(medal)))
par_med
## # A tibble: 20 × 3
##    team          participants medals
##    <chr>                <int>  <int>
##  1 Australia             3768   1306
##  2 Austria               2212    413
##  3 Canada                4547   1243
##  4 China                 2599    901
##  5 Finland               2276    876
##  6 France                5784   1550
##  7 Germany               4579   1984
##  8 Great Britain         5764   1673
##  9 Hungary               2726   1127
## 10 Italy                 4698   1527
## 11 Japan                 4012    911
## 12 Netherlands           2864    988
## 13 Poland                2938    563
## 14 Russia                2324   1110
## 15 South Korea           2353    592
## 16 Soviet Union          2864   2451
## 17 Spain                 2597    483
## 18 Sweden                3618   1434
## 19 Switzerland           2722    588
## 20 United States         9115   5219

Create a scatter plot showing the correlation between the total number of participants (x-axis) and the total number of medals won (y-axis) for each country from the last plot.

p2 <- par_med |> ggplot(aes(x = participants, y = medals)) +
  geom_point() +
  # add a linear regression line
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    x = "Total Number of Participants",
    y = "Total Number of Medals",
    title = "Olympic Participation-Success Correlation",
  ) + 
  theme_minimal()
# use `ggplotly() to add interactivity
p2 <- ggplotly(p2)
## `geom_smooth()` using formula = 'y ~ x'
# use `layout()` to create a caption and adjust its font size and position on the plot
p2<- p2 |> layout(annotations = list(
    text = "Source: Sports Reference",
    x = 1,
    y = -0.105,
    xref = "paper",
    yref = "paper",
    showarrow = FALSE,
    font = list(size = 10)
  ))
p2

Include the team variable in the hover information for each point

Since ggplotly() would remove the regression line when adding a text aesthetic to the points, I used plot_ly() and add_trace() to remake the scatter plot with the regression line intact along with the edited hover information for each point.

p2.5 <- plot_ly() |>
  add_trace(data = par_med, x = ~participants, y = ~medals, text = ~paste(
    "Participants: ", participants,
    "\nMedals: ", medals,
    "\nTeam: ", team),
    mode = "markers",
    type = "scatter",
    marker = list(color = 'black') # keep all the points black
    ) |>
  add_trace(data = par_med, x = ~participants, y = ~predict(lm(medals ~ participants)),
            mode = "lines",
            line = list(color = 'red')) |>
  layout(title = "Olympic Participation-Success Correlation",
         xaxis = list(title = "Total Participants"),
         yaxis = list(title = "Total Medals"),
         showlegend = FALSE) # remove legend
p2.5 <- p2.5 |> layout(annotations = list(
    text = "Source: Sports Reference",
    x = 1,
    y = -0.105,
    xref = "paper",
    yref = "paper",
    showarrow = FALSE,
    font = list(size = 10)
  ))
p2.5
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter

Calculate the correlation coefficient of the regression line

cor(par_med$participants, par_med$medals)
## [1] 0.8349943

Show the summary statistics for the linear regression model

summary(lm(medals ~ participants, data = par_med))
## 
## Call:
## lm(formula = medals ~ participants, data = par_med)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -866.13 -340.41  -57.69  212.43 1546.01 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -577.16138  327.13605  -1.764   0.0946 .  
## participants    0.51751    0.08038   6.438 4.65e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 594.9 on 18 degrees of freedom
## Multiple R-squared:  0.6972, Adjusted R-squared:  0.6804 
## F-statistic: 41.45 on 1 and 18 DF,  p-value: 4.654e-06

Final Visualizations

p1
p2.5
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter

Conclusion

The first visualization depicts the distribution of gold, silver, and bronze medals among the top 20 participating teams in the Olympics from 1896 to 2016. Based on the bar chart, the US not only won the most medals of each type, but also had the most athletes participating in the Olympics during this time frame, based on the descending arrangement of the teams on the plot by total number of distinct participants. I wanted to further explore the correlation between participation and success in the Olympics, so I created a scatter plot in which the x-axis represented the total number distinct participants of a team from 1896 to 2016 and the y-axis represented the total number of medals won by the team in that time frame. The correlation coefficient of the regression line was calculated to be approximately 0.83, which shows there is a strong positive correlation between participation and medals won. In the summary statistics for the regression model, the 3 asterisks next to participants suggest it is a meaningful variable to explain the linear increase in total medals won, and the adjusted r-squared value of about 0.68 means 68% of the variation in the data is likely to be explained by the model. Thus, the US’s large teams over the years could have played a factor in their historical domination in the Olympics. However, this is not the case for all teams; for example, during the 1896-2016 period, the Soviet Union (USSR) team had the 11th most total participants but was second in total medals won, only behind the United States. This may have been due to the fact that the Soviet Union only competed in the Olympics from 1952 to 1988. If I were to use this dataset for further research, I would want to explore the success of host cities’ teams in the years they hosted the Olympics.

Bibliography

“The Olympic Games: Locations, Facts, Ancient & Modern.” History.Com, A&E Television Networks, www.history.com/topics/sports/olympic-games#the- olympics-through-the-years. Accessed 14 Nov. 2023.