Project 1 Sophia Duran

This is the dataset you will be working with:

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')



olympics_alpine <- olympics %>% 
  filter(!is.na(weight)) %>%             # only keep athletes with known weight
  filter(sport == "Alpine Skiing") %>%   # keep only alpine skiers
  mutate(
    medalist = case_when(                # add column to 
      is.na(medal) ~ FALSE,              # NA values go to FALSE
      !is.na(medal) ~ TRUE               # non-NA values (Gold, Silver, Bronze) go to TRUE
    )
  )

olympics_alpine is a subset of olympics and contains only the data for alpine skiers. More information about the original olympics dataset can be found at https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-07-27/readme.md and https://www.sports-reference.com/olympics.html.

For this project, use olympics_alpine to answer the questions 1, 2 and 3, about the weights of alpine skiers. For Question 4, you should use the FULL dataset, olympics

Are there weight differences for male and female Olympic skiers who were successful or not in earning a medal?
Are there weight differences for skiers who competed in different alpine skiing events?
How has the weight distribution of alpine skiers changed over the years?
Create ANY interesting animation from the full olympics dataset.

You should make one plot per question.

Hints:

We recommend you use a violin plot for question 1 and boxplots for questions 2 and 3. However, you are free to use any of the plots we have discussed in class.
For question 3, it may be helpful to consider only a subset of alpine skiers, such as those who competed in a specific event.
To make a series of boxplots over time, you will have to add the following to your aes() statement: group = year.
It can be a bit tricky to re-label facets generated with facet_wrap(). The trick is to add a labeller argument, for example:

 + facet_wrap(
    # your other arguments to facet_wrap() go here
    ...,
    # this replaces "TRUE" with "medaled" and "FALSE" with "did not medal"
    labeller = as_labeller(c(`TRUE` = "medaled", `FALSE` = "did not medal"))
  )

Introduction:

This project uses the Olympic Games dataset, which includes athlete records from 1896 to 2016 with variables such as name, sex, age, weight, height, team, year, and medal outcome. For the first three questions, I focused on Alpine Skiing, exploring weight patterns among male and female athletes and how these differences appeared across events and years.

In the final question, I expanded the analysis to all sports, visualizing how total medals accumulated by country over time. While the earlier plots relied on simpler static visualizations, the animation required layering cumulative counts, transitions, and formatting to show meaningful motion. It was especially interesting to see how my skills and visual choices evolved over time—last year, my graphs would have been plain black-and-white bar or scatter plots. Now, the first three questions show cleaner, more deliberate visual design, and the final animated graph represents the level of creativity and technical control I’ve reached in R (and am still experimenting with) through this class.

Key variables include sex, weight, event, year, team, and medal. I created a new logical variable, medalist, to distinguish those who won medals from those who did not. For the first three questions, I subset to Alpine Skiing athletes, and for Question 4, I used the entire dataset to analyze medal accumulation across all sports.

Approach: I began this project by following the visualization guidelines provided in the project instructions and referring back to class slides for examples—especially for violin and box plots. These helped me shape the early stages of my analysis, where I focused on the weights of Alpine Skiers and compared trends between men and women. I experimented with geom_boxplot() and geom_violin() to highlight differences in distribution and used dplyr functions to filter, group, and summarize the data. Throughout the process, I also played around with different themes and layouts to find one that made the plots feel both clear and polished.

Each visualization was selected for a specific reason: - Violin plots (Q1): Ideal for comparing full weight distributions and shapes (density, tails) between medalists and non-medalists.
- Boxplots (Q2): Show medians and variation across different events, allowing for quick comparison by sex.
- Temporal boxplots (Q3): Emphasize how the weight range shifts over time.
- Animated cumulative bars (Q4): Visually convey growth and rank changes of countries’ total medals through time.

At the start of the course, I was comfortable making simple black-and-white graphs with minimal labels, but as I progressed, I began to understand how much presentation affects interpretation. I started prioritizing axis titles, legends, color schemes, and readability—details that make a visualization look intentional rather than basic. Those design elements became a key part of my workflow as I moved from just displaying data to communicating it effectively.

For the final question, I revisited the lecture slides on animation and applied the same design principles to the dynamic Olympics dataset. I aggregated medal totals by country and year, grouped them into multi-Olympic intervals, and used gganimate to show how the top ten countries’ medal counts grew over time. I combined cumulative totals with smoother transitions, consistent color palettes, and formatted labels to build a visual story that evolves naturally—showing both growth and hierarchy in a way that static charts can’t.

Analysis:

ggplot(olympics_alpine, aes(x = sex, y = weight, fill = sex)) +
  geom_violin(trim = FALSE, alpha = 0.7) +
  facet_wrap(~medalist, labeller = as_labeller(c(`TRUE` = "Medaled", `FALSE` = "Did Not Medal"))) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Weight Distribution of Alpine Skiers by Sex and Medal Status",
    x = "Sex",
    y = "Weight (kg)"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold", hjust = 0.5))

ggplot(olympics_alpine, aes(x = event, y = weight, fill = sex)) +
  geom_boxplot(outlier.alpha = 0.3) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Weight Differences Among Alpine Skiing Events",
    x = "Event",
    y = "Weight (kg)",
    fill = "Sex"
  ) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(face = "bold", hjust = 0.5))

alpine_downhill <- olympics_alpine %>%
  filter(event == "Alpine Skiing Men's Downhill" | event == "Alpine Skiing Women's Downhill")

ggplot(alpine_downhill, aes(x = year, y = weight, group = year, fill = sex)) +
  geom_boxplot(outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "Set1") +
  labs(
    title = "Weight Distribution of Alpine Downhill Skiers Over Time",
    x = "Olympic Year",
    y = "Weight (kg)",
    fill = "Sex"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

medals <- olympics %>%
  filter(!is.na(medal)) %>%                                
  distinct(year, team, event, medal) %>%                   
  count(year, team, name = "medals") %>%                   
  arrange(year)

library(tidyverse)
library(gganimate)  

stopifnot(all(c("year","team","medals") %in% names(medals)))

oly_years <- medals %>% distinct(year) %>% arrange(year) %>% pull(year)

cum_tbl <- medals %>%
  mutate(medals = replace_na(medals, 0)) %>%
  complete(team, year = sort(unique(year)), fill = list(medals = 0)) %>%
  arrange(team, year) %>%
  group_by(team) %>%
  mutate(cum_medals = cumsum(medals)) %>%
  ungroup() %>%
  filter(year %in% oly_years)

top10 <- cum_tbl %>%
  group_by(team) %>%
  summarise(final_total = max(cum_medals), .groups = "drop") %>%
  slice_max(final_total, n = 10, with_ties = FALSE) %>%
  pull(team)

race <- cum_tbl %>%
  filter(team %in% top10) %>%
  complete(year = oly_years, team = top10, fill = list(medals = 0)) %>%
  arrange(team, year) %>%
  group_by(team) %>%
  mutate(cum_medals = cummax(replace_na(cum_medals, 0))) %>%
  ungroup() %>%
  group_by(year) %>%
  mutate(rank = min_rank(desc(cum_medals))) %>%
  ungroup()


max_y <- max(race$cum_medals)


p <- ggplot(race, aes(x = rank, y = cum_medals, group = team, fill = team)) +
  geom_col(width = 0.8, show.legend = FALSE) +
  geom_text(aes(x = rank, y = 0, label = team),
            hjust = 1, size = 4.5) +
  geom_text(aes(label = round(cum_medals, 0)),
          hjust = 0, nudge_y = max_y * 0.01, size = 4)+
  coord_flip(clip = "off") +                # <- horizontal bars
  scale_x_reverse() +                       # rank 1 at the top
  coord_cartesian(ylim = c(0, max_y * 1.15)) +
  labs(
    title = "Top 10 Countries — Cumulative Olympic Medals",
    subtitle = "Year: {closest_state}",
    x = NULL, y = NULL
  ) +
  theme_minimal(base_size = 14) +
  theme(
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    plot.margin = margin(10, 80, 10, 120)
  ) +
  transition_states(year, transition_length = 6, state_length = 1) +
  ease_aes("cubic-in-out")

animate(p, fps = 10, duration = 20, width = 900, height = 600)

anim_save("Sophia Duran Project1.gif")

Discussion: For Question 1, the violin plots comparing the weight distributions of medalists versus non-medalists show similar overall ranges but subtle differences in shape. The non-medalist violins have more visible ridges and uneven peaks, suggesting a wider variety of body types and less consistency in optimal weight. In contrast, the medalists’ distributions appear smoother and more centered, implying that successful skiers tend to fall within a more specific and balanced weight range—possibly reflecting physical traits that are advantageous in performance, control, and speed.

In Question 2, examining weights across different Alpine Skiing events revealed small but noticeable shifts. While all events share similar average weights, the “Combined” event shows slightly heavier athletes, which could reflect the need for more strength and endurance to handle multiple disciplines. Meanwhile, “Giant Slalom” skiers appear just a bit lighter on average, perhaps benefiting from greater agility and precision through tight turns. These subtle differences, though minimal, highlight how each event may favor slightly different physical builds.

For Question 3, the boxplots over time suggest a gradual increase in the average weight of alpine skiers, with a noticeable spike in the late 1990s. This could relate to advances in equipment technology and training methods that shifted the ideal athlete profile toward stronger, more powerful builds. In more recent years, the trend seems to have stabilized, implying that the sport may have reached a balance between speed, strength, and control.

Finally, Question 4’s animated bar chart vividly demonstrates the cumulative medal dominance of countries such as the United States, followed by other top nations like Russia and Germany. Seeing the bars grow over time highlights both the historical consistency of strong Olympic programs and how emerging countries occasionally rise in the rankings. The small decimal values in the data stem from how R handles numeric types—tiny rounding differences from summing and cumulative calculations, even though the underlying medal counts are whole numbers. Overall, this visualization ties together the storytelling aspect of the project, using motion and accumulation to represent Olympic history in an engaging and data-driven way.

Also, While the analysis captures clear patterns, medal counts were grouped by team rather than individual athletes, and future work could separate Summer and Winter Games or normalize totals by number of events to compare across eras.