Shalini Das, sd28397
This is the dataset you will be working with:
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
olympics_alpine <- olympics %>%
filter(!is.na(weight)) %>% # only keep athletes with known weight
filter(sport == "Alpine Skiing") %>% # keep only alpine skiers
mutate(
medalist = case_when( # add column to
is.na(medal) ~ FALSE, # NA values go to FALSE
!is.na(medal) ~ TRUE # non-NA values (Gold, Silver, Bronze) go to TRUE
)
)
olympics_alpine is a subset of olympics and
contains only the data for alpine skiers. More information about the
original olympics dataset can be found at https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-07-27/readme.md
and https://www.sports-reference.com/olympics.html.
For this project, use olympics_alpine to answer the
following questions about the weights of alpine skiers:
Introduction: This report takes a closer look at a
subset of the olympics dataset, more specifically: Alpine
Skiing (a single Winter Olympics sport). The new
olympics_alpine data frame contains 6350 objects, or rows,
for all athletic competitors entered into Alpine Skiing events who also
had their weight at the time of the games/sporting event recorded
between 1936 and 2014. The resulting data frame has 16 total columns
which provides information about (1) the Olympic athlete, (2) their
associated sporting event(s) and (3) how they performed at the games.
Information about the Olympic athlete includes their name, age, ID
number, assigned sex, height, weight and nationality/3-letter team code.
Information about the associated sport event(s) includes the
season/year, city and official name of the event(s) for each new
competition entry. Finally, information about how they performed at the
games includes whether or not they medaled and if they did, what medal
was won (gold, silver or bronze)?
The aim of this report is to ensure comprehensive answers to the
three questions posed about Olympic-level Alpine Skiers and weight. In
order to accomplish this task at hand, it is essential to narrow the
focus of this dataset onto a select few variables. As a result, the five
variables that can be identified as necessary to work with in order to
wholly answer the questions at hand are the athlete’s: weight (column
weight), assigned sex at birth (column sex),
official competition event (column event), whether or not
they medaled (column medalist) and lastly, competing year
at the games (column year). The athlete’s weight is
provided as a numeric value, in kilograms. The assigned sex at birth of
the athlete is provided as one of two character values ‘F’/‘M’, where
‘F’ means that the athlete would have competed in Women’s sporting
events and M means that the athlete would have competed in Men’s
sporting events. Within the Alpine Skiing sport at the Olympics there
are five types of events (each separated and given its own distinct
competition event by sex); therefore, the official
competition event is provided by one of ten different types of character
values. Additionally, whether or not the athlete medaled was encoded as
TRUE/FALSE, where TRUE represents the most successful
Olympians/medalists and FALSE represents non-medalists. The competing
year at the games is provided as a numeric value, starting in 1936 and
ending in 2014.
Approach: To show weight differences between Olympic qualifiers and medalists in Alpine Skiing I have chosen to use violin plots (geom_violin()). Furthermore, in order to avoid skewing the data, it is important to create a distinction between male and female athletes, who weigh approximately 20 kg less than male athletes on average and therefore should be visualized individually. Violins will allow easy side-by-side comparisons between medalists and non-medalists (all other qualifiers who did not earn a medal) for each competition category based on assigned sex to be made.
For question two, boxplots (geom_boxplot()) will be utilized to show the weight distribution of athletes based on event category. Since there are a relatively low number of events in the dataset that are of interest, a strip chart is explored as an option to visualize the data in a more organized and effective manner. Again, boxplots have been colored and filled by sex to differentiate event categories by the same name more easily. In order to answer this question more completely, the strip chart has also been reordered so that the boxplots are arranged by events with the greatest mean weight to lowest mean weight on average.
To show how the weight distribution of alpine skiers has changed over the years I will use the results from the part two. The event with largest distribution and/or median weight will be selected and taken a closer look at as it may be most representative of the entire dataset. For continuity’s sake, this last plot will also use geom_boxplot() and be separated by the assigned sex of at athletes at the Olympic games. Moreover, visualizing the weights of athletes grouped by event and sex with a time series graph (geom_line()) would be ideal to understand any changes over time.
Analysis:
Question 1: Are there weight differences for male and female Olympic skiers who were successful or not in earning a medal?
library(RColorBrewer)
library(gganimate)
## Warning: package 'gganimate' was built under R version 4.3.2
## No renderer backend detected. gganimate will default to writing frames to separate files
## Consider installing:
## - the `gifski` package for gif output
## - the `av` package for video output
## and restarting the R session
library(stringr)
library(dplyr)
library(ggiraph)
library(ggrepel)
weight_stats_by_sex_and_medalist <- olympics_alpine %>%
group_by(sex, medalist) %>%
mutate(max = max(weight),
q3 = quantile(weight, 0.75),
mean = mean(weight),
q1 = quantile(weight, 0.25),
min = min(weight))
weight_by_sex_and_medalist <- weight_stats_by_sex_and_medalist %>%
ggplot(aes(x = medalist, y = weight, color = medalist)) +
geom_violin(draw_quantiles = c(0.5)) +
geom_violin(aes(fill = medalist, alpha = 0.25),
draw_quantiles = c(0.25, 0.75), linetype = "dashed") +
facet_wrap(~sex, labeller = as_labeller(c(`F` = "Female Atheletes", `M` = "Male Atheletes"))) +
scale_x_discrete(name = "", labels = c("Qualifiers\n(non-medalists)", "Medalists\n(gold, silver & bronze)")) +
scale_y_continuous(name = "Weight (kg)",
limits = c(40,110),
breaks = seq(40, 110, by = 10)) +
scale_color_brewer_interactive(palette = "Paired") +
scale_fill_brewer_interactive(palette = "Paired") +
theme_bw() +
theme(legend.position = "bottom",
strip.background = element_rect(colour = "black", fill = "white")) +
guides(alpha = "none", color = "none", fill = "none") +
geom_text_interactive(aes(label = "|", hjust = 0.9,
tooltip = paste0("Max: ", max, sep = " kg\n", "3rd Quartile: ", q3, sep = " kg\n",
"Mean: ", round(mean, 2), sep = " kg\n", "1st Quartile: ", q1, sep = " kg\n",
"Min: ", min, " kg")), color = NA)
girafe(
ggobj = weight_by_sex_and_medalist,
width_svg = 6,
height_svg = 6*0.618)
There is a clear weight difference between male and female Olympic skiers, and a non-negligible rise in the weights for medalists across both men’s and women’s alpine skiing in comparison to their non-medalist counterparts.
Question 2: Are there weight differences for skiers who competed in different alpine skiing events?
olympics_alpine %>%
mutate(event = substring(event, first = 14),
sex = recode(sex, 'M' = "Male Athletes",
'F' = "Female Athletes")) %>%
group_by(event) %>%
ggplot(aes(x = weight, y = fct_reorder(event, weight), color = sex, fill = sex, alpha = 0.25)) +
geom_boxplot(outlier.size = 1, outlier.shape = 1, outlier.alpha = 0.4) +
scale_x_continuous(name = "Weight (kg)", limits = c(40,110), breaks = seq(40,110, by = 5)) +
scale_y_discrete(name = "Event") +
scale_color_brewer_interactive(palette = "Dark2") +
scale_fill_brewer_interactive(palette = "Dark2") +
theme_bw() +
theme(legend.position = c(.99, .022),
legend.justification = c("right", "bottom"),
legend.box.just = "right",
legend.margin = margin(6, 6, 6, 6)) +
guides(alpha = "none",
color = guide_legend(title = "Assigned Sex"),
fill = guide_legend(title = "Assigned Sex", override.aes = list(alpha = 0.8)))
There is a definitive, observable trend in the weight distributions of athletes across alpine skiing events; and the order of these weight distributions is identical for both men’s and women’s events.
Question 3: How has the weight distribution of alpine skiers changed over the years?
alpine_df <- olympics_alpine %>%
mutate(event = as.factor(event)) %>%
mutate(eventby_category = recode(event,
"Alpine Skiing Men's Downhill" = "Downhill",
"Alpine Skiing Women's Downhill" = "Downhill",
"Alpine Skiing Men's Slalom" = "Slalom",
"Alpine Skiing Women's Slalom" = "Slalom",
"Alpine Skiing Men's Giant Slalom" = "Giant Slalom",
"Alpine Skiing Women's Giant Slalom" = "Giant Slalom",
"Alpine Skiing Men's Super G" = "Super G",
"Alpine Skiing Women's Super G" = "Super G",
"Alpine Skiing Men's Combined" = "Combined",
"Alpine Skiing Women's Combined" = "Combined"),
eventby_sex = recode(event,
"Alpine Skiing Men's Downhill" = "Men's Events",
"Alpine Skiing Women's Downhill" = "Women's Events",
"Alpine Skiing Men's Slalom" = "Men's Events",
"Alpine Skiing Women's Slalom" = "Women's Events",
"Alpine Skiing Men's Giant Slalom" = "Men's Events",
"Alpine Skiing Women's Giant Slalom" = "Women's Events",
"Alpine Skiing Men's Super G" = "Men's Events",
"Alpine Skiing Women's Super G" = "Women's Events",
"Alpine Skiing Men's Combined" = "Men's Events",
"Alpine Skiing Women's Combined" = "Women's Events"))
downhill_stats <- alpine_df %>%
filter(eventby_category == "Downhill") %>%
group_by(year, sex) %>%
mutate(max = max(weight),
q3 = quantile(weight, 0.75),
mean = mean(weight),
q1 = quantile(weight, 0.25),
min = min(weight))
downhill_stats <- downhill_stats %>%
filter(eventby_category == "Downhill") %>%
group_by(year, sex) %>%
mutate(uf = (q3 + (1.5*(q3-q1))),
lf = (q1 - (1.5*(q3-q1))))
downhill_yearly <- downhill_stats %>%
ggplot(aes(x = year, y = weight, group = interaction(year,sex), alpha = 0.25, color = sex, fill = sex)) +
geom_boxplot(position = position_dodge2(width = 0.3),
outlier.size = 1, outlier.shape = 1, outlier.alpha = 0.25, width = 3) +
scale_x_continuous(name = "Olympic Year",
limits = c(1946, 2016),
breaks = c(1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980,
1984, 1988, 1992, 1994, 1998, 2002, 2006, 2010, 2014),
labels = c(1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980,
1984, 1988, 1992, 1994, 1998, 2002, 2006, 2010, 2014),
expansion()) +
scale_y_continuous(limits = c(40,110), breaks = seq(40,110, by = 10)) +
scale_color_brewer_interactive(palette = "Dark2") +
scale_fill_brewer_interactive(palette = "Dark2") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x.bottom = element_text(angle = 90, vjust = 0.5, hjust = 1),
legend.position = "none") +
labs(title = "Weight Distribution of Downhill Alpine Skiers over Time",
x = "Olympic Year", y = "Weight (kg)") +
geom_text_interactive(aes(label = sex,
tooltip = paste0("Year: ", year, sep = "\n",
"Max: ", max, sep = " kg\n", "3rd Quartile: ", q3, sep = " kg\n",
"Mean: ", round(mean, 2), sep = " kg\n", "1st Quartile: ", q1, sep = " kg\n",
"Min: ", min, " kg")), color = NA) +
geom_text_interactive(aes(label = year,
tooltip = paste0("Year: ", year, sep = "\n",
"Max: ", max, sep = " kg\n", "3rd Quartile: ", q3, sep = " kg\n",
"Mean: ", round(mean, 2), sep = " kg\n", "1st Quartile: ", q1, sep = " kg\n",
"Min: ", min, " kg")), color = NA)
girafe(
ggobj = downhill_yearly,
width_svg = 8,
height_svg = 8*0.618)
selected <- c("Downhill", "Slalom", "Giant Slalom","Super G","Combined")
alpineski_weights <- alpine_df %>%
filter(eventby_category %in% selected) %>%
ggplot(aes(year, weight, color = eventby_category)) +
geom_line() +
geom_point(size = 3) +
geom_segment_interactive(aes(xend = year, yend = weight), linetype = 2) +
facet_grid(~eventby_sex) +
scale_x_continuous(name = "Year",
limits = c(1934, 2018),
breaks = c(1936, 1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976,
1980, 1984, 1988, 1992, 1994, 1998, 2002, 2006, 2010, 2014),
labels = c(1936, 1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980,
1984, 1988, 1992, 1994, 1998, 2002, 2006, 2010, 2014)) +
scale_y_continuous(limits = c(40,110), breaks = seq(40,110, by = 10),
sec.axis = dup_axis(name = "")) +
scale_color_brewer(palette = "Dark2") +
guides(color = "none") +
coord_cartesian(clip = "off") +
transition_reveal(year) +
theme_light() +
labs(title = "Olympic Alpine Skier Weights (in kg) over Time",
x = "Year", y = "Weight (kg)") +
theme(aspect.ratio = 7/11,
axis.text.x.bottom = element_text(angle = 90, vjust = 0.5, hjust = 1),
strip.background = element_rect(color = "black", fill = "white"),
strip.text = element_text(color = "black"),
panel.background = element_rect(fill = NA, color = NA),
plot.margin = margin(5,5,10,20),
plot.title = element_text(hjust = 0.5)) +
geom_text_repel(aes(label = eventby_category),
hjust = 0,
nudge_x = 2,
direction = "y",
xlim = c(NA, Inf))
animate(alpineski_weights, fps = 4, width=1200, height=500)
## Warning: No renderer available. Please install the gifski, av, or magick
## package to create animated output
According to the data set, the weight distribution of Olympic alpine skiers has risen from 1936 to 2014, albeit not always steadily.
Discussion: It appears that an Olympic medalists in Alpine Skiing have on average higher weights than their counterparts who did not earn a medal. Off the bat the light blue violins are much taller than the dark blue violins in Figure 1, indicating that the variance in weight for non-medalists was accordingly greater. Furthermore the lower, middle and upper quartile values of weight for female Olympic medalists were 1 - 2 kg higher than their qualifying counterparts who did not earn a medal. Likewise, the lower, middle and upper quartile values of weight for male Olympic medalists were 3 kg higher than their qualifying counterparts who did not earn a medal. The distribution was narrower, yet also a few kilograms higher for medalists. In order to confirm this correlation, it would be advisable to do multivariate testing with other variables than an athlete’s weight on the chance of earning a medal.
There were noticeable patterns from Figure 2 about the weights of athletes competing in different event categories in Alpine Skiing. The top-most boxplot represented the event category with highest weight on average, while the bottom-most boxplot held the lowest weight on average. The order of these event categories from highest to lowest average weights remained consistent across both Men’s and Women’s competitions. In order from highest to lowest average weights, the event categories are as follows: (1) combined, (2) super-G, (3) downhill, (4) slalom and (5) giant slalom. In the Men’s competing event the bottom three categories had a considerably greater distribution variance, than the top two categories.
Lastly, Figures 3 and 4 establish the upward trend over time in the weights Alpine Skiers at the Olympics. In the Women’s Downhill event category, the lower, middle and upper quartile values of weight in kg for female athletes rose approximately 10 kg for each metric over the course 66 years and 18 games. Likewise the lower, middle and upper quartile values of weight in kg for male athletes competing in the Downhill event rose approximately 15 kg for each metric over time.