BGEN516 - Data Viz Lab
Lab Overview
In this lab, I’ll utilize data visualization to better understand the Olympic athlete data we explored in the descriptive statistics lab. The dataset contains 271,116 observations for 15 variables, but we’ll focus on observations collected after 1991 (\(n = 122,216\)). From the descriptive stats lab:
Why 1991? It marks the end of the Cold War and a significant shift in global geopolitics. By focusing on Olympic data from this period onward, we can explore how geopolitical shifts, newly independent nations, and changing global dynamics influenced athlete participation and performance.
Data Preparation
I’ve filtered the data and stored it as follows:
olympics_post_1991: observations collected after 1991unique_athletes: duplicate observations removed fromolympics_post_1991pivoted_athletes: transformedunique_athletesto long format
I’ll also remove NAs to prep the data for plotting.
The Modern Olympian Profile: Summarizing Athlete Data
To start, I’ll examine the overall distribution of athlete features: age, height, and weight. I’ll then take a closer look at how these features vary over time.
In this document, I’ve used the label and fig.cap options for code chunks where I create a plot:
#| label: fig-your-label
#| fig-cap: "A Figure Caption"
The fig-cap option provides the text for a figure caption. If the code chunk label starts with fig-, Quarto will automatically number the figure, prepend the word “Figure” to the caption, and allow you to reference the figure inline by using @fig-your-label (e.g., Figure 1). If the label doesn’t start with fig-, then the caption will still appear, but without “Figure.”
Athlete overview
Individual boxplot displays
The individual boxplots for each variable allow us to focus on each athlete feature independently. We can see that height appears approximately normally distributed, while age and weight appear skewed.
Faceted boxplot display
The faceted boxplot allows us to examine multiple athlete features simultaneously. Note that the y-axis differs for each variable.
Although age and weight are measured on different scales, they display similar patterns, with most outliers occurring in the upper range, indicating right-skewed distributions. In contrast, height shows outliers on both the lower and upper ends, suggesting greater variability at both extremes.
Athletes over the years
In the previous plots, we collapsed each variable across time to produce a single distribution. However, because we have year labels, we can see how variable distributions change over time.
Individual boxplot displays
Overall, athlete ages remain fairly consistent across Olympic Games. Most outliers appear at the upper end of the boxplots, indicating a right-skewed distribution. It seems that the number of outliers tends to vary in alternating years. The line in the middle of each boxplot represents the median; it hovers around 25 across all Olympic years. Let’s confirm the median age of athletes across all years:
median(unique_athletes$age, na.rm = TRUE)[1] 25
Indeed, the median age across all athletes is exactly 25, matching what we observed in the boxplots!
We can see that median height hovers above 175 cm (or approximately 5 ft 8 in). Compared to age, athlete height varies more noticeably over the years. It exhibits a cyclical pattern, with spikes in outliers every other Olympic year. One year has no outliers at all: 2002. What might cause this trend?
This alternating pattern reflects the Summer–Winter Olympic cycle. Summer and Winter Games differ in types of sports, the physical requirements of those sports, and the number of athletes, all of which influence the distribution of heights.
The median weight hovers above 75 kg (or approximately 165 lbs). Athlete weight also varies more noticeably over the years, exhibiting a similar cyclical pattern where every other Olympic year sees a spike in outliers. Two years have no outliers at all: 1994 and 2010. Again, the alternating pattern reflects the Summer–Winter Olympic cycle.
Faceted boxplot display
Here, the faceted boxplot is less effective than the individual boxplots. Each facet contains many boxes, creating clutter, and the overlapping outliers produce dense clusters that visually dominate the plot. In this case, focusing on each variable separately over time as in the individual boxplots exemplifies a data visualization principle: avoid clutter! When a plot contains too many elements, it can overwhelm your audience and reduce clarity and interpretability.
A Song of Ice and Fire Sun: Comparing Winter and Summer Games
Okay, now that I’ve gotten my Game of Thrones reference out of the way, let’s first visualize the distribution of medal counts by season to explore differences between the Summer and Winter Olympics. After that, we’ll take a closer look at the top-performing countries. To do this, I first summarize the data by country and season for all medal-winning athletes.
Medal counts
Paired boxplots
This plot displays the distribution of the total number of medals won by countries by season, allowing us to make comparisons between top-performing countries.
Both seasons seem to have a similar median medal count, which we can directly compare:
\(M_{Summer}= 17\)
\(M_{Winter}= 25\)
But the Summer Olympics show more variation with some countries winning a very large number of medals across years, as indicated by the outliers in the plot. This helps explain the relatively large standard deviation for Summer medal counts (SD = 238) observed in the Descriptive Statistics lab.
The Winter Olympics have fewer countries with extremely high medal counts, showing a more consistent distribution overall with a relatively smaller standard deviation (SD = 115). This difference is likely influenced by the fact that fewer countries participate in the Winter Olympics, limiting the range of total medal counts.
Overall, the differences in variability between the Summer and Winter Olympics reflect both the number of participating countries and the spread of medal-winning performances.
Top countries
Lollipop plots
The United States (U.S.) leads in total medal count across all games from 1992 through 2016. In the Summer Games plot (Figure 10 A), Germany, Russia, Australia, and China follow the U.S. but with considerably fewer medals. In the Winter Games plot (Figure 10 B), Canada surpasses the U.S., with Germany also ranking high in medal count.
The U.S. has a notably higher margin in the Summer Games compared to Canada in the Winter Games. In other words, the distribution in the Winter Games shows a tighter clustering among the top countries compared to the wider spread seen in the Summer Games. This suggests that medal counts in Winter Games are more closely contested among leading countries, while in the Summer Games, the U.S. held a more dominant position from 1992 through 2016.
Discussion
In the descriptive statistics lab, we computed summary statistics and displayed them in tables. These statistics provided us with precise measures to represent our data. But, descriptive statistics alone can hide important aspects of a dataset, including outliers, and they can’t show us how the data are arranged within those measures.
With visualization, we can refine and update our understanding of the data. Some visualizations emphasize overall patterns and trends, while others make it easier to compare points and identify differences.
Boxplots highlight spread and potential outliers in addition to symmetry and skewness. They allow us to see where data cluster, where outliers occur, and how evenly the data are distributed.
Lollipop plots emphasize the magnitude of specific values. They allow us to make comparisons between categories. They use minimal ink by representing values with lines and points. This makes it easier to compare the magnitude (i.e., size) of values across categories.
Exploration vs. explanation
When engaged in exploration, tables give precise values that allow us to check calculations, compare specific observations, and detect subtle differences that may not be visible to the human eye in a plot. In contrast, plots reveal trends, clusters, and outliers, helping us understand the structure of the data.
When engaged in explanation, plots are often more effective than tables because they communicate trends and distributions at a glance. However, tables can also support explanation by providing the exact numbers behind visual patterns (e.g., precise values of outliers or the counts in each section of a distribution), allowing us to verify claims based on plots.