The Canadian Ski Marathon

About two weeks ago, I participated in my favourite community event, the Canadian Ski Marathon, or the CSM. The CSM has been running for 52 years. It is currently 160 km spread over two days and two provinces (Québec and Ontario). Each day challenges the skier to complete five sections, roughly between 10 and 20 km per section, and rated easy, intermediate, or difficult. While not a race, skiers attempting the complete course register as “Coureur des Bois” (CDB, or Fur Traders) and must meet a 3:15 pm checkpoint, after which they are cut off. Participants who complete the Bronze CDB, attempt the Silver CDB the following year, which involves carrying an 5 kg (11 lbs) pack. The Gold CDB carry the pack and sleep outside in Gold Camp. The rest of us, attempt fewer sections and later start times. The marathon is wonderful, friendly, visually stunning, challenging, and an escape from city life and slushy sidewalks, if just for two days.

During my first year participating in this event in 2017, I was met with a wonderful surprise at dinner after the first day of skiing. I spotted my first ever academic supervisor in the crowd, who I had not seen since 2009, Stephen Walter. Stephen will always be Dr. Walter to me, and is someone I owe a great debt of gratitude to, for taking me on as an undergraduate for a four-month internship during my first foray in biostatistics. That year Dr. Walter gave me the opportunity to contribute to projects on genotyping for cervical cancer, childhood cerebral palsy developmental trajectories, and the Titanic.

Back then, I was not the least bit athletic, and while I knew of some of Stephen’s adventures I didn’t really know. Since that time I have become more hungry for these outdoor challenges myself, which brought me to the ski marathon. Seeing Stephen at the CSM was a bit of an a-ha moment. I saw another side of a great academic and shared in his excitement for skiing. We talked about work, skiing, and the impressive community of skiers that come out again and again to this event.

This year, I spent an embarrassing amount of my time in the ski tracks thinking about the data on the CSM website. In particular, I wanted to see how participants at different age performed. One of the wonderful things about the ski marathon is the sheer variability in age. The youngest competitor in 2017 was 7 and the oldest was 80. And it is not uncommon to be passed by a bad-ass 50-something, with a fanny pack covered in CSM badges from past events.

Pulling the data

I followed advice from Cory Nissen’s on how to use rvest to scrape an HTML table. Cory explains how to use the inspect tool in your web browser to identify the name of the web table, and extra these data into an R data frame object. Here is the code to do this for the 2018 CSM results, which stores the results for men and women in two separate tables:

# Install these first if you don't have them already.
library(rvest)
library(tidyverse)
url <- "http://skimarathon.ca/2018-results/2018-individual-results/"

csm.females <- url %>%
  read_html() %>% 
  html_nodes(xpath = '//*[@id="tablepress-58"]') %>%
  html_table()

csm.females <- csm.females[[1]]
csm.females <- csm.females %>% mutate(gender = "Girls and women")

csm.males <- url %>%
  read_html() %>% 
  html_nodes(xpath = '//*[@id="tablepress-58-no-2"]') %>%
  html_table()

csm.males <- csm.males[[1]]
csm.males <- csm.males %>% mutate(gender = "Boys and Men")

csm <- rbind(csm.females, csm.males)

no.shows <- csm %>% filter(Sections == 0) %>% tally()

csm <- csm %>% filter(Sections != 0)

Because data for men and women were in separate tables, I used mutate to add a variable denoting gender to each table before rbinding them together. I only considered skiers who completed at least one section, by filtering according to the Section variable.

Exploratory data analysis

I then quickly examined the distributions of age and sections completed by gender. I like to look at the age distributions by single-year age groups as a first step to detect coding errors or strange findings that would otherwise by hidden by grouping age into categories.

Interestingly, there are a few distinct peaks, with a nadir in participation around 19 or 20 years old, about a year or two after students begin college or university in Canada and the US. Eye-balling it, one of the later peaks is near age 50, illustrating the sheer density of older skiers that left me in the dust (or snow)!

ggplot(csm, aes(x = Age)) + 
  geom_histogram(col = "white", binwidth = 1, aes(fill = gender)) +
  facet_wrap(~ gender, nrow = 2) +
  theme_minimal() + 
  scale_fill_manual(values = c("#66c2a5", "#f46d43")) +
  ggtitle("Canadian ski marathon age distribution by gender, 2018") +
  guides(fill = F) +
  ylab("Number of participants")

Next up, I examined the distribution of the number of sections completed by gender. Remember that the maximum is ten. There is also a “half marathon” that is really a 60% marathon, where skiers aim for 6 out of 10 sections.

ggplot(csm, aes(x = Sections)) + geom_histogram(aes(fill = gender), col = "white", binwidth = 1) + 
  facet_wrap(~ gender, nrow = 2) + 
  scale_fill_manual(values = c("#66c2a5", "#f46d43")) +
  theme_minimal() + 
  ggtitle("Distribution of the number of completed \nsections by gender, CSM (2018)") +
  ylab("Number of participants") + 
  scale_x_continuous(breaks = c(1:10), labels = c(1:10)) +
  guides(fill = F) +
  theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())

From these graphs, we can perform a back of the envelope calculation and find that roughly 5 in every 6 Coureur des Bois finishers are men. More on this later!

Augment the data: adding age groupings

We’re about ready for our plots de resistance! But first, let’s categorize age so we can use it as a grouping variable, using mutate and cut. I agonized about this for a little too long. Basically, the tails of the age distribution are the most interesting but have sparser data, so I wanted the right balance of grouping folks with similar physical capabilities and not making the data too sparse. I also wanted to respect the requirement that solo CDB skiers must be 18+. Here are the groupings I used:

csm <- csm %>% mutate(`Age Group` = cut(Age, 
                                      breaks = c(min(csm$Age) - 1, 11, 17, 29, 44, 59, max(csm$Age)),
                                      labels = c("5-11", "12-17", "18-29", "30-44", "45-59", "60-83")))

Histograms of performance by age group and gender

ggplot(csm, aes(x = Sections)) + 
  geom_histogram(binwidth = 1, col = "white", aes(fill = gender)) + 
  facet_grid(gender ~ `Age Group`, labeller = labeller(`Age Group` = label_both, gender = label_value)) + 
  scale_fill_manual(values = c("#66c2a5", "#f46d43")) +
  scale_x_continuous(breaks = c(1:10), labels = c(1:10)) + 
  ggtitle("Distribution of the number of completed sections by age group and gender") +
  xlab("Number of sections completed") +
  ylab("Number of participants") +
  guides(fill = F) +
  theme_minimal() + 
  theme(
    panel.grid.minor = element_blank(), 
        panel.grid.major.x = element_blank())

These plots enable easy comparison across gender, or across age. A striking feature of this year’s results is the increasing number of Coureur des Bois (those completing 10 sections) when you look across the age spectrum for boys and men. It’s also interesting to compare across genders to see how the histograms are similar for the youngest kids, and slowly diverge in age. A few other patterns can be noted like the threshold between 6 and 7 sections, marking the 6-section half marathon finishers, most obvious among those aged 45-59. There is also a subtle 2-4-6 pattern, especially for women aged 30-44, perhaps reflecting participant choosing to complete 1, 2, or 3 sections on each day of the marathon.

As I mulled of these results, I discussed them with a close friend, who is an avid skier and all around stellar athlete, Jen Murray. Jen mentioned that in recent years women have been closing the performance gap across many sports. A fantastic recent example of this occurred when Margo Hayes broke the “5.15” ceiling on women’s rock climbing, a grade of climbing that had previously only been completed by men. After Margo, two other women quickly climbed 5.15. It would certainly be interesting to look back 10, 20, or 50 years and see how these histograms have evolved, to see how things have changed over the CSM’s 52-year tenure.