Introduction

Rock climbing popularity has been dramatically increasing in the past 10 years. As more people have been getting involved in this sport, there has been a increased interest in understanding the sport. This leads to the center of my research questions to try and understand the demographic of climbers, their progression through the sport over time, and characterize expected performance. Specifically, my questions fall into the following three topics:

  1. Hardest grades between different disciplines:
  • What is the distribution of hardest grades achieved for each discipline (sport climbing and bouldering)?
  • Is there a advantage for pursuing both disciplines? Do people achieve a height max grade if they train both sport and bouldering?
  • For people that pursue both disciplines, do they typically climb harder in sport or bouldering?
  1. Climber composition relationship to performance:
  • Is there a relationship between maximum grades achieved and gender?
  • Is there a relationship between maximum grades achieved and height?
  • Is there a relationship between maximum grades achieved and weight (or BMI)?
  • Is there a peak age in ability?
  • What is the distribution of countries in the climber population? Is there a country that is more dominant in the sport?
  1. Progression over time
  • What can people expect from time of starting to climb to achieve a specific grade? For instance, how long will it take on average to climber your first 5.12a?
  • What is the typical progression from grade to grade? For instance, how long does it take to climb a 5.12b after you are able to climb a 5.12a?

Now that we have described all the questions I would like to pursue, let’s dive into the data!

Data

Luckily, there has been an effort to gather user information and the corresponding climbing information from a popular website, 8a.nu. This dataset was gathered by David Cohen and is open source on Kaggle. The 8a.nu website is a platform for climbers to log their climbs. It has been active since 1999 and was created by Jens Larssen in Sweden.

The data set contains logs from over 60 thousand users with over 4 million entries. It is broken into four different tables: ascent, grade, user, and method. Below, we will dive into the fields that we will use.

  1. Ascent: contains all of the logs with user_id, grade_id, and method_id to join the other tables
  • id (integer) - unique id for each entry
  • user_id (integer) - unique id for each user
  • grade_id (integer) - unique for the ranges grades across different metrics(e.g. french grade system versus the Yosemite(USA) grade system)
  • date (integer) - the date that the climb was accomplished - can be converted to date time
  1. Grade:
  • id (integer) - a unique id that corresponds to the grade_id with the other tables
  • usa_boulders (character) - the bouldering grade system used in the US, V0-V20 (where V0 is the easiest and V20 is the hardest)
  • usa_routes (character) - the sport grade system used in the US, 5.1 - 5.15c (where 5.1 is the easiest and 5.15c is the hardest)
  1. User:
  • id (integer) - unique id that corresponds to the user_id with the other tables
  • country (character) - the abreviation of the country the user is from
  • sex (integer) - a binary code of 0 for male, 1 for female and 255 for not applicable
  • height (integer) - self entry height in cm
  • weight (integer) - self entry of weight in kg
  • birth (date) - the user’s birth date in YYYY-MM-DD format

The method table was not used for this analysis.

Reading in the Data

The data was scraped from the website and saved as a sqlite format. To read in the data, we need to create a connection to the database. Then we can query the database, to save off the table we will use for analysis. Note, it is important to close the connection after saving off the tables.

# create a connection to the sqlite database 
data_path <-  paste(here(), "/climber_data/climber_database.sqlite", sep="")
db_conn <-  dbConnect(SQLite(), data_path)

# query the database to get the separate tables 
grade <- as_tibble(tbl(db_conn, "grade"))
ascent <- as_tibble(tbl(db_conn, "ascent"))
user <- as_tibble(tbl(db_conn, "user"))

dbDisconnect(db_conn)

The ascent data is very large with 4 million entries. We will mainly be focusing on the ascent data and join the method and grade table to convert the unique ids into meaningful information. We are going to take the first 10,000 users to reduce the size of the dataset. This sample is large enough and random, so it should be a good representative of the larger dataset.

# select the first 10,000 users
top_users_ascents <- ascent %>% filter(user_id >= 1, user_id <= 10000)

Cleaning the data

There are some values that are noted about this data set that need to be cleaned. For instance, when examining the values of the years by running a unique on the year, there are values that are not complete.

unique(top_users_ascents$year)
##  [1] 1999 1998 2000 2001 1997 1993 2002 1995 1996 1994 1992 1991 1981 2004 2003
## [16] 1989 1990 2005    0 2006 2007 1985 1980 2010 2008 2012 2009 2011 1987 1986
## [31] 2015 2014 1982 1983 1984 1988 2013 2017 2016   11  201  207

We are going to filter this data set to entries that only contain valid year values.

top_users_ascents <- top_users_ascents %>% filter(grepl("^(19|20)\\d{2}", year))

There are also many columns of data that are not very useful for our investigations. We can reduce this data set even further by deleting these columns.

top_users_ascents <-  top_users_ascents %>% select(id, user_id, grade_id, method_id, climb_type, date, year, crag_id, crag, country, rating)
head(top_users_ascents)
## # A tibble: 6 x 11
##      id user_id grade_id method_id climb_type   date  year crag_id crag  country
##   <int>   <int>    <int>     <int>      <int>  <int> <int>   <int> <chr> <chr>  
## 1     2       1       36         3          0 9.18e8  1999   16596 Rail~ "THA"  
## 2     3       1       36         3          0 9.26e8  1999       0 Nya ~ ""     
## 3     4       1       36         3          0 9.33e8  1999     209 Sjöä~ "SWE"  
## 4     5       1       36         3          0 9.33e8  1999     209 Sjöä~ "SWE"  
## 5     6       1       36         3          0 9.33e8  1999       0 Rank~ ""     
## 6     7       1       38         3          0 9.14e8  1998   16596 Rail~ "THA"  
## # ... with 1 more variable: rating <int>

Later in this post, we are going to join the grade and user table with our ascent data. First, we will clean the user and grade table to contain only the vital information for our analysis. We will be simplifying the grade table by only selecting the USA grades (instead of the other measurement systems) based on with the associated grade IDs. Then we removed any grade IDs that are not associated with either USA sport grades or USA boulder grades.

# select only the usa routes for our analysis 
grade_usa <- grade %>% select(id, usa_routes, usa_boulders)
grade_usa <-  grade_usa %>% filter(usa_routes != "" | usa_boulders != "")
head(grade_usa)
## # A tibble: 6 x 3
##      id usa_routes usa_boulders
##   <int> <chr>      <chr>       
## 1     1 3/4        VB          
## 2     7 5.1        VB          
## 3    11 5.2        VB          
## 4    12 5.3        VB          
## 5    13 5.3        VB          
## 6    15 5.3        VB

We will do something similar to the user table as we did with the grade table. We selected only the columns that contain information that we will use in our trend analysis: id, country, sex, height, weight, year user started, birth date of user

user_trend <- user %>% select(id, country, sex, height, weight, started, birth)
user_trend <- user_trend %>% filter(!is.na(birth))
head(user_trend)
## # A tibble: 6 x 7
##      id country   sex height weight started birth     
##   <int> <chr>   <int>  <int>  <int>   <int> <chr>     
## 1     1 SWE         0    177     73    1996 1976-03-10
## 2     3 SWE         0    180     78    1995 1973-09-09
## 3     4 SWE         1    165     58    2001 1984-07-26
## 4     5 USA         0      0      0    1991 1969-05-07
## 5     6 AUS         0    185     73    1992 1970-12-02
## 6    10 SWE         0    167     63    1992 1965-06-22

Analysis

We will now start to answer our research questions through data transformations and visualizations.

Data transformation

We are going to split the data set into three different specific categories: users that sport climbing, users that boulder, and users that do both. This will allow us to conduct a more focus analysis on these three different groups. First, we will create the sport and boulder data sets. This is done with filtering out based on the climbing type flag, which is 0 if it is sport or 1 if it is bouldering. Additionally, we will also joined the grade table to include a translation of the grade_id to a the USA ranking for boulders or sport climbs. This will be helpful when we are looking at grade progressions.

sport_ascents = top_users_ascents %>% filter(climb_type == 0) %>%
  inner_join(grade_usa, 
             by = c("grade_id" = "id")) %>%
  mutate(grade=usa_routes) %>%
  select(-usa_boulders, -usa_routes) %>% 
  arrange(user_id)

boulder_ascents = top_users_ascents %>% filter(climb_type == 1) %>%
  inner_join(grade_usa,
            by = c("grade_id" = "id")) %>%
  mutate(grade=usa_boulders) %>%
  select(-usa_boulders, -usa_routes) %>% 
  arrange(user_id)

Now that we created the two different separate datasets, let us look at the distribution of grades within the two disciplines.

# keys for ordering plots
sport_key <- grade_usa$usa_routes %>% unique()
boulder_key <- grade_usa$usa_boulders %>% unique()

ggplot(sport_ascents, aes(x=factor(grade, levels = sport_key))) + geom_bar() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ggtitle("Distribution of Sport Grades that People Climb") + xlab("Sport Grades")

ggplot(boulder_ascents, aes(x=factor(grade, levels = boulder_key))) + geom_bar() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ggtitle("Distribution of Boulder Grades that People Climb") + xlab("Boulder Grades")

This distribution is interesting to note that the maximum count of sport grade logged is 5.11d and seems to also have high counts around 5.12a, 5.12b, 5.12c, 5.12d. On the bouldering side, it is also important to note the obvious maximum count of bouldering grade is V5 with some high counts V8 -V9. This shows that people tend to log harder grades. For instance, if people were logging since they started climbing, there would be more logs of the easier climbs that then harder climbs because everyone starts in the beginning grades (with few exceptions). This shows there will already be a bias in the data due to the self entry factor. Another fact that will be good to look at in the amount of sport and bouldering logs, shown below.

length(sport_ascents$user_id)
## [1] 586467
length(boulder_ascents$user_id)
## [1] 306878

This shows there is more popularity in the sport climbing discipline versus the bouldering discipline with this dataset.

Now to create the dataset of climbers that pursue both disciplines. A naive approach would be to loop through each user id to see if it exists in both the sport and bouldering dataset. Another approach could be to set a current user id as you loop through each entry of the entire dataset and check if the climbing type ever changes. However, both of these approaches are very time extensive since it involves multiple loops across the dataset. The work around that is more time efficient is to create two vectors of the unique user IDs within the sport and bouldering datasets. These vectors can be used to loop once through and search for the intersection between the two to create the dataset of interest. We will do this with the built in intersect function in the DPLYR package. After this intersection is found, we can filter the user IDs that are in the intersection vector and join the grade table with the correct grades.

uniq_ids_sport = unique(sport_ascents$user_id)
uniq_ids_boulder = unique(boulder_ascents$user_id)
user_id_of_both <- intersect(uniq_ids_sport, uniq_ids_boulder)


both_ascents_grade = top_users_ascents %>% filter(user_id %in% user_id_of_both) %>%
  inner_join(grade_usa,
            by = c("grade_id" = "id")) %>% 
  mutate(grade=if_else(climb_type==0, 
                       usa_routes, usa_boulders)) %>%
  arrange(user_id) %>%
  group_by(usa_routes) %>% 
  mutate(route_count = n()) %>% 
  ungroup() %>%  
  group_by(usa_boulders) %>% 
  mutate(boulder_count = n()) %>% 
  ungroup()

both_ascents <- both_ascents_grade %>% 
  select(-usa_boulders, -usa_routes)

From these plots, the distribution of the sport climbers overall versus the climbers that do both seem pretty consistent. However, it looks like across all of the bouldering climbers versus the climbers do both actually perform better. This is a contradiction to the previous belief that climbing both discplines have inherent advantages.

Hardest grades between different disciplines

We will first find the hardest sport grade and hardest boulder grade for each user within the users that climb both disciplines. Then we will take the difference in grades to find whether they climb a harder boulder grade or a harder sport climb. The idea that is communicated through the climbing community is that a boulder grade translates to a specific sport grade, which is why they have the same grade_id.

# max sport grade
sport_both_hardest <- sport_ascents %>% filter(user_id %in% user_id_of_both) %>% 
  select(user_id, grade_id, grade) %>% 
  arrange(user_id, desc(grade_id)) %>% 
  filter(!duplicated(user_id)) %>% 
  rename("sport_grade_id" = "grade_id", "sport_max_grade"="grade")

# max boulder grade 
boulder_both_hardest <- boulder_ascents %>% filter(user_id %in% user_id_of_both) %>% 
  select(user_id, grade_id, grade) %>% 
  arrange(user_id, desc(grade_id)) %>% 
  filter(!duplicated(user_id)) %>% 
  rename("boulder_grade_id" = "grade_id", "boulder_max_grade"="grade")

# the difference between the two grades 
diff_hardest <- sport_both_hardest %>% 
  inner_join(boulder_both_hardest) %>% 
  mutate(grade_diff = boulder_grade_id - sport_grade_id)
## Joining, by = "user_id"

Now that we have the data sets for the hardest climbs between the grades, let us look at distribution of the hardest grades in each discipline, and the categories of the max grade found.

ggplot(sport_both_hardest, aes(x=factor(sport_max_grade, levels = sport_key))) + geom_bar(fill="darkslateblue") + theme_minimal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ggtitle("Distribution of Hardest Sport Grades for Both") + xlab("Max Sport Grades")

ggplot(boulder_both_hardest, aes(x=factor(boulder_max_grade, levels = boulder_key))) + geom_bar(fill="darkseagreen4") + theme_minimal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ggtitle("Distribution of Hardest Boulder Grades for Both") + xlab("Max Boulder Grades")

#zoomed 
ggplot(diff_hardest, aes(x=grade_diff)) + geom_density() +  geom_vline(aes(xintercept = mean(grade_diff)),
             linetype = "dashed", size = 0.6,
             color = "#FC4E07") + xlim(-20, 20) + ggtitle("Distribution of the Difference between Sport and Boulder Max Grades") + xlab("Difference (Sport - Boulder) Max Grades")
## Warning: Removed 54 rows containing non-finite values (stat_density).

diff_hardest_groups <- diff_hardest %>% 
  mutate(groups=case_when(grade_diff < -4 ~ "sport", 
                          grade_diff >= -4 & grade_diff <= 0 ~ "equal",
                          grade_diff > 0 ~ "boulder")) 

ggplot(diff_hardest_groups, aes(x=groups, fill=groups)) + geom_bar() + theme_minimal() +  ggtitle("Max Grades Groups") + xlab("Groups: Sport = x < -4, Equal = -4 <= x <= 0, Boulder x > 0")

This is interesting because it show that the majority of climbers that pursue both discipline, climb a harder sport grade than a boulder grade.

Climber composition relationship to performance

Since sport climbing is more popular, we will focus on the sport climbing dataset. First, we will look at the distribution of the gender of within the data set.

sport_users <- sport_both_hardest %>% 
  inner_join(user_trend, by = c("user_id" = "id"))

top_users <- top_users_ascents %>% 
  inner_join(user_trend, by = c("user_id" = "id")) %>% 
  filter(sex != 255) %>% 
  rename("country_origin" = "country.y", "country_crag" = "country.x")

# plot gender  
ggplot(top_users, aes(x=factor(sex, labels = c("Male", "Female"), levels = c("0", "1")), fill=sex)) + geom_bar(aes(y = (..count..)/sum(..count..))) + theme_minimal() + ylab("Percent") + ggtitle("Distribution of Gender within Data") + xlab("Gender") + theme(plot.title = element_text(size = 20), legend.position = "none") 

This shows that this data is very biased towards male climbers. There is not enough female entries, so we will not focus on gender as being a contributor to performance. We will focus on BMI and height to see if the body composition has a correlation on performance.

First we will plot the max grade given a certain height. This would be useful for people to see if there is a expected max grade given your height.

#grade vs height
grade_height <- sport_users %>% 
  select(sport_grade_id, height) %>% 
  filter(height > 0 & !is.na(height)) %>% 
  # converting from cm to ft
  mutate(height = (height/2.54)/12) %>% 
  arrange(height) %>% 
  group_by(height) %>% 
  summarise(avg_grade = mean(sport_grade_id), sd = sd(sport_grade_id)) %>% 
  filter(height < 7 & height > 4) %>% 
  mutate(avg_grade = as.integer(avg_grade)) %>% 
  inner_join(grade_usa, by = c("avg_grade" = "id")) %>% 
  select(-usa_boulders)

ggplot(grade_height, aes(x=height, y=usa_routes)) + geom_point() + theme_minimal() + theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1)) + xlab("Height (ft)") + ylab("Sport Grades")

We focus on the heights between 4 feet and 7 feet. This is due to the fact that many people put in 0 or other small numbers as their height. This plot looks to have two distinct minimal of average max grades. It looks like people that are around 4“7’ to 5” may have a disadvantage. It also seems after reaching the height of 6" the max grade decreases. However, there is no distinct trend and the data seems inclusive if height gives an advantage. Another way to view this data would be to see what the average height is for each grade. This can show us if there is an upward or downward trend in height was the grade increases.

#height vs max grade
height_grade <- sport_users %>% 
  select(sport_max_grade, height) %>% 
  mutate(height = (height/2.54)/12) %>% 
  filter(height < 7 & height > 4) %>% 
  arrange(sport_max_grade) %>% 
  filter()

summary_height_grade <- height_grade %>% 
  group_by(sport_max_grade) %>% 
  dplyr::summarise(avg_height = mean(height))

ggplot(summary_height_grade, aes(x=factor(sport_max_grade, levels=sport_key), y=avg_height)) + geom_point() + theme_minimal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + xlab("Sport Grades") + ylab("Height(ft)")

This chart definitely shows a more steady trend, with a solid average height for most grades around 5“8’ to 5”10’. Oddly enough, it also shows a downward trend of height as the grades progress after 5.12d and 5.13c. This is actually contrary to the current belief that taller is better for climbing. However, the downward trend only reaches a minimal of about 5"7’. There is an outlier of the grade 5.9 and 5.13c. These outliers could be due to the fact there are only a few entries of each of these grades out of 1600 entries total.

New we are going to look at weight and performance. Instead of looking purely at weight in kg, we are going to look at BMI since it takes in height as a consideration.

#weight vs max grade
weight_grade <- sport_users %>% 
  select(sport_max_grade, weight, height) %>% 
  arrange(sport_max_grade) %>% 
  mutate(bmi = weight/(height/100)^2) %>% 
  filter(bmi > 1  & bmi < 100)

summary_weight_grade <- weight_grade %>% 
  group_by(sport_max_grade) %>% 
  dplyr::summarise(avg_bmi = mean(bmi))

ggplot(summary_weight_grade, aes(x=factor(sport_max_grade, levels=sport_key), y=avg_bmi)) + geom_point() + theme_minimal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + xlab("Grade") + ylab("BMI")

This also shows a downward trend in BMI as the max grades increase.

Next we are going to look at the distribution of climbers across the different countries. This will show which countries have a higher concentration of climbers.

country_users <- top_users %>% 
  filter(country_origin != "") %>% 
  group_by(country_origin) %>% 
  count() %>% 
  ungroup() %>% 
  arrange(desc(n)) %>% 
  top_n(15)
## Selecting by n
ggplot(country_users, aes(x=country_origin, y=n)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ggtitle("Countries with the Top 15 Climber Population") + xlab("Countries") + scale_y_continuous(name="Count", labels = scales::comma)

Above are the top 15 countries.

Next, we are going to look to see if there is a peak age for climbers. We will do this by examining the age of each person when they achieve their max grade (sport or bouldering). We first do this through calculating their age with the lubridate package.

age_users <- top_users %>% 
  select(id, user_id, grade_id, date, birth, climb_type) %>% 
  mutate(date = date(as_datetime(date))) %>% 
  mutate(age = interval(ymd(birth), date) %/% years(1))

max_grades <- top_users %>%  
  select(id, user_id, grade_id) %>% 
  arrange(user_id, desc(grade_id)) %>%
  filter(!duplicated(user_id))

Now that we have the max grade_id of all users and another table with the user information including the age when the climb was accomplished. This allows us to find the age when the person climbed their max grade by doing an inner_join of the two tables.

peak_age <- age_users %>% 
  inner_join(max_grades, by="id") %>% 
  filter(user_id.x == user_id.y & grade_id.x == grade_id.y) %>% 
  rename("user_id" = "user_id.x", "grade_id" = "grade_id.x") %>% 
  select("user_id", "grade_id", "age", "climb_type") %>% 
  arrange(user_id) %>% 
  inner_join(grade_usa, by = c("grade_id"= "id")) %>% 
  mutate(grade=case_when(climb_type == 0 ~ usa_routes, climb_type == 1 ~ usa_boulders)) %>% 
  select(-usa_routes, -usa_boulders)

We will then plot the distribution of ages. This shows that the average age is around 26 and 27 for when a climber achieves their max grade climb.

#zoomed 
ggplot(peak_age, aes(x=age)) + geom_density() +  geom_vline(aes(xintercept = mean(age)),
             linetype = "dashed", size = 0.6,
             color = "#FC4E07") + ggtitle("Distribution of Peak Age") 

mean(peak_age$age)
## [1] 26.66706

Another interesting view is to see if there is an average age for each grade accomplished.

avg_age <- age_users %>% 
  group_by(grade_id) %>% 
  summarise(avg_age = mean(age), climb_type = first(climb_type)) %>% 
  inner_join(grade_usa, by = c("grade_id" = "id")) %>% 
  filter(climb_type != 1) %>% 
  select(-usa_boulders, -climb_type)

avg_grade_age <- age_users %>% 
  group_by(age) %>% 
  summarise(avg_grade = integer(mean(grade_id)), climb_type = first(climb_type)) %>% 
  inner_join(grade_usa, by = c("avg_grade" = "id")) %>% 
  filter(climb_type != 1) %>% 
  select(-usa_boulders, -climb_type)
## `summarise()` has grouped output by 'age'. You can override using the `.groups` argument.
ggplot(avg_age, aes(x=factor(usa_routes, levels=sport_key), y=avg_age)) + geom_point() + theme_minimal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + xlab("Grade") + ylab("Age")

Progression over time

Now we are going to look two different aspects of progression over time. The first one will be how we progress since we start climbing. So, this would be useful for someone to understand how long they should be climbing until they can expect to reach a grade of a 5.12a. Another aspect we will investigate is the progression from grade to grade. This is helpful to differentiate between a plateu in performance or to understand that certain grades can be harder to move between. For instance, it might be easier to go from a V1 to V2, but not a V8 to a V9.

We first need to find when a person first reached a grade. This can be done by ordering the grade_id by the date for each user in descending order. This means for each user, each grade category will have the first ascent at the top of the table. We can then do a distinct to find the unique values and remove the repeating entries from the other dates.

# find the first time someone reached a grade 
first_grade <- sport_ascents %>% 
  select(id, user_id, grade_id, grade, date) %>% 
  arrange(user_id, grade_id, date) %>% 
  mutate(date_readable = date(as_datetime(date))) %>% 
  distinct(user_id, grade, .keep_all = TRUE)

Here we are just using the first_grade table above to join with the user data. Then we calculate the time it takes to accomplish the climb by converting the date into a Date class and taking the lubridate interval.

# dictionary of user and their start date 
started <- user_trend %>% 
  select(id, started) %>% 
  filter(started != 0 & !is.na(started) & !is_empty(started)) %>% 
  mutate(started_date =ymd(started, truncated = 2L)) %>% 
  right_join(first_grade, by=c("id" =  "user_id")) %>% 
  mutate(time_to_first_grade_years = interval(started_date, date_readable) %/% years(1)) %>% 
  select(-id.y)
head(started)
## # A tibble: 6 x 8
##      id started started_date grade_id grade      date date_readable
##   <int>   <int> <date>          <int> <chr>     <int> <date>       
## 1     1    1996 1996-01-01         36 5.10a 918342000 1999-02-06   
## 2     1    1996 1996-01-01         38 5.10b 914022000 1998-12-18   
## 3     1    1996 1996-01-01         40 5.10c 913503600 1998-12-12   
## 4     1    1996 1996-01-01         42 5.10d 914022000 1998-12-18   
## 5     1    1996 1996-01-01         44 5.11a 916614000 1999-01-17   
## 6     1    1996 1996-01-01         46 5.11b 916527600 1999-01-16   
## # ... with 1 more variable: time_to_first_grade_years <dbl>

Then we take the summary to plot the average time it takes for each grade.

started_avg_years <- started %>% 
  select(grade, time_to_first_grade_years) %>% arrange(grade) %>% 
  dplyr::group_by(grade) %>% 
  dplyr::summarise(avg_time = mean(time_to_first_grade_years, na.rm=TRUE), 
                   min=min(time_to_first_grade_years, na.rm = TRUE), 
                   max=max(time_to_first_grade_years, na.rm = TRUE), 
                   sd=sd(time_to_first_grade_years, na.rm = TRUE)) 

ggplot(started_avg_years, aes(x=factor(grade, levels=sport_key), y=avg_time)) + geom_point() + theme_minimal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ylab("Time in Years") + xlab("Sport Grades")

This plot shows that the progression over the total time of climbing has an exponetial curve. This is expected since the time is cummulative. So it makes sense that it will take longer to reach a 5.13 than to reach a 5.10. It also show that there seems to be a bias of the data showing 6 year entry for most beginner grades. This might be due to the fact that more entry grades were not logged or that the website might have not been active for people in the more beginning phases of their climbing careers.

Now for the progression from grade to grade is a bit tricker. We first need to calculate the first ascent of each grade for each user. We do this by pivoting the table wider to have only the date values show for each grade.

# for sport only 
first_time_ascent <- started %>% 
  select(id, grade, date) %>% 
  pivot_wider(names_from = grade, values_from = date)

head(first_time_ascent)
## # A tibble: 6 x 34
##      id  `5.10a` `5.10b` `5.10c` `5.10d` `5.11a` `5.11b` `5.11d` `5.12a` `5.12b`
##   <int>    <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>
## 1     1   9.18e8  9.14e8  9.14e8  9.14e8  9.17e8  9.17e8  9.14e8  9.17e8  9.22e8
## 2     3  NA      NA      NA      NA      NA      NA       8.07e8  8.31e8  8.36e8
## 3     4   1.17e9  1.16e9 NA      NA       1.13e9  1.14e9  1.13e9  1.06e9  1.06e9
## 4    10   1.06e9  1.18e9  8.36e8  6.94e8  8.99e8  1.13e9  7.89e8  8.20e8  7.26e8
## 5    15  NA      NA      NA      NA      NA       1.09e9  1.08e9  1.08e9  1.08e9
## 6    16  NA      NA       9.79e8  9.78e8  9.80e8  9.99e8  8.60e8  9.26e8  8.89e8
## # ... with 24 more variables: 5.12c <int>, 5.12d <int>, 5.13a <int>,
## #   5.13b <int>, 5.13c <int>, 5.1 <int>, 5.3 <int>, 5.4 <int>, 5.5 <int>,
## #   5.6 <int>, 5.7 <int>, 5.8 <int>, 5.9 <int>, 5.13d <int>, 5.14a <int>,
## #   5.14b <int>, 5.14c <int>, 3/4 <int>, 5.14d <int>, 5.15a <int>,
## #   5.14b/c <int>, 5.15b <int>, 5.15c <int>, 5.14c/d <int>

Then we separately create a table for the grades and their corresponding progression.

boulder_grades <- grade_usa %>%
    transmute(grade=usa_boulders) %>%
    unique() %>%
    filter(grade != "") %>%
    mutate(next_grade = lead(grade), progression=paste(grade, "-", next_grade))
head(boulder_grades)
## # A tibble: 6 x 3
##   grade next_grade progression
##   <chr> <chr>      <chr>      
## 1 VB    V0-        VB - V0-   
## 2 V0-   V0         V0- - V0   
## 3 V0    V1         V0 - V1    
## 4 V1    V2         V1 - V2    
## 5 V2    V3         V2 - V3    
## 6 V3    V3/4       V3 - V3/4
sport_grades <- grade_usa %>%
    transmute(grade=usa_routes) %>%
    unique() %>%
    filter(grade != "") %>%
    mutate(next_grade = lead(grade), progression=paste(grade, "-", next_grade))
head(sport_grades)
## # A tibble: 6 x 3
##   grade next_grade progression
##   <chr> <chr>      <chr>      
## 1 3/4   5.1        3/4 - 5.1  
## 2 5.1   5.2        5.1 - 5.2  
## 3 5.2   5.3        5.2 - 5.3  
## 4 5.3   5.4        5.3 - 5.4  
## 5 5.4   5.5        5.4 - 5.5  
## 6 5.5   5.6        5.5 - 5.6

Now we can use both of these tables to find the time interval between each grade. Since there are times that the next grade might not exist, we have to create a function that can do error handling. This function takes the first_time_ascents table with the dates and the grades from the sport_grades progression tables to calulate the time interval.

find_diff_time <- function(df, next_grade, grade)
{
  tryCatch(
    {
      new_df <- df[next_grade] - df[grade]
      new_df <- new_df %>% 
        setNames("time_diff") %>% 
        mutate(grade_progression = paste(grade,  " - ", next_grade)) %>% 
        filter(!is.na(time_diff) & time_diff > 0) %>% 
        filter(time_diff < quantile(time_diff, 0.75))
      return (new_df)
    }, 
    error = function(e) {
      return (data.frame(grade_progression = paste(grade,  " - ", next_grade), time_diff = NA))
    }
  )
  
}

This takes the time intervals found (in months) and summarizes the average time per grade progression.

sport_progression <- sport_grades %>%  
  rowwise() %>% 
  do(find_diff_time(first_time_ascent, .$next_grade, .$grade)) %>% 
  filter(!is.na(time_diff)) %>% 
  mutate(time_diff_months = time_diff/(3600*24*30))

summary_sport_progression <- sport_progression %>% 
  dplyr::group_by(grade_progression) %>% 
  dplyr::summarise(avg_time = mean(time_diff_months, na.rm=TRUE), 
                   sd=sd(time_diff_months, na.rm = TRUE)) %>% 
  filter(avg_time < quantile(avg_time, 0.90))

Below is the plot of the average time per grade progression

grade_progression_key <- unique(sport_progression$grade_progression)

ggplot(summary_sport_progression, aes(x=factor(grade_progression, levels=grade_progression_key), y=avg_time)) + theme_minimal() + geom_point() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + ylab("Time Intervals (Months)") + xlab("Sport Grade Progressions")

This is very interesting because it shows an exponential relationship between the time progression and grades. This means as the grades get more difficult, it can be more difficult to progress to the next grade.