Rating models can give us a good comparison of the relative strength of different climbers. However, some climbers might have different skills/specialties (e.g more powerful, more technical, more flexible), and so they might excel at particular types of boulders. To identify these specialties, we need to build a multi-dimensional representation for each climber (AKA a vector embedding).

Unfortunately, we don’t have any information about the attributes of each boulder (e.g. we don’t know that a given boulder is very technical). Therefore, we will have to learn these multi-dimensional representations automatically from the data. One way to do this is with probabilistic matrix factorization (PMF), a technique commonly used for recommender systems.

The goal is to learn representations with probabilistic matrix factorization, and show that: (A) these representations are more predictive than a benchmark rating model/unidimensional representation (B) the multidimensional embeddings capture some aspect of climber specialties (we need some domain knowledge to validate this). For the benchmark model, I suggest a generalized linear model with climber coefficients.

(Possibly) Relevant Literature

Data Preparation

url = "https://github.com/DavidBreuer/ifsc-analysis/raw/main/ifsc_Boulder.xlsx"
df = read.xlsx(url, sheet=1)

str(df)

data_clean = tibble(df) |>
  # Get rid of columns we don't need
  select(-c(Unique, Discipline, Number, Group), -matches("Route|Run")) |>
  # Capitalize climber names consistently
  mutate(Name = str_to_title(Name)) |>
  # "Unpivot" so it's one row per climber-problem
  # I'm treating tops and zones as separate problems even though there's obviously a correlation
  pivot_longer(
    Top1:Zone5,
    names_to = "problem",
    values_to = "attempts",
    values_drop_na = TRUE,
  ) |>
  mutate(attempts = as.integer(attempts)) |>
  # Keeping only guys for now
  filter(Gender == "M") |>
  # Only keep boulders that at least one climber topped
  filter(any(is.finite(attempts)), .by = c(Competition, Level, problem)) |>
  # For anyone who failed to top the climb, we'll set their number of attempts to whatever
  # the maximum observed number of attempts was for that climb
  mutate(max_attempts = max(attempts, na.rm = TRUE), .by = c(Competition, Level, problem)) |>
  # Survival model features
  mutate(
    status = !is.na(attempts), # TRUE if they succeeded
    time = ifelse(is.na(attempts), max_attempts, attempts)
  ) |>
  # Keep climber name only for climbers with lots of data, use "Other" as replacement level
  mutate(climber = ifelse(n() >= 1000, Name, "Other"), .by = Name) |>
  mutate(climber = relevel(factor(climber), "Other"))

Basic Survival Models

km_fit = survfit(Surv(time, event = status) ~ 1, data = data_clean)
autoplot(km_fit)

km_fit = survfit(Surv(time, event = status) ~ climber, data = data_clean)
autoplot(km_fit)

cox_fit = coxph(Surv(time, event = status) ~ Level + climber, data = data_clean)
summary(cox_fit)

Regularized Survival Models

We may want to eventually add regularization, especially for the climber coefficients (for example to penalize low-sample size players more). We can do this in R with the glmnet package. See glmnet instructions at https://glmnet.stanford.edu/articles/Coxnet.html. I’m not as familiar with doing generalized linear models in Python but there should be a way to do it (sklearn?).

X = model.matrix(~ climber + Level, data = data_clean)
y = Surv(data_clean$time, event = data_clean$status)
fit = cv.glmnet(X, y, family = "cox", alpha = 0.5)
plot(fit)

coef(fit, s = "lambda.min")

Matrix Factorization Models

Train/Test Split

We need to ensure that for every climber/climb in the testing dataset we also have some data in the training dataset (otherwise we won’t be able to learn their embeddings). Ideally we can compare the test log likelihood betweeen PMF and the glm survival model.

Since 2020 has much less data (COVID) if we’re doing anything with Year as a variable we may need to be careful with that.

boulder2vec