Explore Covariates

Overview

Okay mate…the goal of this analysis is to understand which covariates should be included in the agreement analysis for exercise heart rate (HRex) between the standardized shuttle run and your bespoke ‘Y’ passing drill.

My main thoughts for the exploration are:

View the raw data and see how variables distribute: this will tell us a little about how they behave and if/how we should subsequently include them.
Understand the variability of the covariates: if a variable isn’t changing much over the observation period then it wouldn’t make sense to include it.
Determine how strong the covriates are associated to one another: we want to avoid multicollinearity.
Assess the strength of the relationship between covariates and drill HRex: a stronger association could suggest a greater influence (when we dont want it).

I won’t worry too much about the formatting of outputs (e.g., plot appearances, decimal places), because these can all be tweaked if we summarise a final covariate analysis (likely a supplementary file). But I’ll try get the code into a reproducible format so it can be re-used or edited, and I’ll use a minimum amount of tidying on plots to make them presentable for the report.

Below, I’ll walk through the full code as it appears logically and so it can be copied into a script if you want to play around. In the script saved in Dropbox, I’ve moved all functions to a separate script and called them into the main script using source(). It’s just a good way to keep things clean,tidy and managable.

Load Packages

library(tidyverse)
library(rstudioapi)
library(rmcorr)
library(corrplot)       # corrplot()
library(lme4)
library(broom.mixed)    # tidy()
library(sjPlot)         # plot_model()
library(insight)        # get_deviance()
library(ggrepel)        # geom_text_repel()
library(ggpubr)

Get Data

setwd(dirname(getActiveDocumentContext()$path))

source("1. Data Import & Wrangle.R")

str(submax_standrill_data)

## 'data.frame':    85 obs. of  25 variables:
##  $ Date               : POSIXct, format: "2021-12-13" "2021-12-13" ...
##  $ Week               : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Player.Name        : chr  "32 Daniel Alessi" "29 James Demetriou" "22 Connor Shadock" "25 Mitch Knight" ...
##  $ n.observe          : num  7 4 5 7 4 7 6 4 6 5 ...
##  $ total.dis          : num  454 408 481 497 470 442 448 497 455 389 ...
##  $ mean.hrex          : num  150 167 172 168 151 178 133 188 155 145 ...
##  $ low.speed.run      : num  243 278 193 287 296 218 255 205 255 194 ...
##  $ high.speed.run     : num  99 7 180 94 50 110 55 187 91 98 ...
##  $ very.high.speed.run: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ acc.dis            : num  213 218 238 212 250 204 218 236 227 181 ...
##  $ player.load        : num  52.8 40 63.2 54.9 58.5 ...
##  $ smft.hrex          : num  146 178 177 164 151 ...
##  $ max.hr             : num  179 191 213 213 204 208 197 204 192 189 ...
##  $ Player.id          : Factor w/ 16 levels "2","3","4","5",..: 10 7 1 4 3 8 6 11 5 12 ...
##  $ Date_upd           : chr  "13/12/2021" "13/12/2021" "13/12/2021" "13/12/2021" ...
##  $ Temp.              : num  20 20 20 20 20 20 20 20 20 20 ...
##  $ Humidity           : num  64 64 64 64 64 64 64 64 64 64 ...
##  $ Day.Week           : chr  "Mon" "Mon" "Mon" "Mon" ...
##  $ smft.hrex.rel      : num  81.7 92.9 83.1 77.1 73.9 ...
##  $ mean.hrex.rel      : num  83.8 87.4 80.8 78.9 74 ...
##  $ total.dis.stan     : num  0.0151 -0.2192 0.1527 0.2342 0.0966 ...
##  $ low.speed.run.stan : num  0.21576 0.36463 0.00309 0.40292 0.4412 ...
##  $ high.speed.run.stan: num  0.365 -0.145 0.814 0.337 0.093 ...
##  $ acc.dis.stand      : num  -0.0094 0.0327 0.2012 -0.0178 0.3023 ...
##  $ player.load.stand  : num  0.0309 -0.3255 0.319 0.0891 0.1881 ...

str(env_condition_data)

## spc_tbl_ [7 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Date_upd: chr [1:7] "13/12/2021" "11/1/2022" "18/1/2022" "25/1/2022" ...
##  $ Temp.   : num [1:7] 20 24 21 23 27 22 24
##  $ Humidity: num [1:7] 64 78 94 65 74 69 65
##  $ Day.Week: chr [1:7] "Mon" "Tue" "Tue" "Tue" ...
##  $ Date    : POSIXct[1:7], format: "2021-12-13" "2022-01-11" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Date_upd = col_character(),
##   ..   Temp. = col_double(),
##   ..   Humidity = col_double(),
##   ..   Day.Week = col_character(),
##   ..   Date = col_datetime(format = "")
##   .. )
##  - attr(*, "problems")=<externalptr>

Make Covariate Lists

el_covs <-  c("total.dis", 
              "low.speed.run", 
              "high.speed.run", 
              "very.high.speed.run", 
              "acc.dis", 
              "player.load")
              
env_covs <-  c("Temp.", "Humidity")

all_covs <- c(el_covs, env_covs)

Plot Raw Data

submax_standrill_data %>% 
  pivot_longer(cols = all_of(el_covs), 
               names_to = "covar", 
               values_to = "value") %>% 
  ggplot(data = ., 
       aes(y = value)) + 
  geom_boxplot(width = 0.01, 
               alpha = 0.5, 
               position = position_nudge(x = -0.01)) + 
  geom_density(aes(fill = Player.id), 
               adjust = 1, 
               alpha = 0.5, 
               linetype = "blank") + 
  theme_minimal() +
  theme(axis.text.x = element_blank(), 
        axis.title.x = element_blank()) + 
  facet_wrap(facets = vars(covar), 
             scales = "free")

Most of the external training load variables seem to have a reasonable spread. There’s a decent bit of variation which, for the most part, looks normally distributed. The obvious exception is very high-seed running – most of the values were 0. We’ll check this in the variability analysis but looks like it might not be useful and could cause issues. We should also keep an eye on high-speed running (possibly similar issues) and there could be a few high extremes for total distance and acc distance? Legit or errors?

Let’s take a look at the environmental data. Slightly different plot set up because the data are collected at the level of the group, not the individual

submax_standrill_data %>% 
  pivot_longer(cols = all_of(env_covs), 
               names_to = "covar", 
               values_to = "value") %>% 
  ggplot(data = ., 
         aes(y = value)) + 
  geom_boxplot(width = 0.01, 
               alpha = 0.5, 
               position = position_nudge(x = -0.01)) + 
  geom_density(adjust = 1, 
               alpha = 0.5, 
               fill = "blue", 
               linetype = "blank") + 
  theme_minimal() +
  theme(axis.text.x = element_blank(), 
        axis.title.x = element_blank()) + 
  facet_wrap(facets = vars(covar), 
             scales = "free")

Again, no obvious issues here, less the fact we’ll need to run different models on these outcomes data.

Look at Variability

For the external training load data (and also drill HRex), we can use a linear mixed effects model with a random intercept to estimate thew residual SD (and it’s CI), which is conceptually identical to the SEM in a test-retest trial.

For the environmental data, we can use the SEE from a linear model, which is again equivalent to the SEM.

I’ll also pull the mean of each variable and use it to roughly estimate the CV (SEM as a %). The right way is log transforming, but that wont work if the data has zero values (which a few of our variables do).

It’d be fairly inefficient to copy the code for every variable so time for a few functions! I’ll set the only argument as ‘covar’, which will be replaced by a given covariate when we run the lists through the fucntions.

First, for the external training loads. The function to get the mean is straightforward. To get the residual SD, I’ll run a random intercept model then use broom::tidy() to extract the random effects. The last function is similar to the previous, but instead of extracting the random effects I’ll make a plot of each players random intercept along with their raw data. This will be a good visual to inspect the within-player variability.

# el mean
el_mean_fun <- function(covar) {
  
  submax_standrill_data %>% 
    summarise(mean(.data[[covar]])) %>%
    as.numeric
}

# el variability
el_var_fun <- function(covar) {
  
  # random intercept model
  var_mod <- lmer(formula = paste0(covar," ~ 1 + (1 | Player.id)"),
                  data = submax_standrill_data)
  
  # extract tidy random effects
  tidy(var_mod, 
       effects = "ran_pars", 
       conf.method = "profile", 
       conf.int = TRUE, 
       conf.level = 0.90)
}

# el plots
el_var_plot_fun <- function(covar) { 
  
  # random intercept model
  var_mod <- lmer(formula = paste0(covar," ~ 1 + (1 | Player.id)"),
                  data = submax_standrill_data)
  
  # get random intercepts
  intercepts <- tidy(var_mod, 
                     effects = "ran_coefs") %>% 
    rename(Player.id = level, 
           int.est = estimate) %>%
    select(Player.id, 
           int.est)
  
  # plot
  submax_standrill_data %>%
    left_join(y = intercepts, 
              by = "Player.id") %>% 
    ggplot(data = ., 
           aes(x = reorder(x = Player.id, 
                           X = int.est), 
               y = .data[[covar]], 
               colour = Player.id)) + 
    geom_point(alpha = 0.5, 
               stroke = NA, 
               size = 2) + 
    geom_point(aes(y = int.est, 
                   colour = Player.id), 
               shape = 15, 
               size = 3) + 
    labs(x = "Player ID") + 
    theme_minimal() + 
    theme(legend.position = "none")
}

Next up the environmental data. The mean is the same as previous and for the plot we just need to run a geom_point() with geom_smooth(method = “lm”). For the variability, I’ve written a custom function to get the SEE from the lm() and calculate is CI before putting it into a tibble. I got the inspiration from get_sigma() in the ‘insight’ package, but it gives the values in a weird format that I couldn’t get into a tibble.

# env mean
env_mean_fun <- function(covar) {
  
  env_condition_data %>% 
    summarise(mean(.data[[covar]])) %>%
    as.numeric
  }

# env variability
env_var_fun <- function(covar) {
  
  # linear model
  env_var_mod <- lm(formula = paste0(covar," ~ Date"), 
                    data = env_condition_data)
  
  # extract tidy random effects
  sem_ci <- function(model, ci) {
    
    alpha <- 1 - ci
    
    dev <- insight::get_deviance(model)
    
    n <- n_obs(model)
    
    low <- dev / stats::qchisq(1 - alpha / 2, n)
    high <- dev / stats::qchisq(alpha / 2, n)
    
    tibble(sem = sigma(model), 
           sem.lcl = sqrt(low), 
           sem.ucl = sqrt(high))
    }
  
  sem_ci(model = env_var_mod, 
         ci = 0.9)
  }

# env plotss
env_var_plot_fun <- function(covar) {
  
  ggplot(data = env_condition_data, 
         aes(x = Date, 
             y = .data[[covar]])) + 
    geom_point() + 
    geom_smooth(method = "lm") + 
    theme_minimal()
}

Now to put them into action. I’ll create a tibble with column 1 as the covariate and drill HRex, then use use map() to loop each variable through the functions before mutating the outputs onto the relevant rows. They’ll append as nested tibbles or plots, so a little bit of tidying up to do before creating the CVs. We’ll need to do all this separately for the environmental data which can then be bound together with external training load.

# el summary tibble
el_var <- 
  
  # convert the covariate vector to a tibble and add drill HRex
  tibble("covar" = el_covs) %>%
  add_row(covar = "mean.hrex.rel") %>%
  
  # loop covars through functions
  mutate(mean = map_dbl(.x = covar, 
                        .f = el_mean_fun), 
         var_mod = map(.x = covar, 
                       .f = el_var_fun), 
         var_plot = map(.x = covar, 
                        .f = el_var_plot_fun)) %>%
  unnest(cols = c(var_mod)) %>% 
  
  # get rid of redundant cols and rows
  filter(group == "Residual") %>%
  select(covar, 
         mean, 
         estimate, 
         conf.low, 
         conf.high, 
         var_plot) 

  # calculate approx CVs & tidy names
  el_var <- el_var %>% 
    rename(sem = estimate, 
           sem.lcl = conf.low, 
           sem.ucl = conf.high) %>% 
    mutate(cv = (sem/mean)*100, 
           cv.lcl = (sem.lcl/mean)*100, 
           cv.ucl = (sem.ucl/mean)*100) 

# env summary tibble
env_var <- 
  tibble("covar" = env_covs) %>%
  mutate(mean = map_dbl(.x = covar, 
                        .f = env_mean_fun), 
         env_var_mod = map(.x = covar, 
                           .f = env_var_fun), 
         var_plot = map(.x = covar, 
                        .f = env_var_plot_fun)) %>% 
  unnest(cols = c(env_var_mod)) %>% 
  mutate(cv = (sem/mean)*100, 
         cv.lcl = (sem.lcl/mean)*100, 
         cv.ucl = (sem.ucl/mean)*100) 

# combine
variability <- rbind(el_var, env_var)

variability

## # A tibble: 9 × 9
##   covar                 mean   sem sem.lcl sem.ucl var_plot     cv cv.lcl cv.ucl
##   <chr>                <dbl> <dbl>   <dbl>   <dbl> <list>    <dbl>  <dbl>  <dbl>
## 1 total.dis          453.    39.4    34.5    45.5  <gg>       8.71   7.62  10.1 
## 2 low.speed.run      192.    46.8    41.0    54.1  <gg>      24.4   21.3   28.1 
## 3 high.speed.run      33.2   34.1    29.8    39.6  <gg>     103.    89.7  119.  
## 4 very.high.speed.r…   0.341  1.72    1.52    1.95 <gg>     505.   445.   572.  
## 5 acc.dis            214.    24.2    21.4    27.5  <gg>      11.3    9.95  12.8 
## 6 player.load         51.7    6.10    5.33    7.09 <gg>      11.8   10.3   13.7 
## 7 mean.hrex.rel       75.0    4.46    3.90    5.18 <gg>       5.95   5.20   6.90
## 8 Temp.               23      2.10    1.25    3.19 <gg>       9.14   5.45  13.9 
## 9 Humidity            72.7   11.8     7.01   17.8  <gg>      16.2    9.63  24.5

So, for context, the within-player variability of drill HRex is about 4.5% points. We can see how this looks for each player in the random intercept plots:

variability$var_plot[7]

Lets have a look at the other variables as well as the data from the variability table up above:

ggarrange(plotlist = variability$var_plot)

The first standout, like the raw data, is VHSR. The CV is 500% (!!) and we can see that’s because most people didn’t clock up any meters other than the odd occasion. I wonder why they would have clocked up meters at these very high speeds if the drill was standardized? Anyway, VHSR is clearly not behaving the same as HRex or any other of the external load variables, and those zeros will cause problems in the analysis, so ill remove it from any further consideration.

Another issue is acceleration distance: everyone has the same intercept! That’s because the between-player variability in everyone’s mean was about 10m (<5%) so the model didn’t converge (i.e., it didn’t recognize the intercepts as having substantial enough variability to give everyone their own). That’s fine though, because we are estimating the within-player SD.

Lets take a closer look at the CVs:

# get rid of VHSRD and drill HRex
variability2 <- variability %>% 
  filter(covar != "very.high.speed.run", 
         covar != "mean.hrex.rel") 

# make a forest potof the CVs
ggplot(data = variability2, 
       aes (x = cv, 
            y = reorder(x = covar, 
                        X = cv), 
            colour = cv, 
            label = round(cv, digits = 1))) + 
geom_linerange(mapping = aes(xmin = cv.lcl, 
                             xmax = cv.ucl), 
               colour = "grey5", 
               linewidth = 1, 
               alpha = 0.3) + 
geom_point(size = 4) + 
geom_text_repel() + 
scale_x_continuous(name = "CV (%)") + 
scale_colour_viridis_c(direction = -1) + 
theme_minimal() + 
theme(axis.title.y = element_blank(), 
      legend.position = "none")

Most covariates have about a 10% variabiltiy but HSR seems to be far more erratic.Let’s keep it in for now and see how it compares to other covariates and drill HRex.

Relationship Between Covariates

We’ll run a repeated measures correlation matrix between all the covariates then visualize it:

# revise covars
new_covs <- all_covs[ !all_covs == "very.high.speed.run"]

# create rmcorr matrix
rmcmat <- 
  rmcorr_mat(dataset = submax_standrill_data, 
             participant = Player.id, 
             variables = new_covs, 
             CI.level = 0.90)

# plot the matrix with coefficients

  # clean names to fit
  colnames(rmcmat$matrix) <- c("TD", 
                               "LSR", 
                               "HSR", 
                               "ACC", 
                               "PL", 
                               "TEM", 
                               "HUM")
  # mixed correlation matrix
  corrplot.mixed(rmcmat$matrix, 
                 order = "AOE", # this re-orders the matrix by correlated vars
                 tl.col = "black")

Like we suspected, TD, LSR PL are highly correlated. Interestingly, ACC is also highly correlated to all 3. If we are picking 1 variable to retain, I think TD makes the most sense from these 3, but I wonder if other acceleration variables (count of accels/ decels >/< 2 m/s/s, acc density, acc load index, etc.) would have been different?

Another interesting finding is that humidity and temperature aren’t associated with each other (a normal linear model - which is better for these variables vs rmc - confirms this), but both have a negative association with HSR: when temperature and humidity were lower, players performed more high-speed running in the drill. Is this something you noticed/ observed, or does it seem strange?

Relationship Between Covariates and Drill HRex

Finally, lets look at how well each covariate correlates with drill HRex on it’s own:

# create functions
hr_rmc_fun <- function(covar) {
  
  # run model
  rmc <- rmcorr(participant = Player.id, 
                measure1 = covar, 
                measure2 = mean.hrex.rel, 
                dataset = submax_standrill_data, 
                CI.level = 0.90)
  
  # make summary tibble
  rmcs <- tibble(r = rmc$r, 
                 lcl = rmc$CI[1], 
                 ucl = rmc$CI[2]) %>%
    round(digits = 2)
}

hr_rmc_plot_fun <- function(covar) {

  # run model
  rmc <- rmcorr(participant = Player.id, 
                measure1 = covar, 
                measure2 = mean.hrex.rel, 
                dataset = submax_standrill_data)
  
  # make plot
  ggplot(data = submax_standrill_data, 
                 aes(x = .data[[covar]], 
                     y = mean.hrex.rel, 
                     group = Player.id, 
                     colour = Player.id)) + 
    geom_point() + 
    geom_line(aes(y = rmc$model$fitted.values), 
              linetype = 1) + 
    theme_minimal()
}
  
# loop and create summary tibble 
hr_correls <- tibble("covar" = new_covs) %>% 
  mutate(hr_rmc = map(.x = covar, 
                      .f = hr_rmc_fun), 
         plot = map(.x = covar, 
                    .f = hr_rmc_plot_fun)) %>%
  unnest(cols = hr_rmc) %>% 
  filter(covar != "very.high.speed.run")

# plot
ggarrange(plotlist = hr_correls$plot, 
          legend = "none")

These plots are a little small - I can’t get ggarrage to play nice with Markdown :) - so i’ll summarise with a forest:

hr_correls %>% 
  filter(covar != "very.high.speed.run") %>% 
  ggplot(data = ., 
       aes (x = r, 
            y = reorder(x = covar, 
                        X = r), 
            colour = r)) + 
  geom_linerange(mapping = aes(xmin = lcl, 
                               xmax = ucl), 
                 colour = "grey5", 
                 linewidth = 1, 
                 alpha = 0.3) + 
  geom_point(size = 4) + 
  scale_x_continuous(name = "r value", 
                     limits = c(-1.0, 1.0)) + 
  scale_colour_viridis_c(direction = -1) + 
  theme_minimal() + 
  theme(axis.title.y = element_blank(), 
        legend.position = "none")

Looks like all the external load measures have a large relationship with drill HRex. Since we know that TD, LSR, PL and ACC all correlate strongly with one another, but not with HSR, we could probably retain TD and HSR for the final analysis.

Interesting story for the environmental factors. Their association with drill HRex is trivial. I guess the jury is out on these for now.

Really interested to hear your thoughts!