Okay mate…the goal of this analysis is to understand which covariates should be included in the agreement analysis for exercise heart rate (HRex) between the standardized shuttle run and your bespoke ‘Y’ passing drill.
My main thoughts for the exploration are:
View the raw data and see how variables distribute: this will tell us a little about how they behave and if/how we should subsequently include them.
Understand the variability of the covariates: if a variable isn’t changing much over the observation period then it wouldn’t make sense to include it.
Determine how strong the covriates are associated to one another: we want to avoid multicollinearity.
Assess the strength of the relationship between covariates and drill HRex: a stronger association could suggest a greater influence (when we dont want it).
I won’t worry too much about the formatting of outputs (e.g., plot appearances, decimal places), because these can all be tweaked if we summarise a final covariate analysis (likely a supplementary file). But I’ll try get the code into a reproducible format so it can be re-used or edited, and I’ll use a minimum amount of tidying on plots to make them presentable for the report.
Below, I’ll walk through the full code as it appears logically and so it can be copied into a script if you want to play around. In the script saved in Dropbox, I’ve moved all functions to a separate script and called them into the main script using source(). It’s just a good way to keep things clean,tidy and managable.
library(tidyverse)
library(rstudioapi)
library(rmcorr)
library(corrplot) # corrplot()
library(lme4)
library(broom.mixed) # tidy()
library(sjPlot) # plot_model()
library(insight) # get_deviance()
library(ggrepel) # geom_text_repel()
library(ggpubr)
setwd(dirname(getActiveDocumentContext()$path))
source("1. Data Import & Wrangle.R")
str(submax_standrill_data)
## 'data.frame': 85 obs. of 25 variables:
## $ Date : POSIXct, format: "2021-12-13" "2021-12-13" ...
## $ Week : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Player.Name : chr "32 Daniel Alessi" "29 James Demetriou" "22 Connor Shadock" "25 Mitch Knight" ...
## $ n.observe : num 7 4 5 7 4 7 6 4 6 5 ...
## $ total.dis : num 454 408 481 497 470 442 448 497 455 389 ...
## $ mean.hrex : num 150 167 172 168 151 178 133 188 155 145 ...
## $ low.speed.run : num 243 278 193 287 296 218 255 205 255 194 ...
## $ high.speed.run : num 99 7 180 94 50 110 55 187 91 98 ...
## $ very.high.speed.run: num 0 0 0 0 0 0 0 0 0 0 ...
## $ acc.dis : num 213 218 238 212 250 204 218 236 227 181 ...
## $ player.load : num 52.8 40 63.2 54.9 58.5 ...
## $ smft.hrex : num 146 178 177 164 151 ...
## $ max.hr : num 179 191 213 213 204 208 197 204 192 189 ...
## $ Player.id : Factor w/ 16 levels "2","3","4","5",..: 10 7 1 4 3 8 6 11 5 12 ...
## $ Date_upd : chr "13/12/2021" "13/12/2021" "13/12/2021" "13/12/2021" ...
## $ Temp. : num 20 20 20 20 20 20 20 20 20 20 ...
## $ Humidity : num 64 64 64 64 64 64 64 64 64 64 ...
## $ Day.Week : chr "Mon" "Mon" "Mon" "Mon" ...
## $ smft.hrex.rel : num 81.7 92.9 83.1 77.1 73.9 ...
## $ mean.hrex.rel : num 83.8 87.4 80.8 78.9 74 ...
## $ total.dis.stan : num 0.0151 -0.2192 0.1527 0.2342 0.0966 ...
## $ low.speed.run.stan : num 0.21576 0.36463 0.00309 0.40292 0.4412 ...
## $ high.speed.run.stan: num 0.365 -0.145 0.814 0.337 0.093 ...
## $ acc.dis.stand : num -0.0094 0.0327 0.2012 -0.0178 0.3023 ...
## $ player.load.stand : num 0.0309 -0.3255 0.319 0.0891 0.1881 ...
str(env_condition_data)
## spc_tbl_ [7 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Date_upd: chr [1:7] "13/12/2021" "11/1/2022" "18/1/2022" "25/1/2022" ...
## $ Temp. : num [1:7] 20 24 21 23 27 22 24
## $ Humidity: num [1:7] 64 78 94 65 74 69 65
## $ Day.Week: chr [1:7] "Mon" "Tue" "Tue" "Tue" ...
## $ Date : POSIXct[1:7], format: "2021-12-13" "2022-01-11" ...
## - attr(*, "spec")=
## .. cols(
## .. Date_upd = col_character(),
## .. Temp. = col_double(),
## .. Humidity = col_double(),
## .. Day.Week = col_character(),
## .. Date = col_datetime(format = "")
## .. )
## - attr(*, "problems")=<externalptr>
el_covs <- c("total.dis",
"low.speed.run",
"high.speed.run",
"very.high.speed.run",
"acc.dis",
"player.load")
env_covs <- c("Temp.", "Humidity")
all_covs <- c(el_covs, env_covs)
submax_standrill_data %>%
pivot_longer(cols = all_of(el_covs),
names_to = "covar",
values_to = "value") %>%
ggplot(data = .,
aes(y = value)) +
geom_boxplot(width = 0.01,
alpha = 0.5,
position = position_nudge(x = -0.01)) +
geom_density(aes(fill = Player.id),
adjust = 1,
alpha = 0.5,
linetype = "blank") +
theme_minimal() +
theme(axis.text.x = element_blank(),
axis.title.x = element_blank()) +
facet_wrap(facets = vars(covar),
scales = "free")
Most of the external training load variables seem to have a reasonable spread. There’s a decent bit of variation which, for the most part, looks normally distributed. The obvious exception is very high-seed running – most of the values were 0. We’ll check this in the variability analysis but looks like it might not be useful and could cause issues. We should also keep an eye on high-speed running (possibly similar issues) and there could be a few high extremes for total distance and acc distance? Legit or errors?
Let’s take a look at the environmental data. Slightly different plot set up because the data are collected at the level of the group, not the individual
submax_standrill_data %>%
pivot_longer(cols = all_of(env_covs),
names_to = "covar",
values_to = "value") %>%
ggplot(data = .,
aes(y = value)) +
geom_boxplot(width = 0.01,
alpha = 0.5,
position = position_nudge(x = -0.01)) +
geom_density(adjust = 1,
alpha = 0.5,
fill = "blue",
linetype = "blank") +
theme_minimal() +
theme(axis.text.x = element_blank(),
axis.title.x = element_blank()) +
facet_wrap(facets = vars(covar),
scales = "free")
Again, no obvious issues here, less the fact we’ll need to run different models on these outcomes data.
For the external training load data (and also drill HRex), we can use a linear mixed effects model with a random intercept to estimate thew residual SD (and it’s CI), which is conceptually identical to the SEM in a test-retest trial.
For the environmental data, we can use the SEE from a linear model, which is again equivalent to the SEM.
I’ll also pull the mean of each variable and use it to roughly estimate the CV (SEM as a %). The right way is log transforming, but that wont work if the data has zero values (which a few of our variables do).
It’d be fairly inefficient to copy the code for every variable so time for a few functions! I’ll set the only argument as ‘covar’, which will be replaced by a given covariate when we run the lists through the fucntions.
First, for the external training loads. The function to get the mean is straightforward. To get the residual SD, I’ll run a random intercept model then use broom::tidy() to extract the random effects. The last function is similar to the previous, but instead of extracting the random effects I’ll make a plot of each players random intercept along with their raw data. This will be a good visual to inspect the within-player variability.
# el mean
el_mean_fun <- function(covar) {
submax_standrill_data %>%
summarise(mean(.data[[covar]])) %>%
as.numeric
}
# el variability
el_var_fun <- function(covar) {
# random intercept model
var_mod <- lmer(formula = paste0(covar," ~ 1 + (1 | Player.id)"),
data = submax_standrill_data)
# extract tidy random effects
tidy(var_mod,
effects = "ran_pars",
conf.method = "profile",
conf.int = TRUE,
conf.level = 0.90)
}
# el plots
el_var_plot_fun <- function(covar) {
# random intercept model
var_mod <- lmer(formula = paste0(covar," ~ 1 + (1 | Player.id)"),
data = submax_standrill_data)
# get random intercepts
intercepts <- tidy(var_mod,
effects = "ran_coefs") %>%
rename(Player.id = level,
int.est = estimate) %>%
select(Player.id,
int.est)
# plot
submax_standrill_data %>%
left_join(y = intercepts,
by = "Player.id") %>%
ggplot(data = .,
aes(x = reorder(x = Player.id,
X = int.est),
y = .data[[covar]],
colour = Player.id)) +
geom_point(alpha = 0.5,
stroke = NA,
size = 2) +
geom_point(aes(y = int.est,
colour = Player.id),
shape = 15,
size = 3) +
labs(x = "Player ID") +
theme_minimal() +
theme(legend.position = "none")
}
Next up the environmental data. The mean is the same as previous and for the plot we just need to run a geom_point() with geom_smooth(method = “lm”). For the variability, I’ve written a custom function to get the SEE from the lm() and calculate is CI before putting it into a tibble. I got the inspiration from get_sigma() in the ‘insight’ package, but it gives the values in a weird format that I couldn’t get into a tibble.
# env mean
env_mean_fun <- function(covar) {
env_condition_data %>%
summarise(mean(.data[[covar]])) %>%
as.numeric
}
# env variability
env_var_fun <- function(covar) {
# linear model
env_var_mod <- lm(formula = paste0(covar," ~ Date"),
data = env_condition_data)
# extract tidy random effects
sem_ci <- function(model, ci) {
alpha <- 1 - ci
dev <- insight::get_deviance(model)
n <- n_obs(model)
low <- dev / stats::qchisq(1 - alpha / 2, n)
high <- dev / stats::qchisq(alpha / 2, n)
tibble(sem = sigma(model),
sem.lcl = sqrt(low),
sem.ucl = sqrt(high))
}
sem_ci(model = env_var_mod,
ci = 0.9)
}
# env plotss
env_var_plot_fun <- function(covar) {
ggplot(data = env_condition_data,
aes(x = Date,
y = .data[[covar]])) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
}
Now to put them into action. I’ll create a tibble with column 1 as the covariate and drill HRex, then use use map() to loop each variable through the functions before mutating the outputs onto the relevant rows. They’ll append as nested tibbles or plots, so a little bit of tidying up to do before creating the CVs. We’ll need to do all this separately for the environmental data which can then be bound together with external training load.
# el summary tibble
el_var <-
# convert the covariate vector to a tibble and add drill HRex
tibble("covar" = el_covs) %>%
add_row(covar = "mean.hrex.rel") %>%
# loop covars through functions
mutate(mean = map_dbl(.x = covar,
.f = el_mean_fun),
var_mod = map(.x = covar,
.f = el_var_fun),
var_plot = map(.x = covar,
.f = el_var_plot_fun)) %>%
unnest(cols = c(var_mod)) %>%
# get rid of redundant cols and rows
filter(group == "Residual") %>%
select(covar,
mean,
estimate,
conf.low,
conf.high,
var_plot)
# calculate approx CVs & tidy names
el_var <- el_var %>%
rename(sem = estimate,
sem.lcl = conf.low,
sem.ucl = conf.high) %>%
mutate(cv = (sem/mean)*100,
cv.lcl = (sem.lcl/mean)*100,
cv.ucl = (sem.ucl/mean)*100)
# env summary tibble
env_var <-
tibble("covar" = env_covs) %>%
mutate(mean = map_dbl(.x = covar,
.f = env_mean_fun),
env_var_mod = map(.x = covar,
.f = env_var_fun),
var_plot = map(.x = covar,
.f = env_var_plot_fun)) %>%
unnest(cols = c(env_var_mod)) %>%
mutate(cv = (sem/mean)*100,
cv.lcl = (sem.lcl/mean)*100,
cv.ucl = (sem.ucl/mean)*100)
# combine
variability <- rbind(el_var, env_var)
variability
## # A tibble: 9 × 9
## covar mean sem sem.lcl sem.ucl var_plot cv cv.lcl cv.ucl
## <chr> <dbl> <dbl> <dbl> <dbl> <list> <dbl> <dbl> <dbl>
## 1 total.dis 453. 39.4 34.5 45.5 <gg> 8.71 7.62 10.1
## 2 low.speed.run 192. 46.8 41.0 54.1 <gg> 24.4 21.3 28.1
## 3 high.speed.run 33.2 34.1 29.8 39.6 <gg> 103. 89.7 119.
## 4 very.high.speed.r… 0.341 1.72 1.52 1.95 <gg> 505. 445. 572.
## 5 acc.dis 214. 24.2 21.4 27.5 <gg> 11.3 9.95 12.8
## 6 player.load 51.7 6.10 5.33 7.09 <gg> 11.8 10.3 13.7
## 7 mean.hrex.rel 75.0 4.46 3.90 5.18 <gg> 5.95 5.20 6.90
## 8 Temp. 23 2.10 1.25 3.19 <gg> 9.14 5.45 13.9
## 9 Humidity 72.7 11.8 7.01 17.8 <gg> 16.2 9.63 24.5
So, for context, the within-player variability of drill HRex is about 4.5% points. We can see how this looks for each player in the random intercept plots:
variability$var_plot[7]
Lets have a look at the other variables as well as the data from the variability table up above:
ggarrange(plotlist = variability$var_plot)
The first standout, like the raw data, is VHSR. The CV is 500% (!!) and we can see that’s because most people didn’t clock up any meters other than the odd occasion. I wonder why they would have clocked up meters at these very high speeds if the drill was standardized? Anyway, VHSR is clearly not behaving the same as HRex or any other of the external load variables, and those zeros will cause problems in the analysis, so ill remove it from any further consideration.
Another issue is acceleration distance: everyone has the same intercept! That’s because the between-player variability in everyone’s mean was about 10m (<5%) so the model didn’t converge (i.e., it didn’t recognize the intercepts as having substantial enough variability to give everyone their own). That’s fine though, because we are estimating the within-player SD.
Lets take a closer look at the CVs:
# get rid of VHSRD and drill HRex
variability2 <- variability %>%
filter(covar != "very.high.speed.run",
covar != "mean.hrex.rel")
# make a forest potof the CVs
ggplot(data = variability2,
aes (x = cv,
y = reorder(x = covar,
X = cv),
colour = cv,
label = round(cv, digits = 1))) +
geom_linerange(mapping = aes(xmin = cv.lcl,
xmax = cv.ucl),
colour = "grey5",
linewidth = 1,
alpha = 0.3) +
geom_point(size = 4) +
geom_text_repel() +
scale_x_continuous(name = "CV (%)") +
scale_colour_viridis_c(direction = -1) +
theme_minimal() +
theme(axis.title.y = element_blank(),
legend.position = "none")
Most covariates have about a 10% variabiltiy but HSR seems to be far more erratic.Let’s keep it in for now and see how it compares to other covariates and drill HRex.
We’ll run a repeated measures correlation matrix between all the covariates then visualize it:
# revise covars
new_covs <- all_covs[ !all_covs == "very.high.speed.run"]
# create rmcorr matrix
rmcmat <-
rmcorr_mat(dataset = submax_standrill_data,
participant = Player.id,
variables = new_covs,
CI.level = 0.90)
# plot the matrix with coefficients
# clean names to fit
colnames(rmcmat$matrix) <- c("TD",
"LSR",
"HSR",
"ACC",
"PL",
"TEM",
"HUM")
# mixed correlation matrix
corrplot.mixed(rmcmat$matrix,
order = "AOE", # this re-orders the matrix by correlated vars
tl.col = "black")
Like we suspected, TD, LSR PL are highly correlated. Interestingly, ACC is also highly correlated to all 3. If we are picking 1 variable to retain, I think TD makes the most sense from these 3, but I wonder if other acceleration variables (count of accels/ decels >/< 2 m/s/s, acc density, acc load index, etc.) would have been different?
Another interesting finding is that humidity and temperature aren’t associated with each other (a normal linear model - which is better for these variables vs rmc - confirms this), but both have a negative association with HSR: when temperature and humidity were lower, players performed more high-speed running in the drill. Is this something you noticed/ observed, or does it seem strange?
Finally, lets look at how well each covariate correlates with drill HRex on it’s own:
# create functions
hr_rmc_fun <- function(covar) {
# run model
rmc <- rmcorr(participant = Player.id,
measure1 = covar,
measure2 = mean.hrex.rel,
dataset = submax_standrill_data,
CI.level = 0.90)
# make summary tibble
rmcs <- tibble(r = rmc$r,
lcl = rmc$CI[1],
ucl = rmc$CI[2]) %>%
round(digits = 2)
}
hr_rmc_plot_fun <- function(covar) {
# run model
rmc <- rmcorr(participant = Player.id,
measure1 = covar,
measure2 = mean.hrex.rel,
dataset = submax_standrill_data)
# make plot
ggplot(data = submax_standrill_data,
aes(x = .data[[covar]],
y = mean.hrex.rel,
group = Player.id,
colour = Player.id)) +
geom_point() +
geom_line(aes(y = rmc$model$fitted.values),
linetype = 1) +
theme_minimal()
}
# loop and create summary tibble
hr_correls <- tibble("covar" = new_covs) %>%
mutate(hr_rmc = map(.x = covar,
.f = hr_rmc_fun),
plot = map(.x = covar,
.f = hr_rmc_plot_fun)) %>%
unnest(cols = hr_rmc) %>%
filter(covar != "very.high.speed.run")
# plot
ggarrange(plotlist = hr_correls$plot,
legend = "none")
These plots are a little small - I can’t get ggarrage to play nice with Markdown :) - so i’ll summarise with a forest:
hr_correls %>%
filter(covar != "very.high.speed.run") %>%
ggplot(data = .,
aes (x = r,
y = reorder(x = covar,
X = r),
colour = r)) +
geom_linerange(mapping = aes(xmin = lcl,
xmax = ucl),
colour = "grey5",
linewidth = 1,
alpha = 0.3) +
geom_point(size = 4) +
scale_x_continuous(name = "r value",
limits = c(-1.0, 1.0)) +
scale_colour_viridis_c(direction = -1) +
theme_minimal() +
theme(axis.title.y = element_blank(),
legend.position = "none")
Looks like all the external load measures have a large relationship with drill HRex. Since we know that TD, LSR, PL and ACC all correlate strongly with one another, but not with HSR, we could probably retain TD and HSR for the final analysis.
Interesting story for the environmental factors. Their association with drill HRex is trivial. I guess the jury is out on these for now.
Really interested to hear your thoughts!