Introduction


In the early days of my cycling career I only monitored speed. The distance was determined with a curvimeter and your watch indicated the time. Everything was neatly written down on paper. Then the analog bicycle speedometer appeared. And much later the heart rate monitor followed by the wattage meter, a big step forward in determining your cycling intensity. Combining heart rate and power is a way to see what your body is doing (pedaling power) and how your body responds to it (heart rate).

A linear relation between power output and heart rate was observed.2 During aerobic exercise, an increase in power must be compensated by an increase in oxygen consumption, resulting in an increase in heart rate. Both phenomena are regarded as linear, causal relationships. From this it has been deduced that the relationship between heart rate and power output during physical effort can also be considered linear.

This writing is inspired by and based on publications by David Johnstone3, Marek Śmigielski4 and an article in Critical Powers5. We put things together to create a potentially useful model for two basic types of rides I cycle on the home trainer: 1. a time trial of any type, flat, hilly or heavily uphill; 2. a ride with a number of short intervals, for example Strava segments or short, vicious climbs. This article is about a ride of the second type of ride. This concerns strictly personal cycling data seen through the eyes of a data analyst. The intention is not to become a better, data-driven cyclist, but to - in addition to cycling - also enjoy beautiful “R”-generated graphs and models based on the ever-expanding collection of cycling data.
And perhaps an interesting numerical spin-off will reveal itself.


A sample virtual ride


The Rouvy ride “Nukerke to Oudenaarde” serves as an example. It is a part of the Tour of Flanders, one of the UCI spring classics and the most important cycling race in Flanders. During the ride we are presented with five well-known climbs: the Koppenberg, Taaienberg, Kruisberg, Oude Kwaremont, and finally the Paterberg.
The next R code reads the tcx file.

version[['version.string']]
## [1] "R version 4.3.2 (2023-10-31 ucrt)"
library(gridExtra)       # Plots aggregator
library(TTR)             # Exponential Moving Average (EMA) en Moving Average (SMA)
library(xml2)
library(XML)
library(hms)    
library(data.table)
library(methods)
library(scales)          # function "comma_format"
library(tidyverse) 

options("scipen" = 100, "digits" = 4)         # print numbers in normal notation

# Read Rouvy tcx file.

rouvy_tcx_file <- "C:/Users/acmsc/Documents/Strava uploads/activity_14547951175 ROUVY - Nukerke to Oudenaarde.tcx"

mark_als_XML <- XML::xmlParse(rouvy_tcx_file)

df1 <- xmlToDataFrame(mark_als_XML, nodes=getNodeSet(mark_als_XML, "//nm:HeartRateBpm", 
          namespaces=c(nm="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"))) %>% rename(HR = Value) %>% mutate(HR = as.numeric(HR))

df2 <- xmlToDataFrame(mark_als_XML, nodes=getNodeSet(mark_als_XML, "//nm:Position",
          namespaces=c(nm="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2")))

df3 <- xmlToDataFrame(mark_als_XML, nodes=getNodeSet(mark_als_XML, "//nm:Time",
          namespaces=c(nm="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"))) %>% rename(Time = text)

timestamp <- min(df3$Time)

df4 <- xmlToDataFrame(mark_als_XML, nodes=getNodeSet(mark_als_XML, "//nm:Cadence",
          namespaces=c(nm="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"))) %>% filter(row_number()!=1) %>% rename(Cadence = text) %>% mutate(Cadence = as.numeric(Cadence))

df5 <- xmlToDataFrame(mark_als_XML, nodes=getNodeSet(mark_als_XML, "//nm:DistanceMeters",
          namespaces=c(nm="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"))) %>% filter(row_number()!=1)  %>% rename(Cum_Distance = text) %>% mutate(Cum_Distance = round(as.numeric(Cum_Distance),0)) 

df6 <- xmlToDataFrame(mark_als_XML, nodes=getNodeSet(mark_als_XML, "//nm:Extensions",
          namespaces=c(nm="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"))) %>% filter(row_number()!=nrow(.)) %>% select(-LX) %>% rename(Speed_kmh = TPX) %>% mutate(Speed_kmh = round(as.numeric(Speed_kmh) * 3.6,1))   

df7 <- xmlToDataFrame(mark_als_XML, nodes=getNodeSet(mark_als_XML, "//nm:AltitudeMeters",
          namespaces=c(nm="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"))) %>% rename(Altitude = text) %>% mutate(Altitude = as.numeric(Altitude))

# Function to force equal row number.

bind_cols_fill <- function(df_list) {
  max_rows <- map_int(df_list, nrow) %>% max()
  map(df_list, function(df) {
    if(nrow(df) == max_rows) return(df)
    first <- names(df)[1] %>% sym()
    df %>% add_row(!!first := rep(NA, max_rows - nrow(df)))
  }) %>% bind_cols()
}

part1_df <- bind_cols_fill(list(df1,df2,df3,df4,df5,df6,df7)) %>% 
  mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))  

# Since TPX has two child nodes, Watt and Speed, xmlToDataFrame will mash them together.
# They must be linked separately.

part2_df <-  do.call("rbind", xpathApply(mark_als_XML, "//ns3:TPX", function(N) {
  data.frame(
    Watt = xmlValue(N[["Watts"]]),
    Speed = xmlValue(N[["Speed"]])
  )
}))


tcx_df <- cbind(part1_df, part2_df) %>% select(-Speed) %>% mutate(Watt = round(as.numeric(Watt),0)) %>% 
  mutate(Elevation_change_m = Altitude - lag(Altitude)) %>% 
  mutate(Elevation_change_m = if_else(is.na(Elevation_change_m), 0, Elevation_change_m)) %>% 
  mutate(Cum_Elevation_m = cumsum(if_else(Elevation_change_m < 0, 0, Elevation_change_m))) %>% 
  mutate(Timestamp = hms::as_hms(row_number())) %>% 
  mutate(Date_ride = as.Date(substr(Time,1,10))) %>% 
  mutate(Time_stamp_ext = as.ITime(substr(Time,12,19))) %>% 
  select(Date_ride, Time_stamp_ext, Timestamp, LatitudeDegrees, LongitudeDegrees, Cum_Distance, Speed_kmh, Altitude, Elevation_change_m, Cum_Elevation_m, HR, Cadence, Watt) %>% 
  mutate(Cum_Distance = Cum_Distance / 1000) %>% 
  mutate(Cadence = if_else(Cadence > 120, 120,Cadence)) %>%
  mutate(Power_to_heart_ratio = round(Watt / HR,2))

rm(mark_als_XML,df1,df2,df3,df4,df5,df6,df7,part1_df,part2_df)


Right now we have all our data available and we can plot power and heart rate. The next R code produces the graph with summary stats inside:

# Descriptive statistics.

tot_Elevation_m <- round(last(tcx_df$Cum_Elevation_m),1)
mean_Speed <- round(mean(tcx_df$Speed_kmh),1)
max_Speed <- round(max(tcx_df$Speed_kmh),1)
mean_Watt <- round(mean(tcx_df$Watt),0)
max_Watt <- round(max(tcx_df$Watt),0)
mean_Cadence <- round(mean(tcx_df$Cadence),0)
mean_HR <- round(mean(tcx_df$HR),0)
max_HR <- round(max(tcx_df$HR),0)
Distance_km <- max(tcx_df$Cum_Distance)
Duration <- as.ITime(max(row_number(tcx_df)))
Power_to_heart_ratio <- round(mean_Watt / mean_HR,2)
sd_Watt <- round(sd(tcx_df$Watt),0)
sd_HR <- round(sd(tcx_df$HR),0)
normalized_power <- trunc(mean((SMA(tcx_df$Watt,n=30))^4,na.rm = TRUE)^(1/4))  # SMA is Simple MA, EMA Exponential MA. NP uses 30 seconds by definition.
variability_index <-  round(normalized_power / mean_Watt,2)

#  Line chart Watt and HR, elevation profile.

max_Altitude <- round(max(tcx_df$Altitude),0)
min_Altitude <- round(min(tcx_df$Altitude),0)

ylim.prim <- c(0, ceiling(max_Watt/10)*10)   # wattage
ylim.sec <- c(floor(min_Altitude/10)*10, ceiling(max_Altitude/10)*10)    # altitude 

b <- diff(ylim.prim)/diff(ylim.sec)  
a <- ylim.prim[1] - b*ylim.sec[1]    

tcx_chart <- tcx_df %>%        
  ggplot() +
  geom_area(aes(x = Cum_Distance, y = a+Altitude*b), fill = "grey", alpha = 0.5) +
  geom_line(aes(x = Cum_Distance, y = SMA(Watt,n=5), colour = "watt")) + 
  geom_line(aes(x = Cum_Distance, y = HR, colour = "HR")) + 
  scale_y_continuous("SMA(Watt,n=5) c.q. HR", sec.axis = sec_axis(~ (. - a)/b, name = "altitude (m)"), 
                     labels = comma_format(big.mark = ".", decimal.mark = ",")) +
  labs(color = "line",
      x="distance (km)",
      title = "Trend of power (W) and HR (bpm) in a virtual ride",
      subtitle = paste0("Device: Wahoo Kickr, app: Rouvy, ride: ", basename(rouvy_tcx_file)),
      caption  = paste0("Time stamp virtual ride: ", timestamp)) +
  geom_label(aes(36, 350, hjust = 0, vjust = 0, label = paste0("average power: ", mean_Watt, "\n",
                                                                "s.d. power: ", sd_Watt, "\n", 
                                                                "normalized power: ", normalized_power, "\n", 
                                                                "variability index: ", variability_index, "\n", 
                                                                "average HR: ", mean_HR, "\n",
                                                                "s.d. HR: ", sd_HR, "\n", 
                                                                "power/hr ratio: ", Power_to_heart_ratio, "\n", 
                                                                "average speed (km/h): ", mean_Speed, "\n",
                                                                "average cadence (rpm): ", mean_Cadence, "\n",
                                                                "elevation (m): ", tot_Elevation_m, "\n", 
                                                                "duration: ", Duration)), data.frame())
tcx_chart
Figure 1.

Figure 1.


The values of power and heart rate are on the left y-axis, the altitude is on the right hand side, the x-axis is the distance.6
Compared to the erratic pattern of the power curve, the heart rate is quite smooth. Power has greater variability than heart rate. Power can fluctuate, e.g. 100 W one second, 250 W a few moments later. This is reflected by the high standard deviation of the power (80) and the low heart rate s.d. (12). Average power is 203 W. Normalized power (NP) is sometimes a better metric for the overall strain of a ride. NP takes into account the variance between the fluctuations in a ride to give you a number that better matches your effort level. Average power works for steady efforts, such as an endurance ride or intervals requiring a steady power output. If the pace is steady enough, NP and average power will be nearly identical. Another way to express how variable your effort was throughout the ride, is the variability index (VI). It is calculated as follows: normalized power / average power. The higher the VI, the greater the discrepancy between average and normalized power. A high VI means that the ride had a lot of surges and coasting. For properly executed endurance rides, we want the VI to be as low as possible. On the other hand, a criterium, a training with intervals or chasing KOMs will all show a high VI, due to all of the zero power values from coasting.
The power/hr ratio is an input output measure, for what’s it worth.
Heart rate is a lagging phenomenon. This mean that it takes time for your body to react to harder or easier efforts. If you’re doing an interval, it will take time for your heart rate to rise in reaction to your increased effort.
To properly model the relationship between heart rate and wattage, we will adjust both variables: 1. a moving average for wattage, and 2. a lead time for heart rate.
In our model we will be using “simple moving average” (SMA). Exponential moving average (EMA) is more convenient for endurance data as past effort should have only minimal influence.
Linear regression is a commonly used method of predictive analysis. It uses linear relationships between a dependent variable to be explained and one or more independent variables (predictors) to predict. It shows whether changes observed in the dependent variable are associated with changes in one or more of the explanatory or independent variables.
Heart rate is the dependent variable. In addition to power, fatigue and gradient are the predictors. The simplest way to operationalize fatigue is to add time from start workout computed as an amount of seconds. The elevation change in meters represents the gradient. You can also think of it as a resistance measure.
First, let’s see which combination of ‘moving average wattage’ (referred to as the power interval, in seconds) and ‘lead seconds heart rate’ yield the best model fit. Let’s treat both as parameters i en j that will be iteratively put in i x j regression models.
The R code and its output is shown below. The resulting heatmap allows checking both window size parameter and lead altogether.


# Which combination of mean power (W) window and lead HR gives the highest R-square?

iterations_i = 160
iterations_j = 60
variables = 7
count <- 0
output <- matrix(ncol=variables, nrow = iterations_i * iterations_j)
for (i in c(1:iterations_i)){
  for (j in c(1:iterations_j)){
    count <-  count + 1
    #print(count)
    linearMod <- lm(lead(HR, n = j) ~ SMA(Watt, n = i) + Timestamp + Elevation_change_m, data=tcx_df)
    summary <- summary(linearMod)
    output[count,] <-  c(i,j,summary$r.squared, summary$coefficients[1],summary$coefficients[2],summary$coefficients[3],summary$coefficients[4])
  }
}

# X1 and X2 is the best pair i-j; X3 is the maximum R-square in the i-j range defined.
# X4 intercept b0, X5-X7 parameters b1-b3 viz watt, time, elevation change meters.

summary.df <- data.frame(output)
summary.df[summary.df$X3==max(summary.df$X3),]    
##       X1 X2     X3    X4     X5       X6     X7
## 9001 151  1 0.8245 84.07 0.2291 0.001782 -1.313
ggplot(summary.df, aes(x = X1, y = X2, color = X3)) +  
  geom_point(size = 6) + 
  scale_colour_gradientn(colours=rainbow(4)) +
  labs(x = "mean power (W) window (in seconds)", y = "lead HR (in seconds)", color ="R²",
     title = "Which combination of mean power (W) window and lead HR returns the highest R² ?",
     subtitle = paste0("model range: lm(lead(HR, n = j) ~ SMA(Watt, n = i) + Timestamp + Elevation_change_m\nhighest R² : ", round(as.data.frame(summary.df[summary.df$X3==max(summary.df$X3),]    )[,3],3),
                       " at ", as.data.frame(summary.df[summary.df$X3==max(summary.df$X3),]    )[,2], " lead seconds ", " and mean Watt window ",
                       as.data.frame(summary.df[summary.df$X3==max(summary.df$X3),]    )[,1]),
     caption  = paste0("Time stamp virtual ride: ", timestamp)) 
Figure 2.

Figure 2.



Somewhere roughly in the blue area of the heatmap is the maximum R2, more exact at the intersection of one second HR lead and a power interval of 151 seconds. The number of seconds is very low in the first case and very high in the other case. That’s why we’re now going to shift each interval separately, starting with the power, see script and output below (summary regression model with optimal R2, Figure 3).

##########   plot series 1-5 a: impact mean power window.

iteraties <-  160    # max 'mean power window' 160 seconds 
variables <- 6
output <- matrix(ncol=variables, nrow=(iteraties))
for (i in c(1:iteraties)){
  #print(i)
  linearMod <- lm(HR ~ SMA(Watt,n=i) + Timestamp + Elevation_change_m, data=tcx_df)
  summary <- summary(linearMod)
  output[i,] <- c(i,summary$r.squared, summary$coefficients[1],summary$coefficients[2],summary$coefficients[3],summary$coefficients[4])
}

summary.df <- data.frame(output)
window  <- summary.df[summary.df$X2 == max(summary.df$X2),1] # use this window size for MA.

plot1a <- ggplot(summary.df, aes(x = X1)) + 
  geom_line(aes(y = X2), colour = "blue") +
  labs(x = "different intervals (sec.) moving average power (W)") +
  labs(y = "R²") +
  geom_vline(xintercept = window, linetype="dashed", 
             color = "brown", linewidth=1 ) +
  ggtitle("Increase R² by applying a SMA in power (W)") +
  annotate(x = window, y = +Inf,label = paste0("max. R² at interval ", window), vjust=4, hjust=1.2, geom="label", color = "brown")

plot2a <- ggplot(summary.df, aes(x = X1)) + 
  geom_line(aes(y = X3, colour = "blue"), show.legend=FALSE) +
  labs(x = "different intervals (sec.) moving average power (W)") +
  labs(y = "intercept    b0") +
  geom_vline(xintercept = window, linetype="dashed", 
             color = "brown", linewidth=1 ) +
  ggtitle("Decrease intercept (b0) by applying a SMA in power (W)") +
  annotate(x = window, y = +Inf,label = paste0("max. R² at interval ", window), vjust=3, hjust=1.2, geom="label", color = "brown")

plot3a <- ggplot(summary.df, aes(x = X1)) + 
  geom_line(aes(y = X4, colour = "blue"), show.legend=FALSE) +
  labs(x = "different intervals (sec.) moving average power (W)") +
  labs(y = "power (W)    b1") +
  geom_vline(xintercept = window, linetype="dashed", 
             color = "brown", linewidth=1 ) +
  ggtitle("Increase parameter power (b1) by applying a SMA in power (W)") +
  annotate(x = window, y = +Inf,label = paste0("max. R² at interval ", window), vjust=4, hjust=1.2, geom="label", color = "brown")

plot4a <- ggplot(summary.df, aes(x = X1)) + 
  geom_line(aes(y = X5, colour = "blue"), show.legend=FALSE) +
  labs(x = "different intervals (sec.) moving average power (W)") +
  labs(y = "Timestamp    b2") +
  geom_vline(xintercept = window, linetype="dashed", 
             color = "brown", linewidth=1 ) +
  ggtitle("Change parameter timestamp (b2) by applying a SMA in power (W)") +
  annotate(x = window, y = +Inf,label = paste0("max. R² at interval ", window), vjust=3, hjust=1.2, geom="label", color = "brown")

plot5a <- ggplot(summary.df, aes(x = X1)) + 
  geom_line(aes(y = X6, colour = "blue"), show.legend=FALSE) +
  labs(x = "different intervals (sec.) moving average power (W)") +
  labs(y = "resistance    b3") +
  geom_vline(xintercept = window, linetype="dashed", 
             color = "brown", linewidth=1 ) +
  ggtitle("Change parameter resistance (b3) by applying a SMA in power (W)") +
  annotate(x = window, y = +Inf,label = paste0("max. R² at interval ", window), vjust=3, hjust=1.2, geom="label", color = "brown")

summary.df[summary.df$X2 == max(summary.df$X2),]
##      X1     X2    X3     X4       X5     X6
## 153 153 0.8246 83.79 0.2306 0.001779 -1.483
grid.arrange(plot1a, plot2a, plot3a, plot4a, plot5a, nrow=3)
Figure 3.

Figure 3.



The best fit of the model was achieved when the window for the mean power was 153 seconds, as the blue line in first of the five plots in Figure 3 shows. A thing worth checking is the behavior of the regression parameters, b0 (base HR) and b1 to b3 (the predictors). You can see that with the increase in window size the base HR is dropping constantly. The time variable is dropping a while and stabilize at a much smaller value. The resistance parameter drops. The power growth ratio increases.
Now the same analysis, but with a varying lead HR given a constant power window of five seconds, see the output below (summary regression model with optimal R2, Figure 4).

##    X1     X2    X3      X4       X5     X6
## 51 51 0.4381 111.2 0.09384 0.001824 -1.309
Figure 4.

Figure 4.



The best fit of the model was achieved with a HR lead of 51 seconds. Compared to the previous model, the R2 is lower, the constant is higher and the importance of b1 is reduced. In short, the model-importance of the HR lead is very small in this indoor effort.
Conclusion: suppose we were to consider the best model fit as most important, we would indeed set the HR lead to one second, and the average power interval to 153.
But based on our own virtual cycling experience and reasonableness, we set the HR lead at 10 seconds, and the power SMA at 25. Due to the analyzes shown, we opt for a higher interval value than initially intended.
Both constants will be used in the correlation plot and regression analysis.

Correlation plot


The R script below generates the correlation plot, see Figure 5.


library(RColorBrewer)

# Correlation, with 'lag/lead HR' +10 and 'mean power interval window' 25.

lr <- cor.test(SMA(tcx_df$Watt,25), lead(tcx_df$HR,10))
Signif <- as.character(symnum(lr$p.value, corr = FALSE, na = FALSE, cutpoints = c(0,0.001,0.01,0.05,1), 
                              symbols = c("p-value < .001", "p-value < .01", "p-value < .05", "n.s.")))

lr_txt <- paste0("Pearson's r: ", round(lr$estimate,3), "; df = ", lr$parameter, "; ", Signif)

# Simple regression model with HR as dependent variable. lag/lead HR +10 and 'mean watt interval window 25.

linearMod_bivariate <- lm(lead(HR,10) ~ SMA(Watt,25) , data=tcx_df, na.action = na.exclude)

# get F-statistic and p-value from linearMod_bivariate fit.

Regressionp <- function (modelobject) {
  if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
  f <- summary(modelobject)$fstatistic
  p <- pf(f[1],f[2],f[3],lower.tail=FALSE)
  attributes(p) <- NULL
  return(p)
}

Signif <- as.character(symnum(Regressionp(linearMod_bivariate), corr = FALSE, na = FALSE, cutpoints = c(0,0.001,0.01,0.05,1), 
                              symbols = c("p-value < .001", "p-value < .01", "p-value < .05", "n.s.")))

linearMod_bivariate_txt <- paste0("Simple regression: b0 = ", 
                                  round(linearMod_bivariate$coefficients[1],1), 
                                  "; b1 = ",
                                  round(linearMod_bivariate$coefficients[2],3),
                                  "; R² = ",   # t.b.v. R-square gebruik ik unicode  R² ! See Unicode Character SUPERSCRIPT TWO
                                  round(summary(linearMod_bivariate)$r.squared,3),
                                  ";  ", Signif)

Table_HR_Watt <- matrix(ncol=2, nrow=500)

for (i in c(1:500)){
  pred_HR <- coef(linearMod_bivariate)[1] + coef(linearMod_bivariate)[2] * i
  Table_HR_Watt[i,] <- c(i,pred_HR)
}

Table_HR_Watt <- as.data.frame(Table_HR_Watt) %>% 
  rename(Watt = V1, pred_HR = V2) %>% 
  mutate(pred_HR = round(pred_HR,0)) %>% 
  group_by(pred_HR) %>% 
  summarise(Watt = max(Watt)) %>% 
  filter(pred_HR %in% (c(140,150)))

# the table above must not be empty (check row 140 and 150).
Table_HR_Watt
## # A tibble: 2 × 2
##   pred_HR  Watt
##     <dbl> <dbl>
## 1     140   261
## 2     150   366
pwc_140 <- as.integer(Table_HR_Watt %>% filter(pred_HR == 140) %>% select(Watt))
pwc_150 <- as.integer(Table_HR_Watt %>% filter(pred_HR == 150) %>% select(Watt))

lrp <- ggplot(tcx_df, aes(x = SMA(Watt,25), y = lead(HR,10),colour=Cum_Distance)) + 
  geom_point(size = 2, shape = 1) + 
  scale_color_distiller(palette = 'Spectral') +
  geom_smooth(formula = y ~ x , color = "red") +
  annotate("text", x =  155, y = 105, label = lr_txt, size=4, color = "darkblue") +
  annotate("text", x =  155, y = 102, label = linearMod_bivariate_txt, size=4, color = "darkblue") +
  annotate(x = 140, y = 152, label = paste0("pwc150: ", pwc_150), geom="label", color = "brown") +
  geom_hline(yintercept = 150, linetype="dashed", color = "brown", linewidth=1 ) +
  annotate(x = 140, y = 142, label = paste0("pwc140: ", pwc_140), geom="label", color = "brown") +
  geom_hline(yintercept = 140, linetype="dashed", color = "brown", linewidth=1 ) +
  coord_cartesian(ylim = c(100, 170)) +
  labs(color = "distance \nfrom \nstart (km)",
     title = "Relation power (W) and HR (bpm) in a virtual ride",
     subtitle = paste0("Device: Wahoo Kickr, app: Rouvy, ride: ", basename(rouvy_tcx_file)),
     caption  = paste0("Time stamp virtual ride: ", timestamp)) 

lrp 
Figure 5.

Figure 5.



Power is on the horizontal axis, heart rate on the vertical axis. The red diagonal line is the calculated bivariate linear relationship between power and heart rate.
All data points are plotted from the start (blue) of the activity to the end (red). The graph shows that at the beginning of the ride the heart rate is still low, while a slightly higher wattage is immediately pedaled. Generally speaking, in the lower right quadrant you could trace the effect of a strong anaerobic sprint just after the start, or the lack of warm-up.
Pearson’s correlation coefficient is 0.552.
The intercept of the simple regression model (R2 = 0.31) is 116, the power growth ratio is 0.1.
The physical working capacity (PWC) indicator has been added to the graph for information purposes only. The PWC refers to the amount of power that is produced at a specific heart rate. PWC 140 and PWC 150 are the power produced when the heart rate is raised to 140 bpm and 150 bpm, respectively.



Multiple regression model



Script and output regression model:

library(jtools)
library(ggfortify)

linearMod <- lm(lead(HR, n = 10) ~ SMA(Watt, n = 25) + Timestamp + Elevation_change_m, data=tcx_df)

summ(linearMod, scale = FALSE, digits=4)
Observations 5360 (34 missing obs. deleted)
Dependent variable lead(HR, n = 10)
Type OLS linear regression
F(3,5356) 1110.2740
0.3834
Adj. R² 0.3831
Est. S.E. t val. p
(Intercept) 108.6795 0.5119 212.3198 0.0000
SMA(Watt, n = 25) 0.1034 0.0022 46.4711 0.0000
Timestamp 0.0019 0.0001 25.2487 0.0000
Elevation_change_m -2.3690 0.3807 -6.2224 0.0000
Standard errors: OLS



Based on the output above, we can write out the formula to estimate the heart rate:

## [1] "lead(HR, n = 10) ~ 108.679 + 0.103 * SMA(Watt, n = 25) + 0.002 * Timestamp - 2.369 * Elevation_change_m"



According to the model intercept, when there is no power output heart rate already is at a base level, in this case 108. The intercept however should not be interpreted as the heart rate when the body is in a resting state.< br> A power coefficient of 0.1 means that if your cycling increases by 1 watt, the heart rate increases on average 0.1 bpm in the model.
The fatigue parameter is estimated at 0.0019. This means that in this ride, after one hour of cycling, the heart rate increased by 3600 times 0.0019, which is 6.9 bpm, on average under the same conditions.
The sign of the relationship between HR and the gradient in the model is negative (-2.3690), which is remarkable because the correlation coefficient is positive (0.2238), see the printout below.

corr_df <- tcx_df %>% mutate(HR_lead_10 = lead(HR, n = 10))
cor.test(corr_df$HR_lead_10, corr_df$Elevation_change_m)
## 
##  Pearson's product-moment correlation
## 
## data:  corr_df$HR_lead_10 and corr_df$Elevation_change_m
## t = 17, df = 5382, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1983 0.2490
## sample estimates:
##    cor 
## 0.2238
rm(corr_df)


If we run the model again, but now without power, the HR - gradient relationship is positive (7.3962) just as expected:

## 
## Call:
## lm(formula = lead(HR, n = 10) ~ Timestamp + Elevation_change_m, 
##     data = tcx_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66.65  -6.77  -1.72   6.08  31.20 
## 
## Coefficients:
##                       Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept)        128.6282757   0.3008686   427.5 <0.0000000000000002 ***
## Timestamp            0.0022607   0.0000968    23.4 <0.0000000000000002 ***
## Elevation_change_m   7.3962344   0.4030739    18.4 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11 on 5381 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.137,  Adjusted R-squared:  0.137 
## F-statistic:  429 on 2 and 5381 DF,  p-value: <0.0000000000000002


In the model including power, the gradient apparently functions as an adjustment of the weight of power.

Finally, we perform some calculations for control purposes and to print out some observations.

library(modelr)

linearMod_check <- modelr::add_predictions(tcx_df, linearMod, var = "pred", type = NULL) %>% 
  mutate(HR_predicted = round(pred,0),
         HR_lead_10 = lead(HR, n = 10),
         Watt_mean_window_25 = SMA(Watt, n = 25)) %>% 
  filter(!is.na(HR_predicted) & !is.na(HR_lead_10)) %>% 
  mutate(residual = round(resid(linearMod),3),
         stand_residual = rstandard(linearMod), 
         constant = as.numeric(coef(linearMod)[1]),
         b1 = as.numeric(coef(linearMod)[2]),
         b2 = as.numeric(coef(linearMod)[3]),
         b3 = as.numeric(coef(linearMod)[4]),
         watt_model_part = as.numeric(coef(linearMod)[2]) * Watt_mean_window_25,
         time_model_part = as.numeric(coef(linearMod)[3]) * as.numeric(Timestamp),
         elevation_model_part = as.numeric(coef(linearMod)[4]) * Elevation_change_m,
         HR_pred_check = round(as.numeric(coef(linearMod)[1]) +     
                               as.numeric(coef(linearMod)[2]) * Watt_mean_window_25 +
                               as.numeric(coef(linearMod)[3]) * as.numeric(Timestamp) +
                               as.numeric(coef(linearMod)[4]) * Elevation_change_m,0),
         identical_pred = (HR_predicted == HR_pred_check)) %>% 
  select(HR, HR_lead_10, Watt, Watt_mean_window_25,Timestamp,Elevation_change_m,Speed_kmh,HR_predicted,constant,b1,b2,b3,watt_model_part,time_model_part,
         elevation_model_part, residual,stand_residual, identical_pred)
  
table(linearMod_check$identical_pred)  # outcome must be TRUE
## 
## TRUE 
## 5360



Below are seven numerical elaborations of observations, four with the lowest measured speed (climbing), and three with the highest (descent). Observed model input in green, output in blue. Residual in red.

linearMod_check_examples <- linearMod_check %>% select(-identical_pred) %>%  filter(Speed_kmh < 6.8 | Speed_kmh > 70 ) %>% 
  arrange(Speed_kmh)

linearMod_check_examples <- t(linearMod_check_examples)

linearMod_check_examples <- as.data.frame(format(linearMod_check_examples, justify = "left"))

library(kableExtra)
library(knitr)

knitr::kable(linearMod_check_examples, format = "html", escape = FALSE) %>% 
  row_spec(1:nrow(linearMod_check_examples),font_size=10) %>% 
  row_spec(0,bold=TRUE) %>% 
  row_spec(2,bold=TRUE,color = "#006D2C") %>% 
  row_spec(4:6,bold=TRUE,color = "#006D2C") %>% 
  row_spec(16:17,bold=TRUE,color = "#FC9272") %>% 
  row_spec(8:15,color = "blue") %>% 
  kableExtra::kable_styling(full_width = TRUE, font_size = 10) 
V1 V2 V3 V4 V5 V6 V7
HR 136 136 150 150 126 126 126
HR_lead_10 143 143 150 151 127 128 129
Watt 358 362 298 313 336 242 333
Watt_mean_window_25 368.4 369.0 289.5 290.6 186.0 200.5 193.6
Timestamp 00:05:53 00:05:52 01:08:22 01:08:23 00:10:28 00:10:30 00:10:29
Elevation_change_m 0.4 0.6 0.4 0.4 -1.6 -1.2 -1.4
Speed_kmh 6.2 6.3 6.6 6.7 70.2 70.3 70.5
HR_predicted 147 146 146 146 133 133 133
constant 108.7 108.7 108.7 108.7 108.7 108.7 108.7
b1 0.1034 0.1034 0.1034 0.1034 0.1034 0.1034 0.1034
b2 0.001934 0.001934 0.001934 0.001934 0.001934 0.001934 0.001934
b3 -2.369 -2.369 -2.369 -2.369 -2.369 -2.369 -2.369
watt_model_part 38.10 38.15 29.93 30.05 19.24 20.74 20.02
time_model_part 0.6826 0.6807 7.9322 7.9342 1.2144 1.2183 1.2163
elevation_model_part -0.9476 -1.4214 -0.9476 -0.9476 3.7903 2.8428 3.3166
residual -3.514 -3.092 4.402 5.284 -5.922 -5.476 -4.236
stand_residual -0.4055 -0.3568 0.5078 0.6095 -0.6846 -0.6324 -0.4895


We conclude the model explanation with a touch of residual analysis based on the autoplot output below.


autoplot(linearMod)



Residuals may correlate with another variable. This is usually identified visually using graphs such as the quartet above. The residual versus fitted plot can reveal non-linearity because of inequality of the variances of the error terms. For instance a parabola indicates non-linear relationship. The plot that is in the right upper corner, is the normal probability plot of residuals. It is an indication to see if the model residuals follow a normal distribution. Do residuals follow a straight line well or do they deviate severely? The Scale-Location plot shows if residuals are spread equally along the ranges of predictors. It’s good if you see a horizontal line with equally (randomly) spread points. The leverage idea picks out the potential for a specific value to be influential.

Conclusion


What conclusion can we draw? What would a cyclist like me want to see on screen after every virtual ride?
First of all, some general ride characteristics: distance, elapsed time, elevation meters, average heart rate, maximum heart rate, average power, maximum power, normalized power, the power variability index, average speed, the power/HR ratio and the average cadence. Furthermore, a graph with distance on the x-axis, the altitude profile in the background and wattage and heart rate on the y-axis. We have put all this in one picture, see Figure 1.

The HR and wattage trends from Figure 1 can be plotted concisely, as in the image below. Here you can see in the shape of a normalized trend line to what extent your heart rate increases compared to the also normalized power trend.

ggplot() +
  geom_smooth(aes(x = Cum_Distance, y = scale(HR)), data = tcx_df, 
              method = "lm", se = FALSE, color = "red") + 
  geom_smooth(aes(x = Cum_Distance, y = scale(Watt)), data = tcx_df, # normalize both variables
              method = "lm", se = FALSE, color = "darkgreen") +
  annotate("text", x =  10.1, y = -0.40, label = paste0("HR", " (slope = ",round(coef(lm(scale(tcx_df$HR)~tcx_df$Cum_Distance))[2],4), ")"), size=4, color = "red") +
  annotate("text", x =  4.8, y =  0.07, label = paste0("power", " (slope = ",round(coef(lm(scale(tcx_df$Watt)~tcx_df$Cum_Distance))[2],4), ")"), size=4, color = "darkgreen") +
  labs(x = "distance from start (km)", y = "z-score normalization", 
       title = "Trend HR (bpm) and power (W) presented concisely",
       subtitle = paste0("Device: Wahoo Kickr, app: Rouvy, ride: ", basename(rouvy_tcx_file)),
       caption  = paste0("Time stamp virtual ride: ", timestamp)) 
Figure 6.

Figure 6.


To estimate the heart rate, I would prefer a regression model with three predictors: power, elapsed time (a proxy of fatigue), gradient (expressed by elevation change meters, a measure of resistance). With two constants, viz. a HR lead of 10 seconds, and a mean power window of 25 seconds, see the equation below as an example. With an output display such as provided by the R package ‘jtools’.

## [1] "lead(HR, n = 10) ~ 108.679 + 0.103 * SMA(Watt, n = 25) + 0.002 * Timestamp - 2.369 * Elevation_change_m"


The fatigue parameter is estimated at 0.0019. This means that in this ride, after one hour of cycling, the heart rate increased by 3600 times 0.0019, which is 6.9 bpm, on average under the same power and resistance conditions. From now on we will add this as a bicycle ride characteristic to Figure 1 and see how this indicator behaves during various virtual rides. The fatigue metric seems like an interesting spin-off from this exercise. Instead of fatigue we will call it “hr_HR_change”, defined as the linear HR change after one hour of cycling. Calculated in a model with power and resistance as covariates, in other words, taking power and resistance into account. In a next R markdown blog we will test the hr_HR_change indicator using personal virtual bicycle ride data that can be accessed with VeloViewer.

hr_HR_change <- round(as.numeric(coef(linearMod)[3]) * 3600,0)
hr_HR_change
## [1] 7



Footnotes


  1. Geographer, data-analist and cyclist↩︎

  2. F.J. Arts and H. Kuipers: “The relation between power output, oxygen uptake and heart rate in male athletes”, Int J Sports Med, July 15, 1994
    https://pubmed.ncbi.nlm.nih.gov/7960315/↩︎

  3. David Johnstone: “Heart rate vs. power chart”, Cycling Analytics, June 8, 2013
    https://www.cyclinganalytics.com/blog/2013/06/heart-rate-vs-power-chart↩︎

  4. Marek Śmigielski: “Does power and heart rate go hand in hand when you ride a bike?”, Towards Data Science, September 6, 2018
    https://towardsdatascience.com/does-power-and-heart-rate-go-hand-in-hand-when-you-ride-a-bike-37a174785f37↩︎

  5. Critical Powers: “Modelling Heart Rate and Power Output”, Aug 1, 2016
    https://medium.com/critical-powers/modelling-heart-rate-and-power-output-d5fb654eae9d↩︎

  6. To better represent the power curve, we have calculated a moving average with a five-second interval in this graph.↩︎