Data was obtained from here and saved as csv file. Since it’s not my goal to build and serious app I would like to play with caret package and its models. That’s why I used this raw data and built models instead of use clear equations.
eqt <- read.csv('equivtable.csv', sep = ';', stringsAsFactors = FALSE)
head(eqt)
## X1.mile X5.km X10.km Half.Marathon Marathon VO2
## 1 00:04:12 00:14:24 00:30:00 01:05:54 02:18:03 73
## 2 00:04:21 00:14:53 00:31:00 01:08:06 02:22:39 70
## 3 00:04:29 00:15:22 00:32:00 01:10:19 02:27:16 67
## 4 00:04:38 00:15:51 00:33:00 01:12:31 02:31:53 65
## 5 00:04:46 00:16:20 00:34:00 01:14:43 02:36:29 63
## 6 00:04:55 00:16:49 00:35:00 01:16:56 02:41:05 61
I need these libraries:
library(tidyr)
library(caret)
library(ggplot2)
library(lubridate)
library(brnn)
Next, I use tidyr package to transform my data for learning. And induce new variable $Seconds. This is my function:
# 1.6|5.0|10.0|21.1|42.2|VO2max --> VO2max|Distance|Time|Seconds
eqtNames <- c("1.6", "5.0", "10.0", "21.1", "42.2", "VO2max")
eqtBrush <- function(dat) {
names(dat) <- eqtNames
dat <- gather(dat, Distance, Time, 1:5)
dat$Time <- parse_date_time(paste(rep(origin, length(dat$Time)),
dat$Time), "Ymd HMS")
dat$Seconds <- as.numeric(dat$Time)
dat
}
Now I prepare all data frames I need. Note that $Distance used as factor for making plot. In eqtTrain it coerced to number.
eqtTrain <- eqtBrush(eqt) # training set
eqtTrain$Distance <- as.numeric(as.character(eqtTrain$Distance)) # in km
eqtGraph <- eqtBrush(eqt[c(1,21,50), ]) # data for plot
userT <- eqtBrush(eqt[11,]) # default values of user variable
head(eqtTrain)
## VO2max Distance Time Seconds
## 1 73 1.6 1970-01-01 00:04:12 252
## 2 70 1.6 1970-01-01 00:04:21 261
## 3 67 1.6 1970-01-01 00:04:29 269
## 4 65 1.6 1970-01-01 00:04:38 278
## 5 63 1.6 1970-01-01 00:04:46 286
## 6 61 1.6 1970-01-01 00:04:55 295
str(eqtGraph)
## 'data.frame': 15 obs. of 4 variables:
## $ VO2max : int 73 40 23 73 40 23 73 40 23 73 ...
## $ Distance: Factor w/ 5 levels "1.6","5.0","10.0",..: 1 1 1 2 2 2 3 3 3 4 ...
## $ Time : POSIXct, format: "1970-01-01 00:04:12" "1970-01-01 00:07:03" ...
## $ Seconds : num 252 423 671 864 1442 ...
I chose 3 values of VO2 max parameter and assign related data to level of runner. Also there’s an orange line in the plot – the user estimate. It’s interactive on shiny and moves up or down depending on users data. This plot in logarithmic scale and its breaks were iteratively found.
yBreaks <- c(5*60, 10*60, 18*60, 30*60, 50*60, 80*60,
2*60*60, 3*60*60, 4*60*60, 5*60*60, 6*60*60)
yLabels <- c("5:00", "10:00", "18:00", "30:00", "50:00", "1:20:00",
"2:00:00", "3:00:00", "4:00:00", "5:00:00", "6:00:00")
graph <- ggplot(data = eqtGraph, aes(x = Distance, y = Seconds)) +
geom_line( aes(group = VO2max, colour = factor(VO2max)) ) +
scale_y_log10(breaks = yBreaks, labels = yLabels, limits = c(220, 22000)) +
geom_point(aes(colour = factor(VO2max))) +
labs(x = "Distance, km", y = "Time, h:m:s", colour = "VO2max")
graph + geom_point(data = userT, size = 2.7, aes(group = VO2max, colour = "you")) +
geom_line(data = userT, aes(group = VO2max, colour = "you"), size = 1.3) +
scale_colour_manual(limits=c("you", levels(factor(eqtGraph$VO2max))),
labels = c("you", "newbie", "amateur", "pro"),
values=c("orange", 2:4))
If you’re close to runners world and it’s math and number you’ll find that VO2 max values probably predicted by Daniels’ equation. There’re other formulas too. So if necessary I can test my model using Daniels’ equation. It’s what I acually did but I didn’t report it here (maybe later I’ll write particular note for it).
Let’s start. Try linear model. I have 2 predictors, distance and time in seconds. As measure of success I’ll plot error of training(!) set in %.
fit <- lm(VO2max ~ Seconds * Distance, data = eqtTrain)
summary(fit)
##
## Call:
## lm(formula = VO2max ~ Seconds * Distance, data = eqtTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5852 -6.6653 -0.5484 4.6804 29.7209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.232e+01 1.211e+00 34.955 < 2e-16 ***
## Seconds -6.502e-03 4.951e-04 -13.131 < 2e-16 ***
## Distance 1.613e+00 1.179e-01 13.677 < 2e-16 ***
## Seconds:Distance 4.549e-05 1.027e-05 4.428 1.43e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.563 on 246 degrees of freedom
## Multiple R-squared: 0.5224, Adjusted R-squared: 0.5165
## F-statistic: 89.68 on 3 and 246 DF, p-value: < 2.2e-16
plot((predict(fit, eqtTrain) - eqtTrain$VO2max)/eqtTrain$VO2max*100, ylab="% error")
Not a great start. Errors are up to 80% despite t-values and Pr(>|t|) are very promising. Let’s try to add non linearity to this model.
fit <- lm(VO2max ~ log10(Seconds) * log10(Distance), data = eqtTrain)
plot((predict(fit, eqtTrain) - eqtTrain$VO2max)/eqtTrain$VO2max*100, ylab="% error")
Better, closer, warmer. But still not good as a prediction. I tried most of the caret package with default values. But results are pretty similar. Sometimes error plot looks fantastic!
fit <- train(VO2max ~ log10(Seconds) * log10(Distance),
data = eqtTrain, method = "M5")
plot((predict(fit, eqtTrain) - eqtTrain$VO2max)/eqtTrain$VO2max*100, ylab="% error")
But tree-like models generalizes not well (also I’ll show it later). It’s not necessary to use Daniels’ equation to prove it. Next two values means that VO2max estimates are equal for distances 15 and 21 kilometers in similar time, what is completely incorrect. It’s obvious that if you ran half marathon while I ran just 15 km during same time your VO2max is higher.
predict(fit, data.frame(Distance = 15, Seconds = 4500))
## [1] 62.93563
predict(fit, data.frame(Distance = 21, Seconds = 4500))
## [1] 62.93563
Now it’s neural networks time. There’s an opinion that BRNNs – Bayesian regularized neural networks generalize better than neural networks with traditional learning algirithms (such as back propogation etc). As for me I got great results using BRNN. Neural network with 5 neurons in hidden layer gave me brilliant result. Also it learned extremely fast. There were no problem with generalization as well. It has just 1-2% error and suits great. I use it in my shiny app.
set.seed(19)
fit <- brnn(VO2max ~ log10(Seconds) * log10(Distance), data = eqtTrain,
neurons = 5)
## Number of parameters (weights and biases) to estimate: 25
## Nguyen-Widrow method
## Scaling factor= 0.704521
## gamma= 14.7369 alpha= 0.2453 beta= 4051.137
plot((predict(fit, eqtTrain) - eqtTrain$VO2max)/eqtTrain$VO2max*100, ylab="% error")
Idea is to use VO2max value to obtain predicted time for distances not included in original table. For example if my best result is 10Km for 40 minutes what’s my time for 15 km? I can’t just use proportion because it’s not possible to keep my 10Km pace for longer distance. I use same training set.
First, linear model gave pretty good result:
fitRaces <- lm(Seconds ~ log10(VO2max) * Distance, data = eqtTrain)
plot((predict(fitRaces, eqtTrain) - eqtTrain$Seconds)/eqtTrain$Seconds, ylab="% error")
I tried cubist method what is one of M5 variations. It produses low error on training set. But generalization is very poor, look second plot
fitRaces <- train(Seconds ~ log10(VO2max) * Distance,
data = eqtTrain, method = "cubist")
plot((predict(fitRaces, eqtTrain) - eqtTrain$Seconds)/eqtTrain$Seconds, ylab="% error")
plot(seq(1.6,42.2,by=1.4), predict(fitRaces, data.frame(VO2max = 52,
Distance = seq(1.6,42.2,by=1.4))), xlab="Distance, km", ylab="Time, sec")
Using BRNNs again I made two plots to show how last two plots should look like. I used 7 neurons in hidden layer this time. Black dots was taken from original table.
set.seed(1902)
fitRaces <- brnn(Seconds ~ log10(VO2max) * Distance, data = eqtTrain,
neurons = 7)
## Number of parameters (weights and biases) to estimate: 35
## Nguyen-Widrow method
## Scaling factor= 0.7054698
## gamma= 13.9638 alpha= 0.7288 beta= 13959.01
plot((predict(fitRaces, eqtTrain) - eqtTrain$Seconds)/eqtTrain$Seconds, ylab="% error")
plot(seq(1.6,42.2,by=1.4), predict(fitRaces, data.frame(VO2max = 52,
Distance = seq(1.6,42.2,by=1.4))), xlab="Distance, km", ylab="Time, sec")
points(as.numeric(as.character(userT$Distance)), userT$Seconds, pch=19)
plot(seq(1.6,42.2,by=1.4), predict(fitRaces, data.frame(VO2max = 52, Distance = seq(1.6,42.2,by=1.4)))/seq(1.6,42.2,by=1.4), xlab="Distance, km", ylab="Pace, sec")
Last plot shows that pace differs for different distances.
Bayessian regularized neural networks showed their better perfomance for both models, VO2max and race predictions. M5-like algorithms have closer perfomance but generalizes information very poor. Linear models generalizes much better, but have significant errors. It doesn’t mean BRNNs are better than algebraic models, but on-the-fly with default values BRNNs gave extremely quick and precise model.
Full code of shiny app available on my repo: https://github.com/yurkai/DDP-CP
The rHelper app you may find on https://yurkai.shinyapps.io/rHelper. Since shinyapps.io provides limeted time it could be unaccessible.