Developing Data Products Course Project: Buiding VO2max and race predictor models

Building models

About data

Data was obtained from here and saved as csv file. Since it’s not my goal to build and serious app I would like to play with caret package and its models. That’s why I used this raw data and built models instead of use clear equations.

eqt <- read.csv('equivtable.csv', sep = ';', stringsAsFactors = FALSE)
head(eqt)

##    X1.mile    X5.km   X10.km Half.Marathon Marathon VO2
## 1 00:04:12 00:14:24 00:30:00      01:05:54 02:18:03  73
## 2 00:04:21 00:14:53 00:31:00      01:08:06 02:22:39  70
## 3 00:04:29 00:15:22 00:32:00      01:10:19 02:27:16  67
## 4 00:04:38 00:15:51 00:33:00      01:12:31 02:31:53  65
## 5 00:04:46 00:16:20 00:34:00      01:14:43 02:36:29  63
## 6 00:04:55 00:16:49 00:35:00      01:16:56 02:41:05  61

Playing with data

I need these libraries:

library(tidyr)
library(caret)
library(ggplot2)
library(lubridate)
library(brnn)

Next, I use tidyr package to transform my data for learning. And induce new variable $Seconds. This is my function:

# 1.6|5.0|10.0|21.1|42.2|VO2max --> VO2max|Distance|Time|Seconds
eqtNames <- c("1.6", "5.0", "10.0", "21.1", "42.2", "VO2max")
eqtBrush <- function(dat) {
        names(dat) <- eqtNames
        dat <- gather(dat, Distance, Time, 1:5)
        dat$Time  <- parse_date_time(paste(rep(origin, length(dat$Time)), 
                                           dat$Time), "Ymd HMS")
        dat$Seconds <- as.numeric(dat$Time)
        
        dat
}

Now I prepare all data frames I need. Note that $Distance used as factor for making plot. In eqtTrain it coerced to number.

eqtTrain <- eqtBrush(eqt) # training set
eqtTrain$Distance <- as.numeric(as.character(eqtTrain$Distance)) # in km
eqtGraph <- eqtBrush(eqt[c(1,21,50), ]) # data for plot
userT <- eqtBrush(eqt[11,]) # default values of user variable
head(eqtTrain)

##   VO2max Distance                Time Seconds
## 1     73      1.6 1970-01-01 00:04:12     252
## 2     70      1.6 1970-01-01 00:04:21     261
## 3     67      1.6 1970-01-01 00:04:29     269
## 4     65      1.6 1970-01-01 00:04:38     278
## 5     63      1.6 1970-01-01 00:04:46     286
## 6     61      1.6 1970-01-01 00:04:55     295

str(eqtGraph)

## 'data.frame':    15 obs. of  4 variables:
##  $ VO2max  : int  73 40 23 73 40 23 73 40 23 73 ...
##  $ Distance: Factor w/ 5 levels "1.6","5.0","10.0",..: 1 1 1 2 2 2 3 3 3 4 ...
##  $ Time    : POSIXct, format: "1970-01-01 00:04:12" "1970-01-01 00:07:03" ...
##  $ Seconds : num  252 423 671 864 1442 ...

Graphical race predictor

I chose 3 values of VO2 max parameter and assign related data to level of runner. Also there’s an orange line in the plot – the user estimate. It’s interactive on shiny and moves up or down depending on users data. This plot in logarithmic scale and its breaks were iteratively found.

yBreaks  <- c(5*60, 10*60, 18*60, 30*60, 50*60, 80*60, 
              2*60*60, 3*60*60, 4*60*60, 5*60*60, 6*60*60)
yLabels  <- c("5:00", "10:00", "18:00", "30:00", "50:00", "1:20:00", 
              "2:00:00", "3:00:00", "4:00:00", "5:00:00", "6:00:00")
graph <- ggplot(data = eqtGraph, aes(x = Distance, y = Seconds)) + 
        geom_line( aes(group = VO2max, colour = factor(VO2max)) ) +
        scale_y_log10(breaks = yBreaks, labels = yLabels, limits = c(220, 22000)) + 
        geom_point(aes(colour = factor(VO2max))) +
        labs(x = "Distance, km", y = "Time, h:m:s", colour = "VO2max")
graph + geom_point(data = userT, size = 2.7, aes(group = VO2max, colour = "you")) + 
        geom_line(data = userT, aes(group = VO2max, colour = "you"), size = 1.3) + 
        scale_colour_manual(limits=c("you", levels(factor(eqtGraph$VO2max))),
                            labels = c("you", "newbie", "amateur", "pro"),
                            values=c("orange", 2:4))

VO2 max model

If you’re close to runners world and it’s math and number you’ll find that VO2 max values probably predicted by Daniels’ equation. There’re other formulas too. So if necessary I can test my model using Daniels’ equation. It’s what I acually did but I didn’t report it here (maybe later I’ll write particular note for it).

Let’s start. Try linear model. I have 2 predictors, distance and time in seconds. As measure of success I’ll plot error of training(!) set in %.

fit <- lm(VO2max ~ Seconds * Distance, data = eqtTrain)
summary(fit)

## 
## Call:
## lm(formula = VO2max ~ Seconds * Distance, data = eqtTrain)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5852  -6.6653  -0.5484   4.6804  29.7209 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.232e+01  1.211e+00  34.955  < 2e-16 ***
## Seconds          -6.502e-03  4.951e-04 -13.131  < 2e-16 ***
## Distance          1.613e+00  1.179e-01  13.677  < 2e-16 ***
## Seconds:Distance  4.549e-05  1.027e-05   4.428 1.43e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.563 on 246 degrees of freedom
## Multiple R-squared:  0.5224, Adjusted R-squared:  0.5165 
## F-statistic: 89.68 on 3 and 246 DF,  p-value: < 2.2e-16

plot((predict(fit, eqtTrain) - eqtTrain$VO2max)/eqtTrain$VO2max*100, ylab="% error")

Not a great start. Errors are up to 80% despite t-values and Pr(>|t|) are very promising. Let’s try to add non linearity to this model.

fit <- lm(VO2max ~ log10(Seconds) * log10(Distance), data = eqtTrain)
plot((predict(fit, eqtTrain) - eqtTrain$VO2max)/eqtTrain$VO2max*100, ylab="% error")

Better, closer, warmer. But still not good as a prediction. I tried most of the caret package with default values. But results are pretty similar. Sometimes error plot looks fantastic!

fit <- train(VO2max ~ log10(Seconds) * log10(Distance), 
             data = eqtTrain, method = "M5")
plot((predict(fit, eqtTrain) - eqtTrain$VO2max)/eqtTrain$VO2max*100, ylab="% error")

But tree-like models generalizes not well (also I’ll show it later). It’s not necessary to use Daniels’ equation to prove it. Next two values means that VO2max estimates are equal for distances 15 and 21 kilometers in similar time, what is completely incorrect. It’s obvious that if you ran half marathon while I ran just 15 km during same time your VO2max is higher.

predict(fit, data.frame(Distance = 15, Seconds = 4500))

## [1] 62.93563

predict(fit, data.frame(Distance = 21, Seconds = 4500))

## [1] 62.93563

Now it’s neural networks time. There’s an opinion that BRNNs – Bayesian regularized neural networks generalize better than neural networks with traditional learning algirithms (such as back propogation etc). As for me I got great results using BRNN. Neural network with 5 neurons in hidden layer gave me brilliant result. Also it learned extremely fast. There were no problem with generalization as well. It has just 1-2% error and suits great. I use it in my shiny app.

set.seed(19)
fit <- brnn(VO2max ~ log10(Seconds) * log10(Distance), data = eqtTrain, 
             neurons = 5)

## Number of parameters (weights and biases) to estimate: 25 
## Nguyen-Widrow method
## Scaling factor= 0.704521 
## gamma= 14.7369    alpha= 0.2453   beta= 4051.137

plot((predict(fit, eqtTrain) - eqtTrain$VO2max)/eqtTrain$VO2max*100, ylab="% error")

Predictor model

Idea is to use VO2max value to obtain predicted time for distances not included in original table. For example if my best result is 10Km for 40 minutes what’s my time for 15 km? I can’t just use proportion because it’s not possible to keep my 10Km pace for longer distance. I use same training set.

First, linear model gave pretty good result:

fitRaces <- lm(Seconds ~ log10(VO2max) * Distance, data = eqtTrain)
plot((predict(fitRaces, eqtTrain) - eqtTrain$Seconds)/eqtTrain$Seconds, ylab="% error")

I tried cubist method what is one of M5 variations. It produses low error on training set. But generalization is very poor, look second plot

fitRaces <- train(Seconds ~ log10(VO2max) * Distance, 
                  data = eqtTrain, method = "cubist")
plot((predict(fitRaces, eqtTrain) - eqtTrain$Seconds)/eqtTrain$Seconds, ylab="% error")

plot(seq(1.6,42.2,by=1.4), predict(fitRaces, data.frame(VO2max = 52, 
                                                        Distance = seq(1.6,42.2,by=1.4))), xlab="Distance, km", ylab="Time, sec")

Using BRNNs again I made two plots to show how last two plots should look like. I used 7 neurons in hidden layer this time. Black dots was taken from original table.

set.seed(1902)
fitRaces <- brnn(Seconds ~ log10(VO2max) * Distance, data = eqtTrain, 
                 neurons = 7)

## Number of parameters (weights and biases) to estimate: 35 
## Nguyen-Widrow method
## Scaling factor= 0.7054698 
## gamma= 13.9638    alpha= 0.7288   beta= 13959.01

plot((predict(fitRaces, eqtTrain) - eqtTrain$Seconds)/eqtTrain$Seconds, ylab="% error")

plot(seq(1.6,42.2,by=1.4), predict(fitRaces, data.frame(VO2max = 52, 
                                                        Distance = seq(1.6,42.2,by=1.4))), xlab="Distance, km", ylab="Time, sec")
points(as.numeric(as.character(userT$Distance)), userT$Seconds, pch=19)

plot(seq(1.6,42.2,by=1.4), predict(fitRaces, data.frame(VO2max = 52, Distance = seq(1.6,42.2,by=1.4)))/seq(1.6,42.2,by=1.4), xlab="Distance, km", ylab="Pace, sec")

Last plot shows that pace differs for different distances.

Developing Data Products Course Project: Buiding VO2max and race predictor models

Yuri Isakov

26 January 2016

2Do