Intro

Strava.com, a social networking website for endurance athletes, has a series of metrics for their premium users. We will be reverse engineering their Fitness metric - a status of fitness which changes over time based on the level of activity of the athlete. This appears to leverage a metric calculated for each workout called Relative Effort, which was previously named “suffer score,”

Fitness

The graph showing Fitness Score over time includes a metric called “Training Impulse” which is consistently 1.3x the Suffer Score for the workout that day. Its unclear why these metrics are so related with different names - or why they are scaled differently, however, for our purposes this does not matter. Suffer Score seems to be calculated from the amount of time spent in each HR zone.

Suffer Score

We will start by scraping the page listening each of my workouts (120 in total), pulling the time spent in each HR zone and the Suffer Score for the workout. From here we can run a linear model to study the relationship between HR Zones and Suffer Score. If this relationship is in fact linear, we can study the correlation between the fitness metric and Suffer Score per workout over time by way of a second linear model.

WebScrape HR & Suffer Score

Pull Workout URLS

Go to training page - scrape workout URLs from 6 pages (20 workouts a page, 120 workouts minus those without HR data).

## [1] "\nchris bloome\n" "chris bloome"
## Class method definition for method refresh()
## function () 
## {
##     "Reload the current page."
##     qpath <- sprintf("%s/session/%s/refresh", serverURL, sessionInfo[["id"]])
##     queryRD(qpath, "POST")
## }
## <environment: 0x00000000193121e8>
## 
## Methods used: 
##      "queryRD"
## [1] "656 Activities"
## [1] "21-40 of 656\n<U+2190>\n  \n  <U+2192>\n  \n"
## [1] "41-60 of 656\n<U+2190>\n  \n  <U+2192>\n  \n"
## [1] "61-80 of 656\n<U+2190>\n  \n  <U+2192>\n  \n"
## [1] "81-100 of 656\n<U+2190>\n  \n  <U+2192>\n  \n"
## [1] "101-120 of 656\n<U+2190>\n  \n  <U+2192>\n  \n"

Linear Model - HR to Suffer Score

## 
## Call:
## lm(formula = Suffer_Score ~ 0 + Z1 + Z2 + Z3 + Z4 + Z5, data = Strava_Table3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.1065  -3.9876  -0.2565   1.6473  15.0685 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## Z1 0.0013937  0.0004927   2.829  0.00559 ** 
## Z2 0.0072173  0.0004814  14.992  < 2e-16 ***
## Z3 0.0360673  0.0007313  49.320  < 2e-16 ***
## Z4 0.0560018  0.0012160  46.054  < 2e-16 ***
## Z5 0.0629874  0.0055289  11.392  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.213 on 106 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.9966, Adjusted R-squared:  0.9964 
## F-statistic:  6138 on 5 and 106 DF,  p-value: < 2.2e-16

##          Z1 
## 0.001393704

With a .99 \(R^2\) this is exactly as expected. Suffer Score can be calculated by \[Z_1*.004+Z_2 *.007+Z_3*.036+Z_4*.056+Z_5*.062\] That being said, as each Suffer Score is rounded, there is some variation in our model. It is probable a trim or rounding function in the strava data is creating some static in our data - I would wager that the coefficient for Zone 3 is actually .33, Zone 4 is .05 and Zone 6 is .066.

Fitness Data Aggregation

The webpage hosting fitness over time more challenging to scrape than other pages on Strava. The data is only visible when the mouse hovers over the graph, and these metrics are not hosted elsewhere on the site. That being said - we do not need all that many observations to come to a conclusion on the relationship between Suffer Score and Fitness. Lets manually pull 35 observations of fitness from 3 different segments over the last yer - in July when I was playing rugby without my Heart Rate monitor, and April when I was running regularly and returning to form.

Of note - when digging into the above, I noticed that all my data from the late fall/early winter was off by one day. Workouts were listed one day off on the Fitness over time page when compared to the date listed on the workout itself. I avoided these dates altogether when modeling to keep things simple.

Linear Model - Suffer Score to Fitness

## 
## Call:
## lm(formula = ending_fit ~ 0 + starting_fit + Suffer_Score, data = Combo_Table)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93494 -0.05920  0.01467  0.17748  0.75998 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## starting_fit 0.9763320  0.0014479   674.3   <2e-16 ***
## Suffer_Score 0.0311109  0.0007701    40.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3991 on 30 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 2.513e+05 on 2 and 30 DF,  p-value: < 2.2e-16

With a .999 \(R^2\) and residuals between -1 and 1, this model also seems to be dead-on. We can say with some confidence that Fitness is calculated by \[FitnessToday = .976*FitnessYesterday + .03*SufferScore\]

Again, because both Fitness and Suffer Score are rounded, there is some variation here. It is probable that these coefficients are .975 and .3 respectively.