Exploration of NordicTrack S22i Data

NordicTrack S22i

The NordicTrack S22i studio cycle offers a variety of pre-recorded workouts for its users. The summary and detail information for these workouts are available on the iFit website. Users can download information in the form of a TCX or CSV.

Data Formats

Specifically, the files look quite different and from the above, you can tell that the TCX is missing useful information that the CSV includes. The comparison of available information in the files looks like the following:

Information Comparison

Category	CSV	TCX
Summary		\(\checkmark\)
Absolute Time		\(\checkmark\)
Relative Time	\(\checkmark\)
Distance	\(\checkmark\) (Miles)	\(\checkmark\) (meters)
Speed	\(\checkmark\) (MPH)
Cadence	\(\checkmark\) (RPM)	\(\checkmark\)
Power	\(\checkmark\) (Watts)
Calories		\(\checkmark\)
Heart Rate	\(\checkmark\)	\(\checkmark\)
Resistance	\(\checkmark\)
Relative Resistance	\(\checkmark\)
Incline	\(\checkmark\)
Altitude		\(\checkmark\) (usually, meters)

Obviously, Relative Time and Speed can be derived from the TCX file. Mostly, so can the incline (when the altitude information is present).

Format Details

TCX

Here is an example of a recent ride of mine. Indenting added to make it easier to see the track points:

<?xml version="1.0"?>
<TrainingCenterDatabase xmlns:ns2="http://www.garmin.com/xmlschemas/UserProfile/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns4="http://www.garmin.com/xmlschemas/ProfileExtension/v1" xmlns:ns5="http://www.garmin.com/xmlschemas/ActivityGoals/v1" xmlns:tpx="http://www.garmin.com/xmlschemas/ActivityExtension/v2" xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xsi:schemaLocation="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2 http://www.garmin.com/xmlschemas/TrainingCenterDatabasev2.xsd">
  <Activities>
    <Activity Sport="Other">
      <Notes>Bridal Veil Falls Threshold Ride, Telluride, Colorado</Notes>
      <Id>2020-10-12T11:29:09.223Z</Id>
      <Lap StartTime="2020-10-12T11:29:09.223Z">
        <TotalTimeSeconds>2276</TotalTimeSeconds>
        <DistanceMeters>6649.066633846156</DistanceMeters>
        <MaximumSpeed>11.011</MaximumSpeed>
        <Calories>553.826</Calories>
        <AverageHeartRateBpm><Value>56.896657871591906</Value></AverageHeartRateBpm>
        <MaximumHeartRateBpm><Value>168</Value></MaximumHeartRateBpm>
        <Intensity>Active</Intensity>
        <TriggerMethod>Manual</TriggerMethod>
        <Track>
          <Trackpoint>
            <DistanceMeters>0</DistanceMeters>
            <Cadence>74</Cadence>
            <Calories>0</Calories>
            <HeartRateBpm xsi:type="HeartRateInBeatsPerMinute_t"><Value>0</Value></HeartRateBpm>
            <Time>2020-10-12T11:29:09.223Z</Time>
            <AltitudeMeters>-0.02</AltitudeMeters>  # Note that this is not always present
          </Trackpoint>
          ...
        </Track>
        <Extensions>
          <LX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
            <AvgSpeed>10.146984601847887</AvgSpeed>
          </LX>
        </Extensions>
      </Lap>
    </Activity>
  </Activities>
  <Author xsi:type="Application_t">
    <Name>iFit.com</Name>
    <Build>
      <Version>
        <VersionMajor>20</VersionMajor>
        <VersionMinor>10</VersionMinor>
        <BuildMajor>08</BuildMajor>
        <BuildMinor>23</BuildMinor>
      </Version>
    </Build>
    <LangID>EN</LangID>
  </Author>
</TrainingCenterDatabase>

CSV

These are the first four rows of the CSV. There is not confirmation of the end of the data, the rows just stop at the end.

Stages_Data,,,,,
English,,,,,
Time,Miles,MPH,Watts,HR,RPM,Resistance,Relative Resistance,Incline
00:00,0.0000,13.20,73,0,74,10,42,0

Data Exploration

This review will use R to explore the data. I focused on examining the CSV format due to the simplicity of the import and the missing resistance and incline information (though I realized after the fact that the latter could have been derived). I will show the code that was used, in case it would be helpful for you.

Import and prep data

# Import data from iFit
chr.location <- '~/Dropbox/Fitness/iFitWorkouts/Rides/'
vec.workouts <- list.files(path = chr.location, pattern = '*.csv')

dt.s22 <- rbindlist(lapply(vec.workouts, function(fl){
    dt.temp <- data.table(read.csv2(paste0(chr.location, fl), skip = 2, sep = ',', as.is = TRUE))
    dt.temp[, source := substr(fl, 1, nchar(fl)-4)]
    return(dt.temp)
}))

dt.s22[, `:=` (Miles = as.numeric(Miles), MPH = as.numeric(MPH), Incline = as.numeric(Incline))]
dt.s22[, meanPower := rollmean(Watts, 1200, fill = NA, align = "right")]
dt.s22[, tm := as.numeric(substr(Time, 1, 2)) + as.numeric(substr(Time, 4, 5))/60]
dt.s22[, tmFromStartSec := as.numeric(substr(Time, 1, 2)) * 60 + as.numeric(substr(Time, 4, 5))]
dt.s22[, resistanceChange := 'steady']
dt.s22[, prevResistance := c(NA, Resistance[1:(nrow(dt.s22)-1)])]
dt.s22[, nextResistance := c(Resistance[2:nrow(dt.s22)] , NA)]
dt.s22[prevResistance != Resistance | Resistance != nextResistance, resistanceChange := 'change']
dt.s22[, lastWatts := c(NA, Watts[1:(nrow(dt.s22)-1)])]
dt.s22[abs(lastWatts - Watts) > 20, resistanceChange := 'change']
dt.s22[, finalTime := max(tm), by = source]
dt.s22[, perComplete := tm / as.numeric(finalTime)]

From here, we can look at the format of all of our data:

str(dt.s22)

## Classes 'data.table' and 'data.frame':   27406 obs. of  19 variables:
##  $ Time               : chr  "00:00" "00:01" "00:02" "00:03" ...
##  $ Miles              : num  0 0 0.0085 0.0171 0.0256 0.0342 0.0391 0.0441 0.0485 0.0578 ...
##  $ MPH                : num  18 18 19.4 20.8 22.2 ...
##  $ Watts              : int  155 155 52 104 156 208 192 182 215 208 ...
##  $ HR                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RPM                : int  109 109 107 104 102 99 94 91 91 89 ...
##  $ Resistance         : int  9 9 9 9 9 9 9 9 17 17 ...
##  $ Relative.Resistance: int  38 38 38 38 38 38 38 38 71 71 ...
##  $ Incline            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ source             : chr  "2020_08_29_13_08_Chamonix_Ride_Part_3,_France" "2020_08_29_13_08_Chamonix_Ride_Part_3,_France" "2020_08_29_13_08_Chamonix_Ride_Part_3,_France" "2020_08_29_13_08_Chamonix_Ride_Part_3,_France" ...
##  $ meanPower          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ tm                 : num  0 0.0167 0.0333 0.05 0.0667 ...
##  $ tmFromStartSec     : num  0 1 2 3 4 5 6 7 8 9 ...
##  $ resistanceChange   : chr  "steady" "steady" "change" "change" ...
##  $ prevResistance     : int  NA 9 9 9 9 9 9 9 9 17 ...
##  $ nextResistance     : int  9 9 9 9 9 9 9 17 17 18 ...
##  $ lastWatts          : int  NA 155 155 52 104 156 208 192 182 215 ...
##  $ finalTime          : num  29.5 29.5 29.5 29.5 29.5 ...
##  $ perComplete        : num  0 0.000565 0.001131 0.001696 0.002261 ...
##  - attr(*, ".internal.selfref")=<externalptr>

And we can quickly look at some of the key variables and their relationships with each other:

ggpairs(dt.s22[, .(MPH, Watts, RPM, Resistance)]) +
  theme_nicely() +
  ggtitle("Initial Data Exploration") +
  theme(plot.title = element_text(hjust = 0.5))

Power Versus Speed

How realistic is the speed that NordicTrack is calculating versus the power that I’m inputting into the pedals? Insofar as the Power readings are accurate, we can use a site like Bike Calculator. For that site, I entered the same weight I have in at iFit, 160 pounds and then the following:

Input	Value
Bike Weight (lbs)	0 or 20
Grade (Incline)	0º
Headwind	0
Temperature	70ºF
Elevation	328 ft

I put these values in:

# Data from http://bikecalculator.com/index.html with a rider weight of 160, bicycle weight of 20, 0% grade, 0 headwind, 70ºF and 328 ft elevation
dt.truePower <- rbind(data.table(bikeWeight = 0,
                           Watts = c(0, 10, 25, 50, 75, 10*(10:30)),
                           MPH = c(0, 4.66, 8.09, 11.29, 13.44, 15.11, 15.70, 16.25, 16.77, 17.25, 17.72, 18.16, 18.59, 18.99, 19.38, 19.76, 20.12, 20.48, 20.82, 21.15, 21.47, 21.78, 22.08, 22.38, 22.67, 22.95)),
                      data.table(bikeWeight = 20,
                           Watts = c(0, 10, 25, 50, 75, 10*(10:30)),
                           MPH = c(0, 4.36, 7.81, 11.05, 13.23, 14.92, 15.52, 16.07, 16.59, 17.08, 17.55, 18.00, 18.42, 18.83, 19.23, 19.61, 19.97, 20.33, 20.67, 21.00, 21.33, 21.64, 21.95, 22.24, 22.53, 22.82)))

and then graphed the results:

ggplot() +
    geom_hline(yintercept = 0) + theme_nicely() +
    geom_point(data = dt.s22[RPM >= 55 & Incline == 0 & resistanceChange == 'steady'], 
               mapping = aes(y = MPH, x = Watts, col = as.factor(Resistance))) +
    geom_line(data = dt.truePower[bikeWeight == 0], mapping = aes(y = MPH, x = Watts)) +
    geom_line(data = dt.truePower[bikeWeight == 20], mapping = aes(y = MPH, x = Watts), col = 'grey') +
    scale_y_continuous(name = "Speed (MPH)") +
    scale_x_continuous(name = "Power Output (Watts)") +
    scale_color_discrete(name = "Resistance Level") +
    ggtitle("Speed as a Function of Power Input") +
    theme(plot.title = element_text(hjust = 0.5)) +
    theme(legend.position = c(0.8, 0.3))

I seems pretty clear that the S22i is over-estimating speed given a particular power input. I guess they want you to feel good about yourself. Also, a nice way to “cheat” to get your Garmin cycling distance challenges. Not that I would do that …

How is Speed calculated?

Now that we can see that power is probably not related to speed correctly, it would be interesting to see how they are calculating speed. To do this, we should probably be aware of some of the data limitations that we have using the CSV file. I will go back to our comparison of the TCX and the CSV, but here focus on the data format:

Category	CSV	TCX
Time	second	millisecond
Distance	Miles (ten thousandth)	meter (integer)
Speed	MPH (hundreth)	Derivable
Cadence	RPM (integer)	RPM (integer)
Power	Watt (integer)	Derivable (from calories)?
Calories	Derivable (from Power)?	Calorie (thousandth)
Heart Rate	BMP (integer)	BMP (integer)
Resistance	Value (integer)	NA
Relative Resistance	Value (integer)	NA
Incline	Degrees (1/2 degree)	NA
Altitude	Relative altitude (Derivable)	Meter (hundreth)

I’m pretty sure that power is a possible entry in the Garmin TCX format, so I’m not sure why they don’t include it. I’ve tried importing the file into Strava to see if they derive it, but Strava chokes on the file type (despite it being TCX, which is supposed to be compatible). IDK.

Further Data Exploration: CSV

Quirky Integers

First thing to note from our above graph is that there appear to be strips in the power data. Let’s explore that variable some more by looking at the Probability Distribution Function of the data:

ggplot(data = dt.s22, mapping = aes(x = Watts)) +
  geom_hline(yintercept = 0) + theme_nicely() +
  ggtitle("Distribution of Power") +
  theme(plot.title = element_text(hjust = 0.5), axis.text.y=element_blank(), axis.ticks.y=element_blank()) +
  scale_y_continuous(name = 'Probability Density Function') +
  geom_density(col = 'steelblue2')

Looks like a pretty reasonable PDF for an integer variable, but let’s dig deeper (given the striping seen above):

ggplot(data = dt.s22, mapping = aes(x = Watts)) +
  geom_hline(yintercept = 0) + theme_nicely() +
  ggtitle("Histogram of Power") +
  scale_y_continuous(name = 'Count') +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_histogram(binwidth = 1, fill = 'steelblue2')

Weird, it seems. Not all of the integers have values. Really? Let’s zoom in to see more:

ggplot(data = dt.s22, mapping = aes(x = Watts)) +
  geom_hline(yintercept = 0) + theme_nicely() +
  geom_histogram(binwidth = 1, fill = 'steelblue2') +
  ggtitle("Zoom in of Power Histogram") +
  scale_y_continuous(name = 'Count') +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_cartesian(xlim = c(180, 220))

At the end of the day, we have the data we have. I just find it odd that NordicTrack’s algorithms come up with numbers in this way. Maybe a look-up table? Not sure, but their hardware is pretty crappy, so maybe they are trying to reduce CPU load. Even so, I would have expected a more continuous function. I hope that this doesn’t mess up our data exploration too badly.

Noisy Data

Next thing to note is that there appears to be some noise in the data. To help remove at least some of the noise, we are going to impose a couple of restrictions on any data that we put into our Speed model. First, we want to make sure that incline doesn’t play any role in what we are looking at. So we will restrict all input rows to Incline == 0. In addition, we don’t want to have data derived from rows where there were changes in the Incline or power (Watts), thus above I calculated the following:

dt.s22[, resistanceChange := 'steady']
dt.s22[, prevResistance := c(NA, Resistance[1:(nrow(dt.s22)-1)])]
dt.s22[, nextResistance := c(Resistance[2:nrow(dt.s22)] , NA)]
dt.s22[prevResistance != Resistance | Resistance != nextResistance, resistanceChange := 'change']
dt.s22[, lastWatts := c(NA, Watts[1:(nrow(dt.s22)-1)])]
dt.s22[abs(lastWatts - Watts) > 20, resistanceChange := 'change']

So, we will include a restriction on resistancChange == 'steady' in our models.

Finally, if you go back up and look at the bottom left hand region of the Power-Speed graph, it appears that there might be some wonky stuff going on over there. Therefore, let’s add another restriction to our model of MPH > 10. For now, we will also include a restriction of source != '2020_09_30_22_09_Manual_Workout' but save the explanation until a bit later.

Model Building

Cadence Only

From our ggpairs plot above, we can see that Cadence (RPM) is highly correlated to Speed (MPH). Let’s first build a model just using that variable and see how we do:

lm.mph <- lm(MPH ~ RPM, data = dt.s22[resistanceChange == 'steady' & Incline == 0 & MPH > 10 & source != '2020_09_30_22_09_Manual_Workout'])
summary(lm.mph)

## 
## Call:
## lm(formula = MPH ~ RPM, data = dt.s22[resistanceChange == "steady" & 
##     Incline == 0 & MPH > 10 & source != "2020_09_30_22_09_Manual_Workout"])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2936  -0.6426   0.3630   0.8619   2.6966 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.304181   0.119221   -36.1   <2e-16 ***
## RPM          0.308145   0.001412   218.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.107 on 22466 degrees of freedom
## Multiple R-squared:  0.6794, Adjusted R-squared:  0.6794 
## F-statistic: 4.761e+04 on 1 and 22466 DF,  p-value: < 2.2e-16

Not bad, we get an r² of 67.94%. Let’s look to see how the predicted values compare to the actual values:

dt.s22[, predictedMPH := lm.mph[["coefficients"]][1] + lm.mph[["coefficients"]][2] * RPM]

ggplot(data = dt.s22[resistanceChange == 'steady' & Incline == 0 & MPH > 10 & source != '2020_09_30_22_09_Manual_Workout'],
       mapping = aes(x = predictedMPH, y = MPH)) +
  geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + theme_nicely() +
  geom_point(alpha = 0.2, col = 'steelblue2') +
  geom_abline(slope = 1, col = 'grey', lwd = 2) +
  scale_x_continuous(name = "Predicted Speed (MPH)") +
  scale_y_continuous(name = 'Actual Speed (MPH)') +
  ggtitle("Prediction Using Cadence, Only") +
  theme(plot.title = element_text(hjust = 0.5)) +
  annotate("text", x = 5, y = 20, label = paste0("italic(R) ^ 2 == ", round(summary(lm.mph)$adj.r.squared,4)), parse = TRUE)

Not too shabby, but can we do better?

Power Only

You would think (as my original graph indicated) that Power and Speed would have the best relationship. Let’s explore that single variable using a simple linear model:

lm.mph <- lm(MPH ~ Watts, data = dt.s22[resistanceChange == 'steady' & Incline == 0 & MPH > 10 & source != '2020_09_30_22_09_Manual_Workout'])
summary(lm.mph)

## 
## Call:
## lm(formula = MPH ~ Watts, data = dt.s22[resistanceChange == "steady" & 
##     Incline == 0 & MPH > 10 & source != "2020_09_30_22_09_Manual_Workout"])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4580 -0.2962 -0.0213  0.3231  4.1223 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.106e+01  1.768e-02   625.4   <2e-16 ***
## Watts       5.237e-02  8.596e-05   609.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.467 on 22466 degrees of freedom
## Multiple R-squared:  0.9429, Adjusted R-squared:  0.9429 
## F-statistic: 3.712e+05 on 1 and 22466 DF,  p-value: < 2.2e-16

dt.s22[, predictedMPH := lm.mph[["coefficients"]][1] + lm.mph[["coefficients"]][2] * Watts]

ggplot(data = dt.s22[resistanceChange == 'steady' & Incline == 0 & MPH > 10 & source != '2020_09_30_22_09_Manual_Workout'],
       mapping = aes(x = predictedMPH, y = MPH)) +
  geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + theme_nicely() +
  geom_point(alpha = 0.2, col = 'steelblue2') +
  geom_abline(slope = 1, col = 'grey', lwd = 2) +
  scale_x_continuous(name = "Predicted Speed (MPH)") +
  scale_y_continuous(name = 'Actual Speed (MPH)') +
  ggtitle("Prediction Using Power, Only") +
  theme(plot.title = element_text(hjust = 0.5)) +
  annotate("text", x = 5, y = 20, label = paste0("italic(R) ^ 2 == ", round(summary(lm.mph)$adj.r.squared,4)), parse = TRUE)

It’s pretty good, too. But the r² is only 94.29%.

But, should we be using a linear relationship? The short answer is NO. The somewhat longer answer can be found here. Based on that, let’s try a model that has both a linear and cube root representation of power.

dt.s22[, cubeRootWatts := Watts ^ (1/3)]
lm.mph <- lm(MPH ~ cubeRootWatts, data = dt.s22[resistanceChange == 'steady' & Incline == 0 & MPH > 10 & source != '2020_09_30_22_09_Manual_Workout'])
summary(lm.mph)

## 
## Call:
## lm(formula = MPH ~ cubeRootWatts, data = dt.s22[resistanceChange == 
##     "steady" & Incline == 0 & MPH > 10 & source != "2020_09_30_22_09_Manual_Workout"])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7926 -0.2963  0.0235  0.3145  6.8687 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.367374   0.044482  -120.7   <2e-16 ***
## cubeRootWatts  4.624141   0.007592   609.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4671 on 22466 degrees of freedom
## Multiple R-squared:  0.9429, Adjusted R-squared:  0.9429 
## F-statistic: 3.71e+05 on 1 and 22466 DF,  p-value: < 2.2e-16

dt.s22[, predictedMPH := lm.mph[["coefficients"]][1] + lm.mph[["coefficients"]][2] * cubeRootWatts]

ggplot(data = dt.s22[resistanceChange == 'steady' & Incline == 0 & MPH > 10 & source != '2020_09_30_22_09_Manual_Workout'],
       mapping = aes(x = predictedMPH, y = MPH)) +
  geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + theme_nicely() +
  geom_point(alpha = 0.2, col = 'steelblue2') +
  geom_abline(slope = 1, col = 'grey', lwd = 2) +
  scale_x_continuous(name = "Predicted Speed (MPH)") +
  scale_y_continuous(name = 'Actual Speed (MPH)') +
  ggtitle("Prediction Using Power with Cube Root Term") +
  theme(plot.title = element_text(hjust = 0.5)) +
  annotate("text", x = 5, y = 20, label = paste0("italic(R) ^ 2 == ", round(summary(lm.mph)$adj.r.squared,4)), parse = TRUE)

It helps a lot less than I had hoped. Weird.

Whole Hog: Kitchen sink approach

Let’s put every variable into the equation that we can think of and see what falls out. You know, good statistical practice. Here we go:

lm.mph <- lm(MPH ~ Watts + RPM + Resistance + cubeRootWatts, data = dt.s22[resistanceChange == 'steady' & Incline == 0 & MPH > 10 & source != '2020_09_30_22_09_Manual_Workout'])
summary(lm.mph)

## 
## Call:
## lm(formula = MPH ~ Watts + RPM + Resistance + cubeRootWatts, 
##     data = dt.s22[resistanceChange == "steady" & Incline == 0 & 
##         MPH > 10 & source != "2020_09_30_22_09_Manual_Workout"])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8953 -0.0206  0.0060  0.0206  2.5894 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.9284929  0.0980549  -19.67   <2e-16 ***
## Watts          0.0255855  0.0002404  106.42   <2e-16 ***
## RPM            0.1201198  0.0010226  117.46   <2e-16 ***
## Resistance     0.1253298  0.0082695   15.16   <2e-16 ***
## cubeRootWatts  1.0475348  0.0357274   29.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2006 on 22463 degrees of freedom
## Multiple R-squared:  0.9895, Adjusted R-squared:  0.9895 
## F-statistic: 5.276e+05 on 4 and 22463 DF,  p-value: < 2.2e-16

dt.s22[, predictedMPH := lm.mph[["coefficients"]][1] + lm.mph[["coefficients"]][2] * Watts + lm.mph[["coefficients"]][3] * RPM + lm.mph[["coefficients"]][4] * Resistance + lm.mph[["coefficients"]][5] * cubeRootWatts]

ggplot(data = dt.s22[resistanceChange == 'steady' & Incline == 0 & MPH > 10 & source != '2020_09_30_22_09_Manual_Workout'], mapping = aes(x = predictedMPH, y = MPH)) +
  geom_hline(yintercept = 0) + geom_vline(xintercept = 0) + theme_nicely() +
  geom_point(alpha = 0.2, col = 'steelblue2') +
  geom_abline(slope = 1, col = 'grey', lwd = 2) +
  scale_x_continuous(name = "Predicted Speed (MPH)") +
  scale_y_continuous(name = 'Actual Speed (MPH)') +
  ggtitle("Prediction Using Cadence, Power & Resistance") +
  theme(plot.title = element_text(hjust = 0.5)) +
  annotate("text", x = 5, y = 20, label = paste0("italic(R) ^ 2 == ", round(summary(lm.mph)$adj.r.squared,4)), parse = TRUE)

For those that care, this model is only very slightly better (in terms of r²) than the model that does not include the cubic of the power term. Still, it feels like we should be able to get an r² of 1 given that we presumably have all of the inputs. I guess not. Or NordicTrack is just trying to be mysterious (or maybe the odd distribution of Watts is to blame?). At the end of the day, the integer rounding has to be some source of the error.

Further Data Exploration: TCX

Now that we’ve looked at the data exclusively using the data available in the CSV format, let’s see what the TCX has to offer (or not). Don’t laugh at the code. I understand that I made this horribly inefficient, but I struggled to get the XML Parsing functions to do what I wanted them to do. I don’t know if its’ because <Track> is buried so deep in the XML or not. Laugh, but this did work. Here we go:

vec.tcx <- list.files(path = chr.location, pattern = '*.tcx')

parseIFit <- function(fileName) {
  fl <- paste0(chr.location, fileName)
  doc <- xmlParse(fl)
  tmp <- xmlToList(doc)$Activities$Activity$Lap$Track
  dt.temp <- rbindlist(lapply(tmp, function(rw){
    nm <- names(rw)
    if(any(nm == 'HeartRateBpm')) nm <- c(nm[1:which(nm == "HeartRateBpm")], "HeartRateWorthless", nm[(which(nm == "HeartRateBpm")+1):length(nm)])
    vec <- as.vector(unlist(rw))
    if(length(vec) > 0) {
      dt <- data.table(t(vec))
      setnames(dt, nm)
    } else {
      dt <- data.table()
    }
    return(dt)
  }), fill = TRUE)
  dt.temp[, source :=  substr(fileName, 1, nchar(fileName) - 4) ]
  if(any(names(dt.temp) == "HeartRateWorthless")) dt.temp[, HeartRateWorthless := NULL]
  return(dt.temp)
}

dt.tcx <- rbindlist(lapply(vec.tcx, function(fl){
  return(parseIFit(fileName = fl))
}), fill = TRUE)

dt.tcx[, `:=` (DistanceMeters = as.numeric(DistanceMeters), RPM = as.numeric(Cadence), Calories = as.numeric(Calories), HR = as.numeric(HeartRateBpm))]
dt.tcx[, lastDistanceMeters := c(NA, DistanceMeters[1:(nrow(dt.tcx)-1)])]
dt.tcx[, lastSource := c(NA, source[1:(nrow(dt.tcx)-1)])]
dt.tcx[, MPH := (DistanceMeters - lastDistanceMeters) * 2.23694]
dt.tcx[, deltaMeters := DistanceMeters - lastDistanceMeters]
dt.tcx[, I := .I]
dt.tcx[, lastCalories := c(NA, Calories[1:(nrow(dt.tcx)-1)])]
dt.tcx[, caloriesBurned := Calories - lastCalories]
dt.tcx[source != lastSource, `:=` (MPH = 0, caloriesBurned = 0)]
dt.tcx[, Watts := caloriesBurned * 1000]
dt.tcx[, baseTime := min(gsub("T", " ", Time)), by = source]
dt.tcx[, tmFromStartSec := as.numeric(difftime(gsub("T", " ",Time), baseTime, units = "secs"))]
dt.tcx[, maxTime := max(gsub("T", " ", Time)), by = source]
dt.tcx[, perComplete := as.numeric(difftime(gsub("T", " ", Time), baseTime, units = "secs")) / as.numeric(difftime(maxTime, baseTime, units = "secs"))]

You will note above that I used a simplification of converting Calories (kcal) burned per second to Watts from Gear and Grit. It basically assumes that your caloric efficiency is the same as the conversion from Joules to Calories, about 23.89%. Makes the math pretty easy.

Let’s do a simple check on our data to make sure that they have the same profile of percentage complete (i.e., the other TCX files are not skewed in some way):

ggplot() + 
  geom_hline(yintercept = 0) + theme_nicely() +
  geom_density(data = dt.s22, mapping = aes(x = perComplete), col = "steelblue2") + 
  geom_density(data = dt.tcx, mapping = aes(x = perComplete), col = "red", lty = 'dashed') +
  scale_x_continuous("Percent of Workout Complete") +
  scale_y_continuous("Probability Density Function") +
  ggtitle("Simple Distribution of Data Points") +
  theme(plot.title = element_text(hjust = 0.5)) +
  annotate("text", x = 0.5, y = 0.5, label = "Blue = CSV\nRed = TCX")

Seems pretty reasonable, though I would have expected a straight line - I guess this is how the probability function works. Also, this would at the end of the day just show that they are internally consistent.

Let’s return to the ggpairs function to quickly to see if the TCX data might look reasonable:

ggpairs(dt.tcx[MPH < 30, .(Watts, MPH, RPM)]) +
  ggtitle("Initial TCX Data Exploration") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme_nicely()

Oddly, I will note that the RPM graph does not look the same from the CSV data and this data. Let’s look how different:

ggplot() +
  geom_hline(yintercept = 0) + theme_nicely() +
  geom_density(data = dt.s22, mapping = aes(x = RPM), lwd = 2) +
  geom_density(data = dt.tcx, mapping = aes(x = RPM), lwd = 2, col = 'red') +
  ggtitle("Distribution of Cadence") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(name = 'Probability Density  Function') +
  scale_x_continuous(name = "Cadence (RPM)") +
  annotate("text", x = 50, y = 0.075, label = "Black = CSV\nRed = TCX") +
  theme(axis.text.y=element_blank(), axis.ticks.y=element_blank())

Maybe we can dig and and figure out which ride it was on? Let’s try:

ggplot() +
  geom_hline(yintercept = 0) + theme_nicely() +
  geom_density(data = dt.s22, mapping = aes(x = RPM), lwd = 2) +
  geom_density(data = dt.tcx, mapping = aes(x = RPM), lwd = 2, col = 'red', lty = 'dashed') +
  facet_wrap(. ~ source) +
  ggtitle("Distribution of Cadence\nFaceted by Ride") +
  scale_y_continuous(name = 'Probability Density Function') +
  theme(axis.text.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))

So, this was the point that I figured out that my workout source of 2020_09_30_22_09_Manual_Workout was only a partial file for the TCX data, but a long file for the CSV data. But, the CSV data was all 72 RPM exactly, the whole workout. Something fishy is here, so I’m going to exclude that source (and you’ll note that is has been above without explanation until now) for any of the analysis. If we remove the spike caused by the one bad workout, it looks like we have reasonably good alignment, at least on RPM. Let’s look at the core three variables that we care about: RPM, Watts and MPH.

dt.plot <- rbind(dt.s22[source != '2020_09_30_22_09_Manual_Workout', .(RPM, Watts, MPH, `File Type` = 'CSV')],
                 dt.tcx[source != '2020_09_30_22_09_Manual_Workout', .(RPM, Watts, MPH, `File Type` = 'TCX')])
dt.melt <- melt(dt.plot, id.vars = c("File Type"), measure.vars = c("RPM", "MPH", "Watts"))
ggplot(data = dt.melt, mapping = aes(x = value, col = `File Type`)) +
  geom_hline(yintercept = 0) + theme_nicely() +
  facet_wrap(. ~ variable) +
  geom_density() +
  ggtitle("Distribution of Key Variables\nFaceted by Variable") +
  scale_x_continuous(name = "Value") +
  scale_y_continuous(name = 'Probability Density Function') +
  theme(axis.text.y=element_blank(), axis.ticks.y=element_blank(), 
        plot.title = element_text(hjust = 0.5), legend.position = c(0.8, 0.8))

## Warning: Removed 2 rows containing non-finite values (stat_density).

Unfortunately, putting all of these variables on the same graph does not make for easy comparisons. Let’s try again:

p1 <- ggplot(data = dt.melt[variable == 'RPM'], mapping = aes(x = value, col = `File Type`)) +
  geom_hline(yintercept = 0) + theme_nicely() +
  geom_density() +
  scale_x_continuous(name = "Cadence (RPM)") +
  scale_y_continuous(name = 'Probability Distribution Function') +
  scale_color_discrete(guide = FALSE) +
  theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank())

p2 <- ggplot(data = dt.melt[variable == 'MPH'], mapping = aes(x = value, col = `File Type`)) +
  geom_hline(yintercept = 0) + theme_nicely() +
  geom_density() +
  scale_x_continuous(name = "Speed (MPH)") +
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        legend.position = c(0.7, 0.8))
p3 <- ggplot(data = dt.melt[variable == 'Watts'], mapping = aes(x = value, col = `File Type`)) +
  geom_hline(yintercept = 0) + theme_nicely() +
  geom_density() +
  scale_x_continuous(name = "Power (Watts)") +
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank()) +
  scale_color_discrete(guide = FALSE)
grid.arrange(p1, p2, p3, nrow = 1)

## Warning: Removed 1 rows containing non-finite values (stat_density).

## Warning: Removed 1 rows containing non-finite values (stat_density).

Unfortunately, except for the cadence, there seems to be some outliers in the data. Reasonably speaking, I haven’t gone over 600 Watts or 35 mph on any of my worksouts. Let’s see where the problems are:

dt.melt[variable == 'MPH' & value > 35, .(.N), by = `File Type`]

##    File Type  N
## 1:       TCX 30

and then for power:

dt.melt[variable == 'Watts' & value > 450, .(.N), by = `File Type`]

##    File Type   N
## 1:       TCX 984

So our TCX files have some suspect data. Let’s see if we can figure out the circumstances for those outliers to occur (though there are a lot of them!).

Let’s compare the variables across workouts to see how well they line up. Note that the CSV data will be in orange and the TCX data in blue (to save space on the plot):

dt.plot <- rbind(dt.s22[source != '2020_09_30_22_09_Manual_Workout', .(RPM, Watts, MPH, source, perComplete, `File Type` = 'CSV')],
                 dt.tcx[source != '2020_09_30_22_09_Manual_Workout', .(RPM, Watts, MPH, source, perComplete, `File Type` = 'TCX')])
dt.melt <- melt(dt.plot, id.vars = c("File Type", "perComplete", "source"), measure.vars = c("RPM", "MPH", "Watts"))
ggplot(data = dt.melt[variable == 'RPM'], mapping = aes(x = perComplete, y = value, col = `File Type`)) +
  theme_nicely() +
  geom_point(alpha = 0.2) +
  facet_grid(rows = vars(source), cols = vars(variable)) +
  scale_x_continuous(name = "Percent of Workout Complete", labels = scales::percent) +
  scale_y_continuous(name = "Cadence (RPM)", breaks = c(50, 100)) +
  ggtitle("Cadence over Time\nFaceted by Workout") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_discrete(guide = FALSE)

Here we look pretty good from this view for cadence. How about power?

ggplot(data = dt.melt[variable == 'Watts' & value < 400], mapping = aes(x = perComplete, y = value, col = `File Type`)) +
  geom_hline(yintercept = 0) + theme_nicely() +
  geom_point(alpha = 0.2) +
  facet_grid(rows = vars(source), cols = vars(variable)) +
  scale_x_continuous(name = "Percent of Workout Complete", labels = scales::percent) +
  scale_y_continuous(name = "Power (Watts)", breaks = c(0, 150, 300)) +
  ggtitle("Power over Time\nFaceted by Workout") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_discrete(guide = FALSE)

Uggh. There seems to be a systematic bias here. And finally, speed (MPH)?

ggplot(data = dt.melt[variable == 'MPH' & value < 30], mapping = aes(x = perComplete, y = value, col = `File Type`)) +
  geom_hline(yintercept = 0) + theme_nicely() +
  geom_point(alpha = 0.2) +
  facet_grid(rows = vars(source), cols = vars(variable)) +
  scale_x_continuous(name = "Percent of Workout Complete", labels = scales::percent) +
  scale_y_continuous(name = "Speed (MPH)", breaks = c(0, 25)) +
  ggtitle("Speed over Time\nFaceted by Workout") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_discrete(guide = FALSE)

We are almost never getting the same values. Not great.

Another way to look at the data would be to do a scatter plot of the CSV and the TCX like for like values. Here, though, you’d have to be able to line them up from the start of the ride. We can do this by using the tmFromStartSec variable and then merge the two data sets on that variable and the source.. Here goes the first cadence (RPM) view:

dt.plot <- merge(dt.s22[source != '2020_09_30_22_09_Manual_Workout', .(source = substr(source, 18, nchar(source)), tmFromStartSec, csv = RPM)], 
                 dt.tcx[source != '2020_09_30_22_09_Manual_Workout', .(source = substr(source, 18, nchar(source)), tmFromStartSec, tcx = RPM)],
                 by = c('source', 'tmFromStartSec'))
ggplot(data = dt.plot, mapping = aes(x = csv, y = tcx, col = source)) +
  theme_nicely() +
  geom_abline(slope = 1, lwd = 2, col = 'grey') +
  geom_point(alpha = 0.2) +
  scale_x_continuous(name = "Cadence from CSV (RPM)") +
  scale_y_continuous(name = "Cadence from TCX (RPM)") +
  scale_color_discrete(guide = FALSE) +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Cadence Comparison") +
  annotate("text", x = 60, y = 100, label = "Grey line represents y = x")

Yay - everything lines up well! Let’s see about our other derived variables:

dt.plot <- merge(dt.s22[source != '2020_09_30_22_09_Manual_Workout', .(source = substr(source, 18, nchar(source)), tmFromStartSec, csv = Watts)], 
                 dt.tcx[source != '2020_09_30_22_09_Manual_Workout', .(source = substr(source, 18, nchar(source)), tmFromStartSec, tcx = Watts)],
                 by = c('source', 'tmFromStartSec'))
ggplot(data = dt.plot, mapping = aes(x = csv, y = tcx, col = source)) +
  theme_nicely() +
  geom_abline(slope = 1, lwd = 2, col = 'grey') +
  geom_point(alpha = 0.2) +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Power Comparison") +
  scale_x_continuous(name = "Power from CSV (Watts)") +
  scale_y_continuous(name = "Power from TCX (Watts)") +
  scale_color_discrete(guide = FALSE)

## Warning: Removed 1 rows containing missing values (geom_point).

Totally not awesome. Maybe faceting by ride and limiting the y-axis height will help us understand what is going on:

ggplot(data = dt.plot, mapping = aes(x = csv, y = tcx, col = source)) +
  theme_nicely() +
  geom_abline(slope = 1, lwd = 2, col = 'grey') +
  geom_point(alpha = 0.2) +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Power Comparison\nFaceted by Workout") +
  scale_y_continuous(name = "Power from TCX (Watts)") +
  scale_x_continuous(name = "Power from CSV (Watts)") +
  scale_color_discrete(guide = FALSE) + 
  facet_wrap(. ~ source) +
  coord_cartesian(ylim = c(0, 600))

## Warning: Removed 1 rows containing missing values (geom_point).

Some workouts, it appears to almost perfectly nail it. But others, we clearly have an issue of three lines instead of just the x = y line. This bears some additional thinking to understand why this might be the case, but at first glance, I’m stumped. Let’s look further:

ggplot(data = dt.tcx[22798:22809], mapping = aes(x = tmFromStartSec)) +
  geom_hline(yintercept = 0) + theme_nicely() +
  geom_point(mapping = aes(y = Watts), col = 'steelblue2') +
  geom_line(mapping = aes(y = Watts), col = 'steelblue2') +
  geom_step(mapping = aes(y = (Calories-175) * 50), col = 'red') +
  geom_point(mapping = aes(y = (Calories-175) * 50), col = 'red') +
  scale_y_continuous(name = 'Derived Power (Blue - Watts)\nCumulative Calories Burned\n(Red - scaled up to show variation)') +
  scale_x_continuous(name = "Seconds from Start of Workout") +
  ggtitle("Calorie Problem") +
  theme(plot.title = element_text(hjust = 0.5))

Aha. It appears that Calories gets stuck every so often. Seems like smoothing is in order:

dt.tcx[, smoothCal := frollmean(Calories, 20, align = 'center'), by = source]
dt.tcx[, lastSmoothCal := c(NA, smoothCal[1:(nrow(dt.tcx)-1)])]
dt.tcx[, smoothCaloriesBurned := smoothCal - lastSmoothCal]
dt.tcx[, smoothWatts := smoothCaloriesBurned * 1000]

dt.plot <- merge(dt.s22[source != '2020_09_30_22_09_Manual_Workout', .(source = substr(source, 18, nchar(source)), tmFromStartSec, csv = Watts)], 
                 dt.tcx[source != '2020_09_30_22_09_Manual_Workout', .(source = substr(source, 18, nchar(source)), tmFromStartSec, tcx = smoothWatts)],
                 by = c('source', 'tmFromStartSec'))

ggplot(data = dt.plot, mapping = aes(x = csv, y = tcx, col = source)) +
  theme_nicely() +
  geom_abline(slope = 1, lwd = 2, col = 'grey') +
  geom_point(alpha = 0.2) +
  scale_x_continuous(name = "Power from CSV (Watts)") +
  scale_y_continuous(name = "Smooth Power from TCX (Watts)") +
  scale_color_discrete(guide = FALSE) +
  ggtitle("Smoothed Power Comparison") +
  theme(plot.title = element_text(hjust = 0.5))

## Warning: Removed 200 rows containing missing values (geom_point).

As the above code suggests, it required smoothing the Calories over 20 seconds to get rid of the three line artifact that exists. Unacceptable, IMHO. You will also note that after we have removed it, there is a clear bias in the way that I’m deriving power from the Calorie information. Given that is not even the first order problem, I’m going to ignore it for now. If I were to treat it, though I’d have to believe the difference is due to (1) Calories are including a basil metabolic rate (i.e., the line will not cross the y-axis at zero) and (2) I should use a different conversion factor between calories and power which will adjust the slope.

Let’s now look at speed. Maybe speed is better?

dt.plot <- merge(dt.s22[source != '2020_09_30_22_09_Manual_Workout', .(source = substr(source, 18, nchar(source)), tmFromStartSec, csv = MPH)], 
                 dt.tcx[source != '2020_09_30_22_09_Manual_Workout', .(source = substr(source, 18, nchar(source)), tmFromStartSec, tcx = MPH)],
                 by = c('source', 'tmFromStartSec'))
ggplot(data = dt.plot, mapping = aes(x = csv, y = tcx, col = source)) +
  theme_nicely() +
  geom_abline(slope = 1, lwd = 2, col = 'grey') +
  geom_point(alpha = 0.2) +
  scale_x_continuous(name = "Speed from CSV (MPH)") +
  scale_y_continuous(name = "Speed from TCX (MPH)") +
  scale_color_discrete(guide = FALSE) +
  ggtitle("Speed Comparison") +
  theme(plot.title = element_text(hjust = 0.5))

## Warning: Removed 1 rows containing missing values (geom_point).

This seems to be the worst. Numerous outliers! To even see what is going on, we should restrict the y-axis to roughly the same range as the x. There is also clearly a problem with discrete values on the y-axis (the TCX files). That is due to the rounding of meters to an integer value. But the spread goes beyond rounding and I really don’t know why it would be this bad. Is it some rides, or all of them?

ggplot(data = dt.plot, mapping = aes(x = csv, y = tcx, col = source)) +
  theme_nicely() +
  geom_abline(slope = 1, lwd = 2, col = 'grey') +
  geom_point(alpha = 0.2) +
  scale_x_continuous(name = "Speed from CSV (MPH)") +
  scale_y_continuous(name = "Speed from TCX (MPH)") +
  scale_color_discrete(guide = FALSE) + 
  facet_wrap(. ~ source)  +
  ggtitle("Speed Comparison\nFaceted by Ride") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_cartesian(xlim = c(0,26), ylim = c(0, 40))

## Warning: Removed 1 rows containing missing values (geom_point).

Its all of them. Arguably terrible. But maybe we are getting the same on average?

dt.tcxSummary <- dt.tcx[source != '2020_09_30_22_09_Manual_Workout', .(mxTime = maxTime[1], mnTime = baseTime[1], 
                                                                       endMeter = max(DistanceMeters), 
                                                                       beginMeter = min(DistanceMeters)), by = source]
dt.tcxSummary[, deltaMeters := endMeter - beginMeter]
dt.tcxSummary[, deltaHours := as.numeric(difftime(mxTime, mnTime, units = "hours"))]
dt.tcxSummary[, tcxMPH := deltaMeters / 1609.34 / deltaHours]

dt.s22Summary <- dt.s22[source != '2020_09_30_22_09_Manual_Workout', .(csvMPH = mean(MPH), csvRPM = mean(RPM), csvWatts = mean(Watts)), by = source]
dt.plot <- merge(dt.tcxSummary[, .(source = substr(source, 18, nchar(source)), tcxMPH)], 
                 dt.s22Summary[, .(source = substr(source, 18, nchar(source)), csvMPH)], by = "source")
ggplot(data = dt.plot, mapping = aes(x = csvMPH, y = tcxMPH, col = source)) +
  theme_nicely() +
  geom_abline(slope = 1, lwd = 2, col = 'grey') +
  geom_point() +
  scale_x_continuous(name = 'Speed from CSV (MPH)') +
  scale_y_continuous(name = 'Speed from TCX (MPH)') +
  scale_color_discrete(name = 'Workout') +
  theme(legend.position = c(0.3, 0.8)) +
  coord_cartesian(xlim = c(17, 23), ylim = c(17,23)) +
  theme(plot.title = element_text(hjust = 0.5), legend.text = element_text(size = rel(1))) +
  ggtitle("Speed Comparison")

Nope. It looks like we are getting higher values as reported by the CSV than the TCX. Odd that it would be systematically biased in this way.

Conclusions

Icon Fitness, the owner of the NordicTrack and iFit brands, does not appear to do a very good job with making its data available in a super useful form to its users. The variable that I care most about, Power, is not even provided in the fancier data format. A summary table:

Data Type	`CSV` Accuracy	`TCX` Accuracy
Power	Watt (Integer, sort of)	Not provided, derivation specious
Cadence	RPM (Integer)	RPM (Integer)
Speed	MPH (hundreth of an mph)	Cumulative Meters (equivalent of 2.23694 mph)
Heart Rate	Argh!	Argh!

Given that the TCX doesn’t provide power directly and the derivation is specious, why would you use the TCX? Why don’t they directly provide power in the TCX???

And the Argh!’s above are due to my frustration of having the bike even recognize my heart rate monitor, even though it is the iFit branded one provided with my treadmill. Their own equipment doesn’t work well together.

A simple Python project called upload.bike will take my CSV and create a better TCX than will iFit. Meters are to the ten-thousandth. The activity isn’t Other, but Biking. I’m just flabbergasted that Icon Fitness is so bad at its job. Embarrassing.