Introduction

This document seeks to perform an exploratory dive into a small portion of the Health data recorded on my Apple Watch over the last four years with daily wear. The analysis will focus on pedometer-related measures like daily step counts, walking distance, and active energy burn.

The raw data was exported out of the Health app using the “Export All Health Data” feature highlighted in the image below. The outputs of the export include two XML files along with a directory containing gpx files for workouts.

All data was extracted from the “export.xml” file that contained over 4.5M records. An R script leveraging the dplyr and lubridate libraries was used to parse, clean, and aggregate the records and produce the “daily_steps.csv” file for analysis here. Note: any bolded figures throughout the text of this document are the result of an inline code calcuation.

Import Daily Steps Data

Load and Inspect

To begin, we’ll import the daily steps or activity data, inspect the structure, and summarize the contents of each column. The str() function shows there are 1509 rows in data set with 6 columns including date, hardware and software metadata, along with the calories burned, distance, and step count. The summary shows the data types for each column and also reveals that distance and steps each have a missing value.

str(activity)
## 'data.frame':    1509 obs. of  6 variables:
##  $ recordDate : chr  "2015-09-22" "2015-09-23" "2015-09-24" "2015-09-25" ...
##  $ hardware   : chr  "Watch1,1" "Watch1,1" "Watch1,1" "Watch1,1" ...
##  $ software   : chr  "2.0" "2.0" "2.0" "2.0" ...
##  $ energy_kcal: num  506 250 327 632 345 ...
##  $ distance_mi: num  6.22 3.64 3.49 4.43 5.5 ...
##  $ steps_count: int  12769 7416 6951 8943 10726 20552 5826 15162 18815 7913 ...
summary(activity)
##   recordDate          hardware           software        
##  Length:1509        Length:1509        Length:1509       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   energy_kcal        distance_mi       steps_count   
##  Min.   :   0.003   Min.   : 0.0282   Min.   :   54  
##  1st Qu.: 362.171   1st Qu.: 4.5590   1st Qu.:10080  
##  Median : 506.708   Median : 6.0200   Median :13402  
##  Mean   : 548.740   Mean   : 6.3928   Mean   :14034  
##  3rd Qu.: 701.943   3rd Qu.: 7.9576   3rd Qu.:17318  
##  Max.   :1759.868   Max.   :16.1763   Max.   :37020  
##                     NA's   :1         NA's   :1

Which days are missing steps and distance?

sapply(activity, function(x) which(is.na(x)))
## $recordDate
## integer(0)
## 
## $hardware
## integer(0)
## 
## $software
## integer(0)
## 
## $energy_kcal
## integer(0)
## 
## $distance_mi
## [1] 170
## 
## $steps_count
## [1] 170
activity[170, ]
##     recordDate hardware software energy_kcal distance_mi steps_count
## 170 2016-03-08 Watch1,1      2.1       0.003          NA          NA

It appears the date missing both steps and distance data (March 8, 2016) has approximately 0 energy burn record. This could coincide with a day when I forgot to charge the watch overnight, put it on briefly, then removed it. We’ll remove this record from the data frame moving forward.

Clean Formatting

Next, let’s fix the formatting of the columns in addition to handling the missing data points by making the following updates:

  • factorize hardware and software columns because they are categorical
  • change recordDate to a Date type
  • create helper date columns for Month, Year, and Day of the Week
activity <- activity %>%
  filter(!is.na(steps_count)) %>%
  mutate(hardware = as.factor(hardware),
         software = as.factor(software),
         recordDate = as.Date(recordDate),
         recordMonth = as.Date(format(recordDate, "%Y-%m-01")),
         recordYear = format(recordDate, "%Y"),
         dayOfWeek = format(recordDate, "%A"))
head(activity) %>% kable()
recordDate hardware software energy_kcal distance_mi steps_count recordMonth recordYear dayOfWeek
2015-09-22 Watch1,1 2.0 506.490 6.224021 12769 2015-09-01 2015 Tuesday
2015-09-23 Watch1,1 2.0 250.358 3.639957 7416 2015-09-01 2015 Wednesday
2015-09-24 Watch1,1 2.0 327.250 3.488659 6951 2015-09-01 2015 Thursday
2015-09-25 Watch1,1 2.0 631.611 4.430512 8943 2015-09-01 2015 Friday
2015-09-26 Watch1,1 2.0 345.439 5.498719 10726 2015-09-01 2015 Saturday
2015-09-27 Watch1,1 2.0 602.511 9.926667 20552 2015-09-01 2015 Sunday

Descriptive Statistics

Metadata

Now that activity is in a tidy format, we will proceed with some high-level profiling and descriptive statistics of the various columns to get a sense of the data. The data ranges from Tuesday September 22, 2015 through Friday October 11, 2019. Over that time, I used 2 different watches. Their first dates of use are summarized below.

aggregate(recordDate ~ hardware, activity, FUN = min)
##   hardware recordDate
## 1 Watch1,1 2015-09-22
## 2 Watch3,1 2018-01-13

There were a total of 29 software versions of WatchOS rolled out over that same time frame. The figure below shows how many days worth of activity was recorded on each version. I generally update my watch on the day of release, so the number of days per version can roughly be interpreted as a software release cycle. From the chart, we can see that releases tend to follow a 30, 60,or 90-day cadence. Interestingly, the last two major releases, 5.0 and 6.0, were followed by updates much faster than 3.0 and 4.0.

The same data is shown below with date on the horizontal axis.

Movement Data

The activity data contains three numeric type variables related to movement including steps taken (count), distance traveled (mi), and energy burned (kcal). The distributions for each are illustrated in the kernel density plots below. Each measure appears to be right-skewed, where energy burned has the most dramatic right tail.

The quantile-qualtile plot for daily Steps below highlights the right-skew as well as a handful of low outliers.The low outliers could correspond to days where perhaps I forgot to charge the watch, was ill, or traveling. Taken together with the results of the Shapiro-Wilk test, we can conclude the step data is not normally distributed.

shapiro.test(activity$steps_count)
## 
##  Shapiro-Wilk normality test
## 
## data:  activity$steps_count
## W = 0.98248, p-value = 1.381e-12
ggqqplot(activity$steps_count)

Since the distance and energy measures are presumably estimated from a combination of the step count and other data not included here like GPS, altitude, and heart rate, we could imagine that these measures should be somewhat correlated. Both Pearson and Spearman correlation coefficients are shown below for the pair-wise combinations; the two methods give comparable results. It is logical that steps and distance are highly correlated. It is also follows that distance and energy burned are the least correlated, given that I regularly practice yoga and other anaerobic workouts which would result in higher energy expenditure with less movement than walking. Hiking at high altitude could also explain this disrepancy. A follow-up analysis incorporating altitude gain to explore further could be interesting.

Pearson Corr.
steps_count distance_mi energy_kcal
steps_count 1.0000000 0.9835406 0.7337584
distance_mi 0.9835406 1.0000000 0.6962424
energy_kcal 0.7337584 0.6962424 1.0000000
Spearman Corr.
steps_count distance_mi energy_kcal
steps_count 1.0000000 0.9833065 0.7563483
distance_mi 0.9833065 1.0000000 0.7184943
energy_kcal 0.7563483 0.7184943 1.0000000

Regression: Estimating Step Length

In the previous section we saw that, unsurprisingly, number of steps taken and distance traveled have a high degree of linear association. For fun, we will use this relationship to fit a linear model for distance based on step count to estimate my step length, which should be the slope. This simple model will be fit with the lm() function. In this instance, we will force the intercept to be 0 since 0 steps should result in no distance traveled.

mod1 <- lm(activity$distance_mi ~ activity$steps_count + 0)
summary(mod1)
## 
## Call:
## lm(formula = activity$distance_mi ~ activity$steps_count + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61977 -0.20200  0.06156  0.27953  1.71485 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## activity$steps_count 4.570e-04  8.438e-07   541.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4973 on 1507 degrees of freedom
## Multiple R-squared:  0.9949, Adjusted R-squared:  0.9949 
## F-statistic: 2.933e+05 on 1 and 1507 DF,  p-value: < 2.2e-16

As expected, the model has a good fit with high R-squared and low p-value. The slope of 4.570114210^{-4} miles per step is approximately 2.4 feet per step. This looks like a reasonable figure given that the average step length is around 2.6 feet according to the University of Wyoming: https://www.livestrong.com/article/438170-the-average-walking-stride-length/

Looking at the residual plots for the model however, we see there might be some underlying issues with the model.

Let’s plot Steps vs Distance with our line of fit to explore further. The figure below illustrates there may be two distinct populations within the data.

Back in the descriptive statistics section, we saw that I have used two different watches. One could postulate that perhaps two difference devices might be calibrated differently and yield visibly different estimates of distance. Here is the same data colored by hardware and without the line of fit.

Interestingly, the Watch harware model does not appear to fully explain the discrepancy. Both watches are represented in the group with high slope, but only the newer model is included in the lower-slope segment. Instead, let’s repeat the exercise mapping the color aesthetic to software version and faceting the graphs.

It’s a bit difficult to discern, but it appears that there is a consistent decrease in the slopes of the lines of fit after the release of Version 5.0. For ease of viewing, we can select a few versions to look at more closely and see when this change occurs.

The plot below shows the combined figure for versions 4.3.1 - 5.1.1. Here, it would appear that version 5.0 and earlier utilize a higher slope (or larger step size) to estimate distance traveled than 5.0.1+.

Now, let’s create a new categorical variable to indicate whether the software version is before or after 5.0.1 and replot the data. There is still some noise visible with high outliers in the newer software version group. Trial and error with adjusting the cut-off forward/back a version did not yield improvement.

The summary statistics for the models fit for the two groups are as follows. The legacy model yields a step size of approximately 2.5ft while the 5.0.1+ model yields a step size of 2.2ft. Assuming that the algorithms for calculating distance are more accurate in the later software versions, we could scale all the “legacy distances” by a factor of 87% to account for the difference.

summary(mod2)
## 
## Call:
## lm(formula = distance_mi ~ steps_count + 0, data = legacy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36332 -0.15698 -0.05961  0.08132  1.27845 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## steps_count 4.738e-04  4.669e-07    1015   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2347 on 1119 degrees of freedom
## Multiple R-squared:  0.9989, Adjusted R-squared:  0.9989 
## F-statistic: 1.03e+06 on 1 and 1119 DF,  p-value: < 2.2e-16
summary(mod3)
## 
## Call:
## lm(formula = distance_mi ~ steps_count + 0, data = newer)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66037 -0.18626 -0.08463  0.01848  2.63099 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## steps_count 4.123e-04  1.171e-06     352   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3604 on 387 degrees of freedom
## Multiple R-squared:  0.9969, Adjusted R-squared:  0.9969 
## F-statistic: 1.239e+05 on 1 and 387 DF,  p-value: < 2.2e-16