This document seeks to perform an exploratory dive into a small portion of the Health data recorded on my Apple Watch over the last four years with daily wear. The analysis will focus on pedometer-related measures like daily step counts, walking distance, and active energy burn.
The raw data was exported out of the Health app using the “Export All Health Data” feature highlighted in the image below. The outputs of the export include two XML files along with a directory containing gpx files for workouts.
All data was extracted from the “export.xml” file that contained over 4.5M records. An R script leveraging the dplyr and lubridate libraries was used to parse, clean, and aggregate the records and produce the “daily_steps.csv” file for analysis here. Note: any bolded figures throughout the text of this document are the result of an inline code calcuation.
To begin, we’ll import the daily steps or activity data, inspect the structure, and summarize the contents of each column. The str() function shows there are 1509 rows in data set with 6 columns including date, hardware and software metadata, along with the calories burned, distance, and step count. The summary shows the data types for each column and also reveals that distance and steps each have a missing value.
str(activity)
## 'data.frame': 1509 obs. of 6 variables:
## $ recordDate : chr "2015-09-22" "2015-09-23" "2015-09-24" "2015-09-25" ...
## $ hardware : chr "Watch1,1" "Watch1,1" "Watch1,1" "Watch1,1" ...
## $ software : chr "2.0" "2.0" "2.0" "2.0" ...
## $ energy_kcal: num 506 250 327 632 345 ...
## $ distance_mi: num 6.22 3.64 3.49 4.43 5.5 ...
## $ steps_count: int 12769 7416 6951 8943 10726 20552 5826 15162 18815 7913 ...
summary(activity)
## recordDate hardware software
## Length:1509 Length:1509 Length:1509
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## energy_kcal distance_mi steps_count
## Min. : 0.003 Min. : 0.0282 Min. : 54
## 1st Qu.: 362.171 1st Qu.: 4.5590 1st Qu.:10080
## Median : 506.708 Median : 6.0200 Median :13402
## Mean : 548.740 Mean : 6.3928 Mean :14034
## 3rd Qu.: 701.943 3rd Qu.: 7.9576 3rd Qu.:17318
## Max. :1759.868 Max. :16.1763 Max. :37020
## NA's :1 NA's :1
Which days are missing steps and distance?
sapply(activity, function(x) which(is.na(x)))
## $recordDate
## integer(0)
##
## $hardware
## integer(0)
##
## $software
## integer(0)
##
## $energy_kcal
## integer(0)
##
## $distance_mi
## [1] 170
##
## $steps_count
## [1] 170
activity[170, ]
## recordDate hardware software energy_kcal distance_mi steps_count
## 170 2016-03-08 Watch1,1 2.1 0.003 NA NA
It appears the date missing both steps and distance data (March 8, 2016) has approximately 0 energy burn record. This could coincide with a day when I forgot to charge the watch overnight, put it on briefly, then removed it. We’ll remove this record from the data frame moving forward.
Next, let’s fix the formatting of the columns in addition to handling the missing data points by making the following updates:
activity <- activity %>%
filter(!is.na(steps_count)) %>%
mutate(hardware = as.factor(hardware),
software = as.factor(software),
recordDate = as.Date(recordDate),
recordMonth = as.Date(format(recordDate, "%Y-%m-01")),
recordYear = format(recordDate, "%Y"),
dayOfWeek = format(recordDate, "%A"))
head(activity) %>% kable()
| recordDate | hardware | software | energy_kcal | distance_mi | steps_count | recordMonth | recordYear | dayOfWeek |
|---|---|---|---|---|---|---|---|---|
| 2015-09-22 | Watch1,1 | 2.0 | 506.490 | 6.224021 | 12769 | 2015-09-01 | 2015 | Tuesday |
| 2015-09-23 | Watch1,1 | 2.0 | 250.358 | 3.639957 | 7416 | 2015-09-01 | 2015 | Wednesday |
| 2015-09-24 | Watch1,1 | 2.0 | 327.250 | 3.488659 | 6951 | 2015-09-01 | 2015 | Thursday |
| 2015-09-25 | Watch1,1 | 2.0 | 631.611 | 4.430512 | 8943 | 2015-09-01 | 2015 | Friday |
| 2015-09-26 | Watch1,1 | 2.0 | 345.439 | 5.498719 | 10726 | 2015-09-01 | 2015 | Saturday |
| 2015-09-27 | Watch1,1 | 2.0 | 602.511 | 9.926667 | 20552 | 2015-09-01 | 2015 | Sunday |
Now that activity is in a tidy format, we will proceed with some high-level profiling and descriptive statistics of the various columns to get a sense of the data. The data ranges from Tuesday September 22, 2015 through Friday October 11, 2019. Over that time, I used 2 different watches. Their first dates of use are summarized below.
aggregate(recordDate ~ hardware, activity, FUN = min)
## hardware recordDate
## 1 Watch1,1 2015-09-22
## 2 Watch3,1 2018-01-13
There were a total of 29 software versions of WatchOS rolled out over that same time frame. The figure below shows how many days worth of activity was recorded on each version. I generally update my watch on the day of release, so the number of days per version can roughly be interpreted as a software release cycle. From the chart, we can see that releases tend to follow a 30, 60,or 90-day cadence. Interestingly, the last two major releases, 5.0 and 6.0, were followed by updates much faster than 3.0 and 4.0.
The same data is shown below with date on the horizontal axis.
The activity data contains three numeric type variables related to movement including steps taken (count), distance traveled (mi), and energy burned (kcal). The distributions for each are illustrated in the kernel density plots below. Each measure appears to be right-skewed, where energy burned has the most dramatic right tail.
The quantile-qualtile plot for daily Steps below highlights the right-skew as well as a handful of low outliers.The low outliers could correspond to days where perhaps I forgot to charge the watch, was ill, or traveling. Taken together with the results of the Shapiro-Wilk test, we can conclude the step data is not normally distributed.
shapiro.test(activity$steps_count)
##
## Shapiro-Wilk normality test
##
## data: activity$steps_count
## W = 0.98248, p-value = 1.381e-12
ggqqplot(activity$steps_count)
Since the distance and energy measures are presumably estimated from a combination of the step count and other data not included here like GPS, altitude, and heart rate, we could imagine that these measures should be somewhat correlated. Both Pearson and Spearman correlation coefficients are shown below for the pair-wise combinations; the two methods give comparable results. It is logical that steps and distance are highly correlated. It is also follows that distance and energy burned are the least correlated, given that I regularly practice yoga and other anaerobic workouts which would result in higher energy expenditure with less movement than walking. Hiking at high altitude could also explain this disrepancy. A follow-up analysis incorporating altitude gain to explore further could be interesting.
| steps_count | distance_mi | energy_kcal | |
|---|---|---|---|
| steps_count | 1.0000000 | 0.9835406 | 0.7337584 |
| distance_mi | 0.9835406 | 1.0000000 | 0.6962424 |
| energy_kcal | 0.7337584 | 0.6962424 | 1.0000000 |
| steps_count | distance_mi | energy_kcal | |
|---|---|---|---|
| steps_count | 1.0000000 | 0.9833065 | 0.7563483 |
| distance_mi | 0.9833065 | 1.0000000 | 0.7184943 |
| energy_kcal | 0.7563483 | 0.7184943 | 1.0000000 |
In the previous section we saw that, unsurprisingly, number of steps taken and distance traveled have a high degree of linear association. For fun, we will use this relationship to fit a linear model for distance based on step count to estimate my step length, which should be the slope. This simple model will be fit with the lm() function. In this instance, we will force the intercept to be 0 since 0 steps should result in no distance traveled.
mod1 <- lm(activity$distance_mi ~ activity$steps_count + 0)
summary(mod1)
##
## Call:
## lm(formula = activity$distance_mi ~ activity$steps_count + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.61977 -0.20200 0.06156 0.27953 1.71485
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## activity$steps_count 4.570e-04 8.438e-07 541.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4973 on 1507 degrees of freedom
## Multiple R-squared: 0.9949, Adjusted R-squared: 0.9949
## F-statistic: 2.933e+05 on 1 and 1507 DF, p-value: < 2.2e-16
As expected, the model has a good fit with high R-squared and low p-value. The slope of 4.570114210^{-4} miles per step is approximately 2.4 feet per step. This looks like a reasonable figure given that the average step length is around 2.6 feet according to the University of Wyoming: https://www.livestrong.com/article/438170-the-average-walking-stride-length/
Looking at the residual plots for the model however, we see there might be some underlying issues with the model.
Let’s plot Steps vs Distance with our line of fit to explore further. The figure below illustrates there may be two distinct populations within the data.
Back in the descriptive statistics section, we saw that I have used two different watches. One could postulate that perhaps two difference devices might be calibrated differently and yield visibly different estimates of distance. Here is the same data colored by hardware and without the line of fit.
Interestingly, the Watch harware model does not appear to fully explain the discrepancy. Both watches are represented in the group with high slope, but only the newer model is included in the lower-slope segment. Instead, let’s repeat the exercise mapping the color aesthetic to software version and faceting the graphs.
It’s a bit difficult to discern, but it appears that there is a consistent decrease in the slopes of the lines of fit after the release of Version 5.0. For ease of viewing, we can select a few versions to look at more closely and see when this change occurs.
The plot below shows the combined figure for versions 4.3.1 - 5.1.1. Here, it would appear that version 5.0 and earlier utilize a higher slope (or larger step size) to estimate distance traveled than 5.0.1+.
Now, let’s create a new categorical variable to indicate whether the software version is before or after 5.0.1 and replot the data. There is still some noise visible with high outliers in the newer software version group. Trial and error with adjusting the cut-off forward/back a version did not yield improvement.
The summary statistics for the models fit for the two groups are as follows. The legacy model yields a step size of approximately 2.5ft while the 5.0.1+ model yields a step size of 2.2ft. Assuming that the algorithms for calculating distance are more accurate in the later software versions, we could scale all the “legacy distances” by a factor of 87% to account for the difference.
summary(mod2)
##
## Call:
## lm(formula = distance_mi ~ steps_count + 0, data = legacy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.36332 -0.15698 -0.05961 0.08132 1.27845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## steps_count 4.738e-04 4.669e-07 1015 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2347 on 1119 degrees of freedom
## Multiple R-squared: 0.9989, Adjusted R-squared: 0.9989
## F-statistic: 1.03e+06 on 1 and 1119 DF, p-value: < 2.2e-16
summary(mod3)
##
## Call:
## lm(formula = distance_mi ~ steps_count + 0, data = newer)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66037 -0.18626 -0.08463 0.01848 2.63099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## steps_count 4.123e-04 1.171e-06 352 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3604 on 387 degrees of freedom
## Multiple R-squared: 0.9969, Adjusted R-squared: 0.9969
## F-statistic: 1.239e+05 on 1 and 387 DF, p-value: < 2.2e-16
Like many individuals with a typical 8AM-5PM Monday-Friday work schedule, I am typically busiest during the week and have more liesure time on weekends for exercise and recreation. Let’s see if this is reflected in the Apple Watch movement data. Below are some summary statistics along with a boxplot of step count by day of week. Overall, Thursday appears to be my least active day of the week based on step count while Sunday is my most active. From visual inspection of boxplot, it does appear I am more active on weekends than weekdays.
| dayOfWeek | Mean | Median | Max | StDev |
|---|---|---|---|---|
| Monday | 13215.55 | 12685.0 | 28421 | 4748.726 |
| Tuesday | 13462.85 | 13316.0 | 30534 | 5265.251 |
| Wednesday | 12790.08 | 12447.0 | 27622 | 4592.931 |
| Thursday | 11960.07 | 11680.0 | 27008 | 5112.986 |
| Friday | 12857.07 | 12475.5 | 29907 | 4985.119 |
| Saturday | 16397.20 | 16137.0 | 37020 | 6668.578 |
| Sunday | 17668.83 | 16900.0 | 34837 | 6376.533 |
To test the hypothesis, we can use a Wilcox test. We previously saw the distribution of steps does not meet the normality requirement for a t-test. The results below confirm that weekends are significantly more active days than weekends. The effect size measured by Cliff’s Delta was calculated as well and suggests that the difference between the groups is “medium”.
| dayGroup | Mean | StDev |
|---|---|---|
| Weekday | 12848.65 | 4966.632 |
| Weekend | 17024.10 | 6549.600 |
wilcox.test(steps_count ~ dayGroup, activity, alternative = "less")
##
## Wilcoxon rank sum test with continuity correction
##
## data: steps_count by dayGroup
## W = 142890, p-value < 2.2e-16
## alternative hypothesis: true location shift is less than 0
cliff.delta(steps_count ~ dayGroup, activity)
##
## Cliff's Delta
##
## delta estimate: -0.3817476 (medium)
## 95 percent confidence interval:
## lower upper
## -0.4410538 -0.3191219
As people tend to be more active on certain days of the week due to schedules, it is also known that activity levels tend to vary over the year as well with changes in weather, holidays, and the like. The general population tends to be more active during the warmer months and less active in the winter. We can explore such seasonality in my step data by performing a seasonal trend decomposition.
The following analysis will aggregate the data on a monthly basis. The plots below show the original time series along with the trend component, seasonal component, and remainder. Visual inspection reveals some expected insights. In 2017 I moved to an urban neighborhood with more amenities accessible within walking distance, which would explain the overal increase in the trend by about 3,000 steps per day on average. The seasonal component is also in line with the expectation where activity is lower in colder months and higher warmer months experienced in the continental United States. The magnitude of this seasonal swing is quite small however, roughly half that of the remainder.
These findings can also be observed in the heatmap below. Displayed in this way, the low outlier in March 2016 becomes more eaily discernable. This is consistent with a period of several weeks in which I was incapacitated with pneumonia.
The following figures are additional exploratory charts one might use to look for additional patterns not previously uncovered.
Line + Bubble Graph in Polar Coordinates, standard deviation for week mapped to bubble size:
Here is a version with cumulative steps and weekly total mapped to bubble size:
Parallel Coordinates Plot:
Interactive 3D Scatter, click and drag to zoom/pan/rotate: