Apple Health EDA

Introduction

This document seeks to perform an exploratory dive into a small portion of the Health data recorded on my Apple Watch over the last four years with daily wear. The analysis will focus on pedometer-related measures like daily step counts, walking distance, and active energy burn.

The raw data was exported out of the Health app using the “Export All Health Data” feature highlighted in the image below. The outputs of the export include two XML files along with a directory containing gpx files for workouts.

All data was extracted from the “export.xml” file that contained over 4.5M records. An R script leveraging the dplyr and lubridate libraries was used to parse, clean, and aggregate the records and produce the “daily_steps.csv” file for analysis here. Note: any bolded figures throughout the text of this document are the result of an inline code calcuation.

Import Daily Steps Data

Load and Inspect

To begin, we’ll import the daily steps or activity data, inspect the structure, and summarize the contents of each column. The str() function shows there are 1509 rows in data set with 6 columns including date, hardware and software metadata, along with the calories burned, distance, and step count. The summary shows the data types for each column and also reveals that distance and steps each have a missing value.

str(activity)

## 'data.frame':    1509 obs. of  6 variables:
##  $ recordDate : chr  "2015-09-22" "2015-09-23" "2015-09-24" "2015-09-25" ...
##  $ hardware   : chr  "Watch1,1" "Watch1,1" "Watch1,1" "Watch1,1" ...
##  $ software   : chr  "2.0" "2.0" "2.0" "2.0" ...
##  $ energy_kcal: num  506 250 327 632 345 ...
##  $ distance_mi: num  6.22 3.64 3.49 4.43 5.5 ...
##  $ steps_count: int  12769 7416 6951 8943 10726 20552 5826 15162 18815 7913 ...

summary(activity)

##   recordDate          hardware           software        
##  Length:1509        Length:1509        Length:1509       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   energy_kcal        distance_mi       steps_count   
##  Min.   :   0.003   Min.   : 0.0282   Min.   :   54  
##  1st Qu.: 362.171   1st Qu.: 4.5590   1st Qu.:10080  
##  Median : 506.708   Median : 6.0200   Median :13402  
##  Mean   : 548.740   Mean   : 6.3928   Mean   :14034  
##  3rd Qu.: 701.943   3rd Qu.: 7.9576   3rd Qu.:17318  
##  Max.   :1759.868   Max.   :16.1763   Max.   :37020  
##                     NA's   :1         NA's   :1

Which days are missing steps and distance?

sapply(activity, function(x) which(is.na(x)))

## $recordDate
## integer(0)
## 
## $hardware
## integer(0)
## 
## $software
## integer(0)
## 
## $energy_kcal
## integer(0)
## 
## $distance_mi
## [1] 170
## 
## $steps_count
## [1] 170

activity[170, ]

##     recordDate hardware software energy_kcal distance_mi steps_count
## 170 2016-03-08 Watch1,1      2.1       0.003          NA          NA

It appears the date missing both steps and distance data (March 8, 2016) has approximately 0 energy burn record. This could coincide with a day when I forgot to charge the watch overnight, put it on briefly, then removed it. We’ll remove this record from the data frame moving forward.

Clean Formatting

Next, let’s fix the formatting of the columns in addition to handling the missing data points by making the following updates:

factorize hardware and software columns because they are categorical
change recordDate to a Date type
create helper date columns for Month, Year, and Day of the Week

activity <- activity %>%
  filter(!is.na(steps_count)) %>%
  mutate(hardware = as.factor(hardware),
         software = as.factor(software),
         recordDate = as.Date(recordDate),
         recordMonth = as.Date(format(recordDate, "%Y-%m-01")),
         recordYear = format(recordDate, "%Y"),
         dayOfWeek = format(recordDate, "%A"))
head(activity) %>% kable()

recordDate	hardware	software	energy_kcal	distance_mi	steps_count	recordMonth	recordYear	dayOfWeek
2015-09-22	Watch1,1	2.0	506.490	6.224021	12769	2015-09-01	2015	Tuesday
2015-09-23	Watch1,1	2.0	250.358	3.639957	7416	2015-09-01	2015	Wednesday
2015-09-24	Watch1,1	2.0	327.250	3.488659	6951	2015-09-01	2015	Thursday
2015-09-25	Watch1,1	2.0	631.611	4.430512	8943	2015-09-01	2015	Friday
2015-09-26	Watch1,1	2.0	345.439	5.498719	10726	2015-09-01	2015	Saturday
2015-09-27	Watch1,1	2.0	602.511	9.926667	20552	2015-09-01	2015	Sunday

Descriptive Statistics

Metadata

Now that activity is in a tidy format, we will proceed with some high-level profiling and descriptive statistics of the various columns to get a sense of the data. The data ranges from Tuesday September 22, 2015 through Friday October 11, 2019. Over that time, I used 2 different watches. Their first dates of use are summarized below.

aggregate(recordDate ~ hardware, activity, FUN = min)

##   hardware recordDate
## 1 Watch1,1 2015-09-22
## 2 Watch3,1 2018-01-13

There were a total of 29 software versions of WatchOS rolled out over that same time frame. The figure below shows how many days worth of activity was recorded on each version. I generally update my watch on the day of release, so the number of days per version can roughly be interpreted as a software release cycle. From the chart, we can see that releases tend to follow a 30, 60,or 90-day cadence. Interestingly, the last two major releases, 5.0 and 6.0, were followed by updates much faster than 3.0 and 4.0.

The same data is shown below with date on the horizontal axis.

Movement Data

The activity data contains three numeric type variables related to movement including steps taken (count), distance traveled (mi), and energy burned (kcal). The distributions for each are illustrated in the kernel density plots below. Each measure appears to be right-skewed, where energy burned has the most dramatic right tail.

The quantile-qualtile plot for daily Steps below highlights the right-skew as well as a handful of low outliers.The low outliers could correspond to days where perhaps I forgot to charge the watch, was ill, or traveling. Taken together with the results of the Shapiro-Wilk test, we can conclude the step data is not normally distributed.

shapiro.test(activity$steps_count)

## 
##  Shapiro-Wilk normality test
## 
## data:  activity$steps_count
## W = 0.98248, p-value = 1.381e-12

ggqqplot(activity$steps_count)

Since the distance and energy measures are presumably estimated from a combination of the step count and other data not included here like GPS, altitude, and heart rate, we could imagine that these measures should be somewhat correlated. Both Pearson and Spearman correlation coefficients are shown below for the pair-wise combinations; the two methods give comparable results. It is logical that steps and distance are highly correlated. It is also follows that distance and energy burned are the least correlated, given that I regularly practice yoga and other anaerobic workouts which would result in higher energy expenditure with less movement than walking. Hiking at high altitude could also explain this disrepancy. A follow-up analysis incorporating altitude gain to explore further could be interesting.

Pearson Corr.
	steps_count	distance_mi	energy_kcal
steps_count	1.0000000	0.9835406	0.7337584
distance_mi	0.9835406	1.0000000	0.6962424
energy_kcal	0.7337584	0.6962424	1.0000000

Spearman Corr.
	steps_count	distance_mi	energy_kcal
steps_count	1.0000000	0.9833065	0.7563483
distance_mi	0.9833065	1.0000000	0.7184943
energy_kcal	0.7563483	0.7184943	1.0000000

Regression: Estimating Step Length

In the previous section we saw that, unsurprisingly, number of steps taken and distance traveled have a high degree of linear association. For fun, we will use this relationship to fit a linear model for distance based on step count to estimate my step length, which should be the slope. This simple model will be fit with the lm() function. In this instance, we will force the intercept to be 0 since 0 steps should result in no distance traveled.

mod1 <- lm(activity$distance_mi ~ activity$steps_count + 0)
summary(mod1)

## 
## Call:
## lm(formula = activity$distance_mi ~ activity$steps_count + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61977 -0.20200  0.06156  0.27953  1.71485 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## activity$steps_count 4.570e-04  8.438e-07   541.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4973 on 1507 degrees of freedom
## Multiple R-squared:  0.9949, Adjusted R-squared:  0.9949 
## F-statistic: 2.933e+05 on 1 and 1507 DF,  p-value: < 2.2e-16

As expected, the model has a good fit with high R-squared and low p-value. The slope of 4.570114210^{-4} miles per step is approximately 2.4 feet per step. This looks like a reasonable figure given that the average step length is around 2.6 feet according to the University of Wyoming: https://www.livestrong.com/article/438170-the-average-walking-stride-length/

Looking at the residual plots for the model however, we see there might be some underlying issues with the model.

Let’s plot Steps vs Distance with our line of fit to explore further. The figure below illustrates there may be two distinct populations within the data.

Back in the descriptive statistics section, we saw that I have used two different watches. One could postulate that perhaps two difference devices might be calibrated differently and yield visibly different estimates of distance. Here is the same data colored by hardware and without the line of fit.

Interestingly, the Watch harware model does not appear to fully explain the discrepancy. Both watches are represented in the group with high slope, but only the newer model is included in the lower-slope segment. Instead, let’s repeat the exercise mapping the color aesthetic to software version and faceting the graphs.

It’s a bit difficult to discern, but it appears that there is a consistent decrease in the slopes of the lines of fit after the release of Version 5.0. For ease of viewing, we can select a few versions to look at more closely and see when this change occurs.

The plot below shows the combined figure for versions 4.3.1 - 5.1.1. Here, it would appear that version 5.0 and earlier utilize a higher slope (or larger step size) to estimate distance traveled than 5.0.1+.

Now, let’s create a new categorical variable to indicate whether the software version is before or after 5.0.1 and replot the data. There is still some noise visible with high outliers in the newer software version group. Trial and error with adjusting the cut-off forward/back a version did not yield improvement.

The summary statistics for the models fit for the two groups are as follows. The legacy model yields a step size of approximately 2.5ft while the 5.0.1+ model yields a step size of 2.2ft. Assuming that the algorithms for calculating distance are more accurate in the later software versions, we could scale all the “legacy distances” by a factor of 87% to account for the difference.

summary(mod2)

## 
## Call:
## lm(formula = distance_mi ~ steps_count + 0, data = legacy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36332 -0.15698 -0.05961  0.08132  1.27845 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## steps_count 4.738e-04  4.669e-07    1015   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2347 on 1119 degrees of freedom
## Multiple R-squared:  0.9989, Adjusted R-squared:  0.9989 
## F-statistic: 1.03e+06 on 1 and 1119 DF,  p-value: < 2.2e-16

summary(mod3)

## 
## Call:
## lm(formula = distance_mi ~ steps_count + 0, data = newer)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66037 -0.18626 -0.08463  0.01848  2.63099 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## steps_count 4.123e-04  1.171e-06     352   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3604 on 387 degrees of freedom
## Multiple R-squared:  0.9969, Adjusted R-squared:  0.9969 
## F-statistic: 1.239e+05 on 1 and 387 DF,  p-value: < 2.2e-16

Trends in Activity Level

Hypothesis Testing

Like many individuals with a typical 8AM-5PM Monday-Friday work schedule, I am typically busiest during the week and have more liesure time on weekends for exercise and recreation. Let’s see if this is reflected in the Apple Watch movement data. Below are some summary statistics along with a boxplot of step count by day of week. Overall, Thursday appears to be my least active day of the week based on step count while Sunday is my most active. From visual inspection of boxplot, it does appear I am more active on weekends than weekdays.

dayOfWeek	Mean	Median	Max	StDev
Monday	13215.55	12685.0	28421	4748.726
Tuesday	13462.85	13316.0	30534	5265.251
Wednesday	12790.08	12447.0	27622	4592.931
Thursday	11960.07	11680.0	27008	5112.986
Friday	12857.07	12475.5	29907	4985.119
Saturday	16397.20	16137.0	37020	6668.578
Sunday	17668.83	16900.0	34837	6376.533

To test the hypothesis, we can use a Wilcox test. We previously saw the distribution of steps does not meet the normality requirement for a t-test. The results below confirm that weekends are significantly more active days than weekends. The effect size measured by Cliff’s Delta was calculated as well and suggests that the difference between the groups is “medium”.

dayGroup	Mean	StDev
Weekday	12848.65	4966.632
Weekend	17024.10	6549.600

wilcox.test(steps_count ~ dayGroup, activity, alternative = "less")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  steps_count by dayGroup
## W = 142890, p-value < 2.2e-16
## alternative hypothesis: true location shift is less than 0

cliff.delta(steps_count ~ dayGroup, activity)

## 
## Cliff's Delta
## 
## delta estimate: -0.3817476 (medium)
## 95 percent confidence interval:
##      lower      upper 
## -0.4410538 -0.3191219

Seasonality

As people tend to be more active on certain days of the week due to schedules, it is also known that activity levels tend to vary over the year as well with changes in weather, holidays, and the like. The general population tends to be more active during the warmer months and less active in the winter. We can explore such seasonality in my step data by performing a seasonal trend decomposition.

The following analysis will aggregate the data on a monthly basis. The plots below show the original time series along with the trend component, seasonal component, and remainder. Visual inspection reveals some expected insights. In 2017 I moved to an urban neighborhood with more amenities accessible within walking distance, which would explain the overal increase in the trend by about 3,000 steps per day on average. The seasonal component is also in line with the expectation where activity is lower in colder months and higher warmer months experienced in the continental United States. The magnitude of this seasonal swing is quite small however, roughly half that of the remainder.

These findings can also be observed in the heatmap below. Displayed in this way, the low outlier in March 2016 becomes more eaily discernable. This is consistent with a period of several weeks in which I was incapacitated with pneumonia.

Supplementary/Just for Fun Visualizations

The following figures are additional exploratory charts one might use to look for additional patterns not previously uncovered.

Line + Bubble Graph in Polar Coordinates, standard deviation for week mapped to bubble size:

Here is a version with cumulative steps and weekly total mapped to bubble size:

Parallel Coordinates Plot:

Interactive 3D Scatter, click and drag to zoom/pan/rotate: