(Fun DS) How to Extract Data from Apple Health

This post is an English translation of my original blog post on nipponkiyoshi.com.

We will learn how to extract Apple Health data (including steps, distance traveled, or sleep time). Apple Health data is saved in .xml format, so filtering the data will be more difficult than the usual .csv or .xls. However, there is nothing we can’t handle. Since I do not use Apple watch, the amount of data in this demonstration can be more limited that what you have.

Step 1: Export Apple Health Data

To get the original file containing all your information, go to

Health -> Profile -> Export Health Data

After obtaining the .zip file, extract it and we will get a folder containing 2 files, export.xml and export_cda.xml where export_cda.xml is the standard data of the Clinical Document Architecture (CDA) common in the US. CDA is often used in clinics and this is where it contains data about your medical history. Here, we are interested in the export file containing data about your travel distance and living hours.

To perform data retrieval operations, you can use popular text editors such as Visual Studio or Jupyter Notebook. Using Jupyter is more comfortable because you don’t have to run from start to finish after every code fix. I’m not an expert in XML, but after a bit of research on wikipedia, I can roughly visualize its structure as follows.

Product
├───Name
└───Details
    └───Description
    └───Price
<Product>
    <Name>Widget</Name>
    <Details>
        <Description>
            This Widget is the highest quality widget. 
        </Description>
        <Price>5.50</Price>
    </Details>
</Product>

At first glance, it looks just like a Python dictionary with key:value pairs. However, a value here can contain many more dictionaries (nested). We’ll use a number of libraries to parse the data from this xml file into a more suitable format for processing in Pandas.

Step 2: Creating a DataFrame

We are going to use the following libraries

import numpy as np
import pandas as pd
import xmltodict #essential to parse from xml to Python dict
import datetime
import matplotlib.pyplot as plt

In this case, the xmltodict library will help us split and transform the data mentioned above into proper Pandas format.

#parse xml. to pandas df.
input_path = 'apple_health_export/export.xml'
with open(input_path, 'r') as xml_file:
    input_read = xmltodict.parse(xml_file.read())
records = input_read['HealthData']['Record']
df = pd.DataFrame(records)
 
# Convert date columns to datetime
format = '%Y-%m-%d %H:%M:%S %z'
df['@creationDate'] = pd.to_datetime(df['@creationDate'], format=format)
df['@startDate'] = pd.to_datetime(df['@startDate'], format=format)
df['@endDate'] = pd.to_datetime(df['@endDate'], format=format)
try:
  df['@value'] = pd.to_numeric(df['@value'])
except:
  pass

# sort by creation date
df = df.sort_values('@creationDate')
pd.set_option('display.max_columns', None) #display all columns
# print the first 3 rows
df.head(3)

##                                                  @type @sourceName  \
## 135322  HKQuantityTypeIdentifierDistanceWalkingRunning   ThanhTran   
## 71331                HKQuantityTypeIdentifierStepCount   ThanhTran   
## 71333                HKQuantityTypeIdentifierStepCount   ThanhTran   
## 
##        @sourceVersion  @unit             @creationDate  \
## 135322          9.3.3     km 2016-07-25 08:29:46+09:00   
## 71331           9.3.3  count 2016-07-25 08:29:46+09:00   
## 71333           9.3.3  count 2016-07-25 11:28:48+09:00   
## 
##                       @startDate                  @endDate  @value  \
## 135322 2016-07-25 07:34:40+09:00 2016-07-25 07:34:42+09:00  0.0063   
## 71331  2016-07-25 07:34:40+09:00 2016-07-25 07:34:42+09:00      14   
## 71333  2016-07-25 11:22:48+09:00 2016-07-25 11:27:51+09:00     615   
## 
##        MetadataEntry                                            @device  \
## 135322           NaN  <<HKDevice: 0x3029ff930>, name:iPhone, manufac...   
## 71331            NaN  <<HKDevice: 0x302965a40>, name:iPhone, manufac...   
## 71333            NaN  <<HKDevice: 0x302965a40>, name:iPhone, manufac...   
## 
##        HeartRateVariabilityMetadataList  
## 135322                              NaN  
## 71331                               NaN  
## 71333                               NaN

The input_path is the link to the export.xml file you just obtained. The input_read is the output (which is a dictionary). We then extract the information related to HealthData and Record and save them to the records variable. Finally, we use Pandas to convert it into a DataFrame. After converting to DataFrames, I format the dates to pandas datetime.

The output shows that @type is the name of a variable. For example, in line 0, we have HKQuantityTypeIdentifierStepCount, which is the variable “steps”. Its value corresponds to column @value (here it is 14 steps), the unit (@unit) is simply “count”, this number of steps is initialized in column @creationDate, starting from @startDate to @endDate. There are many types of variables depending on the data that Apple Health records. We can call them all out with the following command

print(df['@type'].unique())

## ['HKQuantityTypeIdentifierDistanceWalkingRunning'
##  'HKQuantityTypeIdentifierStepCount' 'HKQuantityTypeIdentifierHeight'
##  'HKQuantityTypeIdentifierBodyMass'
##  'HKCategoryTypeIdentifierSleepAnalysis'
##  'HKQuantityTypeIdentifierFlightsClimbed'
##  'HKQuantityTypeIdentifierBasalEnergyBurned'
##  'HKQuantityTypeIdentifierActiveEnergyBurned'
##  'HKQuantityTypeIdentifierHeadphoneAudioExposure'
##  'HKDataTypeSleepDurationGoal'
##  'HKQuantityTypeIdentifierDietaryEnergyConsumed'
##  'HKQuantityTypeIdentifierDietaryCarbohydrates'
##  'HKQuantityTypeIdentifierDietaryFiber'
##  'HKQuantityTypeIdentifierDietaryFatMonounsaturated'
##  'HKQuantityTypeIdentifierDietaryProtein'
##  'HKQuantityTypeIdentifierDietarySodium'
##  'HKQuantityTypeIdentifierDietaryVitaminC'
##  'HKQuantityTypeIdentifierDietarySugar'
##  'HKQuantityTypeIdentifierDietaryCalcium'
##  'HKQuantityTypeIdentifierDietaryFatSaturated'
##  'HKQuantityTypeIdentifierDietaryIron'
##  'HKQuantityTypeIdentifierDietaryFatPolyunsaturated'
##  'HKQuantityTypeIdentifierDietaryFatTotal'
##  'HKQuantityTypeIdentifierDietaryVitaminA'
##  'HKQuantityTypeIdentifierDietaryCholesterol'
##  'HKQuantityTypeIdentifierDistanceCycling'
##  'HKQuantityTypeIdentifierWalkingStepLength'
##  'HKQuantityTypeIdentifierWalkingSpeed'
##  'HKQuantityTypeIdentifierWalkingAsymmetryPercentage'
##  'HKQuantityTypeIdentifierWalkingDoubleSupportPercentage'
##  'HKQuantityTypeIdentifierAppleWalkingSteadiness'
##  'HKQuantityTypeIdentifierHeartRate'
##  'HKQuantityTypeIdentifierOxygenSaturation'
##  'HKQuantityTypeIdentifierHeartRateVariabilitySDNN'
##  'HKQuantityTypeIdentifierEnvironmentalAudioExposure'
##  'HKCategoryTypeIdentifierAppleStandHour'
##  'HKQuantityTypeIdentifierAppleStandTime'
##  'HKCategoryTypeIdentifierHandwashingEvent'
##  'HKQuantityTypeIdentifierStairDescentSpeed'
##  'HKQuantityTypeIdentifierAppleExerciseTime'
##  'HKQuantityTypeIdentifierStairAscentSpeed'
##  'HKQuantityTypeIdentifierRespiratoryRate'
##  'HKQuantityTypeIdentifierDietaryNiacin'
##  'HKQuantityTypeIdentifierDietaryVitaminK'
##  'HKQuantityTypeIdentifierDietaryVitaminE'
##  'HKQuantityTypeIdentifierDietarySelenium'
##  'HKQuantityTypeIdentifierDietaryCopper'
##  'HKQuantityTypeIdentifierDietaryZinc'
##  'HKQuantityTypeIdentifierDietaryFolate'
##  'HKQuantityTypeIdentifierDietaryPhosphorus'
##  'HKQuantityTypeIdentifierDietaryThiamin'
##  'HKQuantityTypeIdentifierDietaryMagnesium'
##  'HKQuantityTypeIdentifierDietaryVitaminB6'
##  'HKQuantityTypeIdentifierDietaryPotassium'
##  'HKQuantityTypeIdentifierDietaryVitaminB12'
##  'HKQuantityTypeIdentifierDietaryVitaminD'
##  'HKCategoryTypeIdentifierMindfulSession'
##  'HKQuantityTypeIdentifierWalkingHeartRateAverage'
##  'HKQuantityTypeIdentifierRestingHeartRate'
##  'HKQuantityTypeIdentifierDietaryRiboflavin'
##  'HKQuantityTypeIdentifierDietaryManganese'
##  'HKQuantityTypeIdentifierDietaryPantothenicAcid'
##  'HKQuantityTypeIdentifierSixMinuteWalkTestDistance'
##  'HKCategoryTypeIdentifierAudioExposureEvent']

Note that each instant of the data is a record of an uninterrupted session. Thus, one day can have many instances of the same variable.

Step 3: Data Extraction

Once you’ve created a DataFrame that contains all the data about Health, it’s pretty easy to filter the data you need. First, to make it easier, I’m going to create a list containing the variables in column @type above. When we need to retrieve any data, we can just call directly the i-th element in that list.

variables = df['@type'].unique()

For example, now we’re going to take data on “step counts.” The step counts are stored in the variable HKQuantityTypeIdentifierStepCount.

data = df[df['@type'] == 'HKQuantityTypeIdentifierStepCount']

# Take sum of the values by start date
data_bystart = data.groupby('@startDate')['@value'].sum().reset_index()

# Set start date to be datetime index
data_bystart = data_bystart.set_index('@startDate')

print(data_bystart)

##                           @value
## @startDate                      
## 2016-07-25 07:34:40+09:00     14
## 2016-07-25 10:50:29+09:00     26
## 2016-07-25 11:22:48+09:00    615
## 2016-07-25 11:27:51+09:00    749
## 2016-07-25 11:32:53+09:00    851
## ...                          ...
## 2020-04-29 12:11:41+09:00      9
## 2020-04-29 14:08:40+09:00     55
## 2020-04-30 12:00:35+09:00     39
## 2020-04-30 12:11:21+09:00      8
## 2020-04-30 12:23:38+09:00     28
## 
## [36954 rows x 1 columns]

or walking distance. In the last step, I will calculate the monthly and total distance traveled each month.

# Filter for 'HKQuantityTypeIdentifierDistanceWalkingRunning'
dist = df[df['@type'] == 'HKQuantityTypeIdentifierDistanceWalkingRunning']

# Convert value column to numeric
dist.loc[:, '@value'] = pd.to_numeric(dist.loc[:, '@value'])

# Group by start date and sum the values
dist_data = dist.groupby('@startDate')['@value'].sum().reset_index()

# Set start date to be datetime index
dist_data = dist_data.set_index('@startDate')
dist_data.index = pd.to_datetime(dist_data.index)

# Resample by month and calculate total and average distance
dist_data_bymonth = dist_data['@value'].resample('ME').sum()
df3 = pd.DataFrame(dist_data_bymonth)
df3['@avg_walk'] = dist_data['@value'].resample('ME').mean()

# Reset index and rename columns
df3 = df3.reset_index()
df4 = df3.rename(columns={'@startDate': 'date', '@value': 'total_dist', '@avg_walk': 'avg_dist'})

# Print the resulting DataFrame
print(df4)

##                         date  total_dist  avg_dist
## 0  2016-07-31 00:00:00+09:00   54.688923  0.168793
## 1  2016-08-31 00:00:00+09:00  230.031815  0.152238
## 2  2016-09-30 00:00:00+09:00  105.741351  0.091235
## 3  2016-10-31 00:00:00+09:00   80.733125  0.091224
## 4  2016-11-30 00:00:00+09:00   88.068147  0.116492
## 5  2016-12-31 00:00:00+09:00   87.970852  0.101349
## 6  2017-01-31 00:00:00+09:00   85.965145   0.09791
## 7  2017-02-28 00:00:00+09:00  106.775491  0.120378
## 8  2017-03-31 00:00:00+09:00  108.721833  0.135226
## 9  2017-04-30 00:00:00+09:00  115.359123   0.13882
## 10 2017-05-31 00:00:00+09:00  114.385991  0.105038
## 11 2017-06-30 00:00:00+09:00   58.150204  0.106308
## 12 2017-07-31 00:00:00+09:00   88.400538  0.093348
## 13 2017-08-31 00:00:00+09:00   77.355347   0.10496
## 14 2017-09-30 00:00:00+09:00   75.581428  0.102553
## 15 2017-10-31 00:00:00+09:00   70.448534   0.10499
## 16 2017-11-30 00:00:00+09:00   42.742395  0.056092
## 17 2017-12-31 00:00:00+09:00    53.34522  0.063506
## 18 2018-01-31 00:00:00+09:00   57.114284   0.07595
## 19 2018-02-28 00:00:00+09:00   70.975541  0.074869
## 20 2018-03-31 00:00:00+09:00   89.566851  0.084978
## 21 2018-04-30 00:00:00+09:00   87.107801  0.092275
## 22 2018-05-31 00:00:00+09:00   101.35164  0.104057
## 23 2018-06-30 00:00:00+09:00   104.15383  0.088491
## 24 2018-07-31 00:00:00+09:00   78.811111  0.091748
## 25 2018-08-31 00:00:00+09:00   76.696434  0.150091
## 26 2018-09-30 00:00:00+09:00   58.234732  0.325334
## 27 2018-10-31 00:00:00+09:00   95.680963  0.349201
## 28 2018-11-30 00:00:00+09:00  111.211247  0.138841
## 29 2018-12-31 00:00:00+09:00   85.686536  0.113643
## 30 2019-01-31 00:00:00+09:00   78.670124  0.115352
## 31 2019-02-28 00:00:00+09:00   62.129361  0.087506
## 32 2019-03-31 00:00:00+09:00     80.4408  0.127481
## 33 2019-04-30 00:00:00+09:00   88.949413  0.118758
## 34 2019-05-31 00:00:00+09:00   72.255543  0.108329
## 35 2019-06-30 00:00:00+09:00   81.950318  0.111497
## 36 2019-07-31 00:00:00+09:00   81.673372   0.11391
## 37 2019-08-31 00:00:00+09:00   97.603354  0.097799
## 38 2019-09-30 00:00:00+09:00  139.312885  0.142447
## 39 2019-10-31 00:00:00+09:00  155.117759   0.23291
## 40 2019-11-30 00:00:00+09:00  134.078732  0.180213
## 41 2019-12-31 00:00:00+09:00  114.501228   0.12982
## 42 2020-01-31 00:00:00+09:00  119.648032  0.140103
## 43 2020-02-29 00:00:00+09:00   92.990764  0.123823
## 44 2020-03-31 00:00:00+09:00    96.42829  0.130662
## 45 2020-04-30 00:00:00+09:00   36.197117  0.134562

Step 4: Data Visualization

Now it is straight forward to visualize the data. For example, we can plot the number of steps taken each month.

fig, ax1 = plt.subplots()

#variable
x = df4['date']

# y-axis
y1 = df4['total_dist']
y2 = df4['avg_dist']
plt.xticks(rotation=60)

ax2 = ax1.twinx()
ax1.bar(x, y1, color='orange', width = 20)
ax2.plot(x, y2, color='blue', marker='o')

ax1.set_xlabel('Time')
ax1.set_ylabel('Total distance (km)', color='orange')
ax2.set_ylabel('Daily average (km)', color='blue')

plt.title('Walking distance record')
fig.tight_layout()
plt.show()

Sleep data, however, can be tricky. This is because it spans across multiple days (e.g., sleep that starts at night and ends the next morning), grouping by @creationDate or @start_date might not be the best approach since it could split sleep periods that span midnight. Hence, we need to group by @endDate instead.

First, we extract sleep data

sleep_counts = df[df['@type'] == 'HKCategoryTypeIdentifierSleepAnalysis']
sleep_counts['@value'].unique()

## array(['HKCategoryValueSleepAnalysisInBed',
##        'HKCategoryValueSleepAnalysisAsleepUnspecified'], dtype=object)

There are many sleep-related dataset. Since I don’t have an Apple Watch, I can only plot the time I spent in bed.

# filter inbed data
inbed = sleep_counts[sleep_counts['@value'] == 'HKCategoryValueSleepAnalysisInBed']

Now we can calculate the total time spent in bed each day.

# calculating duration 
inbed['@duration'] =  inbed['@endDate'] - inbed['@startDate']

## <string>:2: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

# convert duration to hours
inbed['@duration'] = inbed['@duration'].dt.total_seconds() / 3600

## <string>:3: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

# Group by start date and aggregate durations
grouped_sleep = inbed.groupby('@startDate')['@duration'].sum().reset_index()

# Extract the date part from @endDate for grouping
inbed['@endDate_date'] = inbed['@endDate'].dt.date

## <string>:3: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

# Group by the date part of the end date and aggregate durationse
grouped_sleep = inbed.groupby('@endDate_date')['@duration'].sum().reset_index()

# Rename columns for clarity
grouped_sleep.rename(columns={'@endDate_date': 'date', '@duration': 'total_sleep_hours'}, inplace=True)

# Display the grouped DataFrame
print(grouped_sleep)

##             date  total_sleep_hours
## 0     2016-09-21           7.290000
## 1     2016-09-23           6.181667
## 2     2016-09-26           5.608611
## 3     2016-09-27           3.385833
## 4     2016-09-28           6.926389
## ...          ...                ...
## 1112  2020-04-25           8.500000
## 1113  2020-04-26           1.297222
## 1114  2020-04-27           2.073889
## 1115  2020-04-28           4.000556
## 1116  2020-04-29           6.550833
## 
## [1117 rows x 2 columns]

Finally, we can plot the data.

fig, ax1 = plt.subplots()
x = grouped_sleep['date']
y1 = grouped_sleep['total_sleep_hours']

ax1.bar(x, y1, color='orange', width = 20)

ax1.set_xlabel('Time')

plt.title('Sleep record')
fig.tight_layout()
plt.show()

(Fun DS) How to Extract Data from Apple Health

Quang-Thanh Tran

2021-03-09

Step 1: Export Apple Health Data

Step 2: Creating a DataFrame

Step 3: Data Extraction

Step 4: Data Visualization