This post is an English translation of my original blog post on nipponkiyoshi.com.
We will learn how to extract Apple Health data (including steps, distance traveled, or sleep time). Apple Health data is saved in .xml format, so filtering the data will be more difficult than the usual .csv or .xls. However, there is nothing we can’t handle. Since I do not use Apple watch, the amount of data in this demonstration can be more limited that what you have.
To get the original file containing all your information, go to
Health -> Profile -> Export Health Data
After obtaining the .zip file, extract it and we will get a folder
containing 2 files, export.xml and
export_cda.xml where export_cda.xml is the
standard data of the Clinical Document Architecture (CDA) common in the
US. CDA is often used in clinics and this is where it contains data
about your medical history. Here, we are interested in the export file
containing data about your travel distance and living hours.
To perform data retrieval operations, you can use popular text
editors such as Visual Studio or Jupyter Notebook. Using Jupyter is more
comfortable because you don’t have to run from start to finish after
every code fix. I’m not an expert in XML, but after a bit
of research on wikipedia, I can roughly visualize its structure as
follows.
Product
├───Name
└───Details
└───Description
└───Price
<Product>
<Name>Widget</Name>
<Details>
<Description>
This Widget is the highest quality widget.
</Description>
<Price>5.50</Price>
</Details>
</Product>
At first glance, it looks just like a Python dictionary with
key:value pairs. However, a value here can contain many
more dictionaries (nested). We’ll use a number of libraries to parse the
data from this xml file into a more suitable format for processing in
Pandas.
We are going to use the following libraries
import numpy as np
import pandas as pd
import xmltodict #essential to parse from xml to Python dict
import datetime
import matplotlib.pyplot as plt
In this case, the xmltodict library will help us split
and transform the data mentioned above into proper Pandas format.
#parse xml. to pandas df.
input_path = 'apple_health_export/export.xml'
with open(input_path, 'r') as xml_file:
input_read = xmltodict.parse(xml_file.read())
records = input_read['HealthData']['Record']
df = pd.DataFrame(records)
# Convert date columns to datetime
format = '%Y-%m-%d %H:%M:%S %z'
df['@creationDate'] = pd.to_datetime(df['@creationDate'], format=format)
df['@startDate'] = pd.to_datetime(df['@startDate'], format=format)
df['@endDate'] = pd.to_datetime(df['@endDate'], format=format)
try:
df['@value'] = pd.to_numeric(df['@value'])
except:
pass
# sort by creation date
df = df.sort_values('@creationDate')
pd.set_option('display.max_columns', None) #display all columns
# print the first 3 rows
df.head(3)
## @type @sourceName \
## 135322 HKQuantityTypeIdentifierDistanceWalkingRunning ThanhTran
## 71331 HKQuantityTypeIdentifierStepCount ThanhTran
## 71333 HKQuantityTypeIdentifierStepCount ThanhTran
##
## @sourceVersion @unit @creationDate \
## 135322 9.3.3 km 2016-07-25 08:29:46+09:00
## 71331 9.3.3 count 2016-07-25 08:29:46+09:00
## 71333 9.3.3 count 2016-07-25 11:28:48+09:00
##
## @startDate @endDate @value \
## 135322 2016-07-25 07:34:40+09:00 2016-07-25 07:34:42+09:00 0.0063
## 71331 2016-07-25 07:34:40+09:00 2016-07-25 07:34:42+09:00 14
## 71333 2016-07-25 11:22:48+09:00 2016-07-25 11:27:51+09:00 615
##
## MetadataEntry @device \
## 135322 NaN <<HKDevice: 0x3029ff930>, name:iPhone, manufac...
## 71331 NaN <<HKDevice: 0x302965a40>, name:iPhone, manufac...
## 71333 NaN <<HKDevice: 0x302965a40>, name:iPhone, manufac...
##
## HeartRateVariabilityMetadataList
## 135322 NaN
## 71331 NaN
## 71333 NaN
The input_path is the link to the
export.xml file you just obtained. The
input_read is the output (which is a dictionary). We then
extract the information related to HealthData and
Record and save them to the records variable.
Finally, we use Pandas to convert it into a DataFrame.
After converting to DataFrames, I format the dates to
pandas datetime.
The output shows that @type is the name of a variable.
For example, in line 0, we have
HKQuantityTypeIdentifierStepCount, which is the variable
“steps”. Its value corresponds to column @value (here it is
14 steps), the unit (@unit) is simply “count”, this number
of steps is initialized in column @creationDate, starting
from @startDate to @endDate. There are many
types of variables depending on the data that Apple Health records. We
can call them all out with the following command
print(df['@type'].unique())
## ['HKQuantityTypeIdentifierDistanceWalkingRunning'
## 'HKQuantityTypeIdentifierStepCount' 'HKQuantityTypeIdentifierHeight'
## 'HKQuantityTypeIdentifierBodyMass'
## 'HKCategoryTypeIdentifierSleepAnalysis'
## 'HKQuantityTypeIdentifierFlightsClimbed'
## 'HKQuantityTypeIdentifierBasalEnergyBurned'
## 'HKQuantityTypeIdentifierActiveEnergyBurned'
## 'HKQuantityTypeIdentifierHeadphoneAudioExposure'
## 'HKDataTypeSleepDurationGoal'
## 'HKQuantityTypeIdentifierDietaryEnergyConsumed'
## 'HKQuantityTypeIdentifierDietaryCarbohydrates'
## 'HKQuantityTypeIdentifierDietaryFiber'
## 'HKQuantityTypeIdentifierDietaryFatMonounsaturated'
## 'HKQuantityTypeIdentifierDietaryProtein'
## 'HKQuantityTypeIdentifierDietarySodium'
## 'HKQuantityTypeIdentifierDietaryVitaminC'
## 'HKQuantityTypeIdentifierDietarySugar'
## 'HKQuantityTypeIdentifierDietaryCalcium'
## 'HKQuantityTypeIdentifierDietaryFatSaturated'
## 'HKQuantityTypeIdentifierDietaryIron'
## 'HKQuantityTypeIdentifierDietaryFatPolyunsaturated'
## 'HKQuantityTypeIdentifierDietaryFatTotal'
## 'HKQuantityTypeIdentifierDietaryVitaminA'
## 'HKQuantityTypeIdentifierDietaryCholesterol'
## 'HKQuantityTypeIdentifierDistanceCycling'
## 'HKQuantityTypeIdentifierWalkingStepLength'
## 'HKQuantityTypeIdentifierWalkingSpeed'
## 'HKQuantityTypeIdentifierWalkingAsymmetryPercentage'
## 'HKQuantityTypeIdentifierWalkingDoubleSupportPercentage'
## 'HKQuantityTypeIdentifierAppleWalkingSteadiness'
## 'HKQuantityTypeIdentifierHeartRate'
## 'HKQuantityTypeIdentifierOxygenSaturation'
## 'HKQuantityTypeIdentifierHeartRateVariabilitySDNN'
## 'HKQuantityTypeIdentifierEnvironmentalAudioExposure'
## 'HKCategoryTypeIdentifierAppleStandHour'
## 'HKQuantityTypeIdentifierAppleStandTime'
## 'HKCategoryTypeIdentifierHandwashingEvent'
## 'HKQuantityTypeIdentifierStairDescentSpeed'
## 'HKQuantityTypeIdentifierAppleExerciseTime'
## 'HKQuantityTypeIdentifierStairAscentSpeed'
## 'HKQuantityTypeIdentifierRespiratoryRate'
## 'HKQuantityTypeIdentifierDietaryNiacin'
## 'HKQuantityTypeIdentifierDietaryVitaminK'
## 'HKQuantityTypeIdentifierDietaryVitaminE'
## 'HKQuantityTypeIdentifierDietarySelenium'
## 'HKQuantityTypeIdentifierDietaryCopper'
## 'HKQuantityTypeIdentifierDietaryZinc'
## 'HKQuantityTypeIdentifierDietaryFolate'
## 'HKQuantityTypeIdentifierDietaryPhosphorus'
## 'HKQuantityTypeIdentifierDietaryThiamin'
## 'HKQuantityTypeIdentifierDietaryMagnesium'
## 'HKQuantityTypeIdentifierDietaryVitaminB6'
## 'HKQuantityTypeIdentifierDietaryPotassium'
## 'HKQuantityTypeIdentifierDietaryVitaminB12'
## 'HKQuantityTypeIdentifierDietaryVitaminD'
## 'HKCategoryTypeIdentifierMindfulSession'
## 'HKQuantityTypeIdentifierWalkingHeartRateAverage'
## 'HKQuantityTypeIdentifierRestingHeartRate'
## 'HKQuantityTypeIdentifierDietaryRiboflavin'
## 'HKQuantityTypeIdentifierDietaryManganese'
## 'HKQuantityTypeIdentifierDietaryPantothenicAcid'
## 'HKQuantityTypeIdentifierSixMinuteWalkTestDistance'
## 'HKCategoryTypeIdentifierAudioExposureEvent']
Note that each instant of the data is a record of an uninterrupted session. Thus, one day can have many instances of the same variable.
Once you’ve created a DataFrame that contains all the data about
Health, it’s pretty easy to filter the data you need. First, to make it
easier, I’m going to create a list containing the variables in column
@type above. When we need to retrieve any data, we can just
call directly the i-th element in that list.
variables = df['@type'].unique()
For example, now we’re going to take data on “step counts.” The step
counts are stored in the variable
HKQuantityTypeIdentifierStepCount.
data = df[df['@type'] == 'HKQuantityTypeIdentifierStepCount']
# Take sum of the values by start date
data_bystart = data.groupby('@startDate')['@value'].sum().reset_index()
# Set start date to be datetime index
data_bystart = data_bystart.set_index('@startDate')
print(data_bystart)
## @value
## @startDate
## 2016-07-25 07:34:40+09:00 14
## 2016-07-25 10:50:29+09:00 26
## 2016-07-25 11:22:48+09:00 615
## 2016-07-25 11:27:51+09:00 749
## 2016-07-25 11:32:53+09:00 851
## ... ...
## 2020-04-29 12:11:41+09:00 9
## 2020-04-29 14:08:40+09:00 55
## 2020-04-30 12:00:35+09:00 39
## 2020-04-30 12:11:21+09:00 8
## 2020-04-30 12:23:38+09:00 28
##
## [36954 rows x 1 columns]
or walking distance. In the last step, I will calculate the monthly and total distance traveled each month.
# Filter for 'HKQuantityTypeIdentifierDistanceWalkingRunning'
dist = df[df['@type'] == 'HKQuantityTypeIdentifierDistanceWalkingRunning']
# Convert value column to numeric
dist.loc[:, '@value'] = pd.to_numeric(dist.loc[:, '@value'])
# Group by start date and sum the values
dist_data = dist.groupby('@startDate')['@value'].sum().reset_index()
# Set start date to be datetime index
dist_data = dist_data.set_index('@startDate')
dist_data.index = pd.to_datetime(dist_data.index)
# Resample by month and calculate total and average distance
dist_data_bymonth = dist_data['@value'].resample('ME').sum()
df3 = pd.DataFrame(dist_data_bymonth)
df3['@avg_walk'] = dist_data['@value'].resample('ME').mean()
# Reset index and rename columns
df3 = df3.reset_index()
df4 = df3.rename(columns={'@startDate': 'date', '@value': 'total_dist', '@avg_walk': 'avg_dist'})
# Print the resulting DataFrame
print(df4)
## date total_dist avg_dist
## 0 2016-07-31 00:00:00+09:00 54.688923 0.168793
## 1 2016-08-31 00:00:00+09:00 230.031815 0.152238
## 2 2016-09-30 00:00:00+09:00 105.741351 0.091235
## 3 2016-10-31 00:00:00+09:00 80.733125 0.091224
## 4 2016-11-30 00:00:00+09:00 88.068147 0.116492
## 5 2016-12-31 00:00:00+09:00 87.970852 0.101349
## 6 2017-01-31 00:00:00+09:00 85.965145 0.09791
## 7 2017-02-28 00:00:00+09:00 106.775491 0.120378
## 8 2017-03-31 00:00:00+09:00 108.721833 0.135226
## 9 2017-04-30 00:00:00+09:00 115.359123 0.13882
## 10 2017-05-31 00:00:00+09:00 114.385991 0.105038
## 11 2017-06-30 00:00:00+09:00 58.150204 0.106308
## 12 2017-07-31 00:00:00+09:00 88.400538 0.093348
## 13 2017-08-31 00:00:00+09:00 77.355347 0.10496
## 14 2017-09-30 00:00:00+09:00 75.581428 0.102553
## 15 2017-10-31 00:00:00+09:00 70.448534 0.10499
## 16 2017-11-30 00:00:00+09:00 42.742395 0.056092
## 17 2017-12-31 00:00:00+09:00 53.34522 0.063506
## 18 2018-01-31 00:00:00+09:00 57.114284 0.07595
## 19 2018-02-28 00:00:00+09:00 70.975541 0.074869
## 20 2018-03-31 00:00:00+09:00 89.566851 0.084978
## 21 2018-04-30 00:00:00+09:00 87.107801 0.092275
## 22 2018-05-31 00:00:00+09:00 101.35164 0.104057
## 23 2018-06-30 00:00:00+09:00 104.15383 0.088491
## 24 2018-07-31 00:00:00+09:00 78.811111 0.091748
## 25 2018-08-31 00:00:00+09:00 76.696434 0.150091
## 26 2018-09-30 00:00:00+09:00 58.234732 0.325334
## 27 2018-10-31 00:00:00+09:00 95.680963 0.349201
## 28 2018-11-30 00:00:00+09:00 111.211247 0.138841
## 29 2018-12-31 00:00:00+09:00 85.686536 0.113643
## 30 2019-01-31 00:00:00+09:00 78.670124 0.115352
## 31 2019-02-28 00:00:00+09:00 62.129361 0.087506
## 32 2019-03-31 00:00:00+09:00 80.4408 0.127481
## 33 2019-04-30 00:00:00+09:00 88.949413 0.118758
## 34 2019-05-31 00:00:00+09:00 72.255543 0.108329
## 35 2019-06-30 00:00:00+09:00 81.950318 0.111497
## 36 2019-07-31 00:00:00+09:00 81.673372 0.11391
## 37 2019-08-31 00:00:00+09:00 97.603354 0.097799
## 38 2019-09-30 00:00:00+09:00 139.312885 0.142447
## 39 2019-10-31 00:00:00+09:00 155.117759 0.23291
## 40 2019-11-30 00:00:00+09:00 134.078732 0.180213
## 41 2019-12-31 00:00:00+09:00 114.501228 0.12982
## 42 2020-01-31 00:00:00+09:00 119.648032 0.140103
## 43 2020-02-29 00:00:00+09:00 92.990764 0.123823
## 44 2020-03-31 00:00:00+09:00 96.42829 0.130662
## 45 2020-04-30 00:00:00+09:00 36.197117 0.134562
Now it is straight forward to visualize the data. For example, we can plot the number of steps taken each month.
fig, ax1 = plt.subplots()
#variable
x = df4['date']
# y-axis
y1 = df4['total_dist']
y2 = df4['avg_dist']
plt.xticks(rotation=60)
ax2 = ax1.twinx()
ax1.bar(x, y1, color='orange', width = 20)
ax2.plot(x, y2, color='blue', marker='o')
ax1.set_xlabel('Time')
ax1.set_ylabel('Total distance (km)', color='orange')
ax2.set_ylabel('Daily average (km)', color='blue')
plt.title('Walking distance record')
fig.tight_layout()
plt.show()
Sleep data, however, can be tricky. This is because it spans across
multiple days (e.g., sleep that starts at night and ends the next
morning), grouping by @creationDate or
@start_date might not be the best approach since it could
split sleep periods that span midnight. Hence, we need to group by
@endDate instead.
First, we extract sleep data
sleep_counts = df[df['@type'] == 'HKCategoryTypeIdentifierSleepAnalysis']
sleep_counts['@value'].unique()
## array(['HKCategoryValueSleepAnalysisInBed',
## 'HKCategoryValueSleepAnalysisAsleepUnspecified'], dtype=object)
There are many sleep-related dataset. Since I don’t have an Apple Watch, I can only plot the time I spent in bed.
# filter inbed data
inbed = sleep_counts[sleep_counts['@value'] == 'HKCategoryValueSleepAnalysisInBed']
Now we can calculate the total time spent in bed each day.
# calculating duration
inbed['@duration'] = inbed['@endDate'] - inbed['@startDate']
## <string>:2: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# convert duration to hours
inbed['@duration'] = inbed['@duration'].dt.total_seconds() / 3600
## <string>:3: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# Group by start date and aggregate durations
grouped_sleep = inbed.groupby('@startDate')['@duration'].sum().reset_index()
# Extract the date part from @endDate for grouping
inbed['@endDate_date'] = inbed['@endDate'].dt.date
## <string>:3: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# Group by the date part of the end date and aggregate durationse
grouped_sleep = inbed.groupby('@endDate_date')['@duration'].sum().reset_index()
# Rename columns for clarity
grouped_sleep.rename(columns={'@endDate_date': 'date', '@duration': 'total_sleep_hours'}, inplace=True)
# Display the grouped DataFrame
print(grouped_sleep)
## date total_sleep_hours
## 0 2016-09-21 7.290000
## 1 2016-09-23 6.181667
## 2 2016-09-26 5.608611
## 3 2016-09-27 3.385833
## 4 2016-09-28 6.926389
## ... ... ...
## 1112 2020-04-25 8.500000
## 1113 2020-04-26 1.297222
## 1114 2020-04-27 2.073889
## 1115 2020-04-28 4.000556
## 1116 2020-04-29 6.550833
##
## [1117 rows x 2 columns]
Finally, we can plot the data.
fig, ax1 = plt.subplots()
x = grouped_sleep['date']
y1 = grouped_sleep['total_sleep_hours']
ax1.bar(x, y1, color='orange', width = 20)
ax1.set_xlabel('Time')
plt.title('Sleep record')
fig.tight_layout()
plt.show()