Final Project for the Unsupervised Learning
Introduction
Literature Review
The proliferation of wearable devices, such as the Apple Watch, has revolutionized the collection of health-related data, offering continuous, high-resolution monitoring of physiological signals including heart rate, activity levels, and sleep patterns (Piwek et al., 2016). This influx of granular data has created new opportunities for data-driven approaches in healthcare, particularly through the application of data science techniques. Unsupervised learning methods, such as Principal Component Analysis (PCA), clustering algorithms, and association rule mining, have proven particularly valuable in extracting latent patterns and associations from large-scale, unlabeled datasets (Jiang et al., 2020; Esteva et al., 2019).
PCA serves as a foundational tool for dimensionality reduction, enabling researchers to summarize high-dimensional health data while preserving variance and facilitating subsequent analyses (Wold et al., 1987). Clustering techniques, such as HDBSCAN, allow the discovery of inherent groupings within heterogeneous datasets, revealing subpopulations with distinct physiological or behavioral profiles without prior labeling (Campello et al., 2013). Furthermore, association rule mining offers a complementary approach for identifying frequently co-occurring health events or behaviors, providing interpretable insights that can inform preventive strategies and personalized interventions (Agrawal et al., 1993).
The integration of these methodologies in wearable health data analysis holds significant societal relevance. By uncovering hidden patterns and behavioral phenotypes, these approaches can inform early detection of health risks, promote personalized lifestyle recommendations, and support the transition toward precision medicine. Such data-driven insights have the potential to improve public health outcomes while advancing the scientific understanding of human physiology in real-world contexts (Patel et al., 2015). Recent reviews highlight the growing importance of machine learning and data mining in wearable health monitoring, emphasizing trends, challenges, and opportunities in unsupervised learning applications (From data to diagnosis, 2023; Data Mining for Wearable Sensors, 2013; Scoping Review of ML in Health Economics, 2022).
Research Project
This project aims to explore health-related data collected from Apple Watch devices using a combination of unsupervised learning techniques, including PCA for dimensionality reduction, HDBSCAN for clustering, and association rule mining for pattern discovery. The central objective is to identify underlying structures, groupings, and co-occurring behavioral or physiological events within the dataset, thereby generating insights relevant to both individual health monitoring and broader public health initiatives.
Specifically, the workflow of the project includes:
1. Data Preprocessing: Cleaning and normalizing the collected physiological and activity data.
2. Dimensionality Reduction: Applying PCA to reduce the feature space while preserving meaningful variability.
3. Clustering: Utilizing HDBSCAN to identify natural groupings among users or time periods based on physiological metrics.
4. Association Rules: Applying association rule mining to detect frequent co-occurrences and potential dependencies among health indicators.
Research Questions and Hypotheses
The study is guided by the following research questions:
Q1: Can PCA effectively capture the primary modes of variability in Apple Watch-derived health data, and which physiological metrics contribute most to these components?
Q2: Are there distinct clusters of health patterns identifiable through HDBSCAN??
Q3: What frequent associations exist among physiological and activity variables?
Corresponding hypotheses include:
H1: PCA effectively captures the dominant modes of variability in Apple Watch–derived physiological and activity data, with the first few principal components accounting for a substantial proportion of total variance.
H2: The application of HDBSCAN reveals distinct and statistically robust clusters representing heterogeneous health‑behavior profiles.
H3: Frequent pattern mining will identify recurrent associations among physiological and activity variables.
By addressing these questions, the project seeks to demonstrate the utility of unsupervised learning in wearable health data analysis, highlighting its potential contribution to preventive healthcare and personalized medicine.
Data and Methodology
Data
In the study, data on physiology and physical activity recorded by the Apple Watch were used. The dataset was obtained from the Harvard Dataverse repository and originates from the scientific study “Replication Data for: Using machine learning methods to predict physical activity types with Apple Watch and Fitbit data using indirect calorimetry as the criterion” (https://doi.org/10.7910/DVN/ZS2Z2J). The study was conducted in 2019 in Canada. The data are owned by Daniel Fuller from the Memorial University of Newfoundland.
The dataset was cleaned by, among other steps, renaming selected columns and enriching the structure with newly derived variables. Negative values in the Intensity variable were replaced with 0, and all relevant fields were converted to appropriate data formats. Subsequently, the data were standardized.
For the PCA and clustering procedures, only numerical variables were included; therefore, non‑numeric attributes-specifically gender and the variable describing the type of physical activity-were excluded from the analytical sample.
The prepared data were subjected to Principal Component Analysis (PCA). This is a widely used dimensionality reduction technique in data analysis that transforms a high-dimensional dataset into a lower-dimensional space while retaining as much of the original variance as possible (Wold et al., 1987). The main idea behind PCA is to identify new, uncorrelated variables called principal components, which are linear combinations of the original features and are ordered by the amount of variance they capture from the data (Jolliffe & Cadima, 2016).
Methodology
PCA is particularly useful in health data analytics for wearable devices because physiological datasets often include numerous correlated metrics, such as heart rate, activity levels, and sleep duration. By reducing dimensionality, PCA simplifies data visualization, noise reduction, and subsequent analyses such as clustering or pattern discovery, without substantial loss of information (Ringnér, 2008)
Subsequently, based on the results derived from the three most influential principal components, the clustering procedure was conducted. Prior to this, the scatter plot was generated to assess the underlying structure of the data.
Due to the irregular shapes of the point groupings observed in the scatter plot, the clustering procedure was carried out using the HDBSCAN method.
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is an advanced clustering algorithm that extends the widely used DBSCAN method by combining density-based clustering with hierarchical clustering techniques (Campello, Moulavi, & Sander, 2013).
Unlike DBSCAN, which requires a fixed density threshold (ε) to define clusters, HDBSCAN constructs a hierarchy of clusters based on varying density levels and then extracts the most stable clusters from this hierarchy. This allows it to:
- Detect clusters of variable density,
- Handle noise and outliers effectively,
- Avoid the need to pre-specify the number of clusters.
HDBSCAN works by computing the mutual reachability distance between points, building a minimum spanning tree (MST) of the data, and condensing the hierarchical clustering tree to select clusters that are most persistent across density levels (Campello et al., 2013).
It has become popular in data science and machine learning applications where clusters are irregularly shaped and densities vary, such as in bioinformatics, geospatial analysis, and anomaly detection (McInnes, Healy, & Astels, 2017).
The quality of the clustering results was evaluated using the Silhouette Coefficient. The Silhouette Coefficient is an internal cluster validation measure that evaluates clustering quality by jointly assessing intra-cluster cohesion and inter-cluster separation. For each data point, it compares the average distance to points within the same cluster with the distance to the nearest neighboring cluster, yielding values in the range [−1,1], where higher values indicate better-defined clusters (Rousseeuw, 1987). The average Silhouette Index is commonly used to compare clustering results and to select the optimal number of clusters.
In the subsequent stage, eight variables whose PCA loadings exceeded at least 0.25 were selected from the dataset and then subjected to a discretization procedure to transform them into categorical variables. Additionally, two variables that had been previously excluded: gender and activity which were reintroduced into the analysis. Furthermore, association rules were applied.
Association rules are a data mining technique used to discover frequent co-occurrence relationships between variables in large datasets. They are most commonly applied in market basket analysis to identify patterns of the form X⇒Y, where the presence of itemset X implies a higher likelihood of itemset Y occurring (Agrawal, Imieliński, & Swami, 1993).
The quality of association rules is typically evaluated using measures such as support, confidence, and lift, which quantify the frequency, reliability, and strength of the discovered associations. Association rule mining aims to uncover interpretable and actionable patterns in transactional data without requiring predefined class labels.
The quality of the extracted rules was evaluated using the confidence and support measures.
In association rule mining, the support and confidence measures are used to evaluate the relevance and reliability of rules of the form X⇒Y.
Support quantifies how frequently an itemset occurs in the dataset and is defined as the proportion of transactions that contain both X and Y. It reflects the prevalence of a rule within the data. Confidence measures the conditional probability that itemset Y appears in a transaction given that itemset X is present. It indicates the strength and reliability of the implication X⇒Y.
Both measures are fundamental for filtering and interpreting association rules, ensuring that discovered patterns are both frequent and meaningful (Agrawal et al., 1993).
Results
This section presents the results obtained from the PCA, clustering, and association rule analyses.
PCA
A principal component analysis (PCA) was performed on 19 standardized variables to reduce dimensionality and identify the dominant sources of variability in the data. The first three principal components (PC1–PC3) together explain approximately 50.6% of the total variance, while the first ten components account for over 91% of the variance, indicating that most of the information in the dataset is captured by a limited number of components.
## Standard deviations (1, .., p=19):
## [1] 2.151750e+00 1.664285e+00 1.490993e+00 1.354762e+00 1.212629e+00
## [6] 1.047239e+00 1.013420e+00 9.684864e-01 8.838961e-01 7.582916e-01
## [11] 7.319553e-01 7.124577e-01 5.366382e-01 4.700018e-01 2.974602e-01
## [16] 9.728283e-02 5.532574e-02 3.876934e-15 1.048456e-15
##
## Rotation (n x k) = (19 x 19):
## PC1 PC2 PC3 PC4
## age 0.17026404 -0.032903167 -0.07471672 -0.44161360
## height -0.17818096 -0.191545742 0.38157856 0.21774538
## weight -0.13107361 -0.311988725 0.48373308 -0.11858970
## steps 0.12065969 -0.418297671 -0.24909285 0.24874562
## heart_rate 0.38112435 0.083260982 0.16018438 0.23033262
## calories -0.35570372 0.008087567 0.05737633 0.22406094
## distance 0.04712192 -0.514528238 -0.18269409 0.11555000
## entropy_heart_per_day -0.02027078 0.211698637 -0.13416212 0.10664207
## entropy_steps_per_day -0.26608551 0.116054203 -0.06324543 0.25911258
## resting_heart_rate 0.16600632 0.042056642 -0.21018280 -0.12946859
## correlation_heart_rate_steps 0.07412204 0.090474014 0.09207671 -0.04053844
## normalized_heart_rate 0.34279193 0.072004592 0.28341757 0.32154404
## intensity 0.36340183 0.067991673 0.25902124 0.26948565
## sd_normalized_heart_rate 0.28805446 -0.005939188 0.15136829 0.08530385
## stepsXdistance 0.05421691 -0.499758058 -0.20514483 0.15812584
## bmi -0.04018122 -0.265019781 0.35828319 -0.34754685
## steps_per_km 0.05945750 0.047328532 -0.25081374 0.25112230
## calories_per_step -0.24919827 0.133415127 0.09384552 0.16047670
## weight_loss_kg -0.35570372 0.008087567 0.05737633 0.22406094
## PC5 PC6 PC7 PC8
## age -0.21814204 0.24971818 0.05098951 -0.184096283
## height 0.34374492 -0.16290768 0.03450000 -0.007289308
## weight 0.05851094 -0.22024006 0.09678620 0.162035582
## steps 0.07096035 -0.01621819 0.14941379 0.113733659
## heart_rate -0.27494699 -0.10423605 0.07976258 0.090513791
## calories -0.34894459 0.20019249 0.09564692 0.042562368
## distance -0.13834750 0.07617252 -0.22096038 0.002939658
## entropy_heart_per_day -0.11476337 -0.47471443 -0.55626116 0.207757787
## entropy_steps_per_day 0.05735289 -0.15856612 -0.19270940 0.098214174
## resting_heart_rate -0.36493567 -0.27790446 0.25795262 0.508879272
## correlation_heart_rate_steps 0.24030193 0.50507334 -0.25936509 0.662179156
## normalized_heart_rate -0.12525951 0.02197313 -0.03936018 -0.152255330
## intensity -0.18551053 0.01634780 -0.01223357 -0.112922071
## sd_normalized_heart_rate 0.15778414 0.26357628 -0.06804184 0.164417668
## stepsXdistance -0.11686740 0.09372777 -0.21891920 -0.012130168
## bmi -0.22235803 -0.19061209 0.06475518 0.233516438
## steps_per_km 0.28426186 -0.10922544 0.58228793 0.191534101
## calories_per_step -0.24747764 0.23540216 0.11469830 0.102287596
## weight_loss_kg -0.34894459 0.20019249 0.09564692 0.042562368
## PC9 PC10 PC11
## age -0.526834837 0.006878036 0.072131159
## height 0.160223166 0.110409228 0.022117451
## weight -0.103998919 0.021994785 0.018512171
## steps -0.139195842 0.034357816 -0.013419976
## heart_rate 0.059719242 -0.010210653 -0.123780447
## calories 0.001122053 -0.312640514 0.061778675
## distance 0.032894255 0.001208693 -0.033312677
## entropy_heart_per_day -0.255881790 0.109312886 0.220575249
## entropy_steps_per_day -0.389749958 -0.208491632 -0.054111878
## resting_heart_rate 0.365113226 -0.009091281 0.034185400
## correlation_heart_rate_steps -0.069727872 -0.022771537 -0.371774814
## normalized_heart_rate -0.115070989 -0.006872282 -0.155181956
## intensity -0.135110540 -0.008764395 -0.135061554
## sd_normalized_heart_rate 0.048994241 -0.176143786 0.840561081
## stepsXdistance 0.002642834 0.146776681 0.006437946
## bmi -0.286538339 -0.062343091 0.018526546
## steps_per_km -0.427493067 -0.007031209 0.041234731
## calories_per_step -0.087310246 0.823319460 0.166742042
## weight_loss_kg 0.001122053 -0.312640514 0.061778675
## PC12 PC13 PC14
## age 0.123837349 0.5654931065 -0.0051417428
## height 0.192359063 0.5551019152 -0.0692391654
## weight 0.030379999 0.0651484807 -0.0233262005
## steps 0.009980977 0.0346122455 0.7877943273
## heart_rate -0.054429595 0.0923950755 -0.0325225367
## calories 0.206776482 0.0247025517 0.0311698121
## distance 0.006336632 -0.0334702814 -0.3625358608
## entropy_heart_per_day 0.451136652 0.0007499187 0.0631538120
## entropy_steps_per_day -0.719532545 0.2249489714 -0.0483825152
## resting_heart_rate -0.141999337 0.3254562045 -0.0710739541
## correlation_heart_rate_steps 0.113563093 0.0304030866 0.0014418304
## normalized_heart_rate 0.009923004 -0.0588586448 -0.0009261428
## intensity 0.008057061 0.0096483440 -0.0213751050
## sd_normalized_heart_rate -0.140497582 -0.0260617901 -0.0132850593
## stepsXdistance -0.041676226 0.0054757232 -0.2868236923
## bmi -0.105065966 -0.4151213914 0.0298437025
## steps_per_km 0.226565006 -0.1416465127 -0.3790787265
## calories_per_step -0.140718893 -0.0406781294 0.0265733987
## weight_loss_kg 0.206776482 0.0247025517 0.0311698121
## PC15 PC16 PC17
## age 0.014319832 -0.067311517 0.0141510389
## height 0.011676573 0.019501173 0.4514443197
## weight -0.009320743 -0.028698540 -0.7291125557
## steps 0.057093403 0.006990753 0.0051326669
## heart_rate -0.008163266 -0.369616549 0.0145063051
## calories -0.034515414 -0.004483385 -0.0017897111
## distance 0.690996246 -0.016080349 0.0040735205
## entropy_heart_per_day 0.007967456 -0.005989196 -0.0244077825
## entropy_steps_per_day 0.019777449 -0.002639191 0.0022629506
## resting_heart_rate -0.009846930 0.092996913 0.0061830246
## correlation_heart_rate_steps -0.013239172 0.001453746 0.0024268260
## normalized_heart_rate -0.004210826 -0.458862589 0.0131147385
## intensity 0.014750285 0.798672791 -0.0271156474
## sd_normalized_heart_rate 0.013202389 -0.005371121 -0.0008144821
## stepsXdistance -0.709931015 0.005647063 -0.0061728176
## bmi -0.037245627 0.007108037 0.5123162669
## steps_per_km -0.008443855 -0.005907717 0.0085412592
## calories_per_step 0.098944107 -0.005345384 0.0009830153
## weight_loss_kg -0.034515414 -0.004483385 -0.0017897111
## PC18 PC19
## age -5.720707e-16 -2.419170e-16
## height -2.785340e-16 -2.897511e-17
## weight 1.878064e-15 1.544297e-16
## steps 2.476796e-16 2.056489e-17
## heart_rate -7.068315e-01 -5.503839e-03
## calories 5.505816e-03 -7.070853e-01
## distance -2.596307e-16 -3.643770e-17
## entropy_heart_per_day 9.951262e-17 1.527408e-16
## entropy_steps_per_day -3.693731e-16 -1.077882e-16
## resting_heart_rate 3.151856e-01 2.454235e-03
## correlation_heart_rate_steps -2.482151e-16 -1.015604e-16
## normalized_heart_rate 6.332351e-01 4.930771e-03
## intensity 7.170308e-16 -6.340873e-16
## sd_normalized_heart_rate 8.498536e-17 4.887811e-17
## stepsXdistance 2.075705e-16 1.554983e-16
## bmi -3.551076e-16 -2.763674e-17
## steps_per_km -1.135095e-16 -1.588685e-16
## calories_per_step -1.140283e-16 2.052659e-16
## weight_loss_kg -5.505816e-03 7.070853e-01
Principal Component 1 (PC1)
The first principal component (PC1) explains 24.4% of the total variance and is primarily associated with cardiovascular intensity and activity-related variables. The largest positive loadings are observed for intensity, heart rate, normalized heart rate, and standard deviation of normalized heart rate, while strong negative loadings are found for calories, calories per step, entropy of steps per day, and weight loss. This component can be interpreted as a general physiological activity and exertion axis, contrasting high cardiovascular intensity with energy expenditure efficiency and variability in daily activity patterns.
Principal Component 2 (PC2)
The second principal component (PC2) accounts for 14.6% of the total variance and is dominated by movement volume and distance-related variables. Strong negative loadings are observed for distance, steps, and steps × distance, indicating that PC2 primarily reflects overall locomotion and physical activity volume. This component distinguishes individuals with high step counts and longer distances from those with lower levels of ambulatory activity.
Principal Component 3 (PC3)
The third principal component (PC3) explains 11.7% of the total variance and is mainly driven by anthropometric characteristics and activity structure. High positive loadings are observed for weight, height, and BMI, while negative contributions are associated with steps, distance-related variables, and steps per kilometer. PC3 therefore captures differences related to body composition and movement efficiency, separating individuals based on physical build and gait-related activity patterns.
Together, PC1, PC2, and PC3 explain approximately 50.6% of the total
variance, indicating that these components capture the most prominent
physiological, behavioral, and anthropometric dimensions of the data.
Extending the analysis to the first ten principal components increases
the cumulative explained variance to approximately 91.3%, suggesting
that higher-order components contribute progressively less unique
information.
The Principal Component Space Visualization displays the projection of the original high-dimensional dataset onto the first two principal components, DIM 1 and DIM 2, which account for 24.4% and 14.6% of the total variance, respectively. This indicates that together, these two components capture approximately 39% of the overall variability in the dataset, providing a meaningful, albeit partial, low-dimensional representation of the data structure.
Clustering
Based on the results derived from the most influential principal components, the clustering procedure was conducted. Prior to this, the scatter plot was generated to assess the underlying structure of the data.
Due to the irregular shapes of the point groupings observed in the
scatter plot, the clustering procedure was carried out using the HDBSCAN
method.
In the plot above, we can observe that two clusters have formed.
## [1] 0.246
The Silhouette Coefficient indicates a modest but meaningful cluster structure, suggesting that the grouping captures underlying patterns in the data.
Association Rules
In the above chart, we can see association rules which reveals health patterns in accordance to confidence and support.
Conclusions
Based on the conducted study, answers to the research questions were obtained and the research hypotheses were confirmed.
References
Agrawal, R., Imieliński, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. ACM SIGMOD Record, 22(2), 207–216.
Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. Pacific-Asia Conference on Knowledge Discovery and Data Mining, 160–172.
Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K., et al. (2019). A guide to deep learning in healthcare. Nature Medicine, 25(1), 24–29.
Jiang, F., Jiang, Y., Zhi, H., Dong, Y., Li, H., Ma, S., et al. (2020). Artificial intelligence in healthcare: Past, present and future. Stroke and Vascular Neurology, 5(4), 230–243.
Patel, M. S., Asch, D. A., & Volpp, K. G. (2015). Wearable devices as facilitators, not drivers, of health behavior change. JAMA, 313(5), 459–460.
Piwek, L., Ellis, D. A., Andrews, S., & Joinson, A. (2016). The rise of consumer health wearables: Promises and barriers. PLoS Medicine, 13(2), e1001953.
Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1–3), 37–52.
From data to diagnosis: A comprehensive review of machine learning-driven wearable sensors in healthcare. (2023). PubMed. Source: https://pubmed.ncbi.nlm.nih.gov/41534133/
A Scoping Review of the Use of Machine Learning in Health Economics and Outcomes Research: Part 1 — Data From Wearable Devices. (2022). Value in Health. Source: https://www.sciencedirect.com/science/article/pii/S1098301522021453
Feature selection for unsupervised machine learning of accelerometer data physical activity clusters: A systematic review. (2021). PubMed. Source: https://pubmed.ncbi.nlm.nih.gov/34438293/
Wearable Devices and Explainable Unsupervised Learning for COVID‑19 Detection and Monitoring. (2023). MDPI Diagnostics. Source: https://pubmed.ncbi.nlm.nih.gov/37835814/
Data Mining for Wearable Sensors in Health Monitoring Systems: A Review of Recent Trends and Challenges. (2013). MDPI Sensors. Source: https://www.mdpi.com/1424-8220/13/12/17472
Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A, 374(2065), 20150202.
Ringnér, M. (2008). What is principal component analysis? Nature Biotechnology, 26(3), 303-304.
Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1-3), 37-52.
Hopkins, B., & Skellam, J. G. (1954). A new method for determining the type of distribution of plant individuals. Annals of Botany, 18(2), 213–227.
Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining. Addison-Wesley.
Hennig, C. (2015). Cluster Validation. Chapman & Hall/CRC.
Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. In Advances in Knowledge Discovery and Data Mining (pp. 160–172). Springer.
McInnes, L., Healy, J., & Astels, S. (2017). HDBSCAN: Hierarchical density based clustering. Journal of Open Source Software, 2(11), 205.
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.