High-dimensional datasets, like heart health data, can be challenging to analyze. Dimensionality reduction methods such as PCA help simplify the data by capturing the most important features, allowing us to uncover patterns or groupings in a more interpretable way. Can we simplify heart health data to uncover patterns or group patients with similar clinical characteristics?
The dataset contains clinical features of patients, such as
Age, MaxHR, Cholesterol, and
indicators like HeartDisease, which denotes whether a
patient has heart disease (1 for “Has Heart Disease”, 0 for “No Heart
Disease”).
The dataset used can be sourced from Kaggle.
High-dimensional data with multiple clinical features can obscure patterns and make it challenging to visualize relationships. By applying dimensionality reduction techniques like PCA, we reduce complexity while retaining as much variance as possible. This allows for easier visualization and identification of patterns that could assist in understanding patient groups.
We use a scree plot to visualize the variance explained by each principal component. This helps identify how many PCs are needed to capture most of the variance.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8069 1.2194 1.1917 1.08922 1.03710 0.98055 0.93133
## Proportion of Variance 0.2332 0.1062 0.1014 0.08474 0.07683 0.06868 0.06196
## Cumulative Proportion 0.2332 0.3394 0.4409 0.52559 0.60242 0.67109 0.73305
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.9196 0.89464 0.78780 0.75183 0.70515 0.63887 1.965e-15
## Proportion of Variance 0.0604 0.05717 0.04433 0.04038 0.03552 0.02915 0.000e+00
## Cumulative Proportion 0.7934 0.85062 0.89495 0.93533 0.97085 1.00000 1.000e+00
Scree plot showing the proportion of variance explained by each principal component.
Based on the scree plot, we observe that the first two components explain approximately 23% and 10% of the total variance. By including the first three components, we capture around 44% of the total variance.
To understand how features contribute to the principal components, we examine the PCA loadings. Larger contributions indicate that the feature plays a significant role in the corresponding PC.
Feature contributions to the principal components. Larger contributions indicate greater influence on the corresponding PC.
From the feature contributions, we see that Age and MaxHR have high contributions to PC1, while Oldpeak and RestingBP influence PC2 the most. Although facet wrapping helps display contributions for each principal component clearly, it reduces the ability to compare across PCs directly.
Next, we visualize the data in 2D and 3D space using the first principal components. This allows us to observe patterns or separations between patients with and without heart disease.
2D PCA plot of PC1 vs PC2. Points are colored by heart disease status.
3D PCA plot of PC1, PC2, and PC3. Points are colored by heart disease status.
Using PCA, we found that the first few principal components explain a limited proportion of the total variance, as observed in the scree plot. Specifically, the first two components capture approximately 33% of the variance, while higher components contribute marginally.
From the feature contributions, we identified that Age and MaxHR strongly influence PC1, while Oldpeak and RestingBP contribute significantly to PC2. However, the 2D and 3D visualizations did not show clear separations between patients with and without heart disease, suggesting that linear relationships alone may not capture the complexity of the dataset.
To address this limitation, non-linear dimensionality reduction techniques such as t-SNE or UMAP could be explored in future analyses to uncover more intricate patterns or clusters in the data.