Homework 4

Introduction

High-dimensional datasets, like heart health data, can be challenging to analyze. Dimensionality reduction methods such as PCA help simplify the data by capturing the most important features, allowing us to uncover patterns or groupings in a more interpretable way. Can we simplify heart health data to uncover patterns or group patients with similar clinical characteristics?

Data Preparation

The dataset contains clinical features of patients, such as Age, MaxHR, Cholesterol, and indicators like HeartDisease, which denotes whether a patient has heart disease (1 for “Has Heart Disease”, 0 for “No Heart Disease”).

The dataset used can be sourced from Kaggle.

High-dimensional data with multiple clinical features can obscure patterns and make it challenging to visualize relationships. By applying dimensionality reduction techniques like PCA, we reduce complexity while retaining as much variance as possible. This allows for easier visualization and identification of patterns that could assist in understanding patient groups.

Principle Component Analysis (PCA)

We use a scree plot to visualize the variance explained by each principal component. This helps identify how many PCs are needed to capture most of the variance.

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.8069 1.2194 1.1917 1.08922 1.03710 0.98055 0.93133
## Proportion of Variance 0.2332 0.1062 0.1014 0.08474 0.07683 0.06868 0.06196
## Cumulative Proportion  0.2332 0.3394 0.4409 0.52559 0.60242 0.67109 0.73305
##                           PC8     PC9    PC10    PC11    PC12    PC13      PC14
## Standard deviation     0.9196 0.89464 0.78780 0.75183 0.70515 0.63887 1.965e-15
## Proportion of Variance 0.0604 0.05717 0.04433 0.04038 0.03552 0.02915 0.000e+00
## Cumulative Proportion  0.7934 0.85062 0.89495 0.93533 0.97085 1.00000 1.000e+00
Scree plot showing the proportion of variance explained by each principal component.

Scree plot showing the proportion of variance explained by each principal component.

Based on the scree plot, we observe that the first two components explain approximately 23% and 10% of the total variance. By including the first three components, we capture around 44% of the total variance.

To understand how features contribute to the principal components, we examine the PCA loadings. Larger contributions indicate that the feature plays a significant role in the corresponding PC.

Feature contributions to the principal components. Larger contributions indicate greater influence on the corresponding PC.

Feature contributions to the principal components. Larger contributions indicate greater influence on the corresponding PC.

From the feature contributions, we see that Age and MaxHR have high contributions to PC1, while Oldpeak and RestingBP influence PC2 the most. Although facet wrapping helps display contributions for each principal component clearly, it reduces the ability to compare across PCs directly.

Next, we visualize the data in 2D and 3D space using the first principal components. This allows us to observe patterns or separations between patients with and without heart disease.

PCA 2D Plot

2D PCA plot of PC1 vs PC2. Points are colored by heart disease status.

2D PCA plot of PC1 vs PC2. Points are colored by heart disease status.

PCA 3D Plot

3D PCA plot of PC1, PC2, and PC3. Points are colored by heart disease status.

Conclusion

Using PCA, we found that the first few principal components explain a limited proportion of the total variance, as observed in the scree plot. Specifically, the first two components capture approximately 33% of the variance, while higher components contribute marginally.

From the feature contributions, we identified that Age and MaxHR strongly influence PC1, while Oldpeak and RestingBP contribute significantly to PC2. However, the 2D and 3D visualizations did not show clear separations between patients with and without heart disease, suggesting that linear relationships alone may not capture the complexity of the dataset.

To address this limitation, non-linear dimensionality reduction techniques such as t-SNE or UMAP could be explored in future analyses to uncover more intricate patterns or clusters in the data.