Methodology

Data Preprocessing

Cleaned and standardized patient data (cholesterol, blood pressure, ECG, etc.).
Separated numerical and categorical features.

PCA (Principal Component Analysis)

Reduced multiple risk factors into four key components (PC1-PC4).
PC1 captured cholesterol & BP (cardiovascular risk), PC2 linked to exercise response, PC3 showed stress markers, and PC4 represented ECG/metabolic abnormalities.

Clustering (K-Means & Hierarchical)

Identified three risk groups: Low, Moderate, and High Risk.
High-risk patients had high cholesterol, BP, and ischemia markers.

Feature Descriptions

-age: Age of the patient (in years) -sex: Sex of the patient (1 = male, 0 = female) -cp: Chest pain type (1-4) -trestbps: Resting blood pressure (in mm Hg on admission to the hospital) -chol: Serum cholesterol in mg/dl -fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false) -restecg: Resting electrocardiographic results (0-2) -thalach: Maximum heart rate achieved -exang: Exercise-induced angina (1 = yes; 0 = no) -oldpeak: ST depression induced by exercise relative to rest

Load Libraries

library(dplyr)
library(caret)
library(ggplot2)
library(arules)
library(arulesViz)
library(FactoMineR)
library(factoextra)
library(psych)
library(corrplot)

Load Data

setwd("C:/Users/ozdil/Downloads")
life <- read.csv("USL project2/archive (1)/heart-disease.csv")
head(life)

##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  3      145  233   1       0     150     0     2.3     0  0    1
## 2  37   1  2      130  250   0       1     187     0     3.5     0  0    2
## 3  41   0  1      130  204   0       0     172     0     1.4     2  0    2
## 4  56   1  1      120  236   0       1     178     0     0.8     2  0    2
## 5  57   0  0      120  354   0       1     163     1     0.6     2  0    2
## 6  57   1  0      140  192   0       1     148     0     0.4     1  0    1
##   target
## 1      1
## 2      1
## 3      1
## 4      1
## 5      1
## 6      1

Data Preprocessing

categorical_columns <- c("sex", "cp", "fbs", "restecg", "exang", "slope", "ca", "thal")
sum(is.na(life))  # Check for missing values

## [1] 0

life <- na.omit(life)  # Remove missing values

# Separate features
df_features <- life %>% select(-target)
df_numeric <- df_features %>% select(-one_of(categorical_columns))

# Standardize numerical features
df_scaled <- as.data.frame(scale(df_numeric))
df_scaled_full <- cbind(df_scaled, df_features[, categorical_columns])

Principal Component Analysis (PCA)

fa.parallel(df_scaled_full, fa = "pc", n.iter = 100)

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  2

pca_model <- prcomp(df_scaled_full, center = TRUE, scale. = TRUE)
summary(pca_model)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.6622 1.2396 1.10582 1.08681 1.01092 0.98489 0.92885
## Proportion of Variance 0.2125 0.1182 0.09406 0.09086 0.07861 0.07462 0.06637
## Cumulative Proportion  0.2125 0.3307 0.42481 0.51567 0.59428 0.66890 0.73527
##                            PC8    PC9    PC10    PC11    PC12   PC13
## Standard deviation     0.88088 0.8479 0.78840 0.72808 0.65049 0.6098
## Proportion of Variance 0.05969 0.0553 0.04781 0.04078 0.03255 0.0286
## Cumulative Proportion  0.79495 0.8503 0.89807 0.93885 0.97140 1.0000

Scree Plot & Variance Explained

screeplot(pca_model, type = "lines", main = "Scree Plot")

fviz_eig(pca_model, addlabels = TRUE, main = "Cumulative Variance Explained by PCs")

factoextra::fviz_pca_var(pca_model, col.var="contrib")

Clustering (K-Means & Hierarchical)

fviz_nbclust(df_scaled, kmeans, method = "wss")

set.seed(123)
kmeans_result <- kmeans(df_scaled, centers = 3)
fviz_cluster(kmeans_result, data = df_scaled)

Hierarchical Clustering

hclust_result <- hclust(dist(df_scaled))
fviz_dend(hclust_result, k = 3)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Cluster Characteristics

Cluster 1 (Low-Risk Group): Low PC1 (cholesterol, BP) & low PC3 (no ischemia/blockages)
Cluster 2 (Moderate-Risk Group): Moderate PC1, PC3 & PC4 (some risk factors but not extreme)
Cluster 3 (High-Risk Group): High PC1 (cholesterol, BP) & high PC3 (blockages & ischemia)

Association Rule Mining (Apriori Algorithm)

pca_data_discrete <- data.frame(
  PC1 = discretize(pca_model$x[,1], method = "frequency", breaks = 3),
  PC2 = discretize(pca_model$x[,2], method = "frequency", breaks = 3),
  PC3 = discretize(pca_model$x[,3], method = "frequency", breaks = 3),
  PC4 = discretize(pca_model$x[,4], method = "frequency", breaks = 3)
)

pca_data_trans <- as(pca_data_discrete, "transactions")
rules <- apriori(pca_data_trans, parameter = list(supp = 0.1, conf = 0.8))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 30 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[12 item(s), 303 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(head(rules))

Association Rule Insights

PC1, PC2, PC3, and PC4 have strong interrelationships.
Higher PC2 often correlates with higher PC1.
Low PC1 is linked with moderate PC3 and high PC4.
Lift values (~2.5) indicate strong associations.
Confidence (~80-83%) suggests reliable rules.
There are hidden patterns in the data that might predict heart disease risk.
If PC1 (which might represent cholesterol, age, and blood pressure) is high, it often correlates with PC2 being high, meaning these features are linked.
This method could help identify risk groups based on the most important factors from PCA.

Visualizing Cluster Characteristics

library(ggplot2)
df_pca <- as.data.frame(pca_model$x)  # Extract PCA-transformed data
df_pca$Cluster <- as.factor(kmeans_result$cluster)

ggplot(df_pca, aes(x = Cluster, y = PC1, fill = Cluster)) +
  geom_boxplot() +
  theme_minimal() +
  ggtitle("Distribution of PC1 Across Clusters")

rule_clusters <- table(df_pca$Cluster, pca_data_discrete$PC1)
rule_clusters_df <- as.data.frame(rule_clusters)
colnames(rule_clusters_df) <- c("Cluster", "PC1_Bin", "Count")

ggplot(rule_clusters_df, aes(x = PC1_Bin, y = Count, fill = Cluster)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  ggtitle("Frequency of PC1 Rules in Each Cluster")

Correlation Analysis

cor_matrix <- cor(pca_model$x[,1:4], df_numeric)
corrplot(cor_matrix, method = "color", tl.cex = 0.8)

Results & Insights

PCA simplified risk factor analysis, allowing early detection of high-risk individuals.
Clustering revealed groups with distinct heart disease risk levels.
Association Rules confirmed strong correlations between high cholesterol, exercise response, and ischemia.

Heart Disease Risk Analysis: A Data-Driven Journey

Dilara Ozdil - 474544

Aim

Methodology

Data Preprocessing

PCA (Principal Component Analysis)

Clustering (K-Means & Hierarchical)

Feature Descriptions

Load Libraries

Load Data

Data Preprocessing

Principal Component Analysis (PCA)

Scree Plot & Variance Explained

Clustering (K-Means & Hierarchical)

Hierarchical Clustering

Cluster Characteristics

Association Rule Mining (Apriori Algorithm)

Association Rule Insights

Visualizing Cluster Characteristics

Correlation Analysis

Results & Insights