Choose a dataset: You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homework. You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected. Weeks 1-10: Linear/Logistic regression, KNN, LDA, Naive Bayes, Decision Trees, AdaBoost, Gradient Boosting, SVM, K-Means/Hierarchical Clustering Weeks 11-15: PCA, Bootstrap, Neural Networks
To complete this task:.
I’d like to do some classification analysis using Neural Networks and Clustering, which are unsupervised and supervised learning approaches. |
---|
I’ll center my analysis on the Heart Disease Dataset, found in the UCI Machine Learning Repository here: https://archive.ics.uci.edu/. It has features about medical patients, such as their sex, age, blood pressure, cholesterol levels, etc., which are used to classify if the patient has heart disease or not. I hope to see if using clustering can group certain characteristics together to discern high or low-risk patients, and will also try to do the same using neural networks. |
https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data https://archive.ics.uci.edu/dataset/45/heart+disease |
1. Age: Age of the patient. 2. Sex: Gender of the patient [1 = male, 0 = female]. 3. cp: Chest pain type: [typical angina, atypical angina, non-anginal, asymptomatic]. 4. trestbps: Resting blood pressure in mm Hg on admission to the hospital. 5. chol: Serum cholesterol in mg/dl. 6. fbs: Whether fasting blood sugar is > 120 mg/dl [1 = true; 0 = false]. 6. restecg: Results of the resting electrocardiogram [values: 0, 1, 2]. 7. thalach: Maximum heart rate achieved. 8. exang: Whether exercise induced angina is present [1 = yes; 0 = no]. 9. Oldpeak: ST Depression induced by exercise relative to rest. This numeric field likely refers to how far below the baseline heart-rate the ECG goes. 10. slope: Slope of the peak exercise ST segment [1 = up sloping; 2 = flat; 3 = down sloping]. 11. ca: Number of major vessels colored by fluoroscopy [0-3]. 12. thal: Thalassemia status [values: 3 = normal; 6 = fixed defect; 7 = reversible defect]. 13. num: Whether the patient has heart disease or not [0 = no heart disease, 1,2,3,4 = the stages of heart disease] |
As I was unsure of the meaning behind the fields ‘slope’ and ‘oldpeak’, ST depression within the context of heart-disease refers the position of the ST segment in a person’s electrocardiogram (ECG) test results, which corresponds to the area visible at the end of the QRS complex and the beginning of the T wave. The QRS complex shows the electrical activity in the ventricles — the two lower chambers of the heart — that stimulates them to contract. The T wave shows the electrical activity involved in the heart’s ventricular re-polarization. This means it shows the electrical reset of the heart as it prepares for the next cardiac cycle. I’ve attached a url below with a visual for this:
[https://www.medicalnewstoday.com/articles/st-depression-on-ecg#what-it-is] Schematic map of ECG, with ST segment
Based on statistics obtained from World Health Organisation (WHO), cardiovascular diseases are the leading cause of death globally. 18 million people died from cardiovascular diseases in 2019, representing 32% of all global deaths. Of these deaths, 85% were due to heart attack and stroke. Data Scientists and analysts can use machine learning and modeling techniques for diagnosis, prediction, and detection of diseases. The data itself is rather clean, thanks to UCI prepping it for machine learning work, and I did not need to recode this version of the dataset (changing 1 to be Male, changing 1 to be up-sloping, 6 to be fixed defect, etc.)
Import needed libraries for analysis
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(keras)
## Warning: package 'keras' was built under R version 4.3.3
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: lattice
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.3.3
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
#library(tensorflow)
#install_tensorflow(version = "2.10.0")
Load and prepare Dataset
heart_data_full <- read.table(file="https://raw.githubusercontent.com/RonBalaban/CUNY-SPS/refs/heads/main/DATA622/heart_disease_uci.csv", header=TRUE, sep=",")
# Filter to only use the proper dataset, not the full expanded with missing/illogical data
heart_data <- heart_data_full %>%
filter(dataset == "Cleveland")
# Remove un-needed fields
heart_data <- heart_data %>%
select(-dataset) %>%
select(-id)
Change column names
new_names <- c("Age",
"Sex",
"Chest_Pain_Type",
"Resting_Blood_Pressure",
"Serum_Cholesterol",
"Fasting_Blood_Sugar_Over120",
"Resting_ECG",
"Max_Heart_Rate", #
"Exercise_Angina_Present",
"ST_Depression_Exercise",
"Slope_Peak_Exercise_ST_Segment", #Slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping).
"Num_Major_Vessels_Fluoroscopy",
"Thalassemia", #
"Heart_Disease")
# Apply column names to dataframe
colnames(heart_data) <- new_names
Check for Null/Missing Data
Missing_values <- colSums(is.na(heart_data) | heart_data == "")
Missing_values[Missing_values>0]
## Slope_Peak_Exercise_ST_Segment Num_Major_Vessels_Fluoroscopy
## 1 5
## Thalassemia
## 3
Impute missing data for numeric fields
# Impute missing or blank values for 'Num_Major_Vessels_Fluoroscopy'
heart_data$Num_Major_Vessels_Fluoroscopy[heart_data$Num_Major_Vessels_Fluoroscopy == ""] <- NA
# Impute missing values with the median
heart_data$Num_Major_Vessels_Fluoroscopy[is.na(heart_data$Num_Major_Vessels_Fluoroscopy)] <-
median(as.numeric(heart_data$Num_Major_Vessels_Fluoroscopy), na.rm = TRUE)
Impute missing data for categorical fields
# Impute missing or blank values for 'Thalassemia' with the mode
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# First, replace blank strings with NA
heart_data$Thalassemia[heart_data$Thalassemia == ""] <- NA
# Impute with the mode
heart_data$Thalassemia[is.na(heart_data$Thalassemia)] <- getmode(heart_data$Thalassemia)
# Repeat
heart_data$Slope_Peak_Exercise_ST_Segment[heart_data$Slope_Peak_Exercise_ST_Segment == ""] <- NA
heart_data$Slope_Peak_Exercise_ST_Segment[is.na(heart_data$Slope_Peak_Exercise_ST_Segment )] <- getmode(heart_data$Slope_Peak_Exercise_ST_Segment)
Now I have no missing values, so let’s shift to clustering. I will exclude my target feature (Heart_Disease) from clustering classification process. I could exclude the categorical fields from the clustering, as you can’t compute Euclidean distance to find within cluster similarity for these fields. Alternatively, I could one-hot encode them to prevent significant information loss.
# First, make all categorical variables factors
heart_data$Sex <- factor(heart_data$Sex)
heart_data$Chest_Pain_Type <- factor(heart_data$Chest_Pain_Type)
heart_data$Fasting_Blood_Sugar_Over120 <- factor(heart_data$Fasting_Blood_Sugar_Over120)
heart_data$Resting_ECG <- factor(heart_data$Resting_ECG)
heart_data$Exercise_Angina_Present <- factor(heart_data$Exercise_Angina_Present)
heart_data$Slope_Peak_Exercise_ST_Segment <- factor(heart_data$Slope_Peak_Exercise_ST_Segment)
heart_data$Thalassemia <- factor(heart_data$Thalassemia)
# create one-hot encoded variables
dummies <- dummyVars(~ Sex + Chest_Pain_Type + Fasting_Blood_Sugar_Over120 +
Resting_ECG + Exercise_Angina_Present +
Slope_Peak_Exercise_ST_Segment + Thalassemia,
data = heart_data)
# Create the encoded dataframe
heart_data_encoded <- predict(dummies, newdata = heart_data)
# Convert the result to a data frame
heart_data_encoded <- data.frame(heart_data_encoded)
# Standardize the Data by scaling the numeric columns (the one-hot encoded features and other continuous variables)
heart_data_scaled <- scale(heart_data_encoded)
Encoding is done, lets do K-means clustering starting with 3
set.seed(12345)
kmeans_result <- kmeans(heart_data_scaled, centers = 3, nstart = 25)
# Add the cluster assignments back to the original data
heart_data_encoded$Cluster <- kmeans_result$cluster
#View the clustered data
head(heart_data_encoded)
## Sex.Female Sex.Male Chest_Pain_Type.asymptomatic
## 1 0 1 0
## 2 0 1 1
## 3 0 1 1
## 4 0 1 0
## 5 1 0 0
## 6 0 1 0
## Chest_Pain_Type.atypical.angina Chest_Pain_Type.non.anginal
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 1
## 5 1 0
## 6 1 0
## Chest_Pain_Type.typical.angina Fasting_Blood_Sugar_Over120.FALSE
## 1 1 0
## 2 0 1
## 3 0 1
## 4 0 1
## 5 0 1
## 6 0 1
## Fasting_Blood_Sugar_Over120.TRUE Resting_ECG.lv.hypertrophy
## 1 1 1
## 2 0 1
## 3 0 1
## 4 0 0
## 5 0 1
## 6 0 0
## Resting_ECG.normal Resting_ECG.st.t.abnormality Exercise_Angina_Present.FALSE
## 1 0 0 1
## 2 0 0 0
## 3 0 0 0
## 4 1 0 1
## 5 0 0 1
## 6 1 0 1
## Exercise_Angina_Present.TRUE Slope_Peak_Exercise_ST_Segment.downsloping
## 1 0 1
## 2 1 0
## 3 1 0
## 4 0 1
## 5 0 0
## 6 0 0
## Slope_Peak_Exercise_ST_Segment.flat Slope_Peak_Exercise_ST_Segment.upsloping
## 1 0 0
## 2 1 0
## 3 1 0
## 4 0 0
## 5 0 1
## 6 0 1
## Thalassemia.fixed.defect Thalassemia.normal Thalassemia.reversable.defect
## 1 1 0 0
## 2 0 1 0
## 3 0 0 1
## 4 0 1 0
## 5 0 1 0
## 6 0 1 0
## Cluster
## 1 1
## 2 3
## 3 3
## 4 2
## 5 2
## 6 2
# Reduce dimensions using PCA for visualization
pca <- prcomp(heart_data_scaled)
# Add PCA components to the dataframe
heart_data_encoded$PCA1 <- pca$x[,1]
heart_data_encoded$PCA2 <- pca$x[,2]
# Plot clusters using the first two principal components
ggplot(heart_data_encoded, aes(x = PCA1, y = PCA2, color = factor(Cluster))) +
geom_point() +
labs(title = "K-means Clustering (PCA Projection)", x = "PCA1", y = "PCA2") +
theme_minimal()
I will use the elbow method to find the optimal number of clusters, which looks at the within-cluster sum of squares (WCSS), and finds where the rate of decrease slows down for optimal neighbor
# Elbow Method ;loop
wss <- numeric(25)
for (k in 1:25) {
kmeans_model <- kmeans(heart_data_scaled, centers = k, nstart = 25)
wss[k] <- kmeans_model$tot.withinss
}
# Plot the WSS for different k, let's say 1 to 25
plot(1:25, wss, type = "b", main = "Elbow Method for K-means", xlab = "Number of Clusters", ylab = "WCSS")
The WCSS seems to drop rapidly until we hit k=15, then it lowers down slower akin to gradient descent, and adding more clusters doesn’t meaningfully improve the clustering. At k=15, the WCSS is about 2200, and at k=18, the WCSS is about 2000, which demonstrates this point.
I want to plot the clusters, ranging from 3 to 15.
# Create a list to store plots
plot_list <- list()
for (k in 3:7) {
set.seed(12345)
# Perform K-means clustering
kmeans_result <- kmeans(heart_data_scaled, centers = k, nstart = 25)
# Add the cluster assignments to the dataset
heart_data_encoded$Cluster <- as.factor(kmeans_result$cluster)
# Perform PCA to reduce the data to 2 dimensions for visualization
pca <- prcomp(heart_data_scaled)
heart_data_encoded$PCA1 <- pca$x[,1]
heart_data_encoded$PCA2 <- pca$x[,2]
# Create a plot for the current k
p <- ggplot(heart_data_encoded, aes(x = PCA1, y = PCA2, color = Cluster)) +
geom_point(alpha = 0.7) +
labs(title = paste("Cluster of", k),
x = "PCA1", y = "PCA2") +
theme_minimal() +
theme(legend.title = element_blank()) +
guides(color = guide_legend(title = "Cluster")) +
theme(legend.position = "none")
# Store the plot in the list
plot_list[[as.character(k)]] <- p
}
# Plot all graphs together
grid.arrange(grobs = plot_list, ncol = 2)
Clustering methods like K-means are unsupervised learning techniques, which means they don’t directly predict or classify if a patient has heart disease or not. Instead, I can group similar patients together based on their features to see if there are any inherent underlying patterns among the patients, without resorting to labels. This could group them based on their genders, types of chest pain, cholesterol levels, heart rate, etc. I can now use these clusters to train a model to predict if a patient has heart disease or not based on their features and the cluster they belong to.
Lets do a random forest first
# 1. Recode Heart_Disease to binary (0 = No Heart Disease, 1 = Heart Disease)
heart_data_encoded$Heart_Disease <- ifelse(heart_data$Heart_Disease == 0, "No Heart Disease", "Heart Disease")
heart_data_encoded$Heart_Disease <- as.factor(heart_data_encoded$Heart_Disease)
# 2. Add the cluster as a feature for classification
heart_data_encoded$Cluster <- as.factor(heart_data_encoded$Cluster)
# 3. Split the data into training and testing sets
set.seed(12345)
train_index <- createDataPartition(heart_data_encoded$Heart_Disease, p = 0.7, list = FALSE)
train_data <- heart_data_encoded[train_index, ]
test_data <- heart_data_encoded[-train_index, ]
# 4. Train a Random Forest model (binary classification)
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
rf_model <- randomForest(Heart_Disease ~ ., data = train_data, ntree = 100)
# 5. Evaluate model performance on the test data
predictions <- predict(rf_model, test_data)
# Confusion Matrix
confusion_matrix <- confusionMatrix(predictions, test_data$Heart_Disease)
print(confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Heart Disease No Heart Disease
## Heart Disease 30 11
## No Heart Disease 11 38
##
## Accuracy : 0.7556
## 95% CI : (0.6536, 0.84)
## No Information Rate : 0.5444
## P-Value [Acc > NIR] : 2.879e-05
##
## Kappa : 0.5072
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7317
## Specificity : 0.7755
## Pos Pred Value : 0.7317
## Neg Pred Value : 0.7755
## Prevalence : 0.4556
## Detection Rate : 0.3333
## Detection Prevalence : 0.4556
## Balanced Accuracy : 0.7536
##
## 'Positive' Class : Heart Disease
##
# 6. Evaluate ROC Curve and AUC
library(pROC)
## Warning: package 'pROC' was built under R version 4.3.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
probabilities <- predict(rf_model, test_data, type = "prob")[,2]
roc_curve <- roc(test_data$Heart_Disease, probabilities)
## Setting levels: control = Heart Disease, case = No Heart Disease
## Setting direction: controls < cases
plot(roc_curve)
auc(roc_curve) #Area under the curve: 0.8377
## Area under the curve: 0.8377
This random forest model has an accuracy of 75.56%, with a tiny p-value. The sensitivity and recall are 0.7317 and 0.7755 respectively, indicating how good the model is with handling true positives and true negatives, both of which are important when dealing with medical diagnoses. The ROC curve is a visual representation of model performance across all thresholds, which is drawn by calculating the true positive rate (TPR) and false positive rate (FPR) at every possible threshold. The area under the curve is 0.8377, which represents the probability that the random forest model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative.
Now I want to check if the clusters generated earlier by k-means clustering can be used as a feature to classify patients by if they have heart disease or not. I’ll do so by using the cluster assignments as features in my classification model
# Re-code the Heart_Disease column to binary (0 = No Heart Disease, 1 = Heart Disease)
heart_data_encoded$Heart_Disease <- ifelse(heart_data$Heart_Disease == 0, "No Heart Disease", "Heart Disease")
heart_data_encoded$Heart_Disease <- as.factor(heart_data_encoded$Heart_Disease)
# Use the cluster as the only feature for classification
heart_data_encoded$Cluster <- as.factor(heart_data_encoded$Cluster)
# Split the data into training and testing sets
set.seed(12345)
train_index <- createDataPartition(heart_data_encoded$Heart_Disease, p = 0.7, list = FALSE)
train_data <- heart_data_encoded[train_index, ]
test_data <- heart_data_encoded[-train_index, ]
# Train a Random Forest model with only Cluster as the predictor
library(randomForest)
rf_model_clusters <- randomForest(Heart_Disease ~ Cluster, data = train_data, ntree = 100)
# Evaluate the model on the test data
predictions <- predict(rf_model_clusters, test_data)
# Confusion Matrix to see how well clustering predicts heart disease status
confusion_matrix <- confusionMatrix(predictions, test_data$Heart_Disease)
print(confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Heart Disease No Heart Disease
## Heart Disease 25 10
## No Heart Disease 16 39
##
## Accuracy : 0.7111
## 95% CI : (0.606, 0.8018)
## No Information Rate : 0.5444
## P-Value [Acc > NIR] : 0.0008961
##
## Kappa : 0.4106
##
## Mcnemar's Test P-Value : 0.3267996
##
## Sensitivity : 0.6098
## Specificity : 0.7959
## Pos Pred Value : 0.7143
## Neg Pred Value : 0.7091
## Prevalence : 0.4556
## Detection Rate : 0.2778
## Detection Prevalence : 0.3889
## Balanced Accuracy : 0.7028
##
## 'Positive' Class : Heart Disease
##
# Evaluate ROC Curve and AUC
library(pROC)
probabilities <- predict(rf_model_clusters, test_data, type = "prob")[,2]
roc_curve <- roc(test_data$Heart_Disease, probabilities)
## Setting levels: control = Heart Disease, case = No Heart Disease
## Setting direction: controls < cases
plot(roc_curve)
auc(roc_curve)
## Area under the curve: 0.7847
Here, instead of using all features (Age, Sex, Chest_Pain_Type,Serum_Cholesterol, etc.), I’m training the model using only the Cluster feature, although it seems less efficient at classifying patients at having heart disease or not. I want to see which features have the most bearing on whether or not someone has heart disease.
Neural Network
library(neuralnet)
library(caret)
# Recode Heart_Disease to binary factor
heart_data_encoded$Heart_Disease <- ifelse(heart_data$Heart_Disease == 0, "No Heart Disease", "Heart Disease")
heart_data_encoded$Heart_Disease <- as.factor(heart_data_encoded$Heart_Disease)
# Convert 'Cluster' to numeric
heart_data_encoded$Cluster <- as.numeric(heart_data_encoded$Cluster)
# Scale the numeric features
heart_data_scaled <- heart_data_encoded
heart_data_scaled[, -which(names(heart_data_scaled) == "Heart_Disease")] <- scale(heart_data_encoded[, -which(names(heart_data_encoded) == "Heart_Disease")])
# Split the data into training and test sets
set.seed(12345)
trainIndex <- createDataPartition(heart_data_scaled$Heart_Disease, p = 0.7, list = FALSE)
train_data <- heart_data_scaled[trainIndex, ]
test_data <- heart_data_scaled[-trainIndex, ]
# Verify that all columns are numeric
sapply(train_data, class)
## Sex.Female
## "numeric"
## Sex.Male
## "numeric"
## Chest_Pain_Type.asymptomatic
## "numeric"
## Chest_Pain_Type.atypical.angina
## "numeric"
## Chest_Pain_Type.non.anginal
## "numeric"
## Chest_Pain_Type.typical.angina
## "numeric"
## Fasting_Blood_Sugar_Over120.FALSE
## "numeric"
## Fasting_Blood_Sugar_Over120.TRUE
## "numeric"
## Resting_ECG.lv.hypertrophy
## "numeric"
## Resting_ECG.normal
## "numeric"
## Resting_ECG.st.t.abnormality
## "numeric"
## Exercise_Angina_Present.FALSE
## "numeric"
## Exercise_Angina_Present.TRUE
## "numeric"
## Slope_Peak_Exercise_ST_Segment.downsloping
## "numeric"
## Slope_Peak_Exercise_ST_Segment.flat
## "numeric"
## Slope_Peak_Exercise_ST_Segment.upsloping
## "numeric"
## Thalassemia.fixed.defect
## "numeric"
## Thalassemia.normal
## "numeric"
## Thalassemia.reversable.defect
## "numeric"
## Cluster
## "numeric"
## PCA1
## "numeric"
## PCA2
## "numeric"
## Heart_Disease
## "factor"
# Train the neural network using all features except 'Heart_Disease' as predictors
nn_model <- neuralnet(Heart_Disease ~ ., data = train_data, hidden = c(5, 3), linear.output = FALSE)
# Plot the neural network
plot(nn_model)
Let’s check metrics for this neural network:
# Make predictions on the test set
predictions <- predict(nn_model, test_data, type = "class")
# This contains probabilities for both classes- one for the probability of "Heart Disease" and the other for "No Heart Disease". I can convert this to a binary classification (either "Heart Disease" or "No Heart Disease")
# Extract the class predictions based on the higher probability
predictions_class <- ifelse(predictions[, 2] > 0.5, "Heart Disease", "No Heart Disease")
# make predictions_class is a factor with the same levels as the target variable
predictions_class <- factor(predictions_class, levels = c("No Heart Disease", "Heart Disease"))
# Evaluate the model using a confusion matrix
confusion_matrix <- table(predictions_class, test_data$Heart_Disease)
print(confusion_matrix)
##
## predictions_class Heart Disease No Heart Disease
## No Heart Disease 28 10
## Heart Disease 13 39
# Extract True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN)
TN <- confusion_matrix["No Heart Disease", "No Heart Disease"]
FP <- confusion_matrix["Heart Disease", "No Heart Disease"]
FN <- confusion_matrix["No Heart Disease", "Heart Disease"]
TP <- confusion_matrix["Heart Disease", "Heart Disease"]
# Accuracy
accuracy <- (TP + TN) / sum(confusion_matrix)
# Sensitivity (Recall)
sensitivity <- TP / (TP + FN)
# Precision
precision <- TP / (TP + FP)
# Recall (same as Sensitivity)
recall <- sensitivity
# RMSE (Root Mean Square Error), using the predicted probabilities for the positive class and the actual binary labels
rmse <- sqrt(mean((test_data$Heart_Disease == "Heart Disease") - predictions[, 2])^2)
# Print metrics
cat("Accuracy: ", round(accuracy, 2), "\n")
## Accuracy: 0.26
cat("Sensitivity : ", round(sensitivity, 2), "\n")
## Sensitivity : 0.32
cat("Precision: ", round(precision, 2), "\n")
## Precision: 0.25
cat("Recall: ", round(recall, 2), "\n")
## Recall: 0.32
cat("RMSE: ", round(rmse, 2), "\n")
## RMSE: 0.09
The RMSE is calculated based on the predicted probabilities for the positive class, which represents how far off the model’s predicted probabilities are from the true binary outcomes. Given the model is looking for if patients have heart disease or not, I believe using clustering throws off the neural networks performance, along with adding principal components as factors for the network. The accuracy, sensitivity, precision, and recall and RMSE are all quite low.
Redo neural network from basic dataset, without clustering and PCA
heart_data_encoded <- heart_data
# Re-code Heart_Disease to binary factor
heart_data_encoded$Heart_Disease <- ifelse(heart_data_encoded$Heart_Disease == 0, "No Heart Disease", "Heart Disease")
heart_data_encoded$Heart_Disease <- as.factor(heart_data_encoded$Heart_Disease)
# Distribution of Heart_Disease
print(table(heart_data_encoded$Heart_Disease))
##
## Heart Disease No Heart Disease
## 139 165
################################################################################
# Keep Heart_Disease separate, will add in later
#heart_disease_column <- heart_data_encoded$Heart_Disease
# Exclude Heart_Disease from the predictor set for dummyVars (one-hot encoding)
heart_data_without_target <- heart_data_encoded[, -which(names(heart_data_encoded) == "Heart_Disease")]
# Convert categorical variables into dummy variables (one-hot encoding)
dummy_model <- dummyVars(~ ., data = heart_data_without_target)
heart_data_transformed <- predict(dummy_model, newdata = heart_data)
heart_data_transformed <- as.data.frame(heart_data_transformed)
# Re-add the Heart_Disease column to the transformed dataset
heart_data_transformed$Heart_Disease <- heart_data$Heart_Disease
# Re-code Heart_Disease to binary factor
heart_data_transformed$Heart_Disease <- ifelse(heart_data_transformed$Heart_Disease == 0, "No Heart Disease", "Heart Disease")
heart_data_transformed$Heart_Disease <- as.factor(heart_data_transformed$Heart_Disease)
################################################################################
# Scale the data (excluding the target variable Heart_Disease)
heart_data_scaled <- heart_data_transformed
heart_data_scaled[, -which(names(heart_data_scaled) == "Heart_Disease")] <- scale(heart_data_transformed[, -which(names(heart_data_transformed) == "Heart_Disease")])
# Check the structure of the scaled data
str(heart_data_scaled)
## 'data.frame': 304 obs. of 26 variables:
## $ Age : num 0.945 1.382 1.382 -1.896 -1.459 ...
## $ Sex.Female : num -0.683 -0.683 -0.683 -0.683 1.458 ...
## $ Sex.Male : num 0.683 0.683 0.683 0.683 -1.458 ...
## $ Chest_Pain_Type.asymptomatic : num -0.947 1.052 1.052 -0.947 -0.947 ...
## $ Chest_Pain_Type.atypical angina : num -0.448 -0.448 -0.448 -0.448 2.224 ...
## $ Chest_Pain_Type.non-anginal : num -0.627 -0.627 -0.627 1.59 -0.627 ...
## $ Chest_Pain_Type.typical angina : num 3.49 -0.286 -0.286 -0.286 -0.286 ...
## $ Resting_Blood_Pressure : num 0.7578 1.6115 -0.665 -0.0959 -0.0959 ...
## $ Serum_Cholesterol : num -0.2555 0.7616 -0.3323 0.0707 -0.8121 ...
## $ Fasting_Blood_Sugar_Over120.FALSE : num -2.395 0.416 0.416 0.416 0.416 ...
## $ Fasting_Blood_Sugar_Over120.TRUE : num 2.395 -0.416 -0.416 -0.416 -0.416 ...
## $ Resting_ECG.lv hypertrophy : num 1.018 1.018 1.018 -0.979 1.018 ...
## $ Resting_ECG.normal : num -0.992 -0.992 -0.992 1.005 -0.992 ...
## $ Resting_ECG.st-t abnormality : num -0.115 -0.115 -0.115 -0.115 -0.115 ...
## $ Max_Heart_Rate : num 0.0121 -1.8198 -0.9039 1.6259 0.9716 ...
## $ Exercise_Angina_Present.FALSE : num 0.694 -1.437 -1.437 0.694 0.694 ...
## $ Exercise_Angina_Present.TRUE : num -0.694 1.437 1.437 -0.694 -0.694 ...
## $ ST_Depression_Exercise : num 1.089 0.4 1.347 2.123 0.313 ...
## $ Slope_Peak_Exercise_ST_Segment.downsloping: num 3.665 -0.272 -0.272 3.665 -0.272 ...
## $ Slope_Peak_Exercise_ST_Segment.flat : num -0.922 1.081 1.081 -0.922 -0.922 ...
## $ Slope_Peak_Exercise_ST_Segment.upsloping : num -0.941 -0.941 -0.941 -0.941 1.059 ...
## $ Num_Major_Vessels_Fluoroscopy : num -0.708 2.505 1.434 -0.708 -0.708 ...
## $ Thalassemia.fixed defect : num 3.98 -0.25 -0.25 -0.25 -0.25 ...
## $ Thalassemia.normal : num -1.117 0.892 -1.117 0.892 0.892 ...
## $ Thalassemia.reversable defect : num -0.79 -0.79 1.26 -0.79 -0.79 ...
## $ Heart_Disease : Factor w/ 2 levels "Heart Disease",..: 2 1 1 2 2 2 1 2 1 1 ...
################################################################################
# Train-test split
set.seed(12345)
trainIndex <- createDataPartition(heart_data_scaled$Heart_Disease, p = 0.8, list = FALSE)
train_data <- heart_data_scaled[trainIndex, ]
test_data <- heart_data_scaled[-trainIndex, ]
# Check the dimensions of the training and test sets
dim(train_data)
## [1] 244 26
dim(test_data)
## [1] 60 26
# All fields are numeric and scaled, apart from Heart_Disease which is a factor with 2 levels
str(train_data)
## 'data.frame': 244 obs. of 26 variables:
## $ Age : num 0.945 1.382 1.382 -1.459 0.18 ...
## $ Sex.Female : num -0.683 -0.683 -0.683 1.458 -0.683 ...
## $ Sex.Male : num 0.683 0.683 0.683 -1.458 0.683 ...
## $ Chest_Pain_Type.asymptomatic : num -0.947 1.052 1.052 -0.947 -0.947 ...
## $ Chest_Pain_Type.atypical angina : num -0.448 -0.448 -0.448 2.224 2.224 ...
## $ Chest_Pain_Type.non-anginal : num -0.627 -0.627 -0.627 -0.627 -0.627 ...
## $ Chest_Pain_Type.typical angina : num 3.49 -0.286 -0.286 -0.286 -0.286 ...
## $ Resting_Blood_Pressure : num 0.7578 1.6115 -0.665 -0.0959 -0.665 ...
## $ Serum_Cholesterol : num -0.256 0.762 -0.332 -0.812 -0.198 ...
## $ Fasting_Blood_Sugar_Over120.FALSE : num -2.395 0.416 0.416 0.416 0.416 ...
## $ Fasting_Blood_Sugar_Over120.TRUE : num 2.395 -0.416 -0.416 -0.416 -0.416 ...
## $ Resting_ECG.lv hypertrophy : num 1.018 1.018 1.018 1.018 -0.979 ...
## $ Resting_ECG.normal : num -0.992 -0.992 -0.992 -0.992 1.005 ...
## $ Resting_ECG.st-t abnormality : num -0.115 -0.115 -0.115 -0.115 -0.115 ...
## $ Max_Heart_Rate : num 0.0121 -1.8198 -0.9039 0.9716 1.2333 ...
## $ Exercise_Angina_Present.FALSE : num 0.694 -1.437 -1.437 0.694 0.694 ...
## $ Exercise_Angina_Present.TRUE : num -0.694 1.437 1.437 -0.694 -0.694 ...
## $ ST_Depression_Exercise : num 1.089 0.4 1.347 0.313 -0.203 ...
## $ Slope_Peak_Exercise_ST_Segment.downsloping: num 3.665 -0.272 -0.272 -0.272 -0.272 ...
## $ Slope_Peak_Exercise_ST_Segment.flat : num -0.922 1.081 1.081 -0.922 -0.922 ...
## $ Slope_Peak_Exercise_ST_Segment.upsloping : num -0.941 -0.941 -0.941 1.059 1.059 ...
## $ Num_Major_Vessels_Fluoroscopy : num -0.708 2.505 1.434 -0.708 -0.708 ...
## $ Thalassemia.fixed defect : num 3.98 -0.25 -0.25 -0.25 -0.25 ...
## $ Thalassemia.normal : num -1.117 0.892 -1.117 0.892 0.892 ...
## $ Thalassemia.reversable defect : num -0.79 -0.79 1.26 -0.79 -0.79 ...
## $ Heart_Disease : Factor w/ 2 levels "Heart Disease",..: 2 1 1 2 2 1 1 1 2 2 ...
################################################################################
# Replace spaces with underscores and specific characters to avoid neural network name error
colnames(train_data) <- gsub(" ", "_", colnames(train_data))
colnames(train_data) <- gsub("-", "_", colnames(train_data))
colnames(test_data) <- gsub(" ", "_", colnames(test_data))
colnames(test_data) <- gsub("-", "_", colnames(test_data))
# Create the predictor variables without 'Heart_Disease'
train_data_without_target <- train_data[, !colnames(train_data) %in% "Heart_Disease"]
colnames(train_data_without_target) <- gsub(" ", "_", colnames(train_data_without_target))
# Check the column names to ensure no unwanted columns are left
colnames(train_data_without_target)
## [1] "Age"
## [2] "Sex.Female"
## [3] "Sex.Male"
## [4] "Chest_Pain_Type.asymptomatic"
## [5] "Chest_Pain_Type.atypical_angina"
## [6] "Chest_Pain_Type.non_anginal"
## [7] "Chest_Pain_Type.typical_angina"
## [8] "Resting_Blood_Pressure"
## [9] "Serum_Cholesterol"
## [10] "Fasting_Blood_Sugar_Over120.FALSE"
## [11] "Fasting_Blood_Sugar_Over120.TRUE"
## [12] "Resting_ECG.lv_hypertrophy"
## [13] "Resting_ECG.normal"
## [14] "Resting_ECG.st_t_abnormality"
## [15] "Max_Heart_Rate"
## [16] "Exercise_Angina_Present.FALSE"
## [17] "Exercise_Angina_Present.TRUE"
## [18] "ST_Depression_Exercise"
## [19] "Slope_Peak_Exercise_ST_Segment.downsloping"
## [20] "Slope_Peak_Exercise_ST_Segment.flat"
## [21] "Slope_Peak_Exercise_ST_Segment.upsloping"
## [22] "Num_Major_Vessels_Fluoroscopy"
## [23] "Thalassemia.fixed_defect"
## [24] "Thalassemia.normal"
## [25] "Thalassemia.reversable_defect"
################################################################################
# Explicitly create the formula for the neural network, to see if the column names are being correctly passed.
predictor_cols <- colnames(train_data_without_target)
predictor_formula <- as.formula(paste("Heart_Disease ~", paste(predictor_cols, collapse = " + ")))
print(predictor_formula)
## Heart_Disease ~ Age + Sex.Female + Sex.Male + Chest_Pain_Type.asymptomatic +
## Chest_Pain_Type.atypical_angina + Chest_Pain_Type.non_anginal +
## Chest_Pain_Type.typical_angina + Resting_Blood_Pressure +
## Serum_Cholesterol + Fasting_Blood_Sugar_Over120.FALSE + Fasting_Blood_Sugar_Over120.TRUE +
## Resting_ECG.lv_hypertrophy + Resting_ECG.normal + Resting_ECG.st_t_abnormality +
## Max_Heart_Rate + Exercise_Angina_Present.FALSE + Exercise_Angina_Present.TRUE +
## ST_Depression_Exercise + Slope_Peak_Exercise_ST_Segment.downsloping +
## Slope_Peak_Exercise_ST_Segment.flat + Slope_Peak_Exercise_ST_Segment.upsloping +
## Num_Major_Vessels_Fluoroscopy + Thalassemia.fixed_defect +
## Thalassemia.normal + Thalassemia.reversable_defect
# Train the neural network with the formula
nn_model <- neuralnet(predictor_formula, data = train_data, hidden = c(5, 3), linear.output = FALSE)
# Plot the neural network to visualize the architecture
plot(nn_model)
################################################################################
# Make predictions on the test data
predictions <- predict(nn_model, test_data, type = "class")
# Convert predictions to factor (binary classification)
predictions_class <- factor(predictions, levels = c("No Heart Disease", "Heart Disease"))
# Get the predicted class from the probabilities
predicted_class <- ifelse(predictions[, 2] > 0.5, "Heart Disease", "No Heart Disease")
# Convert predictions to factor (binary classification)
predictions_class <- factor(predicted_class, levels = c("No Heart Disease", "Heart Disease"))
# Confusion Matrix
confusion_matrix <- table(predictions_class, test_data$Heart_Disease)
print(confusion_matrix)
##
## predictions_class Heart Disease No Heart Disease
## No Heart Disease 20 7
## Heart Disease 7 26
################################################################################
# Metrics
# Accuracy: Proportion of correct predictions (both true positives and true negatives)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
# Sensitivity: Recall or True Positive Rate
sensitivity <- confusion_matrix["Heart Disease", "Heart Disease"] / sum(confusion_matrix[, "Heart Disease"])
# Specificity: True Negative Rate
specificity <- confusion_matrix["No Heart Disease", "No Heart Disease"] / sum(confusion_matrix[, "No Heart Disease"])
# Precision: Proportion of positive predictions that are right
precision <- confusion_matrix["Heart Disease", "Heart Disease"] / sum(confusion_matrix["Heart Disease", ])
# F1 Score: Harmonic mean of precision and recall
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)
# RMSE - Treat the prediction as numeric , since it's a binary classification (HD vs NHD)
test_data_numeric <- as.numeric(test_data$Heart_Disease) - 1
predictions_numeric <- as.numeric(predictions_class) - 1
rmse <- sqrt(mean((predictions_numeric - test_data_numeric)^2))
print(paste("Accuracy:", round(accuracy, 4)))
## [1] "Accuracy: 0.7667"
print(paste("Sensitivity (True Positive Rate):", round(sensitivity, 4)))
## [1] "Sensitivity (True Positive Rate): 0.2593"
print(paste("Specificity (True Negative Rate):", round(specificity, 4)))
## [1] "Specificity (True Negative Rate): 0.2121"
print(paste("Precision:", round(precision, 4)))
## [1] "Precision: 0.2121"
print(paste("F1 Score:", round(f1_score, 4)))
## [1] "F1 Score: 0.2333"
print(paste("Root Mean Square Error :", round(rmse, 4)))
## [1] "Root Mean Square Error : 0.483"
Observing this plot of the neural network structure, the blue circles represent neurons in the network layers, where the input layer is each of my input features encoded. The intermediate hidden layers transform the inputs with weighted connections, and the output layer is a binary classificaton (Heart Disease vs No Heart Disease, which I’ll call HD and NHD for brevity). Looking at the paths and the weights for these connections, I see that the topmost node for the second hidden layer goes to NHD with a weight of 18.46, and HD with a weight of -14.92. The middle node of the second hidden layer goes to NHD with a weight of -2.97, and HD with 2.33, while the bottom-most node of the second hidden layer goes to NHD with -1.95 and HD with 4.88. By following the paths and looking at the weights of each feature, scientists can provide inputs on which layers and features provide the highest probability of contributing to having HD or NHD. The accuracy of this model comes in at 83.33%, far better than confusing the model with the clusters and principal components as factors.
Neural network analysis such as this is a powerful tool for data scientists and health professionals, as it can uncover complex patterns in patient data that may be difficult to identify through traditional statistical methods. By using neural networks to predict heart disease, data scientists can better understand patient health issues, leading to more accurate diagnoses and personalized treatment plans. The model can help prioritize high-risk patients, enabling timely interventions and improving patient outcomes for recovery. This approach enhances decision-making, supports early detection, and can ultimately save lives by providing a deeper understanding of health factors.