Overview

For this project I will be using the Heart Disease dataset from UCI [https://archive.ics.uci.edu/dataset/45/heart+disease] and it contains the Cleveland heart disease information. The dataset has 303 instances and 14 features and I shall use this dataset in my project to predict the presence of heart disease given the patient health information in 14 different features.

I will be creating the Models using RandomForest, Support vector machine(SVM) and Neural network algorithms in order to predict heart disease given the set of features.

Dataset features explained:

age: age in years
sex: sex (1 = male; 0 = female)
cp: chest pain type
- Value 1: typical angina
- Value 2: atypical angina
- Value 3: non-anginal pain
- Value 4: asymptomatic
trestbps: resting blood pressure (in mm Hg on admission to the hospital)
chol: serum cholesterol in mg/dl
fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
restecg: resting electrocardiographic results
- Value 0: normal
- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
thalach: maximum heart rate achieved
exang: exercise induced angina (1 = yes; 0 = no)
oldpeak: ST depression induced by exercise relative to rest
slope: the slope of the peak exercise ST segment
- Value 0: upsloping
- Value 1: flat
- Value 2: downsloping
ca: number of major vessels (0-4) colored by fluoroscopy
thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
target: diagnosis of heart disease (angiographic disease status)
- Value 0: < 50% diameter narrowing
- Value 1: > 50% diameter narrowing

Deliverables

Your final presentation (essay or video) should include: - The traditional R file or Python file and essay, - An Essay (minimum 500 word document) or Video ( 5 to 8 minutes recording) Include the execution and explanation of your code. The video can be recorded on any platform of your choice (Youtube, Free Cam).

Data exploration

Loading of the dataset

I uploaded the dataset on my personal GitHub account and loaded the same here using read.csv function. As discussed earlier we see the dataset has 303 observations with 14 different variables.

heart_data <- read.csv("https://raw.githubusercontent.com/petferns/DATA622/main/cleveland.heart.data.csv", header = TRUE)

dim(heart_data)

## [1] 303  14

We see from the summary of the dataset there aren’t any character data types which means we wouldn’t need dummy coding of any variables.

summary(heart_data)

##       age             sex               cp           trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.000   Min.   : 94.0  
##  1st Qu.:47.50   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:120.0  
##  Median :55.00   Median :1.0000   Median :1.000   Median :130.0  
##  Mean   :54.37   Mean   :0.6832   Mean   :0.967   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.000   Max.   :200.0  
##       chol            fbs            restecg          thalach     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
##  Median :240.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :246.3   Mean   :0.1485   Mean   :0.5281   Mean   :149.6  
##  3rd Qu.:274.5   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak         slope             ca        
##  Min.   :0.0000   Min.   :0.00   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.80   Median :1.000   Median :0.0000  
##  Mean   :0.3267   Mean   :1.04   Mean   :1.399   Mean   :0.7294  
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.20   Max.   :2.000   Max.   :4.0000  
##       thal           target      
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000  
##  Median :2.000   Median :1.0000  
##  Mean   :2.314   Mean   :0.5446  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000

From glimpse of the dataset we see that variables ca, cp, exang, restecg, slope, target, thal, sex can be factored and I shall use as.factor function for the same on all of these variables.

library(dplyr)
glimpse(heart_data)

## Rows: 303
## Columns: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5…
## $ sex      <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1…
## $ cp       <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0…
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1…
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2…
## $ fbs      <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ restecg  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1…
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1…
## $ exang    <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0…
## $ slope    <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1…
## $ ca       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0…
## $ thal     <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3…
## $ target   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

heart_data[['ca']] <- as.factor(heart_data[['ca']])
heart_data[['cp']] <- as.factor(heart_data[['cp']])
heart_data[['exang']] <- as.factor(heart_data[['exang']])
heart_data[['restecg']] <- as.factor(heart_data[['restecg']])
heart_data[['slope']] <- as.factor(heart_data[['slope']])
heart_data[['target']] <- as.factor(heart_data[['target']])
heart_data[['thal']] <- as.factor(heart_data[['thal']])
heart_data[['sex']] <- as.factor(heart_data[['sex']])

Let us see if any missing values exists and we see from the below distribution that there isn’t any missing values.

library(visdat)
vis_miss(heart_data)

We see the ditsribution of data across the various variables using the plot_histogram from DataExplorer package. From the below distribution we see that chol, oldpeak, trestbps requires transformation due to the presence of skewness in data, rest of the variables seems normally distributed.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the correlation matrix we see that chol, trestbps, age are positively correlated and fbs, thalach, oldpeak are negatively correlated.

corrplot.mixed(cor(Filter(is.numeric,dplyr::select(heart_data, target, everything()))), 
               number.cex = .9, 
               upper = "number", 
               lower = "shade", 
               lower.col = brewer.pal(n = 12, name = "Paired"),
               upper.col = brewer.pal(n = 12, name = "Paired"))

kable(sort(cor(Filter(is.numeric,dplyr::select(heart_data, target, everything())))[,1], decreasing = T), col.names = c("Correlation")) %>% 
  kable_styling(full_width = F)

	Correlation
age	1.0000000
trestbps	0.2793509
chol	0.2136780
oldpeak	0.2100126
fbs	0.1213076
thalach	-0.3985219

As we saw in our earlier section chol, oldpeak, trestbps required transformations due to skewness and we apply log transformations to the same.

heart_data_trans <- heart_data %>%
  dplyr::select(everything()) %>% 
  mutate(chol = log(chol),
         oldpeak = log(oldpeak + 1),
         trestbps = log(trestbps)
         )

After the transformations we see the data now is normally distributed across the variables, below is the transformed distribution.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Modelling

I will split the dataset into training and testing sets in the ratio of 80:20 proportions using the createDataPartition function. So the new data frames df_train contains the training data and df_test contains the testing data.

library(caret)
set.seed(123)


index <- createDataPartition(heart_data_trans$target, p=0.8, list = FALSE)

df_train <- heart_data_trans[index,]
df_test  <- heart_data_trans[-index,]

Model 1 : Random Forest Alogrithm

Random Forest is a powerful and versatile ensemble learning method that can be applied to both classification and regression problems. It is based on the concept of bagging (Bootstrap Aggregating) and builds multiple decision trees during training. Random Forest is an ensemble of decision trees. It builds a forest of trees, where each tree is trained on a random subset of the data. The final prediction is often a combination (average or voting) of the predictions of all individual trees.

library(randomForest)
set.seed(123)
fit.forest <- randomForest(target ~ ., data = df_train, importance=TRUE, ntree=2000)

fit.forest

## 
## Call:
##  randomForest(formula = target ~ ., data = df_train, importance = TRUE,      ntree = 2000) 
##                Type of random forest: classification
##                      Number of trees: 2000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 14.81%
## Confusion matrix:
##    0   1 class.error
## 0 90  21   0.1891892
## 1 15 117   0.1136364

We see from the below distribution of Error versus number of trees the Error drops drastically as more tress get added. The red line shows the MCR for class not having the heart disease where as the green line represents the MCR for class not having the heart disease and black line is the overall MCR.

plot(fit.forest)

Random forest is able to correctly predict 81.67% times accurately on our dataset.

rf.pred <- predict(fit.forest, newdata=df_test, type = "class")
(forest.cm_train <- confusionMatrix(rf.pred, df_test$target))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 21  4
##          1  6 29
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.7148, 0.9171)
##     No Information Rate : 0.55            
##     P-Value [Acc > NIR] : 3.483e-06       
##                                           
##                   Kappa : 0.661           
##                                           
##  Mcnemar's Test P-Value : 0.7518          
##                                           
##             Sensitivity : 0.7778          
##             Specificity : 0.8788          
##          Pos Pred Value : 0.8400          
##          Neg Pred Value : 0.8286          
##              Prevalence : 0.4500          
##          Detection Rate : 0.3500          
##    Detection Prevalence : 0.4167          
##       Balanced Accuracy : 0.8283          
##                                           
##        'Positive' Class : 0               
##

Model 2 : Support vector machine(SVM)

upport Vector Machines (SVMs) are a powerful and versatile family of supervised learning algorithms used for classification and regression. It is particularly well-suited for high-dimensional datasets and situations where the data is not linearly separable.

Linear Kernel

A linear kernel Support Vector Machine (SVM) is a type of SVM that uses a linear decision boundary for classification. The linear kernel is the simplest type of kernel and is often used when the relationship between the input features and the output variable is expected to be approximately linear. The decision boundary is a hyperplane that separates the data into different classes.

library(e1071)
set.seed(234)

linear <- svm(target~., 
               data=df_train, 
               kernel="linear", 
               type = 'C-classification',
               )
linear

## 
## Call:
## svm(formula = target ~ ., data = df_train, kernel = "linear", type = "C-classification", 
##     )
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  91

Linear kernel has an accuracy of 80% for our dataset.

predictions <- predict(linear, newdata = df_test)

confusion_matrix <- table(Actual_Label = df_test$target, Predicted_Label = predictions)
accuracy_lin <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_lin

## [1] 0.8

Polynomial Kernel

Polynomial kernel Support Vector Machine (SVM) is a type of SVM that uses a polynomial function to map the input features into a higher-dimensional space. This allows the SVM to capture nonlinear relationships in the data. The decision boundary in the higher-dimensional space is still a hyperplane.

set.seed(234)

poly <- svm(target~., 
               data=df_train, 
               kernel="polynomial", 
               type = 'C-classification',
               )
poly

## 
## Call:
## svm(formula = target ~ ., data = df_train, kernel = "polynomial", 
##     type = "C-classification", )
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##      coef.0:  0 
## 
## Number of Support Vectors:  212

Polynomial Kernel is performing at 70% accuracy which is lower than the our earlier model linear kernel.

predictions <- predict(poly, newdata = df_test)

confusion_matrix <- table(Actual_Label = df_test$target, Predicted_Label = predictions)
accuracy_poly <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_poly

## [1] 0.7333333

Sigmoid Kernel

A sigmoid kernel Support Vector Machine (SVM) is another type of SVM that uses a sigmoid function as the kernel to map the input features into a higher-dimensional space. The sigmoid kernel is particularly useful when dealing with non-linear relationships and when the data is not linearly separable. The decision boundary in the higher-dimensional space is still a hyperplane.

set.seed(345)

sig <- svm(target~., 
               data=df_train, 
               kernel="sigmoid", 
               type = 'C-classification',
               )
sig

## 
## Call:
## svm(formula = target ~ ., data = df_train, kernel = "sigmoid", type = "C-classification", 
##     )
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  sigmoid 
##        cost:  1 
##      coef.0:  0 
## 
## Number of Support Vectors:  133

predictions <- predict(sig, newdata = df_test)

confusion_matrix <- table(Actual_Label = df_test$target, Predicted_Label = predictions)
accuracy_sig <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_sig

## [1] 0.8

Sigmoid kernel has 81.66% accuracy and performs better with than the our earlier 2 models. Let next try radial kernel and see the accuracy.

Radial kernel

Radial kernel, also known as a radial basis function (RBF) kernel, is one of the most commonly used kernel functions in Support Vector Machines (SVM). The RBF kernel is particularly powerful in capturing complex, non-linear relationships in the data. It transforms the input features into a higher-dimensional space, allowing the SVM to find a hyperplane that separates different classes.

set.seed(456)

rad <- svm(target~., 
               data=df_train, 
               kernel="radial", 
               type = 'C-classification',
               )
rad

## 
## Call:
## svm(formula = target ~ ., data = df_train, kernel = "radial", type = "C-classification", 
##     )
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  142

We see the accuracy reduces with radial kernel when compared to our earlier model.

predictions <- predict(rad, newdata = df_test)

confusion_matrix <- table(Actual_Label = df_test$target, Predicted_Label = predictions)
accuracy_rad <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_rad

## [1] 0.7666667

Model 3 : Neural network

Neural network modeling, often referred to as deep learning, involves the use of artificial neural networks to model complex relationships and patterns in data. Neural networks are a class of machine learning models inspired by the structure and functioning of the human brain. It uses interconnected nodes, called neurons, arranged in layers to learn complex patterns from data.

For our dataset we shall try applying the neural network technique and see if it can further improve the accuracy.

train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5)
nnet_model1 <- train(df_train[, 1:13],df_train[, 14],
                 method = "nnet",
                 trControl= train_params,
                 preProcess=c("scale","center"),
                 na.action = na.omit
)

plot(nnet_model1)

set.seed(123) 


nnPred_train <-predict(nnet_model1, df_train)
#Training confusion matrix
table(df_train$target, nnPred_train)

##    nnPred_train
##       0   1
##   0  97  14
##   1   8 124

We see the accuracy has increased with Neural network and our models performs at an accuracy of 90.12%

correct_predictions <- 96 + 123


all_predictions <- 96 + 123 + 9 + 15

accuracy <- correct_predictions / all_predictions
accuracy

## [1] 0.9012346

Summary

From the summary below we see our Neural network model performs better than any other models which we created for this project. The model is able to predict correctly about the presence of heart disease on our dataset with a accuracy of 90.12%.

In my last assignment Polynomial kernel performed better than other kernels and what’s surprising is that for the current dataset Ploynomial kernel is performing worst among the other kernels. This analysis shows how machine learning models cannot be treated as one-size-fits-all.

rf <- 0.8167
kable(cbind(rf,  accuracy_lin,accuracy_poly,accuracy_sig,accuracy_rad,accuracy), col.names = c("RandomForest", "SVM Linear", "SVM Polynomial", "SVM Sigmoid", "SVM Radial", "Neural network"))  %>% 
  kable_styling(full_width = T)

RandomForest	SVM Linear	SVM Polynomial	SVM Sigmoid	SVM Radial	Neural network
0.8167	0.8	0.7333333	0.8	0.7666667	0.9012346

Data 622 - Final Project

2023-12-12