For this project I will be using the Heart Disease dataset from UCI [https://archive.ics.uci.edu/dataset/45/heart+disease] and it contains the Cleveland heart disease information. The dataset has 303 instances and 14 features and I shall use this dataset in my project to predict the presence of heart disease given the patient health information in 14 different features.
I will be creating the Models using RandomForest, Support vector machine(SVM) and Neural network algorithms in order to predict heart disease given the set of features.
age: age in yearssex: sex (1 = male; 0 = female)cp: chest pain type
1: typical angina2: atypical angina3: non-anginal pain4: asymptomatictrestbps: resting blood pressure (in mm Hg on admission to the hospital)chol: serum cholesterol in mg/dlfbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)restecg: resting electrocardiographic results
0: normal1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)2: showing probable or definite left ventricular hypertrophy by Estes’ criteriathalach: maximum heart rate achievedexang: exercise induced angina (1 = yes; 0 = no)oldpeak: ST depression induced by exercise relative to restslope: the slope of the peak exercise ST segment
0: upsloping1: flat2: downslopingca: number of major vessels (0-4) colored by fluoroscopythal: 3 = normal; 6 = fixed defect; 7 = reversible defecttarget: diagnosis of heart disease (angiographic disease status)
0: < 50% diameter narrowing1: > 50% diameter narrowingYour final presentation (essay or video) should include: - The traditional R file or Python file and essay, - An Essay (minimum 500 word document) or Video ( 5 to 8 minutes recording) Include the execution and explanation of your code. The video can be recorded on any platform of your choice (Youtube, Free Cam).
I uploaded the dataset on my personal GitHub account and loaded the same here using read.csv function. As discussed earlier we see the dataset has 303 observations with 14 different variables.
heart_data <- read.csv("https://raw.githubusercontent.com/petferns/DATA622/main/cleveland.heart.data.csv", header = TRUE)
dim(heart_data)
## [1] 303 14
We see from the summary of the dataset there aren’t any character data types which means we wouldn’t need dummy coding of any variables.
summary(heart_data)
## age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :0.000 Min. : 94.0
## 1st Qu.:47.50 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:120.0
## Median :55.00 Median :1.0000 Median :1.000 Median :130.0
## Mean :54.37 Mean :0.6832 Mean :0.967 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.000 Max. :200.0
## chol fbs restecg thalach
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5
## Median :240.0 Median :0.0000 Median :1.0000 Median :153.0
## Mean :246.3 Mean :0.1485 Mean :0.5281 Mean :149.6
## 3rd Qu.:274.5 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.00 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.80 Median :1.000 Median :0.0000
## Mean :0.3267 Mean :1.04 Mean :1.399 Mean :0.7294
## 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.20 Max. :2.000 Max. :4.0000
## thal target
## Min. :0.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000
## Median :2.000 Median :1.0000
## Mean :2.314 Mean :0.5446
## 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000
From glimpse of the dataset we see that variables ca, cp, exang, restecg, slope, target, thal, sex can be factored and I shall use as.factor function for the same on all of these variables.
library(dplyr)
glimpse(heart_data)
## Rows: 303
## Columns: 14
## $ age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5…
## $ sex <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1…
## $ cp <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0…
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1…
## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2…
## $ fbs <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ restecg <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1…
## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1…
## $ exang <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0…
## $ slope <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1…
## $ ca <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0…
## $ thal <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3…
## $ target <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
heart_data[['ca']] <- as.factor(heart_data[['ca']])
heart_data[['cp']] <- as.factor(heart_data[['cp']])
heart_data[['exang']] <- as.factor(heart_data[['exang']])
heart_data[['restecg']] <- as.factor(heart_data[['restecg']])
heart_data[['slope']] <- as.factor(heart_data[['slope']])
heart_data[['target']] <- as.factor(heart_data[['target']])
heart_data[['thal']] <- as.factor(heart_data[['thal']])
heart_data[['sex']] <- as.factor(heart_data[['sex']])
Let us see if any missing values exists and we see from the below distribution that there isn’t any missing values.
library(visdat)
vis_miss(heart_data)
We see the ditsribution of data across the various variables using the
plot_histogram from DataExplorer package. From the below distribution we see that chol, oldpeak, trestbps requires transformation due to the presence of skewness in data, rest of the variables seems normally distributed.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the correlation matrix we see that chol, trestbps, age are positively correlated and fbs, thalach, oldpeak are negatively correlated.
corrplot.mixed(cor(Filter(is.numeric,dplyr::select(heart_data, target, everything()))),
number.cex = .9,
upper = "number",
lower = "shade",
lower.col = brewer.pal(n = 12, name = "Paired"),
upper.col = brewer.pal(n = 12, name = "Paired"))
kable(sort(cor(Filter(is.numeric,dplyr::select(heart_data, target, everything())))[,1], decreasing = T), col.names = c("Correlation")) %>%
kable_styling(full_width = F)
| Correlation | |
|---|---|
| age | 1.0000000 |
| trestbps | 0.2793509 |
| chol | 0.2136780 |
| oldpeak | 0.2100126 |
| fbs | 0.1213076 |
| thalach | -0.3985219 |
As we saw in our earlier section chol, oldpeak, trestbps required transformations due to skewness and we apply log transformations to the same.
heart_data_trans <- heart_data %>%
dplyr::select(everything()) %>%
mutate(chol = log(chol),
oldpeak = log(oldpeak + 1),
trestbps = log(trestbps)
)
After the transformations we see the data now is normally distributed across the variables, below is the transformed distribution.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I will split the dataset into training and testing sets in the ratio of 80:20 proportions using the createDataPartition function. So the new data frames df_train contains the training data and df_test contains the testing data.
library(caret)
set.seed(123)
index <- createDataPartition(heart_data_trans$target, p=0.8, list = FALSE)
df_train <- heart_data_trans[index,]
df_test <- heart_data_trans[-index,]
Random Forest is a powerful and versatile ensemble learning method that can be applied to both classification and regression problems. It is based on the concept of bagging (Bootstrap Aggregating) and builds multiple decision trees during training. Random Forest is an ensemble of decision trees. It builds a forest of trees, where each tree is trained on a random subset of the data. The final prediction is often a combination (average or voting) of the predictions of all individual trees.
library(randomForest)
set.seed(123)
fit.forest <- randomForest(target ~ ., data = df_train, importance=TRUE, ntree=2000)
fit.forest
##
## Call:
## randomForest(formula = target ~ ., data = df_train, importance = TRUE, ntree = 2000)
## Type of random forest: classification
## Number of trees: 2000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 14.81%
## Confusion matrix:
## 0 1 class.error
## 0 90 21 0.1891892
## 1 15 117 0.1136364
We see from the below distribution of Error versus number of trees the Error drops drastically as more tress get added. The red line shows the MCR for class not having the heart disease where as the green line represents the MCR for class not having the heart disease and black line is the overall MCR.
plot(fit.forest)
Random forest is able to correctly predict 81.67% times accurately on our dataset.
rf.pred <- predict(fit.forest, newdata=df_test, type = "class")
(forest.cm_train <- confusionMatrix(rf.pred, df_test$target))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21 4
## 1 6 29
##
## Accuracy : 0.8333
## 95% CI : (0.7148, 0.9171)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 3.483e-06
##
## Kappa : 0.661
##
## Mcnemar's Test P-Value : 0.7518
##
## Sensitivity : 0.7778
## Specificity : 0.8788
## Pos Pred Value : 0.8400
## Neg Pred Value : 0.8286
## Prevalence : 0.4500
## Detection Rate : 0.3500
## Detection Prevalence : 0.4167
## Balanced Accuracy : 0.8283
##
## 'Positive' Class : 0
##
upport Vector Machines (SVMs) are a powerful and versatile family of supervised learning algorithms used for classification and regression. It is particularly well-suited for high-dimensional datasets and situations where the data is not linearly separable.
A linear kernel Support Vector Machine (SVM) is a type of SVM that uses a linear decision boundary for classification. The linear kernel is the simplest type of kernel and is often used when the relationship between the input features and the output variable is expected to be approximately linear. The decision boundary is a hyperplane that separates the data into different classes.
library(e1071)
set.seed(234)
linear <- svm(target~.,
data=df_train,
kernel="linear",
type = 'C-classification',
)
linear
##
## Call:
## svm(formula = target ~ ., data = df_train, kernel = "linear", type = "C-classification",
## )
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 91
Linear kernel has an accuracy of 80% for our dataset.
predictions <- predict(linear, newdata = df_test)
confusion_matrix <- table(Actual_Label = df_test$target, Predicted_Label = predictions)
accuracy_lin <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_lin
## [1] 0.8
Polynomial kernel Support Vector Machine (SVM) is a type of SVM that uses a polynomial function to map the input features into a higher-dimensional space. This allows the SVM to capture nonlinear relationships in the data. The decision boundary in the higher-dimensional space is still a hyperplane.
set.seed(234)
poly <- svm(target~.,
data=df_train,
kernel="polynomial",
type = 'C-classification',
)
poly
##
## Call:
## svm(formula = target ~ ., data = df_train, kernel = "polynomial",
## type = "C-classification", )
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## coef.0: 0
##
## Number of Support Vectors: 212
Polynomial Kernel is performing at 70% accuracy which is lower than the our earlier model linear kernel.
predictions <- predict(poly, newdata = df_test)
confusion_matrix <- table(Actual_Label = df_test$target, Predicted_Label = predictions)
accuracy_poly <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_poly
## [1] 0.7333333
A sigmoid kernel Support Vector Machine (SVM) is another type of SVM that uses a sigmoid function as the kernel to map the input features into a higher-dimensional space. The sigmoid kernel is particularly useful when dealing with non-linear relationships and when the data is not linearly separable. The decision boundary in the higher-dimensional space is still a hyperplane.
set.seed(345)
sig <- svm(target~.,
data=df_train,
kernel="sigmoid",
type = 'C-classification',
)
sig
##
## Call:
## svm(formula = target ~ ., data = df_train, kernel = "sigmoid", type = "C-classification",
## )
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: sigmoid
## cost: 1
## coef.0: 0
##
## Number of Support Vectors: 133
predictions <- predict(sig, newdata = df_test)
confusion_matrix <- table(Actual_Label = df_test$target, Predicted_Label = predictions)
accuracy_sig <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_sig
## [1] 0.8
Sigmoid kernel has 81.66% accuracy and performs better with than the our earlier 2 models. Let next try radial kernel and see the accuracy.
Radial kernel, also known as a radial basis function (RBF) kernel, is one of the most commonly used kernel functions in Support Vector Machines (SVM). The RBF kernel is particularly powerful in capturing complex, non-linear relationships in the data. It transforms the input features into a higher-dimensional space, allowing the SVM to find a hyperplane that separates different classes.
set.seed(456)
rad <- svm(target~.,
data=df_train,
kernel="radial",
type = 'C-classification',
)
rad
##
## Call:
## svm(formula = target ~ ., data = df_train, kernel = "radial", type = "C-classification",
## )
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 142
We see the accuracy reduces with radial kernel when compared to our earlier model.
predictions <- predict(rad, newdata = df_test)
confusion_matrix <- table(Actual_Label = df_test$target, Predicted_Label = predictions)
accuracy_rad <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
accuracy_rad
## [1] 0.7666667
Neural network modeling, often referred to as deep learning, involves the use of artificial neural networks to model complex relationships and patterns in data. Neural networks are a class of machine learning models inspired by the structure and functioning of the human brain. It uses interconnected nodes, called neurons, arranged in layers to learn complex patterns from data.
For our dataset we shall try applying the neural network technique and see if it can further improve the accuracy.
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5)
nnet_model1 <- train(df_train[, 1:13],df_train[, 14],
method = "nnet",
trControl= train_params,
preProcess=c("scale","center"),
na.action = na.omit
)
plot(nnet_model1)
set.seed(123)
nnPred_train <-predict(nnet_model1, df_train)
#Training confusion matrix
table(df_train$target, nnPred_train)
## nnPred_train
## 0 1
## 0 97 14
## 1 8 124
We see the accuracy has increased with Neural network and our models performs at an accuracy of 90.12%
correct_predictions <- 96 + 123
all_predictions <- 96 + 123 + 9 + 15
accuracy <- correct_predictions / all_predictions
accuracy
## [1] 0.9012346
From the summary below we see our Neural network model performs better than any other models which we created for this project. The model is able to predict correctly about the presence of heart disease on our dataset with a accuracy of 90.12%.
In my last assignment Polynomial kernel performed better than other kernels and what’s surprising is that for the current dataset Ploynomial kernel is performing worst among the other kernels. This analysis shows how machine learning models cannot be treated as one-size-fits-all.
rf <- 0.8167
kable(cbind(rf, accuracy_lin,accuracy_poly,accuracy_sig,accuracy_rad,accuracy), col.names = c("RandomForest", "SVM Linear", "SVM Polynomial", "SVM Sigmoid", "SVM Radial", "Neural network")) %>%
kable_styling(full_width = T)
| RandomForest | SVM Linear | SVM Polynomial | SVM Sigmoid | SVM Radial | Neural network |
|---|---|---|---|---|---|
| 0.8167 | 0.8 | 0.7333333 | 0.8 | 0.7666667 | 0.9012346 |