# Loading the data, variables names, and categorical descriptions

files <- dir("data", full = T)


if(length(files)==0 ){
warning("FIRST YOU MUST CREATE A NEW R PROJEC THEN COPY THE FILES DOWNLOADED FROM GIT HUB AND PAST ALL INTO THE NEW PROJECT FOLDER") }


names(files) <- gsub("\\.csv|\\.txt", "", dir("data"))
set_categorical_external <- function(){
   
   col_names <- str_split(stri_read_lines(files["col_names"])," ")
   names(col_names) <- c("var_name", "fac_name")
   level <- str_split(stri_read_lines(files["factors"])," ")
   heart_data <- read.csv(files["heart_cleveland_upload"], col.names = col_names[["var_name"]])
   heart_data2 <- as.data.table(heart_data)
 
   for(i in seq_along(level)){
      values <-  heart_data[ ,col_names[["fac_name"]][i] , drop = T] 
      
      heart_data2[, col_names[["fac_name"]][i] := factor(values,levels = sort(unique(values)),
                             labels = level[[i]])][,"blood_sugar" :=
                                                      ifelse(heart_data2$blood_sugar==1,
                                                              TRUE, FALSE)]
   }
   return(as.data.frame(heart_data2))
}

heart_named <- set_categorical_external()
Confusion Matrix

Introduction

Heart disease is one of the biggest causes of morbidity and mortality among the population of the world. As reported by World Health Organization (WHO), Heart Disease and Stroke are the world’s biggest killers and have remained the leading causes of death globally in the last 15 years.

Prediction of cardiovascular disease is regarded as one of the most important subjects in the section of clinical data analysis. The amount of data in the healthcare industry is huge. Data mining turns the large collection of raw healthcare data into information that can help to make informed decisions and predictions. Machine Learning can present remarkable features that simplify the identification of unseen patterns, eventually providing clinical insights that assist physicians in planning and providing care. In this analysis, the presence of heart disease is predicted by employing Logistic regression and Tree-Based Models.

You can find the code used in this article in the Github Repository. Right-click on the link and choose (open in a New Window).


The Data

The Heart Disease dataset has been taken from Kaggle. This data contains 76 attributes, but all published experiments refer to using a subset of 14 of them. It has a total number of 297 rows and 14 columns among which 137 have a heart disease. You can check the original dataset here.

  • age: age in years
  • sex: sex (1 = male; 0 = female)
  • cp: chest pain type
    • Value 0: typical angina
    • Value 1: atypical angina
    • Value 2: non-anginal pain
    • Value 3: asymptomatic
  • trestbps: resting blood pressure (in mm Hg on admission to the hospital)
  • chol: serum cholestoral in mg/dl
  • fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  • restecg: resting electrocardiographic results
    • Value 0: normal
    • Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    • Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
  • thalach: maximum heart rate achieved
  • exang: exercise induced angina (1 = yes; 0 = no)
  • oldpeak = ST depression induced by exercise relative to rest
  • slope: the slope of the peak exercise ST segment
    • Value 0: upsloping
    • Value 1: flat
    • Value 2: downsloping
  • ca: number of major vessels (0-3) colored by flourosopy
  • thal: 0 = normal; 1 = fixed defect; 2 = reversable defect and the label
  • condition: 0 = no disease, 1 = disease

Data cleaning and pre-processing

There were no duplicate entries and no missing values the table below gives a summary of these:

paste0("There are ",sum(duplicated(heart_named)), " duplicated observations") 
## [1] "There are 0 duplicated observations"
# The skim  providing a strong set of summary statistics that are generated for a variety of different data types.
skim(heart_named)
Data summary
Name heart_named
Number of rows 297
Number of columns 14
_______________________
Column type frequency:
factor 7
logical 1
numeric 6
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1 FALSE 2 mal: 201, fem: 96
chestpain 0 1 FALSE 4 asy: 142, non: 83, aty: 49, typ: 23
electro_results 0 1 FALSE 3 nor: 147, ven: 146, abn: 4
exer_induc_angina 0 1 FALSE 2 no: 200, yes: 97
slope_peak 0 1 FALSE 3 ups: 139, fla: 137, dow: 21
thal 0 1 FALSE 3 nor: 164, rev: 115, fix: 18
condition 0 1 FALSE 2 no_: 160, dis: 137

Variable type: logical

skim_variable n_missing complete_rate mean count
blood_sugar 0 1 0.14 FAL: 254, TRU: 43

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 54.54 9.05 29 48 56.0 61.0 77.0 ▁▅▇▇▁
blood_pressure 0 1 131.69 17.76 94 120 130.0 140.0 200.0 ▃▇▅▁▁
serum_cholestrol 0 1 247.35 52.00 126 211 243.0 276.0 564.0 ▃▇▂▁▁
heart_rate 0 1 149.60 22.94 71 133 153.0 166.0 202.0 ▁▂▅▇▂
oldpeak 0 1 1.06 1.17 0 0 0.8 1.6 6.2 ▇▂▁▁▁
major_vessel 0 1 0.68 0.94 0 0 0.0 1.0 3.0 ▇▃▁▂▁



Problem Statements

Perform dataset analysis of Heart Disease both visually and statistically to obtain important observations that can be used for inference. Fom this analysis, build a model having high accuracy and precision, to predict whether a person has heart disease with greater confidence.



Data Analysis

ggplot(heart_named, aes(x=condition, fill=condition)) + 
   geom_bar(stat="count") +
   geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white",
             fontface = "bold",size = 7)+
   theme_minimal() +
   scale_fill_manual(values =c("#808080", "#B3B6B7"))+
   ggtitle("Count of Presence and Absence of Heart Disease") +
   xlab("Heart Disease") +
   ylab("Count")+
   theme(
      axis.text.y = element_blank(),
      axis.text.x =  element_text(size = 15, color="black") ,
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      axis.ticks = element_blank(),
      plot.title = element_text(size = 17),
                                
      legend.position = "none") 
Figure 1

Figure 1

Althoug, this dataset is small, it is well distributed between the outcomes. We can check on the bar chart that roughly 46% of the outcome tested positive for heart disease. This will help the precision of the algorithm in predicting heart disease.


ggplot(heart_named, aes(sex, fill = sex))+
   geom_bar(stat="count") +
   geom_text(aes(label = ..count..), stat = "count",  vjust = 1.5, colour = "white",
             fontface = "bold",size = 7)+
   facet_grid(cols = vars(condition),labeller = label_both)+
   theme_minimal()  +
   scale_fill_manual(values = c("#808080", "#B3B6B7"))+
   ggtitle("Heart Diseases Frequency for Sexes")+
   theme(
      axis.text.y = element_blank(),
      axis.text.x =  element_text(size = 12, color = "black") ,
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      axis.ticks = element_blank(),
      strip.text.x = element_text(size = 14),
      plot.title = element_text(size = 17),
      legend.position = "none") 
Figure 2

Figure 2

Here, we can verify that more than two thirds of the dataset is composed of males which infers and that the incidence of heart disease in males is far greater than it is in females.

ggplot(heart_named,aes(age, fill= condition)) +
   geom_histogram(aes(y=..density..),breaks=seq(29, 80, by=1), color="grey17") +
   facet_wrap(~condition, ncol=1,scale="fixed") +
   theme_clean()+
   scale_fill_manual(values =c("#808080", "#B3B6B7"))+
   scale_x_continuous(breaks=seq(29, 80, by=5))+
   xlab("Age") +
   ylab("Density") +
   ggtitle("Age Histogram")+
   theme(
         plot.title = element_text(size = 17),
         strip.text.x = element_text(size = 14),
         legend.position = "none") 
Figure 3

Figure 3

We see that age plays an important role in increasing the risk of having heart disease. The incidences of positive (disease) starts to increase at the age of 53, peaking at the age of 57, then it starts decreasing until the age of 62. Most people who are suffering are of the age of 57, followed by 56.


ggplot(heart_named,aes(blood_pressure, fill=condition)) +
   geom_histogram(aes(y=..density..),binwidth = 1, color="grey17") +
   geom_density(alpha=.1, fill="black")+
   facet_wrap(~condition, ncol=1,scale="fixed") +
   scale_x_continuous(breaks=seq(94.0, 200.0, by=10))+
   theme_clean()   +
   scale_fill_manual(values = c("#808080", "#B3B6B7")) +
   xlab("Blood Pressure") +
   ylab("Density") +
   ggtitle("Resting Blood Pressure (in mm Hg on admission to the hospital)")+
   theme(
      plot.title = element_text(size = 15),
      strip.text.x = element_text(size = 14),
      legend.position = "none") 
Figure 4

Figure 4

Although, it is well known that, over time, hypertension can damage the arteries that feed the heart the distribution between the two groups appears to be quite similar. It does not indicate any pattern that suggests that the group with the disease has higher blood pressure. Except, for the incidence of some outliers.


ggplot(heart_named, aes(y = heart_rate, fill =condition))+
   geom_boxplot()+
   scale_fill_manual(values = c("#808080", "#B3B6B7")) +
   scale_y_continuous(breaks=seq(70.0, 200.0, by=10))+
   theme_minimal()+
   ggtitle("Distributions Max Heart Rate achieved")+
   theme(
      plot.title = element_text(size = 15)) 
Figure 5

Figure 5

ggplot(heart_named, aes( heart_rate, fill =condition))+
   geom_density(alpha= .5)+
   scale_fill_manual(values = c("#404040", "#B3B6B7")) +
   theme_minimal()+
   ggtitle("Distributions Max Heart Rate achieved")+
   theme(
      plot.title = element_text(size = 15)) 
Figure 5.1

Figure 5.1

We can see from Figures 5 and 5.1 that healthier people have a higher heart rate in comparison with people who have heart disease. Although the distribution of both groups is normal, we can identify that the spread of the distribution for the people with a detected problem is more spread with a lower mean.


ggplot(heart_named, aes(y = oldpeak, fill =condition))+
   geom_boxplot()+
   scale_fill_manual(values = c("#808080", "#B3B6B7")) +
   theme_minimal()+
   ggtitle("Distribution of ST Depression Induced by Exercise Relative to Rest")  
Figure 6

Figure 6

ggplot(heart_named, aes( oldpeak, fill =condition))+
   geom_density(alpha= .5)+
   scale_fill_manual(values = c("#404040", "#B3B6B7")) +
   theme_minimal()+
   ggtitle("Distribution of ST Depression Induced by Exercise Relative to Rest")
Figure 6.1

Figure 6.1

We can identify in Figures 6 and 6.1 an opposite pattern from Figures 5 and 5.1. Here healthier people have a lower oldpeak, in comparison with people who have heart disease. The spread of the distribution remains more spread out for the people with a detected problem.


ggplot(heart_named,aes(chestpain, fill=condition)) +
   geom_bar(stat="count") +
   geom_text(aes(label = ..count..), stat = "count", vjust = -.5, colour = "black",
             fontface = "bold" )+
   facet_grid(cols = vars(condition), labeller = label_both)+
   theme_minimal()  +
   scale_fill_manual(values = c("#808080", "#B3B6B7"))+
   ggtitle("Bar Chart of Chest Pain Type for Condition")+
   theme(
      axis.text.y = element_blank(),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      axis.ticks = element_blank(),
      axis.text.x=element_text(angle=35, hjust=1, vjust=1.2, color="black"),
      plot.title = element_text(size = 17),
      strip.text.x = element_text(size = 14),
      legend.position = "none")
Figure 7

Figure 7

The predominant chest pain experienced by the group with a heart problem is asymptomatic. While in the other group. It is non-anginal pain.


ggplot(heart_named,aes(thal, fill=condition)) +
   geom_bar(stat="count") +
   geom_text(aes(label = ..count..), stat = "count", vjust = -.2, colour = "black",
             fontface = "bold", size = 4)+
   facet_grid(cols = vars(condition), labeller = label_both)+
   theme_minimal()  +
   scale_fill_manual(values = c("#808080", "#B3B6B7"))+
   ggtitle("Blood Disorder Thalassemia")+
   theme(
      axis.text.y = element_blank(),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      axis.ticks = element_blank(),
      axis.text.x=element_text(angle=35, hjust=1, vjust=1.1, color="black"),
      plot.title = element_text(size = 17),
      strip.text.x = element_text(size = 14),
      legend.position = "none")
Figure 8

Figure 8

Figure 7 shows that only about 21% of the people with a healthy heart do not have abnormality caused by Thalassemia. However, in the group with heart disease, almost 73% present the abnormality.


ggplot(heart_named, aes(age, blood_pressure, color = condition))+ 
   geom_point(size = 4, alpha= 0.6)+
   scale_color_manual(values = c("#A0A0A0", "black"))+
   scale_fill_manual(values = c("#808080", "#B3B6B7"))+
   facet_grid(cols = vars(sex))+
   geom_smooth(method = "lm",se =FALSE )+
   theme_minimal()  +
   ggtitle("Relationship Between Age and Blood Pressure")+
   theme(
      axis.text.x=element_text( color="black"),
      plot.title = element_text(size = 17),
      strip.text.x = element_text(size = 14))
Figure 9

Figure 9

We can see that age and blood pressure are positively correlated. As we saw in Figure 3.0 the spread of blood pressure remains quite similar for men and women. However, by separating the groups by gender, we are able to verify that blood pressure changes for women. Women with heart disease tend to have higher blood pressure.


ggplot(heart_named, aes(age, heart_rate, color = condition))+
   geom_point(size = 4, alpha= 0.6)+
   scale_color_manual(values = c("#A0A0A0", "black"))+
   scale_fill_manual(values = c("#black", "#black"))+
   facet_grid(cols = vars(sex))+
   geom_smooth(method = "lm",se =FALSE )+
   theme_minimal()  +
   ggtitle("Relationship Between Age and Blood Pressure")+
   theme(
      axis.text.x=element_text( color="black"),
      plot.title = element_text(size = 17),
      strip.text.x = element_text(size = 14))
Figure 10

Figure 10

We can see that age and heart rate are negatively correlated in most groups. An exception is the case of women with heart disease, which is positively correlated. As we get older, our hearts tend to decrease heart rate. Though, for women with abnormalities in the heart, it tends to increase.





Machine Learning

Algorithm Description
Random Forest Decision tree
Logistic Regression Binary predictor
Gradient Boosting Machine (GBM) Decision tree


Evaluation Metrics

Confusion Matrix

The confusion matrix displays the correctly predicted as well as incorrectly predicted values by a classifier. The sum of TP and TN, from the confusion matrix, is the number of correctly classified entries by the classifier.

Confusion Matrix



ROC Curve

ROC Curves summarizes the trade-off between the true positive rate and the false positive rate for the predictive model using different probability thresholds.



Applying algorithms

Splitting Data

The sample will be done using 75% of the observations for training and 25% for testing the model. The variables were selected based on the data analyses.

heart_data_2 <- heart_named %>%
   select(chestpain,thal,major_vessel,heart_rate,oldpeak,blood_pressure,
          condition,age,sex)

set.seed(754)

n <- nrow(heart_data_2)

n_train <- round(0.74 * n)

train_indices <- sample(1:n, n_train)

train <- heart_data_2[train_indices, ]

test  <- heart_data_2[-train_indices, ]

The train dataset is composed of: 220 rows and the test dataset is composed of: 77 rows.

Random Forest

#================= TURNING A RANDOM FOREST VIA MTRY 
# This function is a specific utility to tune the mtry parameter based on OOB error, which is # helpful when you want a quick & easy way to tune your model.

set.seed(754)  
model <-  tuneRF(x = subset(train, select = -condition),
                y = train$condition,
                ntreeTry = 500,
                doBest = TRUE,
                plot= FALSE,
                trace = FALSE)


# Generate predictions on the test set
test$target_pred <- predict(object = model, 
                            newdata = test)


Confusion matrix of predictions from the training set (Random Forest):

model$confusion[c(1,2),c(1,2)]
##            no_disease disease
## no_disease         97      17
## disease            24      82


Accuracy for Random Forest for training set = 81%



Confusion matrix of predictions from the test set(Random Forest):

f <- table(test$condition,test$target_pred)
t.df<-as.data.frame(f)


ggplot(data = t.df, aes(x = Var1, y = Var2, label=Freq)) +
   geom_tile(aes(fill = Freq)) +
   scale_fill_gradient(low="#E0E0E0" , high= "#808080") +
   theme_minimal() +
   xlab("Actual Heart Disease") +
   ylab("Predicted Heart Disease") +
   geom_text(size=8) +
   ggtitle("Random Forest")


Accuracy for Random Forest for test set = 84%


Plot of the ROC curve (Random Forest):

 ROC <- suppressMessages(roc(ifelse(test$target_pred == "disease", 1,0),
            ifelse(test$condition   == "disease", 1,0)))
 
 # Plot the ROC curve
 plot(ROC, col = "black")

 # Calculate the area under the curve (AUC)
 auc(ROC)
## Area under the curve: 0.8371



Logistic Regression

set.seed(754)

# Train a Logistic Regression model
model <- glm(condition ~ ., 
             data = train, family = "binomial")


# Generate predictions on the test set
test$target_pred  <- predict(model ,newdata = test, 
                             type = "response")

test$target_pred <- ifelse(test$target_pred >= 0.504,"disease", "no_disease")

Confusion matrix of predictions from the test set(Logistic Regression):

f <- table(test$condition,test$target_pred)
t.df<-as.data.frame(f)


ggplot(data = t.df, aes(x = Var1, y = Var2, label=Freq)) +
   geom_tile(aes(fill = Freq)) +
   scale_fill_gradient(low="#E0E0E0" , high= "#808080") +
   theme_minimal() +
   xlab("Actual Heart Disease") +
   ylab("Predicted Heart Disease") +
   geom_text(size=8) +
   ggtitle("Logistic Regression")


Accuracy for Logistic Regression for test set = 90%


Plot of the ROC curve (Logistic Regression):

 ROC <- suppressMessages(roc(ifelse(test$target_pred == "disease", 1,0),
            ifelse(test$condition   == "disease", 1,0)))
 
 # Plot the ROC curve
 plot(ROC, col = "black")

 # Calculate the area under the curve (AUC)
 auc(ROC)
## Area under the curve: 0.8902



Gradient Boosting Machine (GBM)

train$condition <- ifelse(train$condition == "disease", 1,0)

# Train a 10000-tree GBM model
set.seed(754)

model <- gbm(formula = condition ~ ., 
                    distribution = "bernoulli", 
                    data = train,
                    n.trees = 10000)



#==================== EARLY STOPPING IN GBMs
# Optimal ntree estimate based on OOB
ntree_opt_oob <- gbm.perf(object = model, 
                          method = "OOB", 
                          oobag.curve = TRUE,
                          plot.it = FALSE)



test$condition<- ifelse(test$condition == "disease", 1,0)

# Generate predictions on the test set (scale to response)
test$target_pred <- predict(object  =  model, 
                            newdata = test,
                            n.trees = ntree_opt_oob,
                            type = "response")


test$target_pred <- ifelse(test$target_pred >= 0.504, "disease", "no_disease")

test$condition   <- ifelse(test$condition == 1, "disease", "no_disease")

Confusion matrix of predictions from the test set(GBM):

f <- table(test$condition,test$target_pred)
t.df<-as.data.frame(f)


ggplot(data = t.df, aes(x = Var1, y = Var2, label=Freq)) +
   geom_tile(aes(fill = Freq)) +
   scale_fill_gradient(low="#E0E0E0" , high= "#808080") +
   theme_minimal() +
   xlab("Actual Heart Disease") +
   ylab("Predicted Heart Disease") +
   geom_text(size=8) +
   ggtitle("Gradient Boosting Machine")


Accuracy for (GBM) for test set = 88%


Plot of the ROC curve (GBM):

 ROC <- suppressMessages(roc(ifelse(test$target_pred == "disease", 1,0),
            ifelse(test$condition   == "disease", 1,0)))
 
 # Plot the ROC curve
 plot(ROC, col = "black")

 # Calculate the area under the curve (AUC)
 auc(ROC)
## Area under the curve: 0.8728


Conclusion

Among the algorithms applied to predict the incidence of heart disease, the Logistic Regression performed better with 90% of accuracy on its predictions.

The number of people affected by heart diseases is increasing day by day so it is very important to have a method to detect even the subtle patterns which point out a problem in the heart. With the help of machine learning models we can identify who suffers from these diseases based on the patterns of the data which incerases the odds for possible early treatment and healthier life.