# Loading the data, variables names, and categorical descriptions
files <- dir("data", full = T)
if(length(files)==0 ){
warning("FIRST YOU MUST CREATE A NEW R PROJEC THEN COPY THE FILES DOWNLOADED FROM GIT HUB AND PAST ALL INTO THE NEW PROJECT FOLDER") }
names(files) <- gsub("\\.csv|\\.txt", "", dir("data"))
set_categorical_external <- function(){
col_names <- str_split(stri_read_lines(files["col_names"])," ")
names(col_names) <- c("var_name", "fac_name")
level <- str_split(stri_read_lines(files["factors"])," ")
heart_data <- read.csv(files["heart_cleveland_upload"], col.names = col_names[["var_name"]])
heart_data2 <- as.data.table(heart_data)
for(i in seq_along(level)){
values <- heart_data[ ,col_names[["fac_name"]][i] , drop = T]
heart_data2[, col_names[["fac_name"]][i] := factor(values,levels = sort(unique(values)),
labels = level[[i]])][,"blood_sugar" :=
ifelse(heart_data2$blood_sugar==1,
TRUE, FALSE)]
}
return(as.data.frame(heart_data2))
}
heart_named <- set_categorical_external()
Heart disease is one of the biggest causes of morbidity and mortality among the population of the world. As reported by World Health Organization (WHO), Heart Disease and Stroke are the world’s biggest killers and have remained the leading causes of death globally in the last 15 years.
Prediction of cardiovascular disease is regarded as one of the most important subjects in the section of clinical data analysis. The amount of data in the healthcare industry is huge. Data mining turns the large collection of raw healthcare data into information that can help to make informed decisions and predictions. Machine Learning can present remarkable features that simplify the identification of unseen patterns, eventually providing clinical insights that assist physicians in planning and providing care. In this analysis, the presence of heart disease is predicted by employing Logistic regression and Tree-Based Models.
You can find the code used in this article in the Github Repository. Right-click on the link and choose (open in a New Window).
The Heart Disease dataset has been taken from Kaggle. This data contains 76 attributes, but all published experiments refer to using a subset of 14 of them. It has a total number of 297 rows and 14 columns among which 137 have a heart disease. You can check the original dataset here.
There were no duplicate entries and no missing values the table below gives a summary of these:
paste0("There are ",sum(duplicated(heart_named)), " duplicated observations")
## [1] "There are 0 duplicated observations"
# The skim providing a strong set of summary statistics that are generated for a variety of different data types.
skim(heart_named)
| Name | heart_named |
| Number of rows | 297 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| factor | 7 |
| logical | 1 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| sex | 0 | 1 | FALSE | 2 | mal: 201, fem: 96 |
| chestpain | 0 | 1 | FALSE | 4 | asy: 142, non: 83, aty: 49, typ: 23 |
| electro_results | 0 | 1 | FALSE | 3 | nor: 147, ven: 146, abn: 4 |
| exer_induc_angina | 0 | 1 | FALSE | 2 | no: 200, yes: 97 |
| slope_peak | 0 | 1 | FALSE | 3 | ups: 139, fla: 137, dow: 21 |
| thal | 0 | 1 | FALSE | 3 | nor: 164, rev: 115, fix: 18 |
| condition | 0 | 1 | FALSE | 2 | no_: 160, dis: 137 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| blood_sugar | 0 | 1 | 0.14 | FAL: 254, TRU: 43 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 54.54 | 9.05 | 29 | 48 | 56.0 | 61.0 | 77.0 | ▁▅▇▇▁ |
| blood_pressure | 0 | 1 | 131.69 | 17.76 | 94 | 120 | 130.0 | 140.0 | 200.0 | ▃▇▅▁▁ |
| serum_cholestrol | 0 | 1 | 247.35 | 52.00 | 126 | 211 | 243.0 | 276.0 | 564.0 | ▃▇▂▁▁ |
| heart_rate | 0 | 1 | 149.60 | 22.94 | 71 | 133 | 153.0 | 166.0 | 202.0 | ▁▂▅▇▂ |
| oldpeak | 0 | 1 | 1.06 | 1.17 | 0 | 0 | 0.8 | 1.6 | 6.2 | ▇▂▁▁▁ |
| major_vessel | 0 | 1 | 0.68 | 0.94 | 0 | 0 | 0.0 | 1.0 | 3.0 | ▇▃▁▂▁ |
Perform dataset analysis of Heart Disease both visually and statistically to obtain important observations that can be used for inference. Fom this analysis, build a model having high accuracy and precision, to predict whether a person has heart disease with greater confidence.
ggplot(heart_named, aes(x=condition, fill=condition)) +
geom_bar(stat="count") +
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white",
fontface = "bold",size = 7)+
theme_minimal() +
scale_fill_manual(values =c("#808080", "#B3B6B7"))+
ggtitle("Count of Presence and Absence of Heart Disease") +
xlab("Heart Disease") +
ylab("Count")+
theme(
axis.text.y = element_blank(),
axis.text.x = element_text(size = 15, color="black") ,
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
plot.title = element_text(size = 17),
legend.position = "none")
Figure 1
Althoug, this dataset is small, it is well distributed between the outcomes. We can check on the bar chart that roughly 46% of the outcome tested positive for heart disease. This will help the precision of the algorithm in predicting heart disease.
ggplot(heart_named, aes(sex, fill = sex))+
geom_bar(stat="count") +
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white",
fontface = "bold",size = 7)+
facet_grid(cols = vars(condition),labeller = label_both)+
theme_minimal() +
scale_fill_manual(values = c("#808080", "#B3B6B7"))+
ggtitle("Heart Diseases Frequency for Sexes")+
theme(
axis.text.y = element_blank(),
axis.text.x = element_text(size = 12, color = "black") ,
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
strip.text.x = element_text(size = 14),
plot.title = element_text(size = 17),
legend.position = "none")
Figure 2
Here, we can verify that more than two thirds of the dataset is composed of males which infers and that the incidence of heart disease in males is far greater than it is in females.
ggplot(heart_named,aes(age, fill= condition)) +
geom_histogram(aes(y=..density..),breaks=seq(29, 80, by=1), color="grey17") +
facet_wrap(~condition, ncol=1,scale="fixed") +
theme_clean()+
scale_fill_manual(values =c("#808080", "#B3B6B7"))+
scale_x_continuous(breaks=seq(29, 80, by=5))+
xlab("Age") +
ylab("Density") +
ggtitle("Age Histogram")+
theme(
plot.title = element_text(size = 17),
strip.text.x = element_text(size = 14),
legend.position = "none")
Figure 3
We see that age plays an important role in increasing the risk of having heart disease. The incidences of positive (disease) starts to increase at the age of 53, peaking at the age of 57, then it starts decreasing until the age of 62. Most people who are suffering are of the age of 57, followed by 56.
ggplot(heart_named,aes(blood_pressure, fill=condition)) +
geom_histogram(aes(y=..density..),binwidth = 1, color="grey17") +
geom_density(alpha=.1, fill="black")+
facet_wrap(~condition, ncol=1,scale="fixed") +
scale_x_continuous(breaks=seq(94.0, 200.0, by=10))+
theme_clean() +
scale_fill_manual(values = c("#808080", "#B3B6B7")) +
xlab("Blood Pressure") +
ylab("Density") +
ggtitle("Resting Blood Pressure (in mm Hg on admission to the hospital)")+
theme(
plot.title = element_text(size = 15),
strip.text.x = element_text(size = 14),
legend.position = "none")
Figure 4
Although, it is well known that, over time, hypertension can damage the arteries that feed the heart the distribution between the two groups appears to be quite similar. It does not indicate any pattern that suggests that the group with the disease has higher blood pressure. Except, for the incidence of some outliers.
ggplot(heart_named, aes(y = heart_rate, fill =condition))+
geom_boxplot()+
scale_fill_manual(values = c("#808080", "#B3B6B7")) +
scale_y_continuous(breaks=seq(70.0, 200.0, by=10))+
theme_minimal()+
ggtitle("Distributions Max Heart Rate achieved")+
theme(
plot.title = element_text(size = 15))
Figure 5
ggplot(heart_named, aes( heart_rate, fill =condition))+
geom_density(alpha= .5)+
scale_fill_manual(values = c("#404040", "#B3B6B7")) +
theme_minimal()+
ggtitle("Distributions Max Heart Rate achieved")+
theme(
plot.title = element_text(size = 15))
Figure 5.1
We can see from Figures 5 and 5.1 that healthier people have a higher heart rate in comparison with people who have heart disease. Although the distribution of both groups is normal, we can identify that the spread of the distribution for the people with a detected problem is more spread with a lower mean.
ggplot(heart_named, aes(y = oldpeak, fill =condition))+
geom_boxplot()+
scale_fill_manual(values = c("#808080", "#B3B6B7")) +
theme_minimal()+
ggtitle("Distribution of ST Depression Induced by Exercise Relative to Rest")
Figure 6
ggplot(heart_named, aes( oldpeak, fill =condition))+
geom_density(alpha= .5)+
scale_fill_manual(values = c("#404040", "#B3B6B7")) +
theme_minimal()+
ggtitle("Distribution of ST Depression Induced by Exercise Relative to Rest")
Figure 6.1
We can identify in Figures 6 and 6.1 an opposite pattern from Figures 5 and 5.1. Here healthier people have a lower oldpeak, in comparison with people who have heart disease. The spread of the distribution remains more spread out for the people with a detected problem.
ggplot(heart_named,aes(chestpain, fill=condition)) +
geom_bar(stat="count") +
geom_text(aes(label = ..count..), stat = "count", vjust = -.5, colour = "black",
fontface = "bold" )+
facet_grid(cols = vars(condition), labeller = label_both)+
theme_minimal() +
scale_fill_manual(values = c("#808080", "#B3B6B7"))+
ggtitle("Bar Chart of Chest Pain Type for Condition")+
theme(
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
axis.text.x=element_text(angle=35, hjust=1, vjust=1.2, color="black"),
plot.title = element_text(size = 17),
strip.text.x = element_text(size = 14),
legend.position = "none")
Figure 7
The predominant chest pain experienced by the group with a heart problem is asymptomatic. While in the other group. It is non-anginal pain.
ggplot(heart_named,aes(thal, fill=condition)) +
geom_bar(stat="count") +
geom_text(aes(label = ..count..), stat = "count", vjust = -.2, colour = "black",
fontface = "bold", size = 4)+
facet_grid(cols = vars(condition), labeller = label_both)+
theme_minimal() +
scale_fill_manual(values = c("#808080", "#B3B6B7"))+
ggtitle("Blood Disorder Thalassemia")+
theme(
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
axis.text.x=element_text(angle=35, hjust=1, vjust=1.1, color="black"),
plot.title = element_text(size = 17),
strip.text.x = element_text(size = 14),
legend.position = "none")
Figure 8
Figure 7 shows that only about 21% of the people with a healthy heart do not have abnormality caused by Thalassemia. However, in the group with heart disease, almost 73% present the abnormality.
ggplot(heart_named, aes(age, blood_pressure, color = condition))+
geom_point(size = 4, alpha= 0.6)+
scale_color_manual(values = c("#A0A0A0", "black"))+
scale_fill_manual(values = c("#808080", "#B3B6B7"))+
facet_grid(cols = vars(sex))+
geom_smooth(method = "lm",se =FALSE )+
theme_minimal() +
ggtitle("Relationship Between Age and Blood Pressure")+
theme(
axis.text.x=element_text( color="black"),
plot.title = element_text(size = 17),
strip.text.x = element_text(size = 14))
Figure 9
We can see that age and blood pressure are positively correlated. As we saw in Figure 3.0 the spread of blood pressure remains quite similar for men and women. However, by separating the groups by gender, we are able to verify that blood pressure changes for women. Women with heart disease tend to have higher blood pressure.
ggplot(heart_named, aes(age, heart_rate, color = condition))+
geom_point(size = 4, alpha= 0.6)+
scale_color_manual(values = c("#A0A0A0", "black"))+
scale_fill_manual(values = c("#black", "#black"))+
facet_grid(cols = vars(sex))+
geom_smooth(method = "lm",se =FALSE )+
theme_minimal() +
ggtitle("Relationship Between Age and Blood Pressure")+
theme(
axis.text.x=element_text( color="black"),
plot.title = element_text(size = 17),
strip.text.x = element_text(size = 14))
Figure 10
We can see that age and heart rate are negatively correlated in most groups. An exception is the case of women with heart disease, which is positively correlated. As we get older, our hearts tend to decrease heart rate. Though, for women with abnormalities in the heart, it tends to increase.
| Algorithm | Description |
|---|---|
| Random Forest | Decision tree |
| Logistic Regression | Binary predictor |
| Gradient Boosting Machine (GBM) | Decision tree |
The confusion matrix displays the correctly predicted as well as incorrectly predicted values by a classifier. The sum of TP and TN, from the confusion matrix, is the number of correctly classified entries by the classifier.
ROC Curves summarizes the trade-off between the true positive rate and the false positive rate for the predictive model using different probability thresholds.
The sample will be done using 75% of the observations for training and 25% for testing the model. The variables were selected based on the data analyses.
heart_data_2 <- heart_named %>%
select(chestpain,thal,major_vessel,heart_rate,oldpeak,blood_pressure,
condition,age,sex)
set.seed(754)
n <- nrow(heart_data_2)
n_train <- round(0.74 * n)
train_indices <- sample(1:n, n_train)
train <- heart_data_2[train_indices, ]
test <- heart_data_2[-train_indices, ]
The train dataset is composed of: 220 rows and the test dataset is composed of: 77 rows.
#================= TURNING A RANDOM FOREST VIA MTRY
# This function is a specific utility to tune the mtry parameter based on OOB error, which is # helpful when you want a quick & easy way to tune your model.
set.seed(754)
model <- tuneRF(x = subset(train, select = -condition),
y = train$condition,
ntreeTry = 500,
doBest = TRUE,
plot= FALSE,
trace = FALSE)
# Generate predictions on the test set
test$target_pred <- predict(object = model,
newdata = test)
Confusion matrix of predictions from the training set (Random Forest):
model$confusion[c(1,2),c(1,2)]
## no_disease disease
## no_disease 97 17
## disease 24 82
Accuracy for Random Forest for training set = 81%
Confusion matrix of predictions from the test set(Random Forest):
f <- table(test$condition,test$target_pred)
t.df<-as.data.frame(f)
ggplot(data = t.df, aes(x = Var1, y = Var2, label=Freq)) +
geom_tile(aes(fill = Freq)) +
scale_fill_gradient(low="#E0E0E0" , high= "#808080") +
theme_minimal() +
xlab("Actual Heart Disease") +
ylab("Predicted Heart Disease") +
geom_text(size=8) +
ggtitle("Random Forest")
Accuracy for Random Forest for test set = 84%
Plot of the ROC curve (Random Forest):
ROC <- suppressMessages(roc(ifelse(test$target_pred == "disease", 1,0),
ifelse(test$condition == "disease", 1,0)))
# Plot the ROC curve
plot(ROC, col = "black")
# Calculate the area under the curve (AUC)
auc(ROC)
## Area under the curve: 0.8371
set.seed(754)
# Train a Logistic Regression model
model <- glm(condition ~ .,
data = train, family = "binomial")
# Generate predictions on the test set
test$target_pred <- predict(model ,newdata = test,
type = "response")
test$target_pred <- ifelse(test$target_pred >= 0.504,"disease", "no_disease")
Confusion matrix of predictions from the test set(Logistic Regression):
f <- table(test$condition,test$target_pred)
t.df<-as.data.frame(f)
ggplot(data = t.df, aes(x = Var1, y = Var2, label=Freq)) +
geom_tile(aes(fill = Freq)) +
scale_fill_gradient(low="#E0E0E0" , high= "#808080") +
theme_minimal() +
xlab("Actual Heart Disease") +
ylab("Predicted Heart Disease") +
geom_text(size=8) +
ggtitle("Logistic Regression")
Accuracy for Logistic Regression for test set = 90%
Plot of the ROC curve (Logistic Regression):
ROC <- suppressMessages(roc(ifelse(test$target_pred == "disease", 1,0),
ifelse(test$condition == "disease", 1,0)))
# Plot the ROC curve
plot(ROC, col = "black")
# Calculate the area under the curve (AUC)
auc(ROC)
## Area under the curve: 0.8902
train$condition <- ifelse(train$condition == "disease", 1,0)
# Train a 10000-tree GBM model
set.seed(754)
model <- gbm(formula = condition ~ .,
distribution = "bernoulli",
data = train,
n.trees = 10000)
#==================== EARLY STOPPING IN GBMs
# Optimal ntree estimate based on OOB
ntree_opt_oob <- gbm.perf(object = model,
method = "OOB",
oobag.curve = TRUE,
plot.it = FALSE)
test$condition<- ifelse(test$condition == "disease", 1,0)
# Generate predictions on the test set (scale to response)
test$target_pred <- predict(object = model,
newdata = test,
n.trees = ntree_opt_oob,
type = "response")
test$target_pred <- ifelse(test$target_pred >= 0.504, "disease", "no_disease")
test$condition <- ifelse(test$condition == 1, "disease", "no_disease")
Confusion matrix of predictions from the test set(GBM):
f <- table(test$condition,test$target_pred)
t.df<-as.data.frame(f)
ggplot(data = t.df, aes(x = Var1, y = Var2, label=Freq)) +
geom_tile(aes(fill = Freq)) +
scale_fill_gradient(low="#E0E0E0" , high= "#808080") +
theme_minimal() +
xlab("Actual Heart Disease") +
ylab("Predicted Heart Disease") +
geom_text(size=8) +
ggtitle("Gradient Boosting Machine")
Accuracy for (GBM) for test set = 88%
Plot of the ROC curve (GBM):
ROC <- suppressMessages(roc(ifelse(test$target_pred == "disease", 1,0),
ifelse(test$condition == "disease", 1,0)))
# Plot the ROC curve
plot(ROC, col = "black")
# Calculate the area under the curve (AUC)
auc(ROC)
## Area under the curve: 0.8728
Among the algorithms applied to predict the incidence of heart disease, the Logistic Regression performed better with 90% of accuracy on its predictions.
The number of people affected by heart diseases is increasing day by day so it is very important to have a method to detect even the subtle patterns which point out a problem in the heart. With the help of machine learning models we can identify who suffers from these diseases based on the patterns of the data which incerases the odds for possible early treatment and healthier life.