Heart disease is the leading cause of death in the United States, causing about 1 in 4 deaths. The term “heart disease” refers to several types of heart conditions. In the United States, the most common type of heart disease is coronary artery disease (CAD), which can lead to heart attack. This dataset is from Cleveland which is a major city in the U.S. state of Ohio. At this time we make model for predict heart disease in Cleveland.
Source dataset: https://www.kaggle.com/ronitf/heart-disease-uci
library(dplyr)
library(tidyr)
library(MASS)
library(caret)
heart <- read.csv("heart.csv", stringsAsFactors = T)
glimpse(heart)
## Rows: 303
## Columns: 14
## $ age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5…
## $ sex <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1…
## $ cp <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0…
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1…
## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2…
## $ fbs <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ restecg <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1…
## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1…
## $ exang <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0…
## $ slope <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1…
## $ ca <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0…
## $ thal <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3…
## $ target <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
Attribute information:
age : age.
sex : sex.
cp : chest pain type (4 values).
trestbps : resting blood pressure.
chol : serum cholestoral in mg/dl.
fbs : fasting blood sugar > 120 mg/dl.
restecg : resting electrocardiographic results (values 0,1,2).
thalach : maximum heart rate achieved.
exang : exercise induced angina.
oldpeak : ST depression induced by exercise relative to rest.
slope : the slope of the peak exercise ST segment.
ca: number of major vessels (0-3) colored by flourosopy.
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect.
target : 0 = Health ; 1 = Not Health.
Change Data Type
heart <- heart %>%
mutate_if(is.integer, as.factor) %>%
mutate(target = factor(target, level = c(0,1), labels = c("Health", "Not Health")))
head(heart)
Check Missing Value
colSums(is.na(heart))
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
There is no missing value at our dataset.
Check Propotional data
prop.table(table(heart$target))
##
## Health Not Health
## 0.4554455 0.5445545
We have quite balance data propotion, we can continue next step.
Cross Validation
We do splitting our dataset into train and test data. Right now we split our dataset 80% train data and 20% test data.
set.seed(212)
index <- sample(nrow(heart), nrow(heart)*0.8)
heart_train <- heart[index, ]
heart_test <- heart[-index, ]
We try to build model machine learning to predict heart disease.
model_logistic <- glm(formula = target ~ sex+cp+fbs+exang+oldpeak+slope+ca+thal,
family = "binomial",
data = heart_train)
summary(model_logistic)
##
## Call:
## glm(formula = target ~ sex + cp + fbs + exang + oldpeak + slope +
## ca + thal, family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.86182 -0.35074 0.08587 0.36404 3.06243
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7316 4.2749 0.171 0.864122
## sex1 -1.9091 0.6021 -3.171 0.001521 **
## cp1 0.9981 0.6662 1.498 0.134070
## cp2 2.0423 0.5949 3.433 0.000597 ***
## cp3 2.2555 0.7997 2.821 0.004794 **
## fbs1 0.4578 0.6333 0.723 0.469792
## exang1 -0.9428 0.5094 -1.851 0.064226 .
## oldpeak -0.5059 0.2550 -1.984 0.047213 *
## slope1 -1.3048 0.8848 -1.475 0.140283
## slope2 0.6994 0.9782 0.715 0.474593
## ca1 -2.1061 0.5798 -3.633 0.000281 ***
## ca2 -2.6051 0.8824 -2.952 0.003154 **
## ca3 -2.1581 0.9293 -2.322 0.020216 *
## ca4 1.4331 1.5944 0.899 0.368728
## thal1 2.2960 4.2561 0.539 0.589578
## thal2 2.3008 4.1713 0.552 0.581237
## thal3 0.9524 4.1754 0.228 0.819567
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 334.42 on 241 degrees of freedom
## Residual deviance: 146.50 on 225 degrees of freedom
## AIC: 180.5
##
## Number of Fisher Scoring iterations: 6
heart_test$prob_heart<-predict(model_logistic, type = "response", newdata = heart_test)
ggplot(heart_test, aes(x=prob_heart)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data") +
theme_minimal()
Based on graph above we can see our prediction tend to be 1 that means Not Health.
heart_test$pred_heart <- factor(ifelse(heart_test$prob_heart > 0.5, "Not Health","Health"))
cm_logis <- confusionMatrix(heart_test$pred_heart, heart_test$target, positive = "Not Health")
cm_logis
## Confusion Matrix and Statistics
##
## Reference
## Prediction Health Not Health
## Health 19 4
## Not Health 6 32
##
## Accuracy : 0.8361
## 95% CI : (0.7191, 0.9185)
## No Information Rate : 0.5902
## P-Value [Acc > NIR] : 3.428e-05
##
## Kappa : 0.6569
##
## Mcnemar's Test P-Value : 0.7518
##
## Sensitivity : 0.8889
## Specificity : 0.7600
## Pos Pred Value : 0.8421
## Neg Pred Value : 0.8261
## Prevalence : 0.5902
## Detection Rate : 0.5246
## Detection Prevalence : 0.6230
## Balanced Accuracy : 0.8244
##
## 'Positive' Class : Not Health
##
Based on confussionMatrix our result are Accuracy 83.6%, Sensitivity/Recall 88.8%, Specificity 76%, Precision 84% .
Our objective is Sensitivity/Recall which is 88.8%. Because we want get False Positive as high as possible, our model can be functional as pre-screening for the doctor, eventhough our patient is labeled as positive or not health but doctor can do checking for more details.