Heart Disease
Heart Disease
- Introduction
- Library
- Read Data
- Exploratory Data Analysis
- Variable Transforming
- Data Visualization
- Sex
- Age
- Chest Pain
- Resting Blood Presure
- Cholestoral
- Fasting Blood Sugar
- Resting Electrocardiographic Results
- Maximum Heart Rate Achieved
- Exercise Induced Angina
- ST Depression
- Slope of The Peak Exercise
- Number of Major Vessels
- Thallasemia
- Multivariables Interpretation
- Checking Data Availability
- Checking Target Proportion
- Modeling
- Evaluation
- Conclusion
Introduction
We would like to use logistic regression model and K-Nearest Neighbour (K-NN) to predict the presence of heart disease in patients, which is target variable with 13 attributes, which are:
age: in years
sex: male or female
chest pain type (4 values)
resting blood pressure: detect blood pressure during rest (in mmHg)
serum cholestoral in mg/dl
fasting blood sugar > 120 mg/dl
resting electrocardiographic results (values 0,1,2)
maximum heart rate achieved
exercise induced angina
oldpeak = ST depression induced by exercise relative to rest
the slope of the peak exercise ST segment
number of major vessels (0-3) colored by flourosopy
thal: 3 = No Thalassemia; 6 = Fixed Defect Thalassemia; 7 = Reversible Defect Thalassemia
Library
Read Data
Exploratory Data Analysis
Variable Transforming
We need to change some variables from numeric into character class. ca variable is not transformed into character class because it need to be know whether more visibled blood vessels will affect the model calculation or not.
## Observations: 303
## Variables: 14
## $ age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, …
## $ sex <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, …
## $ cp <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, …
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, …
## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, …
## $ fbs <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ restecg <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, …
## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, …
## $ exang <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, …
## $ slope <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, …
## $ ca <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, …
## $ thal <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ target <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
heart <- heart %>%
mutate(sex = ifelse(sex==1, "Male","Female"),
fbs = ifelse(fbs == 1, "> 120 mg/dl", "< 120 mg/dl"),
exang = ifelse(exang == 1, "Exercise Induced Angina" ,"No Exercise Induced Angina"),
cp = ifelse(cp == 0, "Chest Pain Type 0",
ifelse(cp == 1, "Chest Pain Type 1", ifelse(cp==2, "Chest Pain Type 2", "Chest Pain Type 3"))),
restecg = ifelse(restecg == 0, "Normal",
if_else(restecg == 1, "Abnormality", "Probable or Definite")),
thal = ifelse(thal== 0, "No Thalassemia", ifelse(thal==1, "Normal Thalassemia", ifelse(thal==2, "Fixed Defect Thalassemia", "Reversible Defect Thalassemia"))),
target = ifelse(target == 0, "Healthy", "Heart Disease"),
slope = ifelse(slope == 0, "Peak Excercise ST Slope 0", ifelse(slope==1,"Peak Excercise ST Slope 1", "Peak Excercise ST Slope 2"))
) %>%
mutate_if(is.character, as.factor)
summary(heart)## age sex cp trestbps
## Min. :29.00 Female: 96 Chest Pain Type 0:143 Min. : 94.0
## 1st Qu.:47.50 Male :207 Chest Pain Type 1: 50 1st Qu.:120.0
## Median :55.00 Chest Pain Type 2: 87 Median :130.0
## Mean :54.37 Chest Pain Type 3: 23 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## chol fbs restecg thalach
## Min. :126.0 < 120 mg/dl:258 Abnormality :152 Min. : 71.0
## 1st Qu.:211.0 > 120 mg/dl: 45 Normal :147 1st Qu.:133.5
## Median :240.0 Probable or Definite: 4 Median :153.0
## Mean :246.3 Mean :149.6
## 3rd Qu.:274.5 3rd Qu.:166.0
## Max. :564.0 Max. :202.0
## exang oldpeak
## Exercise Induced Angina : 99 Min. :0.00
## No Exercise Induced Angina:204 1st Qu.:0.00
## Median :0.80
## Mean :1.04
## 3rd Qu.:1.60
## Max. :6.20
## slope ca
## Peak Excercise ST Slope 0: 21 Min. :0.0000
## Peak Excercise ST Slope 1:140 1st Qu.:0.0000
## Peak Excercise ST Slope 2:142 Median :0.0000
## Mean :0.7294
## 3rd Qu.:1.0000
## Max. :4.0000
## thal target
## Fixed Defect Thalassemia :166 Healthy :138
## No Thalassemia : 2 Heart Disease:165
## Normal Thalassemia : 18
## Reversible Defect Thalassemia:117
##
##
Data Visualization
In data visualization, we would try to visualize the relationship between heart disesase with each variable.
Sex
sex <- heart %>%
group_by(sex, target) %>%
summarise(count=n()) %>%
mutate(perc=count/sum(count))
ggplot(sex, aes(x=sex, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip()+
facet_wrap(~target, ncol=1,scale="fixed")+xlab("Sex") +
ylab("Number") +
ggtitle("Gender") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))In our data, there is only 30 more male who has heart disease than female. However, there is much more healthy male than healthy female.
Age
age <- heart %>%
mutate(age=as.factor(age)) %>%
group_by(age, target) %>%
summarise(count=n()) %>%
mutate(perc=count/sum(count))
age2 <- ggplot(age, aes(x=age, y=count, fill=target)) + geom_bar(stat = "identity") + xlab("Age") + scale_x_discrete(breaks = c("30", "35","40","45","50","55","60","65","70","75","80")) +
ylab("Number") +
ggtitle("Age Distribution") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))
ggplotly(age2)The patient in the data distributed from the age of 29 into 77. Age with the highest number with heart disease is 55.
Chest Pain
cp <- heart %>%
group_by(cp, target) %>%
summarise(count=n()) %>%
mutate(perc=count/sum(count))
ggplot(cp, aes(x=cp, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip() +
ylab("Number") +
ggtitle("Chest Pain") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))Chest pain type 0 shows significant number of healthy patient compare to another types.
Resting Blood Presure
ggplot(heart, aes(x=trestbps, fill=target)) + geom_histogram(aes(y=..density..), color="grey17") +
geom_density(alpha=.2, fill="yellow")+
facet_wrap(~target, ncol=1,scale="fixed")+
xlab("Resting Blood Pressure (mmHg)") +
ylab("Density/Count") +
ggtitle("Resting Blood Pressure") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In heart disease patients, many of them have blood pressure of 120 - 140 mmHG, although the number is not too far off from the healthy patients.
Cholestoral
ggplot(heart, aes(x=chol, fill=target)) + geom_histogram(aes(y=..density..), color="grey17") +
geom_density(alpha=.2, fill="yellow")+
facet_wrap(~target, ncol=1,scale="fixed")+
xlab("Cholesterol") +
ylab("Density/Count") +
ggtitle("Serum Cholestoral (mg/dl)") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There is no significant differences between healthy and heart disease patients in their cholestoral (mg/dl).
Fasting Blood Sugar
fbs <- heart %>%
group_by(fbs, target) %>%
summarise(count=n()) %>%
mutate(perc=count/sum(count))
ggplot(fbs, aes(x=fbs, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip() + xlab("fbs >120 mg/dl") +
ylab("Number") +
ggtitle("Fasting Blood Sugar") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))There is no significant differences between healthy and heart disease patients in their fasting blood sugar, whether it is more or less than 120 mg/dl.
Resting Electrocardiographic Results
restecg <- heart %>%
group_by(restecg, target) %>%
summarise(count=n()) %>%
mutate(perc=count/sum(count))
ggplot(restecg, aes(x=restecg, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip() + facet_wrap(~target, ncol=1,scale="fixed")+ xlab("RER") +
ylab("Number") +
ggtitle("Resting Electrocardiographic Results") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))There are twice of heart disease patients that detected with abnormality compare to healthy patients that detected with abnormality.
Maximum Heart Rate Achieved
ggplot(heart, aes(x=thalach , fill=target)) + geom_histogram(aes(y=..density..), color="grey17") +
geom_density(alpha=.2, fill="yellow")+
facet_wrap(~target, ncol=1,scale="fixed")+
xlab("Maximum Heart Rate Achieved") +
ylab("Density/Count") +
ggtitle("Maximum Heart Rate Achieved") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Patients with heart disease appear to have maximum of heart rate achieved of more than 150.
Exercise Induced Angina
ex <- heart %>%
group_by(exang, target) %>%
summarise(count=n()) %>%
mutate(perc=count/sum(count))
ggplot(ex, aes(x=exang, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip() + facet_wrap(~target, ncol=1,scale="fixed")+ xlab("RER") +
ylab("EIA") +
ggtitle("Exercise Induced Angina") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))Most of the patient with heart disease detected with no exercise induced angina.
ST Depression
ggplot(heart, aes(x=oldpeak , fill=target)) + geom_histogram(aes(y=..density..), color="grey17") +
geom_density(alpha=.2, fill="yellow")+
facet_wrap(~target, ncol=1,scale="fixed")+
xlab("Oldpeak") +
ylab("Density/Count") +
ggtitle("ST depression induced by exercise relative to rest") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Most of the patient with heart disease detected with zero oldpeak.
Slope of The Peak Exercise
slope <- heart %>%
group_by(slope, target) %>%
summarise(count=n()) %>%
mutate(perc=count/sum(count))
ggplot(slope, aes(x=slope, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip() + facet_wrap(~target, ncol=1,scale="fixed")+ xlab("Slope") +
ylab("Number") +
ggtitle("the slope of the peak exercise ST segment") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))More than half of the patient with heart disease detected with peak exercise ST slope 2, while more than half of the healthy patient with detected with peak exercise ST slope 1
Number of Major Vessels
ca <- heart %>%
group_by(ca, target) %>%
summarise(count=n()) %>%
mutate(perc=count/sum(count))
ggplot(ca, aes(x=ca, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip() + facet_wrap(~target, ncol=1,scale="fixed")+ xlab("Number of Major Vessels") +
ylab("Number") +
ggtitle("Major Vessels Colored by Flourosopy") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))Most of the patient with heart disease detected with zero number of major vessel colored by flourosopy.
Thallasemia
thal <- heart %>%
group_by(thal, target) %>%
summarise(count=n()) %>%
mutate(perc=count/sum(count))
ggplot(thal, aes(x=thal, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip() + facet_wrap(~target, ncol=1,scale="fixed")+ xlab("Thalassemia Examination Result") +
ylab("Number") +
ggtitle("Thalassemia") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))Most of the patient with heart disease detected with fixed defect of thalassemia.
Multivariables Interpretation
We would try to correlate the relationship between oldpeak, number of blood vessels to the patient who is healty or having a heart disease.
ggplot(heart, aes(x=age, y=oldpeak, color=sex, size=ca)) + geom_point(alpha=0.7) + facet_wrap(~target, ncol=1,scale="fixed") + xlab("Age") +
ylab("Oldpeak") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5)) From the graph above, we could interpretate:
- Patient who has heart disease having a relatively low oldpeak.
- Patient who has heart disease having a relatively low oldpeak number of blood vessels (
ca). - Age has no siginificant effect whether the patient has heart disease or not.
Checking Data Availability
Before going further, we need to check if there is any not available data.
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
Checking Target Proportion
h <- heart %>%
group_by(target) %>%
summarise(count=n()) %>%
mutate(perc=count/sum(count))
ggplot(h, aes(x=target, y=perc*100, fill=target)) + geom_bar(stat = "identity") + coord_flip() + xlab("Patient Status") +
ylab("Percentage (%)") +
ggtitle("Patient Status Proportion") +
scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5)) Finally, we need to cross validate the
target variable. The proportion is balanced, 54:46.
Modeling
There would be two models that we will use which are logistic regression and k-nearest neighbour (K-NN)
Logistic Regression
Data Splitting
First, we must slip the dataset into train and test dataset. The data will be split into ratio of 70% for data train and 30% for data test.
Targetting All Variables
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7597 -0.3682 0.1828 0.4810 2.8007
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.465457 3.332549 0.740 0.45941
## age -0.009925 0.029102 -0.341 0.73306
## sexMale -0.972760 0.594311 -1.637 0.10168
## cpChest Pain Type 1 0.676937 0.650570 1.041 0.29809
## cpChest Pain Type 2 1.683990 0.561586 2.999 0.00271 **
## cpChest Pain Type 3 1.876914 0.763048 2.460 0.01390 *
## trestbps -0.021103 0.013683 -1.542 0.12301
## chol -0.005397 0.005413 -0.997 0.31867
## fbs> 120 mg/dl 0.464879 0.694490 0.669 0.50325
## restecgNormal -0.114183 0.481740 -0.237 0.81264
## restecgProbable or Definite -0.036626 2.214469 -0.017 0.98680
## thalach 0.017983 0.012822 1.403 0.16076
## exangNo Exercise Induced Angina 1.286237 0.507204 2.536 0.01121 *
## oldpeak -0.340274 0.261999 -1.299 0.19403
## slopePeak Excercise ST Slope 1 -0.372844 1.008251 -0.370 0.71154
## slopePeak Excercise ST Slope 2 0.646221 1.097124 0.589 0.55585
## ca -0.658304 0.265271 -2.482 0.01308 *
## thalNo Thalassemia -1.687082 2.231244 -0.756 0.44958
## thalNormal Thalassemia -1.059155 0.980853 -1.080 0.28022
## thalReversible Defect Thalassemia -1.427661 0.508727 -2.806 0.00501 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 292.00 on 211 degrees of freedom
## Residual deviance: 142.36 on 192 degrees of freedom
## AIC: 182.36
##
## Number of Fisher Scoring iterations: 6
By using all the variables provided, we have numbers of variables that do not have significant effect for the target variable which could be seen on the p values. To eliminate those unsignificant variables, we would try stepwise method with backward direction.
Stepwise Method
## Start: AIC=182.36
## target ~ age + sex + cp + trestbps + chol + fbs + restecg + thalach +
## exang + oldpeak + slope + ca + thal
##
## Df Deviance AIC
## - restecg 2 142.42 178.42
## - age 1 142.48 180.48
## - fbs 1 142.82 180.82
## - chol 1 143.36 181.36
## - slope 2 146.04 182.04
## - oldpeak 1 144.12 182.12
## <none> 142.36 182.36
## - thalach 1 144.41 182.41
## - trestbps 1 144.84 182.84
## - sex 1 145.13 183.13
## - thal 3 150.81 184.81
## - ca 1 148.69 186.69
## - exang 1 148.87 186.87
## - cp 3 154.84 188.84
##
## Step: AIC=178.42
## target ~ age + sex + cp + trestbps + chol + fbs + thalach + exang +
## oldpeak + slope + ca + thal
##
## Df Deviance AIC
## - age 1 142.58 176.58
## - fbs 1 142.87 176.87
## - chol 1 143.57 177.57
## - oldpeak 1 144.17 178.17
## - slope 2 146.38 178.38
## <none> 142.42 178.42
## - thalach 1 144.47 178.47
## - trestbps 1 144.94 178.94
## - sex 1 145.37 179.37
## - thal 3 150.90 180.90
## - ca 1 148.76 182.76
## - exang 1 148.97 182.97
## - cp 3 154.87 184.87
##
## Step: AIC=176.58
## target ~ sex + cp + trestbps + chol + fbs + thalach + exang +
## oldpeak + slope + ca + thal
##
## Df Deviance AIC
## - fbs 1 143.02 175.02
## - chol 1 143.93 175.93
## - oldpeak 1 144.25 176.25
## - slope 2 146.51 176.51
## <none> 142.58 176.58
## - sex 1 145.48 177.48
## - thalach 1 145.64 177.64
## - trestbps 1 145.73 177.73
## - thal 3 151.10 179.10
## - exang 1 149.22 181.22
## - ca 1 149.83 181.83
## - cp 3 155.30 183.30
##
## Step: AIC=175.02
## target ~ sex + cp + trestbps + chol + thalach + exang + oldpeak +
## slope + ca + thal
##
## Df Deviance AIC
## - chol 1 144.24 174.24
## - oldpeak 1 144.79 174.79
## - slope 2 146.93 174.93
## <none> 143.02 175.02
## - sex 1 145.84 175.84
## - trestbps 1 145.92 175.92
## - thalach 1 146.09 176.09
## - thal 3 151.40 177.40
## - exang 1 149.39 179.39
## - ca 1 149.85 179.85
## - cp 3 156.88 182.88
##
## Step: AIC=174.24
## target ~ sex + cp + trestbps + thalach + exang + oldpeak + slope +
## ca + thal
##
## Df Deviance AIC
## <none> 144.24 174.24
## - oldpeak 1 146.30 174.30
## - sex 1 146.47 174.47
## - slope 2 148.57 174.57
## - thalach 1 147.16 175.16
## - trestbps 1 147.24 175.24
## - thal 3 152.49 176.49
## - exang 1 150.29 178.29
## - ca 1 151.44 179.44
## - cp 3 158.83 182.83
This method eliminates six unsignificant variables, which are age, fbs, chol, restecg, oldpeak, and slope. So therefore, we would use model.b as our prefer mode.
model.b <- glm(target ~ sex + cp + trestbps + thalach + exang + oldpeak + slope +
ca + thal,heart_train, family='binomial')
summary(model.b)##
## Call:
## glm(formula = target ~ sex + cp + trestbps + thalach + exang +
## oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7746 -0.3799 0.2014 0.4931 2.7368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.43152 2.58633 0.167 0.86749
## sexMale -0.83157 0.56335 -1.476 0.13991
## cpChest Pain Type 1 0.72184 0.63672 1.134 0.25692
## cpChest Pain Type 2 1.79685 0.54745 3.282 0.00103 **
## cpChest Pain Type 3 1.93688 0.75034 2.581 0.00984 **
## trestbps -0.02183 0.01295 -1.685 0.09200 .
## thalach 0.01927 0.01156 1.666 0.09568 .
## exangNo Exercise Induced Angina 1.21859 0.49692 2.452 0.01419 *
## oldpeak -0.36267 0.25753 -1.408 0.15906
## slopePeak Excercise ST Slope 1 -0.44129 0.98775 -0.447 0.65504
## slopePeak Excercise ST Slope 2 0.64093 1.07046 0.599 0.54934
## ca -0.64851 0.24833 -2.612 0.00901 **
## thalNo Thalassemia -1.25894 2.45145 -0.514 0.60757
## thalNormal Thalassemia -0.84997 0.94975 -0.895 0.37082
## thalReversible Defect Thalassemia -1.38805 0.49318 -2.815 0.00489 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 292.00 on 211 degrees of freedom
## Residual deviance: 144.24 on 197 degrees of freedom
## AIC: 174.24
##
## Number of Fisher Scoring iterations: 6
Model Interpretation
## .
## (Intercept) 1.5395889
## sexMale 0.4353669
## cpChest Pain Type 1 2.0582241
## cpChest Pain Type 2 6.0306449
## cpChest Pain Type 3 6.9370818
## trestbps 0.9784109
## thalach 1.0194538
## exangNo Exercise Induced Angina 3.3824019
## oldpeak 0.6958139
## slopePeak Excercise ST Slope 1 0.6432050
## slopePeak Excercise ST Slope 2 1.8982437
## ca 0.5228225
## thalNo Thalassemia 0.2839554
## thalNormal Thalassemia 0.4274267
## thalReversible Defect Thalassemia 0.2495606
Fom the model, model.b, we could interprete several things:
The odds of male to diagnosed with heart disease is 43% less than female.
Patient detected with no exercise induced angina (unstable angina) is 3 times more reluctant to heart disease compared with patient with exercise induced angina.
The odds of patient detected with oldpeak to diagnosed with heart disease is 70% less than those who don’t.
K-Nearest Neighbour
Variable Transforming
## 'data.frame': 303 obs. of 28 variables:
## $ age : num 63 37 41 56 57 57 56 44 52 57 ...
## $ sex.Female : num 0 0 1 0 1 0 1 0 0 0 ...
## $ sex.Male : num 1 1 0 1 0 1 0 1 1 1 ...
## $ cp.Chest.Pain.Type.0 : num 0 0 0 0 1 1 0 0 0 0 ...
## $ cp.Chest.Pain.Type.1 : num 0 0 1 1 0 0 1 1 0 0 ...
## $ cp.Chest.Pain.Type.2 : num 0 1 0 0 0 0 0 0 1 1 ...
## $ cp.Chest.Pain.Type.3 : num 1 0 0 0 0 0 0 0 0 0 ...
## $ trestbps : num 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : num 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs...120.mg.dl : num 0 1 1 1 1 1 1 1 0 1 ...
## $ fbs...120.mg.dl.1 : num 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg.Abnormality : num 0 1 0 1 1 1 0 1 1 1 ...
## $ restecg.Normal : num 1 0 1 0 0 0 1 0 0 0 ...
## $ restecg.Probable.or.Definite : num 0 0 0 0 0 0 0 0 0 0 ...
## $ thalach : num 150 187 172 178 163 148 153 173 162 174 ...
## $ exang.Exercise.Induced.Angina : num 0 0 0 0 1 0 0 0 0 0 ...
## $ exang.No.Exercise.Induced.Angina : num 1 1 1 1 0 1 1 1 1 1 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope.Peak.Excercise.ST.Slope.0 : num 1 1 0 0 0 0 0 0 0 0 ...
## $ slope.Peak.Excercise.ST.Slope.1 : num 0 0 0 0 0 1 1 0 0 0 ...
## $ slope.Peak.Excercise.ST.Slope.2 : num 0 0 1 1 1 0 0 1 1 1 ...
## $ ca : num 0 0 0 0 0 0 0 0 0 0 ...
## $ thal.Fixed.Defect.Thalassemia : num 0 1 1 1 1 0 1 0 0 1 ...
## $ thal.No.Thalassemia : num 0 0 0 0 0 0 0 0 0 0 ...
## $ thal.Normal.Thalassemia : num 1 0 0 0 0 1 0 0 0 0 ...
## $ thal.Reversible.Defect.Thalassemia: num 0 0 0 0 0 0 0 1 1 0 ...
## $ target.Healthy : num 0 0 0 0 0 0 0 0 0 0 ...
## $ target.Heart.Disease : num 1 1 1 1 1 1 1 1 1 1 ...
We need to transform all the variables into numeric class, as KNN only processing numeric variables.
heart2 <- heart2 %>%
dplyr::select(-c(target.Healthy, sex.Female, fbs...120.mg.dl.1 , exang.Exercise.Induced.Angina))
names(heart2) ## [1] "age" "sex.Male"
## [3] "cp.Chest.Pain.Type.0" "cp.Chest.Pain.Type.1"
## [5] "cp.Chest.Pain.Type.2" "cp.Chest.Pain.Type.3"
## [7] "trestbps" "chol"
## [9] "fbs...120.mg.dl" "restecg.Abnormality"
## [11] "restecg.Normal" "restecg.Probable.or.Definite"
## [13] "thalach" "exang.No.Exercise.Induced.Angina"
## [15] "oldpeak" "slope.Peak.Excercise.ST.Slope.0"
## [17] "slope.Peak.Excercise.ST.Slope.1" "slope.Peak.Excercise.ST.Slope.2"
## [19] "ca" "thal.Fixed.Defect.Thalassemia"
## [21] "thal.No.Thalassemia" "thal.Normal.Thalassemia"
## [23] "thal.Reversible.Defect.Thalassemia" "target.Heart.Disease"
Then, we neglect variables with only two levels, such as sex variable.
Data Splitting
For KNN, we need to split data into the main data and label data.
heart2_train <- heart2[id,1:23]
heart2_test <- heart2[-id,1:23]
heart2_train_label <- heart2[id,24]%>% as.factor()
heart2_test_label <- heart2[-id,24] %>% as.factor()Then, we need to scale the data as the scale from each variables are different.
Prediction with KNN
## [1] 14.56022
Before modeling, we need to define the value of K. From the formula above, here we get the value of 15.
Evaluation
Logistic Regression
Preparation
We put the predict value from the model into our data, heart_train and heart_test.
Tuning Cutoff
performa <- function(cutoff, prob, ref, postarget, negtarget)
{
predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
conf <- caret::confusionMatrix(predict , ref, positive = postarget)
acc <- conf$overall[1]
rec <- conf$byClass[1]
prec <- conf$byClass[3]
spec <- conf$byClass[2]
mat <- t(as.matrix(c(rec , acc , prec, spec)))
colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
return(mat)
}
co <- seq(0.01,0.80,length=100)
result <- matrix(0,100,4)
for(i in 1:100){
result[i,] = performa(cutoff = co[i],
prob = heart_test$pred.Target,
ref = heart_test$target,
postarget = "Heart Disease",
negtarget = "Healthy")
}
data_frame("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "performa", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = performa)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff model perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
Before doing evaluation, we need to inspect the good cutoff or threshold to maximize the accuracy, recall and precision value. Here from the graph, 0.53 is the most balance value.
Data Training Evaluation
## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Heart Disease
## Healthy 80 13
## Heart Disease 16 103
##
## Accuracy : 0.8632
## 95% CI : (0.8095, 0.9064)
## No Information Rate : 0.5472
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.7232
##
## Mcnemar's Test P-Value : 0.7103
##
## Sensitivity : 0.8879
## Specificity : 0.8333
## Pos Pred Value : 0.8655
## Neg Pred Value : 0.8602
## Prevalence : 0.5472
## Detection Rate : 0.4858
## Detection Prevalence : 0.5613
## Balanced Accuracy : 0.8606
##
## 'Positive' Class : Heart Disease
##
Data Test Evaluation
## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Heart Disease
## Healthy 35 8
## Heart Disease 7 41
##
## Accuracy : 0.8352
## 95% CI : (0.7427, 0.9047)
## No Information Rate : 0.5385
## P-Value [Acc > NIR] : 0.000000002394
##
## Kappa : 0.6689
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8367
## Specificity : 0.8333
## Pos Pred Value : 0.8542
## Neg Pred Value : 0.8140
## Prevalence : 0.5385
## Detection Rate : 0.4505
## Detection Prevalence : 0.5275
## Balanced Accuracy : 0.8350
##
## 'Positive' Class : Heart Disease
##
The train data give accuracy of 86.32%, while the data test give accurracy of 83.52%. Because the accuracy is not far, so we can assume the model is fit.
K-Nearest Neighbour
Data Training Evaluation
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 76 18
## 1 20 98
##
## Accuracy : 0.8208
## 95% CI : (0.7624, 0.8699)
## No Information Rate : 0.5472
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.6376
##
## Mcnemar's Test P-Value : 0.8711
##
## Sensitivity : 0.8448
## Specificity : 0.7917
## Pos Pred Value : 0.8305
## Neg Pred Value : 0.8085
## Prevalence : 0.5472
## Detection Rate : 0.4623
## Detection Prevalence : 0.5566
## Balanced Accuracy : 0.8182
##
## 'Positive' Class : 1
##
Data Test Evaluation
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 34 5
## 1 8 44
##
## Accuracy : 0.8571
## 95% CI : (0.7681, 0.9217)
## No Information Rate : 0.5385
## P-Value [Acc > NIR] : 0.0000000001091
##
## Kappa : 0.7111
##
## Mcnemar's Test P-Value : 0.5791
##
## Sensitivity : 0.8980
## Specificity : 0.8095
## Pos Pred Value : 0.8462
## Neg Pred Value : 0.8718
## Prevalence : 0.5385
## Detection Rate : 0.4835
## Detection Prevalence : 0.5714
## Balanced Accuracy : 0.8537
##
## 'Positive' Class : 1
##
The train data give accuracy of 82.08%, while the data test give accurracy of 85.71%. Because the accuracy is not far, so we can assume the model is fit.
Conclusion
Logistic Regression
eval_logit <- data_frame(Accuracy = logtest$overall[1],
Recall = logtest$byClass[1],
Precision = logtest$byClass[3]) %>% print()## # A tibble: 1 x 3
## Accuracy Recall Precision
## <dbl> <dbl> <dbl>
## 1 0.835 0.837 0.854
K-Nearest Neighbour
eval_knn <- data_frame(Accuracy = knn.test$overall[1],
Recall = knn.test$byClass[1],
Precision = knn.test$byClass[3]) %>% print()## # A tibble: 1 x 3
## Accuracy Recall Precision
## <dbl> <dbl> <dbl>
## 1 0.857 0.898 0.846
KNN give accuracy of 85.71%, while logistic regression model give accuracy of 83.51%. KNN also could predict the actual patient with heart disease with recall value of 89.79%, compared to logistic regression model with value of 83.67%.