to classify the customer whether have heart disease or not based on provided variables. To do this classification, we will use Logistic Regression and K-Nearest Neighbor (KNN) method for the modeling.
Load the required package
library(dplyr)
library(rsample)
library(class)
library(caret)
Let’s load the data first as our first step of modeling.
heart <- read.csv("data_input/heart.csv")
head(heart)
Check the structure of the data
str(heart)
## 'data.frame': 303 obs. of 14 variables:
## $ ï..age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
Here is the dictionary of data:
- age -> age in years
- sex -> (1 = male; 0 = female)
- cp -> chest pain type
- trestbps -> resting blood pressure (in mm Hg on admission to the hospital)
- chol -> serum cholestoral in mg/dl
- fbs -> (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- restecg -> resting electrocardiographic results
- thalach -> maximum heart rate achieved
- exang -> exercise induced angina (1 = yes; 0 = no)
- oldpeak -> ST depression induced by exercise relative to rest
- slope -> the slope of the peak exercise ST segment
- ca -> number of major vessels (0-3) colored by flourosopy
- thal -> 1 = fixed defect; 2 = normal ; 3 = reversable defect
- target -> 1 or 0
Variables are needed to be changed type:
sex, cp, fbs, restecg, exang, thal, target -> as factor
heart1 <- heart %>%
mutate(sex = as.factor(sex),
cp = as.factor(cp),
fbs = as.factor(fbs),
exang = as.factor(exang),
thal = as.factor(thal),
target = as.factor(target))
Check any missing value
anyNA(heart1)
## [1] FALSE
“FALSE” means there is no missing values.
summary(heart1)
## ï..age sex cp trestbps chol fbs
## Min. :29.00 0: 96 0:143 Min. : 94.0 Min. :126.0 0:258
## 1st Qu.:47.50 1:207 1: 50 1st Qu.:120.0 1st Qu.:211.0 1: 45
## Median :55.00 2: 87 Median :130.0 Median :240.0
## Mean :54.37 3: 23 Mean :131.6 Mean :246.3
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:274.5
## Max. :77.00 Max. :200.0 Max. :564.0
## restecg thalach exang oldpeak slope
## Min. :0.0000 Min. : 71.0 0:204 Min. :0.00 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:133.5 1: 99 1st Qu.:0.00 1st Qu.:1.000
## Median :1.0000 Median :153.0 Median :0.80 Median :1.000
## Mean :0.5281 Mean :149.6 Mean :1.04 Mean :1.399
## 3rd Qu.:1.0000 3rd Qu.:166.0 3rd Qu.:1.60 3rd Qu.:2.000
## Max. :2.0000 Max. :202.0 Max. :6.20 Max. :2.000
## ca thal target
## Min. :0.0000 0: 2 0:138
## 1st Qu.:0.0000 1: 18 1:165
## Median :0.0000 2:166
## Mean :0.7294 3:117
## 3rd Qu.:1.0000
## Max. :4.0000
There are 2 “0” values in thal variables we can check what data it is.
heart1 %>%
filter(thal==0)
As we compare this 2 observatioan with the summary and “0” represented null values to this variable, we can conclude that there is no anomaly for these 2 observation, thus we take out.
heart1 <- heart1 %>%
filter(thal!=0)
table(heart1$target)
##
## 0 1
## 137 164
The proportion of the target of the data is imbalance enough, we don’t have to take action about this.
set.seed(124)
splitted <- initial_split(heart1, prop = 0.75, strata = target)
heart_train <- training(splitted)
heart_test <- testing(splitted)
Check again the proportion of train data and compare to the original one
prop.table(table(heart_train$target))
##
## 0 1
## 0.4557522 0.5442478
prop.table(table(heart1$target))
##
## 0 1
## 0.4551495 0.5448505
heart_model <- glm(target~., heart_train,family = "binomial")
summary (heart_model)
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7031 -0.3483 0.1887 0.5644 2.5136
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.989770 3.313846 -0.299 0.765186
## ï..age 0.018539 0.028546 0.649 0.516048
## sex1 -1.008176 0.594303 -1.696 0.089810 .
## cp1 1.397972 0.669245 2.089 0.036718 *
## cp2 2.023641 0.568660 3.559 0.000373 ***
## cp3 2.172573 0.757348 2.869 0.004122 **
## trestbps -0.016059 0.011989 -1.339 0.180428
## chol -0.001780 0.004441 -0.401 0.688580
## fbs1 -0.209264 0.639025 -0.327 0.743310
## restecg 0.652053 0.415747 1.568 0.116790
## thalach 0.020439 0.011962 1.709 0.087523 .
## exang1 -0.658821 0.475766 -1.385 0.166127
## oldpeak -0.374505 0.257358 -1.455 0.145617
## slope 0.640606 0.426476 1.502 0.133073
## ca -0.955366 0.254445 -3.755 0.000174 ***
## thal2 -0.143108 0.815993 -0.175 0.860782
## thal3 -1.456762 0.796775 -1.828 0.067501 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 311.53 on 225 degrees of freedom
## Residual deviance: 154.86 on 209 degrees of freedom
## AIC: 188.86
##
## Number of Fisher Scoring iterations: 6
Though there are several variables which are not significantly impact to the model, we can take it out or insert to step process and it will be eliminated if it’s trully not significantly impact.
step(heart_model, direction = "backward", trace = 0)
##
## Call: glm(formula = target ~ sex + cp + restecg + thalach + exang +
## oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
##
## Coefficients:
## (Intercept) sex1 cp1 cp2 cp3 restecg
## -1.86110 -0.90781 1.38345 1.92019 1.97448 0.64641
## thalach exang1 oldpeak slope ca thal2
## 0.01584 -0.72678 -0.42578 0.61315 -0.89164 -0.06270
## thal3
## -1.39826
##
## Degrees of Freedom: 225 Total (i.e. Null); 213 Residual
## Null Deviance: 311.5
## Residual Deviance: 157.3 AIC: 183.3
Since there is no anomaly coefficients of the variables which indicates perfect seperation, we can save the model first.
heart_model2 <- glm(formula = target ~ sex + cp + restecg + thalach + exang +
oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
summary(heart_model2)
##
## Call:
## glm(formula = target ~ sex + cp + restecg + thalach + exang +
## oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6086 -0.3760 0.2009 0.5462 2.4617
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.86110 1.76889 -1.052 0.292740
## sex1 -0.90781 0.52445 -1.731 0.083458 .
## cp1 1.38345 0.65875 2.100 0.035717 *
## cp2 1.92019 0.52882 3.631 0.000282 ***
## cp3 1.97448 0.72202 2.735 0.006244 **
## restecg 0.64641 0.39528 1.635 0.101983
## thalach 0.01584 0.01061 1.493 0.135561
## exang1 -0.72678 0.46232 -1.572 0.115946
## oldpeak -0.42578 0.25307 -1.682 0.092481 .
## slope 0.61315 0.42000 1.460 0.144321
## ca -0.89164 0.24301 -3.669 0.000243 ***
## thal2 -0.06270 0.79523 -0.079 0.937156
## thal3 -1.39826 0.76859 -1.819 0.068872 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 311.53 on 225 degrees of freedom
## Residual deviance: 157.35 on 213 degrees of freedom
## AIC: 183.35
##
## Number of Fisher Scoring iterations: 6
heart_test$pred.target <- predict(heart_model2,heart_test, type="response")
For default, let’s use 0.5 as threshold.
heart_test$pred.label <- ifelse(heart_test$pred.target>0.5,"1","0")
heart_test <- heart_test %>%
mutate(pred.label= as.factor(pred.label))
Due to scaling method the data in numberic type, we have to convert the predictors into numberic first.
heart_train1 <- heart_train %>%
mutate(sex = as.numeric(sex),
cp = as.numeric(cp),
fbs = as.numeric(fbs),
exang = as.numeric(exang),
thal = as.numeric(thal),
target = as.numeric(target))
heart_test1 <- heart_test %>%
mutate(sex = as.numeric(sex),
cp = as.numeric(cp),
fbs = as.numeric(fbs),
exang = as.numeric(exang),
thal = as.numeric(thal),
target = as.numeric(target)) %>%
select(-pred.label,-pred.target)
heart_train_p <- scale(x=heart_train1[,-14])
heart_test_p <- scale(x=heart_test1[,-14],
center = attr(heart_train_p,"scaled:center"),
scale = attr(heart_train_p,"scaled:scale"))
#data label(target)
heart_train_label <- heart_train1[,14]
heart_test_label <- heart_test1[,14]
#find optimum k
round(sqrt(nrow(heart_train1)),0)
## [1] 15
Number of class:2 k: 15
heart_pred <- knn(train = heart_train_p,
test = heart_test_p,
cl = heart_train_label,
k=15)
#checking result of prediction
head(heart_pred)
## [1] 2 2 2 1 2 2
## Levels: 1 2
confusionMatrix(data = heart_test$pred.label,
reference = heart_test$target,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 28 3
## 1 6 38
##
## Accuracy : 0.88
## 95% CI : (0.7844, 0.9436)
## No Information Rate : 0.5467
## P-Value [Acc > NIR] : 5.906e-10
##
## Kappa : 0.7561
##
## Mcnemar's Test P-Value : 0.505
##
## Sensitivity : 0.9268
## Specificity : 0.8235
## Pos Pred Value : 0.8636
## Neg Pred Value : 0.9032
## Prevalence : 0.5467
## Detection Rate : 0.5067
## Detection Prevalence : 0.5867
## Balanced Accuracy : 0.8752
##
## 'Positive' Class : 1
##
Result of logistic regression is good. Accurary 88%, Sensitivity 92%, and Pos Pred Value 86% are above 80%.
#class checking
class(heart_test_label)
## [1] "numeric"
class(heart_pred)
## [1] "factor"
In order to use confusionMatrix syntax, we have to convert heart_test_label into factor
heart_test_label <- as.factor(heart_test_label)
confusionMatrix(data = heart_pred,
reference = heart_test_label,
positive = "2")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 26 3
## 2 8 38
##
## Accuracy : 0.8533
## 95% CI : (0.7527, 0.9244)
## No Information Rate : 0.5467
## P-Value [Acc > NIR] : 1.664e-08
##
## Kappa : 0.7003
##
## Mcnemar's Test P-Value : 0.2278
##
## Sensitivity : 0.9268
## Specificity : 0.7647
## Pos Pred Value : 0.8261
## Neg Pred Value : 0.8966
## Prevalence : 0.5467
## Detection Rate : 0.5067
## Detection Prevalence : 0.6133
## Balanced Accuracy : 0.8458
##
## 'Positive' Class : 2
##
Result of logistic regression is good. Accurary 85%, Sensitivity 92%, and Pos Pred Value 82% are above 80%.
Both generated model are good to predict the data. Despite of being good, Logistic Regression model is slightly better than KNN based on Accurate Rate. For this data, accurate rate is supposed to be prioritized because we want to predict a patient whether have heart disease precisely.