Hi, welcome to my Learning by Building for Classification. In this project, I will try to analyze about Heart Disease. I found the dataset from kaggle.com
1 Load Library
Here are the library I will use in this LBB
2 Load Dataset
Let’s just load the data for this poject
The dataset has of 14 columns, which are:
age
: information of patient’s agesex
: define patient’s sex in boolean type, which mean 0 is female and 1 is malecp
: define 4 types of chest paintrestbps
: resting blood pressurechol
: serum cholestoral in mg/dlfbs
: fasting blood sugar > 120 mg/dl in boleean type, which mean 0 is no and 1 is yesrestecg
: resting electrocardiographic results (values 0,1,2)thalach
: maximum heart rate achievedexang
: exercise induced angina in boolean, which mean 0 is no and 1 is yesoldpeak
: ST depression induced by exercise relative to restslope
: the slope of the peak exercise ST segmentca
: number of major vessels (0-3) colored by flourosopythal
: 3 = normal; 6 = fixed defect; 7 = reversable defecttarget
: target in boolean, which means 0 is healthy and 1 is not healthy
In this project, I will use the ggplot
theme I’ve already created
white_theme <- theme(
panel.background = element_rect(fill="white"),
plot.background = element_rect(fill="white"),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
text = element_text(color="black"),
axis.text = element_text(color="black"),
strip.background =element_rect(fill="snow3"),
strip.text = element_text(colour = 'black')
)
3 Explanatory Data Analysis (EDA)
Before we continue, let’s chech the correlation between our predictor, so we can define our predictor
Predictors that have strong correlatio with our target
are slope
, thalach
, restecg
and cp
. This can be used when we use glm()
model
Now, let’s check if our dataset is already in proper data type
## Rows: 303
## Columns: 14
## $ age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ restecg <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
We are about to change some data typeS to analyze the data. In this data, I will change target
into factor
and change it into Healthy and Not Healthy and some predictors in binary to factor
.
heart <- heart %>%
mutate(sex = factor(sex, levels = c(0,1), labels = c("Female", "Male")),
fbs = factor(fbs, levels = c(0,1), labels = c("False", "True")),
exang = factor(exang, levels = c(0,1), labels = c("No", "Yes")),
target = factor(target, levels = c(0,1),
labels = c("Healthy", "Not Healthy")))
glimpse(heart)
## Rows: 303
## Columns: 14
## $ age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex <fct> Male, Male, Female, Male, Female, Male, Female, Male, Male...
## $ cp <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs <fct> True, False, False, False, False, False, False, False, Tru...
## $ restecg <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang <fct> No, No, No, No, Yes, No, No, No, No, No, No, No, No, Yes, ...
## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target <fct> Not Healthy, Not Healthy, Not Healthy, Not Healthy, Not He...
Now, let’s check whether there are missing values in our dataset
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
There are no missing values in our data
3.1 Pre-Processing Data
Before we continue, let’s check the proportion for out target
, this process is to determined if our data is having a balance data
##
## Healthy Not Healthy
## 0.4554455 0.5445545
The data we have is having a proportional balance
Now, let’s see the summary of our data
## age sex cp trestbps chol
## Min. :29.00 Female: 96 Min. :0.000 Min. : 94.0 Min. :126.0
## 1st Qu.:47.50 Male :207 1st Qu.:0.000 1st Qu.:120.0 1st Qu.:211.0
## Median :55.00 Median :1.000 Median :130.0 Median :240.0
## Mean :54.37 Mean :0.967 Mean :131.6 Mean :246.3
## 3rd Qu.:61.00 3rd Qu.:2.000 3rd Qu.:140.0 3rd Qu.:274.5
## Max. :77.00 Max. :3.000 Max. :200.0 Max. :564.0
## fbs restecg thalach exang oldpeak
## False:258 Min. :0.0000 Min. : 71.0 No :204 Min. :0.00
## True : 45 1st Qu.:0.0000 1st Qu.:133.5 Yes: 99 1st Qu.:0.00
## Median :1.0000 Median :153.0 Median :0.80
## Mean :0.5281 Mean :149.6 Mean :1.04
## 3rd Qu.:1.0000 3rd Qu.:166.0 3rd Qu.:1.60
## Max. :2.0000 Max. :202.0 Max. :6.20
## slope ca thal target
## Min. :0.000 Min. :0.0000 Min. :0.000 Healthy :138
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000 Not Healthy:165
## Median :1.000 Median :0.0000 Median :2.000
## Mean :1.399 Mean :0.7294 Mean :2.314
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.000 Max. :4.0000 Max. :3.000
From the summary above, we can conclude that the youngest patient was 29 years old the oldest was 77 years old, Female patients were 96 and Male patients were 207. cp
consisted of 1, 2, and 3 score. restecg
also consisted of 0, 1, and 2 point. And for trestbps
, chol
and thalach
, they had various score.
3.2 Cross Validation
Let’s split our data to train
and test
. I will split it wIth 80% and 20% proportion
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(heart), nrow(heart)*0.85)
train <- heart[index,]
test <- heart[-index,]
Now, let’s check the proportion for target
column in our train
and test
data
##
## Healthy Not Healthy
## 0.4552529 0.5447471
##
## Healthy Not Healthy
## 0.4565217 0.5434783
And our target
in our train
and test
data is already balance
4 Modelling
4.1 Logistic Regression
Now, let’s check some models with all predictors and some predictors that have strong correlation
##
## Call: glm(formula = target ~ ., family = "binomial", data = train)
##
## Coefficients:
## (Intercept) age sexMale cp trestbps chol
## 4.814425 -0.004785 -1.750236 0.939859 -0.021192 -0.004124
## fbsTrue restecg thalach exangYes oldpeak slope
## 0.050349 0.276302 0.014837 -1.059257 -0.706850 0.643376
## ca thal
## -0.758366 -0.891504
##
## Degrees of Freedom: 256 Total (i.e. Null); 243 Residual
## Null Deviance: 354.2
## Residual Deviance: 175.5 AIC: 203.5
From all predictors we use, the AIC number we get is 203.5
Now, let’s check the model using slope
, thalach
, restecg
and cp
step(glm(target ~ slope + thalach + restecg + cp, train ,family = "binomial"), direction = "backward")
## Start: AIC=265.7
## target ~ slope + thalach + restecg + cp
##
## Df Deviance AIC
## - restecg 1 256.89 264.89
## <none> 255.70 265.70
## - thalach 1 265.33 273.33
## - slope 1 268.92 276.92
## - cp 1 292.24 300.24
##
## Step: AIC=264.89
## target ~ slope + thalach + cp
##
## Df Deviance AIC
## <none> 256.89 264.89
## - thalach 1 266.41 272.41
## - slope 1 270.19 276.19
## - cp 1 294.38 300.38
##
## Call: glm(formula = target ~ slope + thalach + cp, family = "binomial",
## data = train)
##
## Coefficients:
## (Intercept) slope thalach cp
## -5.59122 1.02332 0.02338 0.94153
##
## Degrees of Freedom: 256 Total (i.e. Null); 253 Residual
## Null Deviance: 354.2
## Residual Deviance: 256.9 AIC: 264.9
The AIC we get is greater than the AIC we get from all predictors. So, we will just use all predictors
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6436 -0.3717 0.1510 0.5962 2.5437
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.814425 2.860851 1.683 0.092401 .
## age -0.004785 0.025127 -0.190 0.848966
## sexMale -1.750236 0.518216 -3.377 0.000732 ***
## cp 0.939859 0.206652 4.548 0.00000542 ***
## trestbps -0.021192 0.011537 -1.837 0.066242 .
## chol -0.004124 0.004084 -1.010 0.312601
## fbsTrue 0.050349 0.598407 0.084 0.932947
## restecg 0.276302 0.380824 0.726 0.468122
## thalach 0.014837 0.011216 1.323 0.185885
## exangYes -1.059257 0.459936 -2.303 0.021276 *
## oldpeak -0.706850 0.260123 -2.717 0.006580 **
## slope 0.643376 0.430717 1.494 0.135246
## ca -0.758366 0.223014 -3.401 0.000673 ***
## thal -0.891504 0.318591 -2.798 0.005138 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 354.22 on 256 degrees of freedom
## Residual deviance: 175.54 on 243 degrees of freedom
## AIC: 203.54
##
## Number of Fisher Scoring iterations: 6
Now, let’s find out if we use stepwise
, will we get smaller AIC?
## Start: AIC=203.54
## target ~ age + sex + cp + trestbps + chol + fbs + restecg + thalach +
## exang + oldpeak + slope + ca + thal
##
## Df Deviance AIC
## - fbs 1 175.54 201.54
## - age 1 175.57 201.57
## - restecg 1 176.06 202.06
## - chol 1 176.53 202.53
## - thalach 1 177.34 203.34
## <none> 175.54 203.54
## - slope 1 177.75 203.75
## - trestbps 1 179.03 205.03
## - exang 1 180.85 206.85
## - thal 1 183.59 209.59
## - oldpeak 1 183.60 209.60
## - ca 1 187.77 213.77
## - sex 1 188.78 214.78
## - cp 1 199.76 225.76
##
## Step: AIC=201.54
## target ~ age + sex + cp + trestbps + chol + restecg + thalach +
## exang + oldpeak + slope + ca + thal
##
## Df Deviance AIC
## - age 1 175.58 199.58
## - restecg 1 176.08 200.08
## - chol 1 176.54 200.54
## - thalach 1 177.36 201.36
## <none> 175.54 201.54
## - slope 1 177.76 201.76
## - trestbps 1 179.06 203.06
## - exang 1 180.85 204.85
## - oldpeak 1 183.70 207.70
## - thal 1 183.82 207.82
## - ca 1 187.89 211.89
## - sex 1 188.90 212.90
## - cp 1 200.38 224.38
##
## Step: AIC=199.58
## target ~ sex + cp + trestbps + chol + restecg + thalach + exang +
## oldpeak + slope + ca + thal
##
## Df Deviance AIC
## - restecg 1 176.14 198.14
## - chol 1 176.65 198.65
## <none> 175.58 199.58
## - slope 1 177.82 199.82
## - thalach 1 178.01 200.01
## - trestbps 1 179.45 201.45
## - exang 1 180.85 202.85
## - oldpeak 1 183.71 205.71
## - thal 1 183.85 205.85
## - ca 1 188.70 210.70
## - sex 1 188.95 210.95
## - cp 1 200.41 222.41
##
## Step: AIC=198.14
## target ~ sex + cp + trestbps + chol + thalach + exang + oldpeak +
## slope + ca + thal
##
## Df Deviance AIC
## - chol 1 177.52 197.52
## <none> 176.14 198.14
## - slope 1 178.40 198.40
## - thalach 1 178.67 198.67
## - trestbps 1 180.16 200.16
## - exang 1 181.33 201.33
## - oldpeak 1 184.15 204.15
## - thal 1 184.17 204.17
## - ca 1 189.16 209.16
## - sex 1 189.80 209.80
## - cp 1 200.75 220.75
##
## Step: AIC=197.52
## target ~ sex + cp + trestbps + thalach + exang + oldpeak + slope +
## ca + thal
##
## Df Deviance AIC
## <none> 177.52 197.52
## - thalach 1 179.67 197.67
## - slope 1 179.67 197.67
## - trestbps 1 181.56 199.56
## - exang 1 182.79 200.79
## - oldpeak 1 186.52 204.52
## - thal 1 186.57 204.57
## - sex 1 189.82 207.82
## - ca 1 190.58 208.58
## - cp 1 203.07 221.07
##
## Call:
## glm(formula = target ~ sex + cp + trestbps + thalach + exang +
## oldpeak + slope + ca + thal, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6138 -0.3894 0.1584 0.5505 2.5092
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.82212 2.14716 1.780 0.075062 .
## sexMale -1.58325 0.48331 -3.276 0.001053 **
## cp 0.94813 0.20278 4.676 0.00000293 ***
## trestbps -0.02177 0.01099 -1.981 0.047630 *
## thalach 0.01444 0.01002 1.442 0.149344
## exangYes -1.03926 0.45284 -2.295 0.021735 *
## oldpeak -0.73241 0.25702 -2.850 0.004377 **
## slope 0.62350 0.42402 1.470 0.141436
## ca -0.74482 0.21581 -3.451 0.000558 ***
## thal -0.92191 0.30936 -2.980 0.002882 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 354.22 on 256 degrees of freedom
## Residual deviance: 177.52 on 247 degrees of freedom
## AIC: 197.52
##
## Number of Fisher Scoring iterations: 6
The AIC in model2
is smaller than model2
. In stepwise
model, we get AIC of 197.52 with a combination of sex
, cp
, trestbps
, exang
, oldpeak
, ca
, thal
Now, with model2
, we will try to predict our data with test
data
Now, let’s see the data distribution in test
ggplot(test, aes(x = pred)) +
geom_density(lwd = 1) +
labs(title = "Distribution of Probability of Data Prediction",
x = "Probability",
y = "Density") +
theme(plot.title = element_text(hjust = 0.5)) +
white_theme
From the graph above, we can conclude that the prediction result is more inclined to Not Healthy
From the table, we can define that whenever the probability of test
is more than 0.5, the patient is Not Healthy
Since we want to predict Not Healthy patient, we must define it into the positive
class in the confusionMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Not Healthy
## Healthy 17 3
## Not Healthy 4 22
##
## Accuracy : 0.8478
## 95% CI : (0.7113, 0.9366)
## No Information Rate : 0.5435
## P-Value [Acc > NIR] : 0.000013
##
## Kappa : 0.6922
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8800
## Specificity : 0.8095
## Pos Pred Value : 0.8462
## Neg Pred Value : 0.8500
## Prevalence : 0.5435
## Detection Rate : 0.4783
## Detection Prevalence : 0.5652
## Balanced Accuracy : 0.8448
##
## 'Positive' Class : Not Healthy
##
We’re just finished our Logistic Regression model. Now, let’s just use K-Nearest Neighbor
4.2 K-Nearest Neighbor
Before we continue to the k-NN, let’s scale the numeric column
numerictrain <- sapply(X = train, FUN = is.numeric)
numerictest <- sapply(X = test, FUN = is.numeric)
train_x <- scale(train[,numerictrain])
train_y <- train$target
test_x <- scale(test[,numerictest],
center = attr(train_x, "scaled:center"),
scale = attr(train_x, "scaled:scale"))
test_y <- test$target
Now, let’s define the k number from train_x
## [1] 16
Since we have 2 target class, we would better to use odds number, let’s just use 15. Now, let’s do a prediction with k-NN
Just like model2
, now it’s time to predict it with confusionMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Not Healthy
## Healthy 14 2
## Not Healthy 7 23
##
## Accuracy : 0.8043
## 95% CI : (0.6609, 0.9064)
## No Information Rate : 0.5435
## P-Value [Acc > NIR] : 0.0002066
##
## Kappa : 0.5981
##
## Mcnemar's Test P-Value : 0.1824224
##
## Sensitivity : 0.9200
## Specificity : 0.6667
## Pos Pred Value : 0.7667
## Neg Pred Value : 0.8750
## Prevalence : 0.5435
## Detection Rate : 0.5000
## Detection Prevalence : 0.6522
## Balanced Accuracy : 0.7933
##
## 'Positive' Class : Not Healthy
##
5 Model Evaluation of Logistic Regression and k-NN
eval_log <- data.frame(Accuracy = logreg_cm$overall[1],
Recall = logreg_cm$byClass[1],
Specificity = logreg_cm$byClass[2],
Precision = logreg_cm$byClass[3])
eval_knn <- data.frame(Accuracy = knn_cm$overall[1],
Recall = knn_cm$byClass[1],
Specificity = knn_cm$byClass[2],
Precision = knn_cm$byClass[3])
eval_total <- rbind(round(eval_log,2), round(eval_knn,2))
rownames(eval_total) <- c("Logistic Regression", "kNN")
eval_total
From the result above, we can see that both of Logistic Regression and k-NN model give the best result for Recall while for Accuracy, Specificity and Precision, Logistic Regression give the higher result.
6 Conclusion
In this case, since this model is to predict within Healthy and Not Healthy patient, I will use Precision metric from Logistic Regression model, since it give the most accurate percentage to predict the patient who has heart problem or not, so I won’t give treatment to the wrong person, cause if I wrongly predict my patient, the result can be fatal.
Our result with Logistic Regression seems to perform better than our k-NN model. I believe that this happened because the categorical variable that we omit out of our k-NN model is highly affecting our model quality.