The main objective of this project is to predict whether the given person is having heart disease or not, with the help of several factor which are causing eg- age,cholesterol level,type of chest pain etc.
The algorithm which we are using in this problem are:
library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)
library(caret)
library(ROCR)First, we need to import the dataset to R. The data used in the project contains information about patients with heart disease based on several factors. You can download it from Kaggle.
heart <- read.csv("heart.csv")
str(heart)## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
Here are some brief explanations of the variables :
age : Age of individual in yearssex : Gender Of individual(1 = male; 0 = female)cp : Chest pain type (1 = typical angina; 2 = atypical angina; 3 = non:anginal pain; 4 = asymptomatic)trestbps : Resting blood pressure (in mm Hg on admission to the hospital)chol : Serum cholesterol in mg/dlfbs : Fasting blood sugar level > 120 mg/dl (1 = true; 0 = false)restecg : Resting electrocardiographic results (0 = normal; 1 = having ST:T; 2 = hypertrophy)thalach : Maximum heart rate achievedexang : Exercise induced angina (1 = yes; 0 = no)oldpeak : ST depression induced by exercise relative to restslope : The slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)ca : Number of major vessels (0:4) colored by flourosopythal : Thalassemia is an inherited blood disorder that affects the body’s ability to produce hemoglobin and red blood cells. 1 = normal; 2 = fixed defect; 3 = reversable defecttarget : the predicted attribute : diagnosis of heart disease (angiographic disease status) (Value 0 = < 50% diameter narrowing; Value 1 = > 50% diameter narrowing)Exploratory Data Analysis is a phase where we explore the data variables to understand the condition of the data before modeling.
Check all variables data distribution
summary(heart)## age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :0.000 Min. : 94.0
## 1st Qu.:47.50 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:120.0
## Median :55.00 Median :1.0000 Median :1.000 Median :130.0
## Mean :54.37 Mean :0.6832 Mean :0.967 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.000 Max. :200.0
## chol fbs restecg thalach
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5
## Median :240.0 Median :0.0000 Median :1.0000 Median :153.0
## Mean :246.3 Mean :0.1485 Mean :0.5281 Mean :149.6
## 3rd Qu.:274.5 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.00 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.80 Median :1.000 Median :0.0000
## Mean :0.3267 Mean :1.04 Mean :1.399 Mean :0.7294
## 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.20 Max. :2.000 Max. :4.0000
## thal target
## Min. :0.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000
## Median :2.000 Median :1.0000
## Mean :2.314 Mean :0.5446
## 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000
See if there any missing values
colSums(is.na(heart))## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
As observing the above summary and str we can say following points:
sex cannot be continuous variable as it can be either Male or Female as per our description. Hence we have to convert the variable name sex from integer to factor. And also labelling it to avoid any further confusion.cp cannot be continuous variable as it is type of chest pain. As it is type of chest pain, we have to convert variable cp to factor and labelling it to our convenience.fbs cannot be continuous variable or integer as it shows blood sugar level below 120mg/dl or not.Therefore, we convert it to factor and labelling it to our convenience.restecg should be factor as it is type of ECG results.Hence, it can’t be integer. So, we are converting it to factor and labelling it.exang should be factor as per the description of the dataset. Angina can happen or not i.e. it can be either yes or no. Therefore, converting the variable to factor and labelling it.slope cannot be integer as it is type of slope which is observed in ECG. Therefore we are converting the variable to factor and labelling it.ca as per the description of our dataset. It can’t be integer. Therefore, we are converting the variable to factor.thal cannot be integer as it is type of thalassemia which cannot be numeric or integer. Therefore, we are converting the variable to factor and labelling it.target is the predicated variable and tells us whether the individual has heart disease or not. Therefore, we are converting the variable to factor and labelling it for your convenience.According to above observation we implementing the changes
heart <- heart %>%
mutate(sex = factor(sex, levels = c(0,1), labels = c("Female", "Male")),
cp = factor(cp, levels = c(0,1,2,3), labels = c("typical","atypical","non-anginal","asymptomatic")),
fbs =factor(fbs, levels = c(0,1), labels = c("False", "True")),
restecg =factor(restecg, levels = c(0,1,2), labels = c("normal","stt","hypertrophy")),
exang = factor(exang, levels = c(0,1), labels = c("No", "Yes")),
slope = factor(slope, levels = c(0,1,2), labels = c("upsloping","flat","downsloping")),
ca = factor(ca),
thal = factor(thal),
target = factor(target, levels = c(0,1),
labels = c("Health", "Not Health")))
glimpse(heart)## Rows: 303
## Columns: 14
## $ age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5~
## $ sex <fct> Male, Male, Female, Male, Female, Male, Female, Male, Male, M~
## $ cp <fct> asymptomatic, non-anginal, atypical, atypical, typical, typic~
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1~
## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2~
## $ fbs <fct> True, False, False, False, False, False, False, False, True, ~
## $ restecg <fct> normal, stt, normal, stt, stt, stt, normal, stt, stt, stt, st~
## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1~
## $ exang <fct> No, No, No, No, Yes, No, No, No, No, No, No, No, No, Yes, No,~
## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0~
## $ slope <fct> upsloping, upsloping, downsloping, downsloping, downsloping, ~
## $ ca <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
## $ thal <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ target <fct> Not Health, Not Health, Not Health, Not Health, Not Health, N~
And these are the proportions of our variable target class
prop.table(table(heart$target))##
## Health Not Health
## 0.4554455 0.5445545
Before we train our model, we must split our data into train and test data. We will use our train data to train our model and use our test data to validate our model when overcoming the unseen data.The data will be split with ratio of 75/25 (75% data will be used to train, 25% to test).
set.seed(100)
index<-sample(nrow(heart),0.75*nrow(heart))
train<-heart[index,]
test<-heart[-index,]This is the equation of a logistic regression model. The left-hand side is called the log-odds or logit. On the right side, the B0 is the model intercept and B1 is the coefficient of feature X.
\[log(\frac{p(X)}{1-p(X)})= B_0 + B_1.X\]
Generate the model on training data and then validating the model with testing data.
model_lgr<-glm(target~.,data = train,family = "binomial")
# family = " binomial" means it contains only two outcomes.
summary(model_lgr)##
## Call:
## glm(formula = target ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.03957 -0.26644 0.08235 0.42215 2.97504
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.864308 4.269846 0.202 0.839587
## age 0.036153 0.031524 1.147 0.251452
## sexMale -2.082095 0.707450 -2.943 0.003249 **
## cpatypical 1.357814 0.730943 1.858 0.063223 .
## cpnon-anginal 1.871870 0.616335 3.037 0.002389 **
## cpasymptomatic 3.473455 1.021746 3.400 0.000675 ***
## trestbps -0.027971 0.015272 -1.832 0.067024 .
## chol -0.006059 0.006112 -0.991 0.321497
## fbsTrue 0.427703 0.695171 0.615 0.538390
## restecgstt 0.958900 0.493394 1.943 0.051958 .
## restecghypertrophy -1.049684 3.174112 -0.331 0.740870
## thalach 0.009176 0.013970 0.657 0.511258
## exangYes -0.704661 0.542541 -1.299 0.194007
## oldpeak -0.282308 0.306674 -0.921 0.357286
## slopeflat -0.069356 1.187088 -0.058 0.953410
## slopedownsloping 1.573649 1.290392 1.220 0.222650
## ca1 -2.618266 0.658078 -3.979 6.93e-05 ***
## ca2 -4.114748 1.041030 -3.953 7.73e-05 ***
## ca3 -2.171650 1.016714 -2.136 0.032683 *
## ca4 1.056279 2.046429 0.516 0.605745
## thal1 2.867969 2.767630 1.036 0.300083
## thal2 2.822062 2.645828 1.067 0.286149
## thal3 1.190640 2.643434 0.450 0.652412
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 313.10 on 226 degrees of freedom
## Residual deviance: 125.41 on 204 degrees of freedom
## AIC: 171.41
##
## Number of Fisher Scoring iterations: 6
then we can predicting using model_lgr we created before. We will use the default number 0.5 as a classification threshold / cut off number.
test$pred<-predict(model_lgr,test,type = "response")
test$pred<-ifelse(test$pred < 0.5,"Health","Not Health")
rmarkdown::paged_table(data.frame(test$pred))Interpretation
The logistic regression model is an interpretable model. We can see how each variable contribute to the probability of the outcome. It can be done by create a data frame consist of our model coefficients and transform the value with inv.logit(). These will transform the log of odds value into probability value.
# Return the probability value
model_logit <- model_lgr$coefficients %>%
inv.logit() %>%
data.frame()
rmarkdown::paged_table(model_logit)so, from dataframe above we can interprete some information like:
K-NN classify the outcome by looking at the nearest “neighbour”. In other words, K-NN is looking at what is the class of data-point(s) with the least/shortest distance to the data we want to classify. The distance is measured with Euclidean Distance, which can penalizes neighbour with greater distance.
\[d(x,y)=\sqrt{\Sigma(x_i - y_i)^2}\] Scaling
As the overview of our new data frame before, we notice that some variable have a different range with each other. Before we start to create our K-NN model we need scale our predictors in the training data, so that there is no dominant predictor because the scale is bigger than the others.
numerictrain <- sapply(X = train, FUN = is.numeric)
numerictest <- sapply(X = test, FUN = is.numeric)
train_x <- scale(train[,numerictrain])
train_y <- train$target
test_x <- scale(test[,numerictest],
center = attr(train_x, "scaled:center"),
scale = attr(train_x, "scaled:scale"))
test_y <- test$targetAnd then let’s pick the optimum k number from train_x
nrow(train_x)## [1] 227
sqrt(nrow(train_x))## [1] 15.06652
Because the number of target classes is even, then k must be odd to avoid a tie when there is majority voting. Now, we can do predictions with K-NN model
knn <- knn(train = train_x,
test = test_x,
cl = train_y,
k = 16)
rmarkdown::paged_table(data.frame(knn))We will evaluate both logistic regression and K-NN model with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.
knitr::include_graphics("cf.PNG")The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision (Saito and Rehmsmeier, 2015). Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.
\[Accuracy=\frac{TP+TN}{TP+TN+FP+FN}\]
\[Sensitifity=\frac{TP}{TP+FN}\]
\[Specificity=\frac{TP}{TN+FP}\]
\[Precision=\frac{TP}{TP+FP}\]
log_cm <- confusionMatrix(factor(test$pred),test$target,positive = "Not Health")
log_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction Health Not Health
## Health 24 4
## Not Health 10 38
##
## Accuracy : 0.8158
## 95% CI : (0.7103, 0.8955)
## No Information Rate : 0.5526
## P-Value [Acc > NIR] : 1.294e-06
##
## Kappa : 0.6211
##
## Mcnemar's Test P-Value : 0.1814
##
## Sensitivity : 0.9048
## Specificity : 0.7059
## Pos Pred Value : 0.7917
## Neg Pred Value : 0.8571
## Prevalence : 0.5526
## Detection Rate : 0.5000
## Detection Prevalence : 0.6316
## Balanced Accuracy : 0.8053
##
## 'Positive' Class : Not Health
##
knn_cm <- confusionMatrix(knn, test$target, positive = "Not Health")
knn_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction Health Not Health
## Health 21 8
## Not Health 13 34
##
## Accuracy : 0.7237
## 95% CI : (0.6091, 0.8201)
## No Information Rate : 0.5526
## P-Value [Acc > NIR] : 0.001639
##
## Kappa : 0.4332
##
## Mcnemar's Test P-Value : 0.382733
##
## Sensitivity : 0.8095
## Specificity : 0.6176
## Pos Pred Value : 0.7234
## Neg Pred Value : 0.7241
## Prevalence : 0.5526
## Detection Rate : 0.4474
## Detection Prevalence : 0.6184
## Balanced Accuracy : 0.7136
##
## 'Positive' Class : Not Health
##
eval_log <- data.frame(Accuracy = log_cm$overall[1],
Recall = log_cm$byClass[1],
Specificity = log_cm$byClass[2],
Precision = log_cm$byClass[3])
eval_knn <- data.frame(Accuracy = knn_cm$overall[1],
Recall = knn_cm$byClass[1],
Specificity = knn_cm$byClass[2],
Precision = knn_cm$byClass[3])
eval_total <- rbind(round(eval_log,2), round(eval_knn,2))
rownames(eval_total) <- c("Logistic Regression", "K-NN")
rmarkdown::paged_table(data.frame(eval_total))To get a better model result in logistic regression model, we can do it in several ways, one way is by changing the classification thresholds to determine the prediction. We require ROC curve(receiver operating characteristic curve) which is a graph showing the performance of a classification model at all classification thresholds. It will allow us to take proper cutoff.
train$pred<-fitted(model_lgr)
# fitted can be used only to get predicted score of the data on which model has been generated.
pred<-prediction(train$pred,train$target)
perf<-performance(pred,"tpr","fpr")
plot(perf,colorize = T,print.cutoffs.at = seq(0.1,by = 0.1)) With the use of ROC curve we can observe that 0.6 is having better sensitivity and specificity.There we can change 0.6 as our cutoff number.
test$pred<-predict(model_lgr,test,type = "response")
test$pred<-ifelse(test$pred < 0.6,"Health","Not Health")and let’s calculate the confusion matrix again
log_cm <- confusionMatrix(factor(test$pred),test$target,positive = "Not Health")
log_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction Health Not Health
## Health 25 5
## Not Health 9 37
##
## Accuracy : 0.8158
## 95% CI : (0.7103, 0.8955)
## No Information Rate : 0.5526
## P-Value [Acc > NIR] : 1.294e-06
##
## Kappa : 0.6232
##
## Mcnemar's Test P-Value : 0.4227
##
## Sensitivity : 0.8810
## Specificity : 0.7353
## Pos Pred Value : 0.8043
## Neg Pred Value : 0.8333
## Prevalence : 0.5526
## Detection Rate : 0.4868
## Detection Prevalence : 0.6053
## Balanced Accuracy : 0.8081
##
## 'Positive' Class : Not Health
##
as the result, the accuracy and precision numbers have increased than before.
For improving K-NN model, we will try to change the K number from 16 to 18 and see if there is any differences
knn <- knn(train = train_x,
test = test_x,
cl = train_y,
k = 18)
knn_cm <- confusionMatrix(knn, test$target, positive = "Not Health")
knn_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction Health Not Health
## Health 21 4
## Not Health 13 38
##
## Accuracy : 0.7763
## 95% CI : (0.6662, 0.864)
## No Information Rate : 0.5526
## P-Value [Acc > NIR] : 4.229e-05
##
## Kappa : 0.5359
##
## Mcnemar's Test P-Value : 0.05235
##
## Sensitivity : 0.9048
## Specificity : 0.6176
## Pos Pred Value : 0.7451
## Neg Pred Value : 0.8400
## Prevalence : 0.5526
## Detection Rate : 0.5000
## Detection Prevalence : 0.6711
## Balanced Accuracy : 0.7612
##
## 'Positive' Class : Not Health
##
And we can see there is improvement from the previous results with higher accuracy dan precicion than before.
eval_log <- data.frame(Accuracy = log_cm$overall[1],
Recall = log_cm$byClass[1],
Specificity = log_cm$byClass[2],
Precision = log_cm$byClass[3])
eval_knn <- data.frame(Accuracy = knn_cm$overall[1],
Recall = knn_cm$byClass[1],
Specificity = knn_cm$byClass[2],
Precision = knn_cm$byClass[3])
eval_total <- rbind(round(eval_log,2), round(eval_knn,2))
rownames(eval_total) <- c("Logistic Regression", "K-NN")
rmarkdown::paged_table(data.frame(eval_total))Both model performances are increased after we used the tuning process. From the comparison table above, we can see that Logistic Regression get higher result for Recall, Accuracy, Specificity and Precision matrics then K-NN model. Depending on what we want to achieve, we better look at the precision metric, where we don’t want our model to be wrong in predicting which patients are really Not Health or Health. So, we can choose the Logistic Regression models. The ability of the Logistic Regression models to predict correctly from the actual data of people with Not Health is better because it has a value of precision = 85%, greater than using the K-NN models.