Berikut ini adalah analisis dan prediksi dari penyakit jantung yang didapatkan dari Cleveland Database.
Data tesebut akan diolah dengan Generalised Logarithmic Regression dan K Near Neighbour, serta dibuat perbandingan antara kedua Machine Learning tersebut dengan variable target adalah target pasien yang bernilai Heakth/Sehat dan Not Helath/Tidak Sehat. Hal yang menjadi variable prediktor adalah semua variable yang ada pada data tersebut. tujuan dari project ini adalah untuk memprediksi pasien yang beresiko terkena penyakit jantung. Metric Evaluation yang dipakai pada variable target adalah Recall.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages --------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v stringr 1.4.0
## v tidyr 1.0.0 v forcats 0.4.0
## v readr 1.3.1
## -- Conflicts ------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(funModeling)
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## funModeling v.1.9.3 :)
## Examples and tutorials at livebook.datascienceheroes.com
## / Now in Spanish: librovivodecienciadedatos.ai
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:funModeling':
##
## range01
## The following object is masked from 'package:dplyr':
##
## nasa
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(gtools)
library(class)
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
## The following object is masked from 'package:purrr':
##
## lift
heart <- read.csv("heart.csv")
str(heart)
## 'data.frame': 303 obs. of 14 variables:
## $ ï..age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
Informasi penting dalam data :
ï..age : dalam beberapa tahunsex : (1 = laki-laki; 0 = perempuan)cp : tipe nyeri yang paling parahtrestbps : melacak tekanan darah(dalam mm Hg saat masuk ke rumah sakit)chol : kolestoral dalam mg / dlfbs : (gula darah puasa> 120 mg / dl) (1 = benar; 0 = salah)restecg : mengembalikan hasil elektrokardiografithalach : denyut jantung maksimum tercapaiexang : kejadian nyeri pada dada karena kekurangan suplai darah (1 = ya; 0 = tidak)oldpeak : ST depresi yang disebabkan oleh olahraga relatif terhadap istirahatslope : kemiringan segmen ST latihan puncakca : jumlah pembuluh darah utama (0-3) diwarnai dengan fluoroskopithal : 3 = normal; 6 = cacat tetap; 7 = cacat yang dapat dibaliktarget : 1 = sakit atau 0=tidak sakitMengubah tipe data kategorik yg semula tipe integer menjadi tipe data factor
heart <- heart %>%
mutate_if(is.integer, as.factor) %>%
mutate(sex = factor(sex, levels = c(0,1), labels = c("Female","Male")),
fbs = factor(fbs, levels = c(0,1), labels = c("False","True")),
exang = factor(exang, levels = c(0,1), labels = c("No","Yes")),
target = factor(target, levels = c(0,1), labels = c("Health","Not Health")))
str(heart)
## 'data.frame': 303 obs. of 14 variables:
## $ ï..age : Factor w/ 41 levels "29","34","35",..: 30 4 8 23 24 24 23 11 19 24 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 1 2 2 2 ...
## $ cp : Factor w/ 4 levels "0","1","2","3": 4 3 2 2 1 1 2 2 3 3 ...
## $ trestbps: Factor w/ 49 levels "94","100","101",..: 32 23 23 15 15 29 29 15 44 35 ...
## $ chol : Factor w/ 152 levels "126","131","141",..: 65 81 36 68 146 26 117 93 32 10 ...
## $ fbs : Factor w/ 2 levels "False","True": 2 1 1 1 1 1 1 1 2 1 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 1 2 1 2 2 2 1 2 2 2 ...
## $ thalach : Factor w/ 91 levels "71","88","90",..: 50 85 72 77 63 48 53 73 62 74 ...
## $ exang : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 1 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : Factor w/ 3 levels "0","1","2": 1 1 3 3 3 2 2 3 3 3 ...
## $ ca : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ thal : Factor w/ 4 levels "0","1","2","3": 2 3 3 3 3 2 3 4 4 3 ...
## $ target : Factor w/ 2 levels "Health","Not Health": 2 2 2 2 2 2 2 2 2 2 ...
Melakukan pengecekan nilai NA
colSums(is.na(heart))
## ï..age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
kesimpulan tidak ada data NA pada data
Mengecek komposisi pada variabel target
table(heart$target)
##
## Health Not Health
## 138 165
prop.table(table(heart$target))
##
## Health Not Health
## 0.4554455 0.5445545
Selisih antara perbandingan data pada target tidak terlalu jauh, maka dapat dijadikan sebagai variable target
Membuat data train dan test
set.seed(123)
intrain<-sample(nrow(heart),nrow(heart)*0.8)
heart_train<-heart[intrain,]
heart_test<-heart[-intrain,]
heart$target %>%
levels()
## [1] "Health" "Not Health"
Melakukan pemodelan menggunakan regresi logistik. Pemodelan menggunakan fungsi glm dalam memodelkan menggunakan regresi logistik. Variabel yang digunakan adalah semua variabel, dimana variabel target menjadi variabel responnya.
model<-glm(formula = target ~ sex+cp+fbs+exang+oldpeak+slope+ca+thal, family = "binomial", data = heart_train)
summary(model)
##
## Call:
## glm(formula = target ~ sex + cp + fbs + exang + oldpeak + slope +
## ca + thal, family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7457 -0.3730 0.1611 0.5319 2.6855
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.8672 1455.3983 -0.008 0.99349
## sexMale -1.3487 0.5284 -2.552 0.01070 *
## cp1 0.9420 0.6422 1.467 0.14241
## cp2 1.6332 0.5672 2.879 0.00398 **
## cp3 1.9326 0.6847 2.823 0.00476 **
## fbsTrue 0.3648 0.5863 0.622 0.53381
## exangYes -1.0363 0.4672 -2.218 0.02654 *
## oldpeak -0.5609 0.2544 -2.205 0.02749 *
## slope1 -1.1177 0.9600 -1.164 0.24429
## slope2 0.1152 1.0686 0.108 0.91414
## ca1 -1.6318 0.5288 -3.086 0.00203 **
## ca2 -2.8979 0.7327 -3.955 7.65e-05 ***
## ca3 -1.9490 0.9890 -1.971 0.04877 *
## ca4 0.8910 1.5496 0.575 0.56528
## thal1 15.6744 1455.3979 0.011 0.99141
## thal2 14.9142 1455.3977 0.010 0.99182
## thal3 13.5356 1455.3977 0.009 0.99258
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 333.48 on 241 degrees of freedom
## Residual deviance: 161.57 on 225 degrees of freedom
## AIC: 195.57
##
## Number of Fisher Scoring iterations: 14
Model fitting menggunakan step
model2<-stepAIC(model,direction = "backward")
## Start: AIC=195.57
## target ~ sex + cp + fbs + exang + oldpeak + slope + ca + thal
##
## Df Deviance AIC
## - fbs 1 161.96 193.96
## <none> 161.57 195.57
## - exang 1 166.50 198.50
## - oldpeak 1 166.79 198.79
## - slope 2 168.79 198.79
## - sex 1 168.48 200.48
## - cp 3 174.44 202.44
## - thal 3 174.55 202.55
## - ca 4 189.09 215.09
##
## Step: AIC=193.96
## target ~ sex + cp + exang + oldpeak + slope + ca + thal
##
## Df Deviance AIC
## <none> 161.96 193.96
## - exang 1 166.77 196.77
## - slope 2 168.98 196.98
## - oldpeak 1 167.57 197.57
## - sex 1 168.72 198.72
## - thal 3 174.75 200.75
## - cp 3 175.98 201.98
## - ca 4 189.15 213.15
summary(model2)
##
## Call:
## glm(formula = target ~ sex + cp + exang + oldpeak + slope + ca +
## thal, family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7604 -0.3788 0.1531 0.5385 2.6567
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.48079 1455.39812 -0.008 0.99371
## sexMale -1.33279 0.52856 -2.522 0.01169 *
## cp1 0.96192 0.63968 1.504 0.13265
## cp2 1.69640 0.55766 3.042 0.00235 **
## cp3 1.97114 0.68149 2.892 0.00382 **
## exangYes -1.02151 0.46595 -2.192 0.02836 *
## oldpeak -0.57722 0.25328 -2.279 0.02267 *
## slope1 -1.15377 0.95315 -1.210 0.22610
## slope2 0.04235 1.05726 0.040 0.96805
## ca1 -1.61500 0.52360 -3.084 0.00204 **
## ca2 -2.82707 0.72135 -3.919 8.89e-05 ***
## ca3 -1.83846 0.97478 -1.886 0.05929 .
## ca4 1.00063 1.59506 0.627 0.53044
## thal1 15.35183 1455.39779 0.011 0.99158
## thal2 14.58769 1455.39765 0.010 0.99200
## thal3 13.22013 1455.39763 0.009 0.99275
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 333.48 on 241 degrees of freedom
## Residual deviance: 161.96 on 226 degrees of freedom
## AIC: 193.96
##
## Number of Fisher Scoring iterations: 14
Melihat peluang pada prediksi data
heart_test$prob_heart<-predict(model2, type = "response", newdata = heart_test[,-14])
ggplot(heart_test, aes(x=prob_heart))+geom_density(lwd=0.5)+theme_minimal()
Pada grafik diatas dapat diinterpretasi bahwa hasil prediksi bergerak ke arah 1 yg berarti Not Health
heart_test$pred_heart<-factor(ifelse(heart_test$prob_heart>0.5, "Not Health","Health"))
heart_test[1:10,c("pred_heart","target")]
## pred_heart target
## 2 Not Health Not Health
## 3 Not Health Not Health
## 12 Not Health Not Health
## 15 Not Health Not Health
## 19 Not Health Not Health
## 28 Not Health Not Health
## 38 Not Health Not Health
## 44 Not Health Not Health
## 47 Not Health Not Health
## 49 Health Not Health
Dalam syntax diatas, ketika probabilitas data test lebih dari 0.5, artinya dia sakit atau Not Health.
log_conf <- confusionMatrix(heart_test$pred_heart, heart_test$target, positive="Not Health")
log_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Health Not Health
## Health 20 3
## Not Health 8 30
##
## Accuracy : 0.8197
## 95% CI : (0.7002, 0.9064)
## No Information Rate : 0.541
## P-Value [Acc > NIR] : 4.82e-06
##
## Kappa : 0.6319
##
## Mcnemar's Test P-Value : 0.2278
##
## Sensitivity : 0.9091
## Specificity : 0.7143
## Pos Pred Value : 0.7895
## Neg Pred Value : 0.8696
## Prevalence : 0.5410
## Detection Rate : 0.4918
## Detection Prevalence : 0.6230
## Balanced Accuracy : 0.8117
##
## 'Positive' Class : Not Health
##
Beberapa pengertian:
\(Recall/Sensitivity = (30)/(3+30)\)
\(Specificity = (20)/(20+8)\)
\(Accuracy = (20+30)/(20+3+8+30)\)
Sensitivity = (30)/(3+30)
Specificity = (20)/(20+8)
Accuracy = (20+30)/(20+3+8+30)
paste("Sensitivity =", round(Sensitivity,4))
## [1] "Sensitivity = 0.9091"
paste("Specificity =", round(Specificity,4))
## [1] "Specificity = 0.7143"
paste("Accuracy =", round(Accuracy,4))
## [1] "Accuracy = 0.8197"
Berdasarkan hasil confusionMatrix diatas, dapat kita ambil informasi bahwa:
Tuning Cutoff digunakan untuk mengetahui nilai threshold maksimum yang akan diteliti
# tuning cutoff
performa <- function(cutoff, prob, ref, postarget, negtarget)
{
predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
conf <- caret::confusionMatrix(predict , ref, positive = postarget)
acc <- conf$overall[1]
rec <- conf$byClass[1]
prec <- conf$byClass[3]
spec <- conf$byClass[2]
mat <- t(as.matrix(c(rec , acc , prec, spec)))
colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
return(mat)
}
co <- seq(0.01,0.80,length=100)
result <- matrix(0,100,4)
for(i in 1:100){
result[i,] = performa(cutoff = co[i],
prob = heart_test$prob_heart,
ref = heart_test$target,
postarget = "Not Health",
negtarget = "Health")
}
data_frame("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "performa", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = performa)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff model perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
Berdasarkan Tradeoff Model Performance diatas, kita bisa mengetahui bahwa dengan cutoff 0.6 kita memperoleh nilai precision dan re-call yang agak tinggi, namun nilai accuracy dan nilai specificity agak rendah.
#Odds ratio all coefficients
exp(model2$coefficients) %>%
data.frame()
## .
## (Intercept) 1.032657e-05
## sexMale 2.637416e-01
## cp1 2.616728e+00
## cp2 5.454282e+00
## cp3 7.178880e+00
## exangYes 3.600514e-01
## oldpeak 5.614586e-01
## slope1 3.154464e-01
## slope2 1.043264e+00
## ca1 1.988907e-01
## ca2 5.918608e-02
## ca3 1.590614e-01
## ca4 2.719990e+00
## thal1 4.647452e+06
## thal2 2.164479e+06
## thal3 5.513553e+05
Interpretasi model :
Membuat variabel dummy dari data-data kategori yang akan digunakan dalam klasifikasi.
dmy <- dummyVars(" ~target+sex+cp+fbs+exang+oldpeak+slope+ca+thal", data = heart)
dmy <- data.frame(predict(dmy, newdata = heart))
str(dmy)
## 'data.frame': 303 obs. of 25 variables:
## $ target.Health : num 0 0 0 0 0 0 0 0 0 0 ...
## $ target.Not.Health: num 1 1 1 1 1 1 1 1 1 1 ...
## $ sex.Female : num 0 0 1 0 1 0 1 0 0 0 ...
## $ sex.Male : num 1 1 0 1 0 1 0 1 1 1 ...
## $ cp.0 : num 0 0 0 0 1 1 0 0 0 0 ...
## $ cp.1 : num 0 0 1 1 0 0 1 1 0 0 ...
## $ cp.2 : num 0 1 0 0 0 0 0 0 1 1 ...
## $ cp.3 : num 1 0 0 0 0 0 0 0 0 0 ...
## $ fbs.False : num 0 1 1 1 1 1 1 1 0 1 ...
## $ fbs.True : num 1 0 0 0 0 0 0 0 1 0 ...
## $ exang.No : num 1 1 1 1 0 1 1 1 1 1 ...
## $ exang.Yes : num 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope.0 : num 1 1 0 0 0 0 0 0 0 0 ...
## $ slope.1 : num 0 0 0 0 0 1 1 0 0 0 ...
## $ slope.2 : num 0 0 1 1 1 0 0 1 1 1 ...
## $ ca.0 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ ca.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ca.2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ca.3 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ca.4 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ thal.0 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ thal.1 : num 1 0 0 0 0 1 0 0 0 0 ...
## $ thal.2 : num 0 1 1 1 1 0 1 0 0 1 ...
## $ thal.3 : num 0 0 0 0 0 0 0 1 1 0 ...
Menghapus variabel dummy yang variabel sebelumnya hanya terdapat 2 kategori.
dmy$target.Health <- NULL
dmy$sex.Female <- NULL
dmy$fbs.False <- NULL
dmy$exang.No <- NULL
Mengetahui nama-nama dari variabel dummy yang terbentuk.
names(dmy)
## [1] "target.Not.Health" "sex.Male" "cp.0"
## [4] "cp.1" "cp.2" "cp.3"
## [7] "fbs.True" "exang.Yes" "oldpeak"
## [10] "slope.0" "slope.1" "slope.2"
## [13] "ca.0" "ca.1" "ca.2"
## [16] "ca.3" "ca.4" "thal.0"
## [19] "thal.1" "thal.2" "thal.3"
Membentuk data training dan data testing dari data dmy yang telah terbentuk.
set.seed(123)
dmy_train <- dmy[intrain,2:21]
dmy_test <- dmy[-intrain,2:21]
dmy_train_label <- dmy[intrain,1]
dmy_test_label <- dmy[-intrain,1]
Menentukan jumlah k untuk K-NN
sqrt(nrow(dmy_train))
## [1] 15.55635
Kesimpulan dipilih nilai K sebesar 15.55
Melakukan prediksi dengan K-NN
pred_knn <- class::knn(train = dmy_train,
test = dmy_test,
cl = dmy_train_label,
k = 15.55)
Membuat confusion matrix dari prediksi K-NN
pred_knn_conf <- confusionMatrix(as.factor(pred_knn), as.factor(dmy_test_label),"1")
pred_knn_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 21 2
## 1 7 31
##
## Accuracy : 0.8525
## 95% CI : (0.7383, 0.9302)
## No Information Rate : 0.541
## P-Value [Acc > NIR] : 2.6e-07
##
## Kappa : 0.6988
##
## Mcnemar's Test P-Value : 0.1824
##
## Sensitivity : 0.9394
## Specificity : 0.7500
## Pos Pred Value : 0.8158
## Neg Pred Value : 0.9130
## Prevalence : 0.5410
## Detection Rate : 0.5082
## Detection Prevalence : 0.6230
## Balanced Accuracy : 0.8447
##
## 'Positive' Class : 1
##
Berdasarkan hasil confusion matrix diatas, dapat kita ketahui bahwa:
Membuat datafarame yang hanya mengambil matrix evaluation buat Logistic Regression dan K-NN
eval_logit <- data_frame(Accuracy = log_conf$overall[1],
Recall = log_conf$byClass[1],
Precision = log_conf$byClass[3])
eval_knn <- data_frame(Accuracy = pred_knn_conf$overall[1],
Recall = pred_knn_conf$byClass[1],
Precision = pred_knn_conf$byClass[3])
#Model Evaluation Logit
eval_logit
## # A tibble: 1 x 3
## Accuracy Recall Precision
## <dbl> <dbl> <dbl>
## 1 0.820 0.909 0.789
#Model Evaluation Logit
eval_knn
## # A tibble: 1 x 3
## Accuracy Recall Precision
## <dbl> <dbl> <dbl>
## 1 0.852 0.939 0.816
Jika dilihat dari kedua metode tersebut, yaitu dengan menggunakan Regresi Logistik dan K-NN, kemampuan model dalam memprediksi benar dari data aktual orang yang Not Health lebih baik dengan menggunakan metode K-NN karena memiliki nilai recall sebesar 93,9%, nilai tersebut lebih besar dari pada menggunakan metode Regresi Logistik.