Pada kesempatan kali ini, Saya ingin membuat model untuk memprediksi pasien mana yang terindikasi mempunyai penyakit jantung berdasarkan kategori dari beberapa variabel penunjangnya. Algoritma yang akan saya gunakan yaitu menggunakan model Classification yang terdiri dari Logistik Regression, Classification Tree, Random Forest dan Super Vector Machine yang termasuk dalam supervised learning.
Dataset yang akan digunakan untuk melakukan penelitian ini diambil dari https://www.kaggle.com/rishidamarla/heart-disease-prediction.
Materi yang akan dibahas, yaitu:
1. Data Extraction
2. Data Exploration
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation
Proses mengimport data dalam format file csv dapat dilakukan menggunakan fungsi read.csv() dan menyebutkan nama file beserta folder tempat menyimpannya.
heart_df <- read.csv(file = "data/heart.csv")
Dengan perintah di atas, pada workspace-R kita memiliki data frame dengan nama heart. Selanjutnya kita akan bekerja dengan dataframe ini.
Sebelum lebih jauh bekerja dengan data, kita perlu mengetahui informasi dasar dari data yang kita miliki. Informasi dasar tersebut meliputi ukuran/banyaknya data dan nama-nama kolom atau variabel yang ada di dalamnya.
Fungsi dim() dapat digunakan untuk menampilkan berapa banyak baris dan kolom pada dataframe kita, dan fungsi colnames() dapat digunakan untuk menampilkan nama-nama kolom pada data.
dim(heart_df)
## [1] 270 14
Tampak bahwa ada sebanyak 270 baris pada data dengan kolom sebanyak 14 buah. Berikut ini adalah nama-nama kolom yang ada pada Data Heart.
colnames(heart_df)
## [1] "Age" "Sex"
## [3] "Chest.pain.type" "BP"
## [5] "Cholesterol" "FBS.over.120"
## [7] "EKG.results" "Max.HR"
## [9] "Exercise.angina" "ST.depression"
## [11] "Slope.of.ST" "Number.of.vessels.fluro"
## [13] "Thallium" "Heart.Disease"
Untuk melihat tipe data, dapat menggunakan fungsi str:
str(heart_df)
## 'data.frame': 270 obs. of 14 variables:
## $ Age : int 70 67 57 64 74 65 56 59 60 63 ...
## $ Sex : int 1 0 1 1 0 1 1 1 1 0 ...
## $ Chest.pain.type : int 4 3 2 4 2 4 3 4 4 4 ...
## $ BP : int 130 115 124 128 120 120 130 110 140 150 ...
## $ Cholesterol : int 322 564 261 263 269 177 256 239 293 407 ...
## $ FBS.over.120 : int 0 0 0 0 0 0 1 0 0 0 ...
## $ EKG.results : int 2 2 0 0 2 0 2 2 2 2 ...
## $ Max.HR : int 109 160 141 105 121 140 142 142 170 154 ...
## $ Exercise.angina : int 0 0 0 1 1 0 1 1 0 0 ...
## $ ST.depression : num 2.4 1.6 0.3 0.2 0.2 0.4 0.6 1.2 1.2 4 ...
## $ Slope.of.ST : int 2 2 1 2 1 1 2 2 2 2 ...
## $ Number.of.vessels.fluro: int 3 0 0 1 1 0 1 1 2 3 ...
## $ Thallium : int 3 7 7 7 3 7 6 7 7 7 ...
## $ Heart.Disease : chr "Presence" "Absence" "Presence" "Absence" ...
Ada baiknya kita mengetahui berapa banyak pasien yang statusnya tergolong Absence (Tidak punya penyakit jantung) dan Presence (Punya penyakit jantung), karena variabel ini merupakan variabel yang akan kita gunakan dalam pemodelan sebagai variabel target.
Fungsi table() merupakan fungsi yang dapat menghasilkan frekuensi dari setiap kategori, sedangkan prop.table() dapat menghasilkan proporsi atau persentase-nya.
table(heart_df$Heart.Disease)
##
## Absence Presence
## 150 120
Cek Prosentase Variabel Target
prop.table(table(heart_df$Heart.Disease))
##
## Absence Presence
## 0.5555556 0.4444444
Berdasarkan hasil di atas diperoleh bahwa pasien yang memiliki penyakit jantung ada sebanyak 120 pasien atau 44% dari keseluruhan yang ada.
Cek Missing Value (Apakah ada kolom yang kosong)
colSums(is.na(heart_df))
## Age Sex Chest.pain.type
## 0 0 0
## BP Cholesterol FBS.over.120
## 0 0 0
## EKG.results Max.HR Exercise.angina
## 0 0 0
## ST.depression Slope.of.ST Number.of.vessels.fluro
## 0 0 0
## Thallium Heart.Disease
## 0 0
Merubah tipe data variabel Chest.pain.type dari integer ke factor
heart_df$Chest.pain.type <- factor(heart_df$Chest.pain.type , levels = c(1,2,3,4),
labels = c("typical angina","atypical angina",
"non-anginal pain","asymptomatic"))
Merubah tipe data variabel FBS.over.120 dari integer ke factor
heart_df$FBS.over.120 <- factor(heart_df$FBS.over.120 , levels = c(0,1),
labels = c("Kurang Dari 120","Lebih Dari 120"))
Merubah tipe data variabel EKG.results dari integer ke factor
heart_df$EKG.results <- factor(heart_df$EKG.results , levels = c(0,1,2),
labels = c("Normal","ST-T Wave Abnormality","Ventricular kiri mengalami hipertropi"))
Merubah tipe data variabel Exercise.angina dari integer ke factor
heart_df$Exercise.angina <- factor(heart_df$Exercise.angina , levels = c(0,1),
labels = c("Tidak Nyeri Dada","Nyeri Dada"))
Merubah tipe data variabel Sex dari integer ke factor
heart_df$Sex <- factor(heart_df$Sex , levels = c(0,1),
labels = c("Female","Male"))
Merubah tipe data variabel Thallium dari integer ke factor
heart_df$Thallium <- factor(heart_df$Thallium , levels = c(3,6,7),
labels = c("Normal","Fixed Defect","Reversal Defect"))
Melakukan install.packages
#install.packages("ggplot2")
library(ggplot2)
A. Chest Pain Type
ggplot(data = heart_df, aes(x = "", fill = Chest.pain.type)) +
geom_bar(position = "dodge") + labs(title = "Prediksi Kemungkinan Terkena Penyakit Jantung Berdasarkan Chest Pain")
B. FBS.over.120
ggplot(data = heart_df, aes(x = "", fill = FBS.over.120)) +
geom_bar(position = "dodge") + labs(title = "Prediksi Kemungkinan Terkena Penyakit Jantung Berdasarkan FBS")
C. EKG.results
ggplot(data = heart_df, aes(x = "", fill = EKG.results)) +
geom_bar(position = "dodge") + labs(title = "Prediksi Kemungkinan Terkena Penyakit Jantung Berdasarkan EKG")
D. Exercise.angina
ggplot(data = heart_df, aes(x = "", fill = Exercise.angina)) +
geom_bar(position = "dodge") + labs(title = "Prediksi Kemungkinan Terkena Penyakit Jantung Berdasarkan Engercise Angina")
A. Usia Pasien Yang Terkena Penyakit Jantung
ggplot (data = heart_df, aes(x = Heart.Disease, y = Age)) +
geom_boxplot() +
labs(title = "Prediksi Usia Pasien Kemungkinan Terkena Penyakit Jantung",
x = "Heart Disease",
y = "Age")
B. Jenis Kelamin Pasien Yang Terkena Penyakit Jantung
ggplot(data = heart_df, aes(x = Heart.Disease, fill = Sex)) +
geom_bar(position = "dodge") + labs(title = "Prediksi Kemungkinan Terkena Penyakit Jantung Berdasarkan Sex")
C. Kolestrol dan BP Pasien Yang Terkena Penyakit Jantung
ggplot(data = heart_df, aes(x= Cholesterol, y = BP))+
geom_point()+ labs(title = "Prediksi Terkena Penyakit Jantung Berdasarkan BP+Kolestrol")
D. Max.HR Pasien Yang Terkena Penyakit Jantung
ggplot(data = heart_df, aes(y = Max.HR, x = Heart.Disease)) +
geom_boxplot() +
geom_point(position = "jitter",
color = "red",
alpha = 0.5)+ labs(title = "Prediksi Kemungkinan Terkena Penyakit Jantung Berdasarkan Detak Jantung")
A. Melihat Usia Pasien Berdasarkan Sex
library(ggplot2)
ggplot(data = heart_df, aes(x=Heart.Disease, y=Age, fill = Sex))+
geom_boxplot()+ labs(title = "Usia Pasien Berdasarkan Sex Yang Memiliki Penyakit Jantung")
B. Melihat ST.depression
library(ggplot2)
ggplot(data = heart_df, aes(x= Age, y=ST.depression, color=Heart.Disease, shape=Sex)) +
geom_point()+ labs(title = "ST Depression Pasien Berdasarkan Sex")
heart_df$Heart.Disease <- as.factor(heart_df$Heart.Disease)
library(caret)
## Loading required package: lattice
# OHE (Chest.pain.type and Thallium)
dmy <- dummyVars(" ~ Chest.pain.type+Thallium", data = heart_df)
heart_ohe <- data.frame(predict(dmy, newdata = heart_df))
heart_df<- cbind(heart_df, heart_ohe)
# REMOVE OHE (Chest.pain.type and Thallium)
heart_df$Chest.pain.type<- NULL
heart_df$Thallium<- NULL
## get outlier values
out_heart <- boxplot.stats(heart_df$Max.HR)$out
## get row index of outlier values
out_idx <- which(heart_df$Max.HR %in% c(out_heart))
## remove rows with outliers
heart_df <- heart_df[-out_idx,]
## Remove Rows with Zero Price
heart_df <- heart_df[heart_df$Max.HR > 0, ]
set.seed(2021)
m <- nrow(heart_df)
train_idx <- sample(m, 0.8 * m)
train_idx
## [1] 166 231 70 192 251 102 110 103 23 146 123 188 125 131 46 104 26 164
## [19] 101 86 159 240 98 147 68 117 250 133 73 191 38 18 171 59 267 243
## [37] 198 248 79 162 150 114 144 268 113 88 17 265 148 196 200 211 67 130
## [55] 223 254 169 121 31 225 201 155 139 210 252 269 263 126 241 19 16 222
## [73] 127 106 44 22 239 120 27 195 13 33 174 149 115 89 95 173 75 137
## [91] 261 205 94 255 7 29 129 246 109 206 178 202 185 170 209 138 266 65
## [109] 259 37 157 80 172 76 58 175 64 57 207 217 176 20 45 1 143 244
## [127] 237 230 167 132 189 177 85 186 39 128 112 71 260 224 53 180 134 43
## [145] 54 160 77 118 69 48 72 190 151 181 56 187 182 99 83 152 212 219
## [163] 87 194 14 218 234 220 108 161 227 93 203 247 253 229 135 235 245 96
## [181] 193 124 41 111 51 49 35 92 197 40 84 24 168 78 141 264 63 184
## [199] 179 122 11 32 142 153 97 156 66 8 82 62 199 3 163 61 221
train_df <- heart_df[train_idx, ]
test_df <- heart_df[-train_idx, ]
Logistic Regression
fit.logit <- glm(formula = Heart.Disease ~ .,
family = binomial(link="logit"),
data = train_df)
summary(fit.logit)
##
## Call:
## glm(formula = Heart.Disease ~ ., family = binomial(link = "logit"),
## data = train_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4359 -0.4812 -0.1452 0.3021 2.8813
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value
## (Intercept) -1.816e+00 3.690e+00 -0.492
## Age -1.798e-02 3.061e-02 -0.588
## SexMale 1.281e+00 6.725e-01 1.905
## BP 2.502e-02 1.329e-02 1.883
## Cholesterol 6.918e-03 5.587e-03 1.238
## FBS.over.120Lebih Dari 120 -1.650e-01 6.842e-01 -0.241
## EKG.resultsST-T Wave Abnormality 1.332e+01 1.455e+03 0.009
## EKG.resultsVentricular kiri mengalami hipertropi 8.252e-01 4.774e-01 1.729
## Max.HR -2.413e-02 1.461e-02 -1.651
## Exercise.anginaNyeri Dada 9.559e-01 5.309e-01 1.800
## ST.depression 3.261e-01 2.659e-01 1.227
## Slope.of.ST 4.117e-01 4.320e-01 0.953
## Number.of.vessels.fluro 1.112e+00 3.059e-01 3.637
## Chest.pain.type.typical.angina -2.234e+00 7.869e-01 -2.839
## Chest.pain.type.atypical.angina -1.257e+00 7.249e-01 -1.734
## Chest.pain.type.non.anginal.pain -2.105e+00 5.960e-01 -3.532
## Chest.pain.type.asymptomatic NA NA NA
## Thallium.Normal -1.778e+00 5.324e-01 -3.339
## Thallium.Fixed.Defect -1.709e+00 8.977e-01 -1.904
## Thallium.Reversal.Defect NA NA NA
## Pr(>|z|)
## (Intercept) 0.622588
## Age 0.556856
## SexMale 0.056751 .
## BP 0.059699 .
## Cholesterol 0.215597
## FBS.over.120Lebih Dari 120 0.809405
## EKG.resultsST-T Wave Abnormality 0.992699
## EKG.resultsVentricular kiri mengalami hipertropi 0.083885 .
## Max.HR 0.098672 .
## Exercise.anginaNyeri Dada 0.071786 .
## ST.depression 0.219994
## Slope.of.ST 0.340578
## Number.of.vessels.fluro 0.000276 ***
## Chest.pain.type.typical.angina 0.004519 **
## Chest.pain.type.atypical.angina 0.082993 .
## Chest.pain.type.non.anginal.pain 0.000413 ***
## Chest.pain.type.asymptomatic NA
## Thallium.Normal 0.000842 ***
## Thallium.Fixed.Defect 0.056956 .
## Thallium.Reversal.Defect NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 294.65 on 214 degrees of freedom
## Residual deviance: 136.10 on 197 degrees of freedom
## AIC: 172.1
##
## Number of Fisher Scoring iterations: 14
#Evaluation (testing)
pred <- predict(fit.logit, test_df)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
pred
## 2 4 5 6 9 10
## -0.03692991 3.76009866 -0.86647497 -0.31245212 4.07922134 6.19440714
## 12 15 21 25 28 30
## 3.14327636 -0.80684157 5.41189923 -1.22226784 -3.92801028 -3.34219143
## 34 36 42 47 50 52
## 5.66028679 4.25925131 -3.68128699 0.47973917 4.88626350 -4.07607616
## 55 60 74 81 90 91
## -3.16605356 2.77262116 9.50133327 3.63812802 3.40846532 -2.18743117
## 100 106 108 117 120 137
## -4.19723040 1.40657788 3.57621748 1.38642588 4.55662394 -3.80986577
## 141 146 155 159 166 184
## 3.04094057 5.90493198 -3.78715530 -1.07914749 -0.34036492 -1.06283525
## 205 209 214 215 216 217
## 4.95822350 0.68019165 6.61773697 -3.06345152 -3.38014956 -4.62352516
## 227 229 233 234 237 239
## 0.02297126 -3.67090069 -1.67231029 2.09233551 -2.09858922 -2.55875726
## 243 250 257 258 259 263
## -1.13884984 6.28154735 -1.44507530 4.31925486 -3.14826243 -1.26959281
prob <- predict(fit.logit, test_df, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
prob
## 2 4 5 6 9 10
## 0.490768572 0.977248250 0.295988317 0.422516316 0.983360908 0.997963344
## 12 15 21 25 28 30
## 0.958642973 0.308563945 0.995556676 0.227537599 0.019302863 0.034151799
## 34 36 42 47 50 52
## 0.996530561 0.986064075 0.024571563 0.617686282 0.992506990 0.016690632
## 55 60 74 81 90 91
## 0.040463363 0.941178266 0.999925253 0.974372509 0.967968052 0.100884866
## 100 106 108 117 120 137
## 0.014814400 0.803225624 0.972780305 0.800021043 0.989611612 0.021671112
## 141 146 155 159 166 184
## 0.954389790 0.997281444 0.022157874 0.253667381 0.415720836 0.256768007
## 205 209 214 215 216 217
## 0.993023615 0.663781471 0.998665332 0.044640273 0.032921633 0.009722666
## 227 229 233 234 237 239
## 0.505742562 0.024821733 0.158116400 0.890155996 0.109234018 0.071840363
## 243 250 257 258 259 263
## 0.242531595 0.998132988 0.190760639 0.986865026 0.041159798 0.219326965
logit.pred <- factor(prob > 0.5,
levels = c(FALSE, TRUE),
labels = c("Absence", "Presence"))
logit.perf <- table(test_df$Heart.Disease , logit.pred,
dnn = c("Actual","Predicted"))
logit.perf
## Predicted
## Actual Absence Presence
## Absence 26 3
## Presence 2 23
Classification Tree
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
#membuat model
fit.ctree <- ctree(formula = Heart.Disease ~ .,
data = train_df)
#ngeplot ctree nya
plot(fit.ctree)
ctree.pred <- predict(fit.ctree, test_df)
ctree.pred
## [1] Absence Presence Absence Presence Presence Presence Presence Presence
## [9] Presence Absence Absence Absence Presence Presence Absence Absence
## [17] Presence Absence Absence Absence Absence Presence Presence Absence
## [25] Absence Presence Presence Presence Presence Absence Presence Presence
## [33] Absence Absence Absence Absence Presence Absence Presence Absence
## [41] Absence Absence Absence Absence Absence Presence Absence Absence
## [49] Absence Presence Absence Presence Absence Absence
## Levels: Absence Presence
ctree.perf <- table(test_df$Heart.Disease, ctree.pred,
dnn = c("Actual","Predicted"))
ctree.perf
## Predicted
## Actual Absence Presence
## Absence 25 4
## Presence 6 19
Random Forest
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
fit.forest <- randomForest(formula = Heart.Disease ~ .,
data = train_df,
na.action = na.roughfix)
forest.pred <- predict(fit.forest, test_df)
forest.pred
## 2 4 5 6 9 10 12 15
## Absence Presence Absence Presence Presence Presence Presence Presence
## 21 25 28 30 34 36 42 47
## Presence Absence Absence Absence Presence Presence Absence Absence
## 50 52 55 60 74 81 90 91
## Presence Absence Absence Presence Absence Presence Presence Absence
## 100 106 108 117 120 137 141 146
## Absence Absence Presence Presence Presence Absence Presence Presence
## 155 159 166 184 205 209 214 215
## Absence Absence Presence Absence Presence Presence Presence Absence
## 216 217 227 229 233 234 237 239
## Absence Absence Presence Absence Absence Presence Absence Absence
## 243 250 257 258 259 263
## Absence Presence Absence Presence Absence Absence
## Levels: Absence Presence
forest.perf <- table(test_df$Heart.Disease, forest.pred,
dnn = c("Actual","Predicted"))
forest.perf
## Predicted
## Actual Absence Presence
## Absence 24 5
## Presence 4 21
SVM (Support Vector Machine)
library(e1071)
fit.svm <- svm(formula = Heart.Disease ~ .,
data = train_df)
svm.pred <- predict(fit.svm, na.omit(test_df))
svm.pred
## 2 4 5 6 9 10 12 15
## Absence Presence Absence Absence Presence Presence Presence Absence
## 21 25 28 30 34 36 42 47
## Presence Absence Absence Absence Presence Presence Absence Presence
## 50 52 55 60 74 81 90 91
## Presence Absence Absence Presence Absence Presence Presence Absence
## 100 106 108 117 120 137 141 146
## Absence Presence Presence Presence Presence Absence Presence Presence
## 155 159 166 184 205 209 214 215
## Absence Absence Presence Absence Presence Absence Presence Absence
## 216 217 227 229 233 234 237 239
## Absence Absence Presence Absence Absence Presence Absence Absence
## 243 250 257 258 259 263
## Absence Presence Absence Presence Absence Absence
## Levels: Absence Presence
svm.perf <- table(na.omit(test_df)$Heart.Disease, svm.pred,
dnn = c("Actual","Predicted"))
svm.perf
## Predicted
## Actual Absence Presence
## Absence 26 3
## Presence 3 22
# Performance Comparison
jumlah <- function(a, b){
result <- a + b
}
jumlah(0,1)
# Performance Comparison
performance <- function(table){
tn <- table[1,1]
tp <- table[2,2]
fn <- table[1,2]
fp <- table[2,1]
accuracy <- (tn + tp) / (tn + tp + fn + fp)
precision <- tp / (tp + fp)
recall <- tp / (tp + fn)
f1score <- 2 * precision * recall / (precision + recall)
result <- paste("Accuracy = ", round(accuracy, 3),
"\nPrecision = ", round(precision, 3),
"\nRecall = ", round(recall, 3),
"\nF1 Score = ", round(f1score, 3), "\n")
cat(result)
}
# Performance Comparison
performance(logit.perf)
## Accuracy = 0.907
## Precision = 0.92
## Recall = 0.885
## F1 Score = 0.902
performance(ctree.perf)
## Accuracy = 0.815
## Precision = 0.76
## Recall = 0.826
## F1 Score = 0.792
performance(forest.perf)
## Accuracy = 0.833
## Precision = 0.84
## Recall = 0.808
## F1 Score = 0.824
performance(svm.perf)
## Accuracy = 0.889
## Precision = 0.88
## Recall = 0.88
## F1 Score = 0.88