0.1 PENDAHULUAN

WHO memperkirakan 12 juta kematian terjadi di seluruh dunia, setiap tahun karena penyakit jantung. Separuh kematian di Amerika Serikat dan negara maju lainnya disebabkan oleh penyakit kardiovaskular. Prognosis dini penyakit kardiovaskular dapat membantu dalam membuat keputusan tentang perubahan gaya hidup pada pasien berisiko tinggi dan pada gilirannya mengurangi komplikasi. Penelitian ini bermaksud untuk menentukan faktor risiko penyakit jantung yang paling relevan serta memprediksi risiko keseluruhan menggunakan logistic regression.

Dataset berikut berasal dari situs web Kaggle, dan berasal dari studi kardiovaskular yang sedang berlangsung pada penduduk kota Framingham, Massachusetts. Tujuan klasifikasi adalah untuk memprediksi apakah pasien memiliki risiko penyakit jantung koroner (PJK) 10 tahun di masa depan. Kumpulan data menyediakan informasi pasien. Ini mencakup lebih dari 4.000 records dan 15 variable(kolom).

Library yang digunakan :

library(class)
library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.1.8
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ✔ readr     2.1.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(gtools)

1 IMPORT DATA

pasien_jantung <- read.csv("data/framingham.csv")
glimpse(pasien_jantung)

## Rows: 4,238
## Columns: 16
## $ male            <int> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, …
## $ age             <int> 39, 46, 48, 61, 46, 43, 63, 45, 52, 43, 50, 43, 46, 41…
## $ education       <int> 4, 2, 1, 3, 3, 2, 1, 2, 1, 1, 1, 2, 1, 3, 2, 2, 3, 2, …
## $ currentSmoker   <int> 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, …
## $ cigsPerDay      <int> 0, 0, 20, 30, 23, 0, 0, 20, 0, 30, 0, 0, 15, 0, 9, 20,…
## $ BPMeds          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ prevalentStroke <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ prevalentHyp    <int> 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, …
## $ diabetes        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ totChol         <int> 195, 250, 245, 225, 285, 228, 205, 313, 260, 225, 254,…
## $ sysBP           <dbl> 106.0, 121.0, 127.5, 150.0, 130.0, 180.0, 138.0, 100.0…
## $ diaBP           <dbl> 70.0, 81.0, 80.0, 95.0, 84.0, 110.0, 71.0, 71.0, 89.0,…
## $ BMI             <dbl> 26.97, 28.73, 25.34, 28.58, 23.10, 30.30, 33.11, 21.68…
## $ heartRate       <int> 80, 95, 75, 65, 85, 77, 60, 79, 76, 93, 75, 72, 98, 65…
## $ glucose         <int> 77, 76, 70, 103, 85, 99, 85, 78, 79, 88, 76, 61, 64, 8…
## $ TenYearCHD      <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …

2 CEK MISSING VALUE

anyNA(pasien_jantung)

## [1] TRUE

missing_values <- colSums(is.na(pasien_jantung))
print(missing_values)

##            male             age       education   currentSmoker      cigsPerDay 
##               0               0             105               0              29 
##          BPMeds prevalentStroke    prevalentHyp        diabetes         totChol 
##              53               0               0               0              50 
##           sysBP           diaBP             BMI       heartRate         glucose 
##               0               0              19               1             388 
##      TenYearCHD 
##               0

pasien_jantung <- na.omit(pasien_jantung)
anyNA(pasien_jantung)

## [1] FALSE

3 DATA WRANGLING

pasien_jantung <- pasien_jantung %>% 
  mutate(male = as.factor(male),
         currentSmoker = as.factor(currentSmoker),
         BPMeds = as.factor(BPMeds),
         prevalentStroke = as.factor(prevalentStroke),
         prevalentHyp = as.factor(male),
         diabetes = as.factor(diabetes),
         TenYearCHD = as.factor(TenYearCHD))

str(pasien_jantung)

## 'data.frame':    3656 obs. of  16 variables:
##  $ male           : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 1 2 2 ...
##  $ age            : int  39 46 48 61 46 43 63 45 52 43 ...
##  $ education      : int  4 2 1 3 3 2 1 2 1 1 ...
##  $ currentSmoker  : Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 2 1 2 ...
##  $ cigsPerDay     : int  0 0 20 30 23 0 0 20 0 30 ...
##  $ BPMeds         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ prevalentStroke: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ prevalentHyp   : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 1 2 2 ...
##  $ diabetes       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ totChol        : int  195 250 245 225 285 228 205 313 260 225 ...
##  $ sysBP          : num  106 121 128 150 130 ...
##  $ diaBP          : num  70 81 80 95 84 110 71 71 89 107 ...
##  $ BMI            : num  27 28.7 25.3 28.6 23.1 ...
##  $ heartRate      : int  80 95 75 65 85 77 60 79 76 93 ...
##  $ glucose        : int  77 76 70 103 85 99 85 78 79 88 ...
##  $ TenYearCHD     : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:582] 15 22 27 34 37 43 50 55 71 73 ...
##   ..- attr(*, "names")= chr [1:582] "15" "22" "27" "34" ...

Berikut adalah keterangan untuk setiap kolom : - male: Gender

- age: Usia

- education: Jenjang Pendidikan

- currentSmoker: Status Merokok

- cigsPerDay: Konsumsi Rokok Per Hari

- BPMeds: Status Tekanan Darah

- prevalentStroke: Status Stroke

- prevalentHyp: Status Hipertensi

- diabetes : Status Diabetes

- totChol: Total Kolesterol

- sysBP: Tekanan Darah Systolic

- diaBP: Tekanan Darah Diastolic

- BMI: Body Mass Index

- heartRate: Heart Rate

- glucose: Level Glukosa

- TenYearCHD: Resiko Penyakit Jantung dalam 10 Tahun Ke Depan

4 EXPLORATORY DATA ANALYSIS

summary(pasien_jantung)

##  male          age          education    currentSmoker   cigsPerDay    
##  0:2034   Min.   :32.00   Min.   :1.00   0:1868        Min.   : 0.000  
##  1:1622   1st Qu.:42.00   1st Qu.:1.00   1:1788        1st Qu.: 0.000  
##           Median :49.00   Median :2.00                 Median : 0.000  
##           Mean   :49.56   Mean   :1.98                 Mean   : 9.022  
##           3rd Qu.:56.00   3rd Qu.:3.00                 3rd Qu.:20.000  
##           Max.   :70.00   Max.   :4.00                 Max.   :70.000  
##  BPMeds   prevalentStroke prevalentHyp diabetes    totChol          sysBP      
##  0:3545   0:3635          0:2034       0:3557   Min.   :113.0   Min.   : 83.5  
##  1: 111   1:  21          1:1622       1:  99   1st Qu.:206.0   1st Qu.:117.0  
##                                                 Median :234.0   Median :128.0  
##                                                 Mean   :236.9   Mean   :132.4  
##                                                 3rd Qu.:263.2   3rd Qu.:144.0  
##                                                 Max.   :600.0   Max.   :295.0  
##      diaBP             BMI          heartRate         glucose       TenYearCHD
##  Min.   : 48.00   Min.   :15.54   Min.   : 44.00   Min.   : 40.00   0:3099    
##  1st Qu.: 75.00   1st Qu.:23.08   1st Qu.: 68.00   1st Qu.: 71.00   1: 557    
##  Median : 82.00   Median :25.38   Median : 75.00   Median : 78.00             
##  Mean   : 82.91   Mean   :25.78   Mean   : 75.73   Mean   : 81.86             
##  3rd Qu.: 90.00   3rd Qu.:28.04   3rd Qu.: 82.00   3rd Qu.: 87.00             
##  Max.   :142.50   Max.   :56.80   Max.   :143.00   Max.   :394.00

prop.table(table(pasien_jantung$TenYearCHD))

## 
##         0         1 
## 0.8476477 0.1523523

pasien_tidak_jantung <- pasien_jantung %>%
  filter(pasien_jantung$TenYearCHD %in% "0")

pasien_tidak_jantung_downsample <- pasien_tidak_jantung[sample(nrow(pasien_tidak_jantung), 557), ]

pasien_jantung <- pasien_jantung %>%
  filter(pasien_jantung$TenYearCHD %in% "1")

pasien_jantung <- bind_rows(pasien_tidak_jantung_downsample,pasien_jantung)
pasien_jantung <- pasien_jantung[sample(1:nrow(pasien_jantung)), ]

prop.table(table(pasien_jantung$TenYearCHD))

## 
##   0   1 
## 0.5 0.5

ggcorr(pasien_jantung, label = T)

## Warning in ggcorr(pasien_jantung, label = T): data in column(s) 'male',
## 'currentSmoker', 'BPMeds', 'prevalentStroke', 'prevalentHyp', 'diabetes',
## 'TenYearCHD' are not numeric and were ignored

5 CROSS VALIDATION

set.seed(417) 

index <- sample(x = nrow(pasien_jantung), size = nrow(pasien_jantung)*0.8)

pasien_jantung_train <- pasien_jantung[index,]
pasien_jantung_test <- pasien_jantung[-index,]

6 MEMBUAT MODEL

# model base
model_pasien_jantung <- glm(
  formula = TenYearCHD ~ .,
  data = pasien_jantung_train,
  family = "binomial",
  control = list(trace=FALSE) # cek perubahan deviasi setiap iterasi
)

summary(model_pasien_jantung)

## 
## Call:
## glm(formula = TenYearCHD ~ ., family = "binomial", data = pasien_jantung_train, 
##     control = list(trace = FALSE))
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.265  -1.001   0.365   1.001   2.240  
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -7.041253   0.991289  -7.103 1.22e-12 ***
## male1             0.531743   0.161034   3.302 0.000960 ***
## age               0.054760   0.010057   5.445 5.19e-08 ***
## education        -0.068126   0.074764  -0.911 0.362185    
## currentSmoker1    0.320431   0.250059   1.281 0.200045    
## cigsPerDay        0.012425   0.010113   1.229 0.219218    
## BPMeds1           0.109960   0.412281   0.267 0.789693    
## prevalentStroke1  1.227634   1.099875   1.116 0.264354    
## prevalentHyp1           NA         NA      NA       NA    
## diabetes1         0.818239   0.691344   1.184 0.236592    
## totChol           0.003258   0.001666   1.955 0.050533 .  
## sysBP             0.022441   0.005788   3.877 0.000106 ***
## diaBP            -0.003623   0.009980  -0.363 0.716575    
## BMI               0.015231   0.019437   0.784 0.433265    
## heartRate        -0.006627   0.006613  -1.002 0.316316    
## glucose           0.004987   0.004166   1.197 0.231254    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1234.4  on 890  degrees of freedom
## Residual deviance: 1058.4  on 876  degrees of freedom
## AIC: 1088.4
## 
## Number of Fisher Scoring iterations: 4

# stepwise
model_step <- step(object = model_pasien_jantung,
                   direction = "backward",
                   trace = F)
summary(model_step)

## 
## Call:
## glm(formula = TenYearCHD ~ male + age + cigsPerDay + diabetes + 
##     totChol + sysBP, family = "binomial", data = pasien_jantung_train, 
##     control = list(trace = FALSE))
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.273  -1.009   0.408   1.005   2.207  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.976530   0.685436 -10.178  < 2e-16 ***
## male1        0.516739   0.156168   3.309 0.000937 ***
## age          0.055798   0.009608   5.808 6.34e-09 ***
## cigsPerDay   0.020972   0.006591   3.182 0.001463 ** 
## diabetes1    1.397930   0.511287   2.734 0.006254 ** 
## totChol      0.003083   0.001650   1.868 0.061763 .  
## sysBP        0.021523   0.003668   5.868 4.40e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1234.4  on 890  degrees of freedom
## Residual deviance: 1065.7  on 884  degrees of freedom
## AIC: 1079.7
## 
## Number of Fisher Scoring iterations: 4

exp(model_step$coefficient)

##  (Intercept)        male1          age   cigsPerDay    diabetes1      totChol 
## 0.0009335367 1.6765521990 1.0573843107 1.0211934388 4.0468141905 1.0030877274 
##        sysBP 
## 1.0217562106

Setiap kenaikan 1 nilai usia, pasien meningkatkan odds berpenyakit jantung sebesar 1.06 kali dengan catatan seluruh prediktor lain bernilai tetap. Setiap pasien laki- laki, 1.7 kali lebih mungkin berpenyakit jantung dibandingkan dengan perempuan.

7 PREDICT

pasien_jantung_test$pred_Risk <- predict(
  object = model_step,
  newdata = pasien_jantung_test,
  type = "response"
)

pasien_jantung_test$pred_Label <- 
  ifelse(pasien_jantung_test$pred_Risk < 0.5, 0, 1) %>% 
  as.factor()

# lihat hasil prediksi
pasien_jantung_test %>% 
  select(pred_Risk, pred_Label, TenYearCHD)

8 MODEL EVALUATION

table(predict = pasien_jantung_test$pred_Label,
      actual = pasien_jantung_test$TenYearCHD)

##        actual
## predict  0  1
##       0 83 30
##       1 42 68

# confusion matrix
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

confusionMatrix(
  data = pasien_jantung_test$pred_Label, # hasil prediksi kita
  reference = pasien_jantung_test$TenYearCHD, # data asli/aktual
  positive = "1"
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 83 30
##          1 42 68
##                                          
##                Accuracy : 0.6771         
##                  95% CI : (0.6115, 0.738)
##     No Information Rate : 0.5605         
##     P-Value [Acc > NIR] : 0.0002452      
##                                          
##                   Kappa : 0.3532         
##                                          
##  Mcnemar's Test P-Value : 0.1948506      
##                                          
##             Sensitivity : 0.6939         
##             Specificity : 0.6640         
##          Pos Pred Value : 0.6182         
##          Neg Pred Value : 0.7345         
##              Prevalence : 0.4395         
##          Detection Rate : 0.3049         
##    Detection Prevalence : 0.4933         
##       Balanced Accuracy : 0.6789         
##                                          
##        'Positive' Class : 1              
##

Dari hasil confusionMatrix() di atas kita dapat melihat rentang nilai yang cukup besar antara True Positive dan False Positive. Kita dapat melihat kebaikan dari sebuah model ini berdasarkan nilai Sensitivity. Model yang kita punya memiliki nilai Sensitivity 0.7327.

9 K-NN

Metode K-NN ini akan mengklasifikasikan data test berdasarkan karakteristik data train. Klasifikasi ditentukan berdasarkan kedekatan karakteristik yang diukur dengan cara Euclidean Distance atau menghitung jarak. Setelah itu dipilih menggunakan majority voting sesuai dengan nilai k yang sudah kita tentukan. Nilai k adalah jumlah data terdekat dari data test. Untuk mendapatkan nilai k yang optimal kita dapat mengakar kuadratkan jumlah baris pada data kita.

sqrt(nrow(pasien_jantung_train))

## [1] 29.84962

Kita akan memisahkan terlebih dahulu variabel target dan variabel prediktor sebelum membuat model K-NN.

pasien_jantung_train_x <- pasien_jantung_train[,-c(1,4,6,7,8,9,16)]
pasien_jantung_test_x <- pasien_jantung_test[,-c(1,4,6,7,8,9,16,17,18)]

pasien_jantung_train_y <- pasien_jantung_train$TenYearCHD
pasien_jantung_test_y <- pasien_jantung_test$TenYearCHD

Berikut adalah proses scaling pada variabel prediktor di data train dan data test.

pasien_jantung_train_xs <- scale(pasien_jantung_train_x)

pasien_jantung_test_xs <- scale(pasien_jantung_test_x , 
                        center = attr(pasien_jantung_train_xs,"scaled:center"), 
                        scale = attr(pasien_jantung_train_xs,"scaled:scale"))

Berikut adalah prediksi KNN menggunakan fungsi knn().

pasien_jantung_pred <- knn(train = pasien_jantung_train_xs, 
                   test = pasien_jantung_test_xs, 
                   cl = pasien_jantung_train_y, 
                   k = 29)

head(pasien_jantung_pred)

## [1] 1 0 1 0 1 1
## Levels: 0 1

Kemudian kita melakukan confusionMatrix untuk evaluasi model KNN.

confusionMatrix(data = pasien_jantung_pred, 
                reference = pasien_jantung_test_y, 
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 82 38
##          1 43 60
##                                           
##                Accuracy : 0.6368          
##                  95% CI : (0.5699, 0.6999)
##     No Information Rate : 0.5605          
##     P-Value [Acc > NIR] : 0.01252         
##                                           
##                   Kappa : 0.2668          
##                                           
##  Mcnemar's Test P-Value : 0.65672         
##                                           
##             Sensitivity : 0.6122          
##             Specificity : 0.6560          
##          Pos Pred Value : 0.5825          
##          Neg Pred Value : 0.6833          
##              Prevalence : 0.4395          
##          Detection Rate : 0.2691          
##    Detection Prevalence : 0.4619          
##       Balanced Accuracy : 0.6341          
##                                           
##        'Positive' Class : 1               
##

Model K-NN memiliki nilai Sensitivity yang lebih rendah dibanding dengan model sebelumnya.

10 KESIMPULAN

Pada kasus data penyakit jantung ini, model Logistic Regression masih lebih baik dibanding dengan model K-NN. Kemungkinan karena faktor jenis kelamin dan diabetes cukup mempengaruhi penyakit jantung. Dimana jenis kelamin dan diabetes ini tidak digunakan pada model KNN karena merupakan variable kategorik.

LBB Classification in Machine Learning I

David Ricardo

2023-03-29