0.1 PENDAHULUAN

WHO memperkirakan 12 juta kematian terjadi di seluruh dunia, setiap tahun karena penyakit jantung. Separuh kematian di Amerika Serikat dan negara maju lainnya disebabkan oleh penyakit kardiovaskular. Prognosis dini penyakit kardiovaskular dapat membantu dalam membuat keputusan tentang perubahan gaya hidup pada pasien berisiko tinggi dan pada gilirannya mengurangi komplikasi. Penelitian ini bermaksud untuk menentukan faktor risiko penyakit jantung yang paling relevan serta memprediksi risiko keseluruhan menggunakan logistic regression.

Dataset berikut berasal dari situs web Kaggle, dan berasal dari studi kardiovaskular yang sedang berlangsung pada penduduk kota Framingham, Massachusetts. Tujuan klasifikasi adalah untuk memprediksi apakah pasien memiliki risiko penyakit jantung koroner (PJK) 10 tahun di masa depan. Kumpulan data menyediakan informasi pasien. Ini mencakup lebih dari 4.000 records dan 15 variable(kolom).

Library yang digunakan :

library(e1071)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(partykit)

## Loading required package: grid

## Loading required package: libcoin

## Loading required package: mvtnorm

library(class)

1 IMPORT DATA

pasien_jantung <- read.csv("data/framingham.csv")
glimpse(pasien_jantung)

## Rows: 4,238
## Columns: 16
## $ male            <int> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, …
## $ age             <int> 39, 46, 48, 61, 46, 43, 63, 45, 52, 43, 50, 43, 46, 41…
## $ education       <int> 4, 2, 1, 3, 3, 2, 1, 2, 1, 1, 1, 2, 1, 3, 2, 2, 3, 2, …
## $ currentSmoker   <int> 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, …
## $ cigsPerDay      <int> 0, 0, 20, 30, 23, 0, 0, 20, 0, 30, 0, 0, 15, 0, 9, 20,…
## $ BPMeds          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ prevalentStroke <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ prevalentHyp    <int> 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, …
## $ diabetes        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ totChol         <int> 195, 250, 245, 225, 285, 228, 205, 313, 260, 225, 254,…
## $ sysBP           <dbl> 106.0, 121.0, 127.5, 150.0, 130.0, 180.0, 138.0, 100.0…
## $ diaBP           <dbl> 70.0, 81.0, 80.0, 95.0, 84.0, 110.0, 71.0, 71.0, 89.0,…
## $ BMI             <dbl> 26.97, 28.73, 25.34, 28.58, 23.10, 30.30, 33.11, 21.68…
## $ heartRate       <int> 80, 95, 75, 65, 85, 77, 60, 79, 76, 93, 75, 72, 98, 65…
## $ glucose         <int> 77, 76, 70, 103, 85, 99, 85, 78, 79, 88, 76, 61, 64, 8…
## $ TenYearCHD      <int> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, …

2 CEK MISSING VALUE

anyNA(pasien_jantung)

## [1] TRUE

missing_values <- colSums(is.na(pasien_jantung))
print(missing_values)

##            male             age       education   currentSmoker      cigsPerDay 
##               0               0             105               0              29 
##          BPMeds prevalentStroke    prevalentHyp        diabetes         totChol 
##              53               0               0               0              50 
##           sysBP           diaBP             BMI       heartRate         glucose 
##               0               0              19               1             388 
##      TenYearCHD 
##               0

pasien_jantung <- na.omit(pasien_jantung)
anyNA(pasien_jantung)

## [1] FALSE

3 DATA WRANGLING

pasien_jantung <- pasien_jantung %>% 
  mutate(male = as.factor(male),
         currentSmoker = as.factor(currentSmoker),
         BPMeds = as.factor(BPMeds),
         prevalentStroke = as.factor(prevalentStroke),
         prevalentHyp = as.factor(male),
         diabetes = as.factor(diabetes),
         TenYearCHD = as.factor(TenYearCHD))

str(pasien_jantung)

## 'data.frame':    3656 obs. of  16 variables:
##  $ male           : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 1 2 2 ...
##  $ age            : int  39 46 48 61 46 43 63 45 52 43 ...
##  $ education      : int  4 2 1 3 3 2 1 2 1 1 ...
##  $ currentSmoker  : Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 2 1 2 ...
##  $ cigsPerDay     : int  0 0 20 30 23 0 0 20 0 30 ...
##  $ BPMeds         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ prevalentStroke: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ prevalentHyp   : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 1 2 2 ...
##  $ diabetes       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ totChol        : int  195 250 245 225 285 228 205 313 260 225 ...
##  $ sysBP          : num  106 121 128 150 130 ...
##  $ diaBP          : num  70 81 80 95 84 110 71 71 89 107 ...
##  $ BMI            : num  27 28.7 25.3 28.6 23.1 ...
##  $ heartRate      : int  80 95 75 65 85 77 60 79 76 93 ...
##  $ glucose        : int  77 76 70 103 85 99 85 78 79 88 ...
##  $ TenYearCHD     : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:582] 15 22 27 34 37 43 50 55 71 73 ...
##   ..- attr(*, "names")= chr [1:582] "15" "22" "27" "34" ...

Berikut adalah keterangan untuk setiap kolom : - male: Gender

- age: Usia

- education: Jenjang Pendidikan

- currentSmoker: Status Merokok

- cigsPerDay: Konsumsi Rokok Per Hari

- BPMeds: Status Tekanan Darah

- prevalentStroke: Status Stroke

- prevalentHyp: Status Hipertensi

- diabetes : Status Diabetes

- totChol: Total Kolesterol

- sysBP: Tekanan Darah Systolic

- diaBP: Tekanan Darah Diastolic

- BMI: Body Mass Index

- heartRate: Heart Rate

- glucose: Level Glukosa

- TenYearCHD: Resiko Penyakit Jantung dalam 10 Tahun Ke Depan

4 EXPLORATORY DATA ANALYSIS

summary(pasien_jantung)

##  male          age          education    currentSmoker   cigsPerDay    
##  0:2034   Min.   :32.00   Min.   :1.00   0:1868        Min.   : 0.000  
##  1:1622   1st Qu.:42.00   1st Qu.:1.00   1:1788        1st Qu.: 0.000  
##           Median :49.00   Median :2.00                 Median : 0.000  
##           Mean   :49.56   Mean   :1.98                 Mean   : 9.022  
##           3rd Qu.:56.00   3rd Qu.:3.00                 3rd Qu.:20.000  
##           Max.   :70.00   Max.   :4.00                 Max.   :70.000  
##  BPMeds   prevalentStroke prevalentHyp diabetes    totChol          sysBP      
##  0:3545   0:3635          0:2034       0:3557   Min.   :113.0   Min.   : 83.5  
##  1: 111   1:  21          1:1622       1:  99   1st Qu.:206.0   1st Qu.:117.0  
##                                                 Median :234.0   Median :128.0  
##                                                 Mean   :236.9   Mean   :132.4  
##                                                 3rd Qu.:263.2   3rd Qu.:144.0  
##                                                 Max.   :600.0   Max.   :295.0  
##      diaBP             BMI          heartRate         glucose       TenYearCHD
##  Min.   : 48.00   Min.   :15.54   Min.   : 44.00   Min.   : 40.00   0:3099    
##  1st Qu.: 75.00   1st Qu.:23.08   1st Qu.: 68.00   1st Qu.: 71.00   1: 557    
##  Median : 82.00   Median :25.38   Median : 75.00   Median : 78.00             
##  Mean   : 82.91   Mean   :25.78   Mean   : 75.73   Mean   : 81.86             
##  3rd Qu.: 90.00   3rd Qu.:28.04   3rd Qu.: 82.00   3rd Qu.: 87.00             
##  Max.   :142.50   Max.   :56.80   Max.   :143.00   Max.   :394.00

prop.table(table(pasien_jantung$TenYearCHD))

## 
##         0         1 
## 0.8476477 0.1523523

Karena proporsi data target tidak seimbang, maka kita perlu untuk melakukan downsample pada data pasien tidak jantung.

pasien_tidak_jantung <- pasien_jantung %>%
  filter(pasien_jantung$TenYearCHD %in% "0")

pasien_tidak_jantung_downsample <- pasien_tidak_jantung[sample(nrow(pasien_tidak_jantung), 557), ]

pasien_jantung <- pasien_jantung %>%
  filter(pasien_jantung$TenYearCHD %in% "1")

pasien_jantung <- bind_rows(pasien_tidak_jantung_downsample,pasien_jantung)
pasien_jantung <- pasien_jantung[sample(1:nrow(pasien_jantung)), ]

Proporsi data target telah seimbang.

prop.table(table(pasien_jantung$TenYearCHD))

## 
##   0   1 
## 0.5 0.5

5 CROSS VALIDATION

Membuat data train dan data test :

set.seed(417) 

index <- sample(x = nrow(pasien_jantung), size = nrow(pasien_jantung)*0.8)

pasien_jantung_train <- pasien_jantung[index,]
pasien_jantung_test <- pasien_jantung[-index,]

#Menghilangkan Varibel Target
pasien_jantung_test_prediktor <- pasien_jantung_test[1:15]

6 MEMBUAT MODEL

6.1 Model Naive Bayes

model_naive<- naiveBayes(TenYearCHD ~ ., data = pasien_jantung_train) 
pred_naive <- predict(model_naive, newdata = pasien_jantung_test_prediktor) 

(conf_matrix_naive <- table(pred_naive, pasien_jantung_test$TenYearCHD))

##           
## pred_naive  0  1
##          0 86 54
##          1 21 62

confusionMatrix(conf_matrix_naive)

## Confusion Matrix and Statistics
## 
##           
## pred_naive  0  1
##          0 86 54
##          1 21 62
##                                           
##                Accuracy : 0.6637          
##                  95% CI : (0.5976, 0.7254)
##     No Information Rate : 0.5202          
##     P-Value [Acc > NIR] : 9.968e-06       
##                                           
##                   Kappa : 0.3342          
##                                           
##  Mcnemar's Test P-Value : 0.0002199       
##                                           
##             Sensitivity : 0.8037          
##             Specificity : 0.5345          
##          Pos Pred Value : 0.6143          
##          Neg Pred Value : 0.7470          
##              Prevalence : 0.4798          
##          Detection Rate : 0.3857          
##    Detection Prevalence : 0.6278          
##       Balanced Accuracy : 0.6691          
##                                           
##        'Positive' Class : 0               
##

Dari output model Naive Bayes terlihat bahwa tingkat sensitivity model sebesar 79.09%

6.2 Model Decision Tree

dtree_model <- ctree(TenYearCHD ~ .,pasien_jantung_train)
plot(dtree_model)

plot(dtree_model, type="simple")

7 PREDICT

predict(dtree_model, head(pasien_jantung_test[,-1]))

## 563 585  43 578 594 178 
##   1   1   1   1   0   0 
## Levels: 0 1

pred_dtree_model <- predict(dtree_model, pasien_jantung_test[,-1])

(conf_matrix_dtree <- table(pred_dtree_model, pasien_jantung_test$TenYearCHD))

##                 
## pred_dtree_model  0  1
##                0 64 40
##                1 43 76

predict(dtree_model, head(pasien_jantung_test[,-1]), type="prob")

##             0         1
## 563 0.2618026 0.7381974
## 585 0.2618026 0.7381974
## 43  0.3791209 0.6208791
## 578 0.3791209 0.6208791
## 594 0.6524823 0.3475177
## 178 0.6524823 0.3475177

caret::confusionMatrix(pred_dtree_model, pasien_jantung_test[,1])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 77 27
##          1 37 82
##                                           
##                Accuracy : 0.713           
##                  95% CI : (0.6488, 0.7714)
##     No Information Rate : 0.5112          
##     P-Value [Acc > NIR] : 6.713e-10       
##                                           
##                   Kappa : 0.4269          
##                                           
##  Mcnemar's Test P-Value : 0.2606          
##                                           
##             Sensitivity : 0.6754          
##             Specificity : 0.7523          
##          Pos Pred Value : 0.7404          
##          Neg Pred Value : 0.6891          
##              Prevalence : 0.5112          
##          Detection Rate : 0.3453          
##    Detection Prevalence : 0.4664          
##       Balanced Accuracy : 0.7139          
##                                           
##        'Positive' Class : 0               
##

Dari output model Decision Tree terlihat bahwa tingkat sensitivity model sebesar 62.61%

8 KESIMPULAN

Pada kasus data penyakit jantung ini, model Naive Bayes masih lebih baik dibanding dengan model Decision Tree.

LBB Classification in Machine Learning I

David Ricardo

2023-03-29