Analisis Prediksi Penyakit Hati Menggunakan Algoritma C5.0

Moh. Ainur Rohman dan Prof. Dr. Suhartono, M.Kom.

UIN Maulana Malik Ibrahim Malang

Magister Informatika
28 November 2021

Algoritma C5.0 adalah algoritma klasifikasi yang menghasilkan pohon keputusan yang ditemukan oleh Ross Quinlan pada tahun 1987. Algoritma ini merupakan penyempurnaan dari algoritma ID3 dan C4.5. Dalam hal manajemen memori dan akurasi algoritma C5.0 lebih baik daripada algoritma C4.5 [1]

Algoritma C5.0 menerapkan rule-based model (model berbasis aturan) sehingga memudahkan untuk melihat rule pada pohon keputusan. Selain itu, algoritma C5.0 dapat mengatasi missing value. Hal ini menjadi kelebihan C5.0 yang dinilai lebih unggul dibanding algoritma lainnya.

Model algoritma C5.0 bekerja dengan split sampel berdasarkan atribut yang memiliki information gain tertinggi [2]. Untuk menghitung informasi himpunan kasus pada kelas i digunakan formula

\(I(S_1,S_2,...,S_m)=-\sum\limits_{i=1}^m p_i\log_2(p_i)\)

dengan \(I(S_1,S_2,...,S_m)\) merupakan informasi dari himpunan kasus pada kelas i yang dirumuskan sebagai \(p_i=S_i/S.S_i, Si\) merupakan jumlah sampel pada kelas i, dan S merupakan himpunan kasus. Langkah selanjutnya menghitung informasi himpunan kasus pada kelas i dan subset j sebagai berikut

\(I(S_1,S_2,...,S_m)=-\sum\limits_{ij=1}^m p_i\log_2(p_ij)\)

\(S_1,S_j,...,S_mj\) merupakan informasi dari himpunan kasus kelas i dan subset j dan pij merupakan proporsi kelas i dan subset j

Informasi dari himpunan kasus pada kelas i dan subset j dapat digunakan untuk menghitung nilai entropy, sebagai berikut

\(E(A)=\sum\limits_{i=1}^m \frac{S_1j +...+S_mj}{S} (S_1j,...,S_mj)\)

dengan \(E(A)\) merupakan \(entropy\) atribut A dan \(S_ij\) merupakan sampel dari kelas i dan subset j dari atribut A. Langkah terakhir menghitung information gain sebagai pemilihan atribut yang digunakan sebagai node.

\(Gain (A) = I(S_1,S_2,...,S_m)-E(A)\)

Proses dilakukan sampai subset sampel tidak dapat dilakukan split

Pada artikel kali ini, algoritma C5.0 akan digunakan untuk menganalisis penyakit hati. Untuk datasetnya sendiri bersumber dari kaggel dari user “Rishi Darmala”

  1. Import Library

library(dplyr) #Data preprocessing
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(party) #Decision tree
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
library(C50) #C5.0
library(tidyrules)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v stringr 1.4.0
## v tidyr   1.1.4     v forcats 0.5.1
## v readr   2.0.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x stringr::boundary() masks strucchange::boundary()
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()
library(pander)
library(DT) #data table

  1. Import Dataset

data = read.csv("data/Heart_Disease_Prediction.csv", header = TRUE)
datatable(data, caption = "Dataset Prediksi Penyakit Hati")

  1. Data Preprocessing

Mengecek data, apakah data terdapat missing value

summary(data)
##       Age             Sex         Chest.pain.type       BP       
##  Min.   :29.00   Min.   :0.0000   Min.   :1.000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:120.0  
##  Median :55.00   Median :1.0000   Median :3.000   Median :130.0  
##  Mean   :54.43   Mean   :0.6778   Mean   :3.174   Mean   :131.3  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :4.000   Max.   :200.0  
##   Cholesterol     FBS.over.120     EKG.results        Max.HR     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.000   Min.   : 71.0  
##  1st Qu.:213.0   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:133.0  
##  Median :245.0   Median :0.0000   Median :2.000   Median :153.5  
##  Mean   :249.7   Mean   :0.1481   Mean   :1.022   Mean   :149.7  
##  3rd Qu.:280.0   3rd Qu.:0.0000   3rd Qu.:2.000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.000   Max.   :202.0  
##  Exercise.angina  ST.depression   Slope.of.ST    Number.of.vessels.fluro
##  Min.   :0.0000   Min.   :0.00   Min.   :1.000   Min.   :0.0000         
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000         
##  Median :0.0000   Median :0.80   Median :2.000   Median :0.0000         
##  Mean   :0.3296   Mean   :1.05   Mean   :1.585   Mean   :0.6704         
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000         
##  Max.   :1.0000   Max.   :6.20   Max.   :3.000   Max.   :3.0000         
##     Thallium     Heart.Disease     
##  Min.   :3.000   Length:270        
##  1st Qu.:3.000   Class :character  
##  Median :3.000   Mode  :character  
##  Mean   :4.696                     
##  3rd Qu.:7.000                     
##  Max.   :7.000

Karena data terdapat missing value. Kemudian kita cek lagi apakah data terdapat tipe data chr. Seperti percobaan sebelumnya bahwasanya decision tree tidak menerima data bertipe chr

str(data)
## 'data.frame':    270 obs. of  14 variables:
##  $ Age                    : int  70 67 57 64 74 65 56 59 60 63 ...
##  $ Sex                    : int  1 0 1 1 0 1 1 1 1 0 ...
##  $ Chest.pain.type        : int  4 3 2 4 2 4 3 4 4 4 ...
##  $ BP                     : int  130 115 124 128 120 120 130 110 140 150 ...
##  $ Cholesterol            : int  322 564 261 263 269 177 256 239 293 407 ...
##  $ FBS.over.120           : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ EKG.results            : int  2 2 0 0 2 0 2 2 2 2 ...
##  $ Max.HR                 : int  109 160 141 105 121 140 142 142 170 154 ...
##  $ Exercise.angina        : int  0 0 0 1 1 0 1 1 0 0 ...
##  $ ST.depression          : num  2.4 1.6 0.3 0.2 0.2 0.4 0.6 1.2 1.2 4 ...
##  $ Slope.of.ST            : int  2 2 1 2 1 1 2 2 2 2 ...
##  $ Number.of.vessels.fluro: int  3 0 0 1 1 0 1 1 2 3 ...
##  $ Thallium               : int  3 7 7 7 3 7 6 7 7 7 ...
##  $ Heart.Disease          : chr  "Presence" "Absence" "Presence" "Absence" ...

Pada atribut HeartDisease, atribut bertipe chr. Oleh karena itu kita konversi dulu menjadi factor

data <- data %>%
  mutate(across(where(is.character), as.factor))

Kita cek kembali apakah tipe data chr sudah menjadi factor

str(data)
## 'data.frame':    270 obs. of  14 variables:
##  $ Age                    : int  70 67 57 64 74 65 56 59 60 63 ...
##  $ Sex                    : int  1 0 1 1 0 1 1 1 1 0 ...
##  $ Chest.pain.type        : int  4 3 2 4 2 4 3 4 4 4 ...
##  $ BP                     : int  130 115 124 128 120 120 130 110 140 150 ...
##  $ Cholesterol            : int  322 564 261 263 269 177 256 239 293 407 ...
##  $ FBS.over.120           : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ EKG.results            : int  2 2 0 0 2 0 2 2 2 2 ...
##  $ Max.HR                 : int  109 160 141 105 121 140 142 142 170 154 ...
##  $ Exercise.angina        : int  0 0 0 1 1 0 1 1 0 0 ...
##  $ ST.depression          : num  2.4 1.6 0.3 0.2 0.2 0.4 0.6 1.2 1.2 4 ...
##  $ Slope.of.ST            : int  2 2 1 2 1 1 2 2 2 2 ...
##  $ Number.of.vessels.fluro: int  3 0 0 1 1 0 1 1 2 3 ...
##  $ Thallium               : int  3 7 7 7 3 7 6 7 7 7 ...
##  $ Heart.Disease          : Factor w/ 2 levels "Absence","Presence": 2 1 2 1 1 1 2 2 2 2 ...

  1. Analisis Data

Membuat model

predictor <- Heart.Disease~Age+Sex+Chest.pain.type+BP+Cholesterol+FBS.over.120+EKG.results+Max.HR+Exercise.angina+ST.depression+Slope.of.ST+Number.of.vessels.fluro+Thallium

Dataset dibagi menjadi 10 bagian yang akan diujikan di model

set.seed(1234)
# cross fold validation
folds <- cut(seq(1, nrow(data)), breaks = 10, labels = FALSE)
for(i in 1:10){
  testIndexes <- which(folds==i, arr.ind = TRUE)
  testData <- data[testIndexes,]
  trainData <- data[-testIndexes,]
}

Algoritma C5.0 menggunakan 2 based model yang pertama tree-based mode yang kedua rule-based model

Membuat tree-based model

treec5 <- C5.0(predictor, data = trainData)

Menampilkan hasil dari tree-based model

treec5
## 
## Call:
## C5.0.formula(formula = predictor, data = trainData)
## 
## Classification Tree
## Number of samples: 243 
## Number of predictors: 13 
## 
## Tree size: 20 
## 
## Non-standard options: attempt to group attributes
plot(treec5)

summary(treec5)
## 
## Call:
## C5.0.formula(formula = predictor, data = trainData)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sun Nov 28 17:22:24 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 243 cases (14 attributes) from undefined.data
## 
## Decision tree:
## 
## Thallium <= 3:
## :...Number.of.vessels.fluro <= 0:
## :   :...BP <= 146: Absence (82/3)
## :   :   BP > 146:
## :   :   :...Slope.of.ST <= 1: Absence (8/2)
## :   :       Slope.of.ST > 1: Presence (5/1)
## :   Number.of.vessels.fluro > 0:
## :   :...Chest.pain.type <= 3:
## :       :...Slope.of.ST <= 1: Absence (18/2)
## :       :   Slope.of.ST > 1:
## :       :   :...ST.depression <= 0.9: Absence (3)
## :       :       ST.depression > 0.9: Presence (2)
## :       Chest.pain.type > 3:
## :       :...Sex > 0: Presence (12)
## :           Sex <= 0:
## :           :...Slope.of.ST <= 1: Absence (2)
## :               Slope.of.ST > 1: Presence (3/1)
## Thallium > 3:
## :...Chest.pain.type <= 3:
##     :...Number.of.vessels.fluro <= 0:
##     :   :...Exercise.angina <= 0: Absence (15/2)
##     :   :   Exercise.angina > 0:
##     :   :   :...ST.depression <= 1.5: Absence (2)
##     :   :       ST.depression > 1.5: Presence (3)
##     :   Number.of.vessels.fluro > 0:
##     :   :...Slope.of.ST > 1: Presence (12/1)
##     :       Slope.of.ST <= 1:
##     :       :...EKG.results <= 1: Absence (3)
##     :           EKG.results > 1: Presence (2)
##     Chest.pain.type > 3:
##     :...ST.depression > 0.5: Presence (53/2)
##         ST.depression <= 0.5:
##         :...EKG.results > 1: Presence (7/1)
##             EKG.results <= 1:
##             :...Max.HR <= 151: Absence (3)
##                 Max.HR > 151:
##                 :...Number.of.vessels.fluro <= 0: Absence (4/1)
##                     Number.of.vessels.fluro > 0: Presence (4)
## 
## 
## Evaluation on training data (243 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      20   16( 6.6%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     130     6    (a): class Absence
##      10    97    (b): class Presence
## 
## 
##  Attribute usage:
## 
##  100.00% Thallium
##   74.07% Number.of.vessels.fluro
##   60.91% Chest.pain.type
##   39.09% BP
##   33.33% ST.depression
##   23.87% Slope.of.ST
##    9.47% EKG.results
##    8.23% Exercise.angina
##    7.00% Sex
##    4.53% Max.HR
## 
## 
## Time: 0.0 secs

Membuat rule-based model

rules<-C5.0(predictor, data = trainData, rules = TRUE)

Menampilkan hasil dari rule-based model

rules
## 
## Call:
## C5.0.formula(formula = predictor, data = trainData, rules = TRUE)
## 
## Rule-Based Model
## Number of samples: 243 
## Number of predictors: 13 
## 
## Number of Rules: 10 
## 
## Non-standard options: attempt to group attributes
summary(rules)
## 
## Call:
## C5.0.formula(formula = predictor, data = trainData, rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sun Nov 28 17:22:30 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 243 cases (14 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (42/2, lift 1.7)
##  Chest.pain.type <= 3
##  EKG.results <= 1
##  Slope.of.ST <= 1
##  ->  class Absence  [0.932]
## 
## Rule 2: (68/4, lift 1.7)
##  Chest.pain.type <= 3
##  ST.depression <= 1.5
##  Number.of.vessels.fluro <= 0
##  ->  class Absence  [0.929]
## 
## Rule 3: (65/4, lift 1.7)
##  Chest.pain.type <= 3
##  ST.depression <= 0.9
##  Thallium <= 3
##  ->  class Absence  [0.925]
## 
## Rule 4: (74/5, lift 1.6)
##  Chest.pain.type <= 3
##  Exercise.angina <= 0
##  Number.of.vessels.fluro <= 0
##  ->  class Absence  [0.921]
## 
## Rule 5: (30/2, lift 1.6)
##  EKG.results <= 1
##  Max.HR > 151
##  ST.depression <= 0.5
##  Number.of.vessels.fluro <= 0
##  ->  class Absence  [0.906]
## 
## Rule 6: (135/27, lift 1.4)
##  Thallium <= 3
##  ->  class Absence  [0.796]
## 
## Rule 7: (47/2, lift 2.1)
##  Sex > 0
##  Chest.pain.type > 3
##  Number.of.vessels.fluro > 0
##  ->  class Presence  [0.939]
## 
## Rule 8: (46/2, lift 2.1)
##  ST.depression > 0.9
##  Slope.of.ST > 1
##  Number.of.vessels.fluro > 0
##  ->  class Presence  [0.938]
## 
## Rule 9: (24/4, lift 1.8)
##  BP > 146
##  Slope.of.ST > 1
##  ->  class Presence  [0.808]
## 
## Rule 10: (108/28, lift 1.7)
##  Thallium > 3
##  ->  class Presence  [0.736]
## 
## Default class: Absence
## 
## 
## Evaluation on training data (243 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##      10   20( 8.2%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     127     9    (a): class Absence
##      11    96    (b): class Presence
## 
## 
##  Attribute usage:
## 
##  100.00% Thallium
##   64.20% Number.of.vessels.fluro
##   62.14% Chest.pain.type
##   57.61% ST.depression
##   42.39% Slope.of.ST
##   30.45% Exercise.angina
##   21.81% EKG.results
##   19.34% Sex
##   12.35% Max.HR
##    9.88% BP
## 
## 
## Time: 0.0 secs

Mengurutkan rule berdasarkan yang paling berpengaruh

rules$output %>%
  stringr::str_sub(start = 1,
                   end = 500) %>%
  writeLines()
## 
## C5.0 [Release 2.07 GPL Edition]      Sun Nov 28 17:22:30 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 243 cases (14 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (42/2, lift 1.7)
##  Chest.pain.type <= 3
##  EKG.results <= 1
##  Slope.of.ST <= 1
##  ->  class Absence  [0.932]
## 
## Rule 2: (68/4, lift 1.7)
##  Chest.pain.type <= 3
##  ST.depression <= 1.5
##  Number.of.vessels.fluro <= 0
##  ->  class Absence  [0.929]
## 
## Rule 3: (65/4, lift 1.7)
##  Chest.pain.type <= 3
##  ST.depression <= 0.

Mengurutkan rule untuk digunakan

tr_att <- tidyRules(rules)
tr_att
## # A tibble: 10 x 8
##       id LHS             RHS   support confidence  lift rule_number trial_number
##    <int> <chr>           <chr>   <int>      <dbl> <dbl>       <int>        <int>
##  1     1 Chest.pain.typ~ Abse~      42      0.932   1.7           1            1
##  2     2 Chest.pain.typ~ Abse~      68      0.929   1.7           2            1
##  3     3 Chest.pain.typ~ Abse~      65      0.925   1.7           3            1
##  4     4 Chest.pain.typ~ Abse~      74      0.921   1.6           4            1
##  5     5 EKG.results <=~ Abse~      30      0.906   1.6           5            1
##  6     6 Thallium <= 3   Abse~     135      0.796   1.4           6            1
##  7     7 Sex > 0 & Ches~ Pres~      47      0.939   2.1           7            1
##  8     8 ST.depression ~ Pres~      46      0.938   2.1           8            1
##  9     9 BP > 146 & Slo~ Pres~      24      0.808   1.8           9            1
## 10    10 Thallium > 3    Pres~     108      0.736   1.7          10            1

Pada tahap evaluasi dibagi menjadi 2 pengujian. Pengujian tree-based mode dan pengujian rule-based model. Dalam tahap evaluasi menggunakan metode confusion matrix

Menghitung akurasi dari tree-based model menggunakan data testing

treec5Predict <- predict(treec5, testData)
table(treec5Predict, testData$Heart.Disease)
##              
## treec5Predict Absence Presence
##      Absence       13        4
##      Presence       1        9

Menghitung akurasi dari rule-based model menggunakan testing

rulesPred <- predict(rules, testData)
table(predict(rules, testData), testData$Heart.Disease)
##           
##            Absence Presence
##   Absence       13        4
##   Presence       1        9

Referensi

[1] P. Nilima, L. Rekha, and V. Chitre, “‘Customer Card Classification Based on C5 . 0 & CART Algorithms,’” עלון הנוטע, vol. 66, no. 3, pp. 37–39, 2012.
[2] A. M.Elsayad and H. A. Elsalamony, “Diagnosis of Breast Cancer using Decision Tree Models and SVM,” Int. J. Comput. Appl., vol. 83, no. 5, pp. 19–29, 2013, doi: 10.5120/14445-2604.
[3] https://www.youtube.com/Irwansight
[4] https://www.kaggle.com/rishidamarla/heart-disease-prediction