Algoritma C5.0 adalah algoritma klasifikasi yang menghasilkan pohon keputusan yang ditemukan oleh Ross Quinlan pada tahun 1987. Algoritma ini merupakan penyempurnaan dari algoritma ID3 dan C4.5. Dalam hal manajemen memori dan akurasi algoritma C5.0 lebih baik daripada algoritma C4.5 [1]
Algoritma C5.0 menerapkan rule-based model (model berbasis aturan) sehingga memudahkan untuk melihat rule pada pohon keputusan. Selain itu, algoritma C5.0 dapat mengatasi missing value. Hal ini menjadi kelebihan C5.0 yang dinilai lebih unggul dibanding algoritma lainnya.
Model algoritma C5.0 bekerja dengan split sampel berdasarkan atribut yang memiliki information gain tertinggi [2]. Untuk menghitung informasi himpunan kasus pada kelas i digunakan formula
dengan \(I(S_1,S_2,...,S_m)\) merupakan informasi dari himpunan kasus pada kelas i yang dirumuskan sebagai \(p_i=S_i/S.S_i, Si\) merupakan jumlah sampel pada kelas i, dan S merupakan himpunan kasus. Langkah selanjutnya menghitung informasi himpunan kasus pada kelas i dan subset j sebagai berikut
\(S_1,S_j,...,S_mj\) merupakan informasi dari himpunan kasus kelas i dan subset j dan pij merupakan proporsi kelas i dan subset j
Informasi dari himpunan kasus pada kelas i dan subset j dapat digunakan untuk menghitung nilai entropy, sebagai berikut
dengan \(E(A)\) merupakan \(entropy\) atribut A dan \(S_ij\) merupakan sampel dari kelas i dan subset j dari atribut A. Langkah terakhir menghitung information gain sebagai pemilihan atribut yang digunakan sebagai node.
Proses dilakukan sampai subset sampel tidak dapat dilakukan split
Pada artikel kali ini, algoritma C5.0 akan digunakan untuk menganalisis penyakit hati. Untuk datasetnya sendiri bersumber dari kaggel dari user “Rishi Darmala”
library(dplyr) #Data preprocessing
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(party) #Decision tree
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
library(C50) #C5.0
library(tidyrules)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.5 v stringr 1.4.0
## v tidyr 1.1.4 v forcats 0.5.1
## v readr 2.0.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x stringr::boundary() masks strucchange::boundary()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(pander)
library(DT) #data table
data = read.csv("data/Heart_Disease_Prediction.csv", header = TRUE)
datatable(data, caption = "Dataset Prediksi Penyakit Hati")
Mengecek data, apakah data terdapat missing value
summary(data)
## Age Sex Chest.pain.type BP
## Min. :29.00 Min. :0.0000 Min. :1.000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:120.0
## Median :55.00 Median :1.0000 Median :3.000 Median :130.0
## Mean :54.43 Mean :0.6778 Mean :3.174 Mean :131.3
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :4.000 Max. :200.0
## Cholesterol FBS.over.120 EKG.results Max.HR
## Min. :126.0 Min. :0.0000 Min. :0.000 Min. : 71.0
## 1st Qu.:213.0 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:133.0
## Median :245.0 Median :0.0000 Median :2.000 Median :153.5
## Mean :249.7 Mean :0.1481 Mean :1.022 Mean :149.7
## 3rd Qu.:280.0 3rd Qu.:0.0000 3rd Qu.:2.000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.000 Max. :202.0
## Exercise.angina ST.depression Slope.of.ST Number.of.vessels.fluro
## Min. :0.0000 Min. :0.00 Min. :1.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.80 Median :2.000 Median :0.0000
## Mean :0.3296 Mean :1.05 Mean :1.585 Mean :0.6704
## 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.20 Max. :3.000 Max. :3.0000
## Thallium Heart.Disease
## Min. :3.000 Length:270
## 1st Qu.:3.000 Class :character
## Median :3.000 Mode :character
## Mean :4.696
## 3rd Qu.:7.000
## Max. :7.000
Karena data terdapat missing value. Kemudian kita cek lagi apakah data terdapat tipe data chr
. Seperti percobaan sebelumnya bahwasanya decision tree tidak menerima data bertipe chr
str(data)
## 'data.frame': 270 obs. of 14 variables:
## $ Age : int 70 67 57 64 74 65 56 59 60 63 ...
## $ Sex : int 1 0 1 1 0 1 1 1 1 0 ...
## $ Chest.pain.type : int 4 3 2 4 2 4 3 4 4 4 ...
## $ BP : int 130 115 124 128 120 120 130 110 140 150 ...
## $ Cholesterol : int 322 564 261 263 269 177 256 239 293 407 ...
## $ FBS.over.120 : int 0 0 0 0 0 0 1 0 0 0 ...
## $ EKG.results : int 2 2 0 0 2 0 2 2 2 2 ...
## $ Max.HR : int 109 160 141 105 121 140 142 142 170 154 ...
## $ Exercise.angina : int 0 0 0 1 1 0 1 1 0 0 ...
## $ ST.depression : num 2.4 1.6 0.3 0.2 0.2 0.4 0.6 1.2 1.2 4 ...
## $ Slope.of.ST : int 2 2 1 2 1 1 2 2 2 2 ...
## $ Number.of.vessels.fluro: int 3 0 0 1 1 0 1 1 2 3 ...
## $ Thallium : int 3 7 7 7 3 7 6 7 7 7 ...
## $ Heart.Disease : chr "Presence" "Absence" "Presence" "Absence" ...
Pada atribut HeartDisease
, atribut bertipe chr
. Oleh karena itu kita konversi dulu menjadi factor
data <- data %>%
mutate(across(where(is.character), as.factor))
Kita cek kembali apakah tipe data chr
sudah menjadi factor
str(data)
## 'data.frame': 270 obs. of 14 variables:
## $ Age : int 70 67 57 64 74 65 56 59 60 63 ...
## $ Sex : int 1 0 1 1 0 1 1 1 1 0 ...
## $ Chest.pain.type : int 4 3 2 4 2 4 3 4 4 4 ...
## $ BP : int 130 115 124 128 120 120 130 110 140 150 ...
## $ Cholesterol : int 322 564 261 263 269 177 256 239 293 407 ...
## $ FBS.over.120 : int 0 0 0 0 0 0 1 0 0 0 ...
## $ EKG.results : int 2 2 0 0 2 0 2 2 2 2 ...
## $ Max.HR : int 109 160 141 105 121 140 142 142 170 154 ...
## $ Exercise.angina : int 0 0 0 1 1 0 1 1 0 0 ...
## $ ST.depression : num 2.4 1.6 0.3 0.2 0.2 0.4 0.6 1.2 1.2 4 ...
## $ Slope.of.ST : int 2 2 1 2 1 1 2 2 2 2 ...
## $ Number.of.vessels.fluro: int 3 0 0 1 1 0 1 1 2 3 ...
## $ Thallium : int 3 7 7 7 3 7 6 7 7 7 ...
## $ Heart.Disease : Factor w/ 2 levels "Absence","Presence": 2 1 2 1 1 1 2 2 2 2 ...
Membuat model
predictor <- Heart.Disease~Age+Sex+Chest.pain.type+BP+Cholesterol+FBS.over.120+EKG.results+Max.HR+Exercise.angina+ST.depression+Slope.of.ST+Number.of.vessels.fluro+Thallium
Dataset dibagi menjadi 10 bagian yang akan diujikan di model
set.seed(1234)
# cross fold validation
folds <- cut(seq(1, nrow(data)), breaks = 10, labels = FALSE)
for(i in 1:10){
testIndexes <- which(folds==i, arr.ind = TRUE)
testData <- data[testIndexes,]
trainData <- data[-testIndexes,]
}
Algoritma C5.0 menggunakan 2 based model yang pertama tree-based mode yang kedua rule-based model
Membuat tree-based model
treec5 <- C5.0(predictor, data = trainData)
Menampilkan hasil dari tree-based model
treec5
##
## Call:
## C5.0.formula(formula = predictor, data = trainData)
##
## Classification Tree
## Number of samples: 243
## Number of predictors: 13
##
## Tree size: 20
##
## Non-standard options: attempt to group attributes
plot(treec5)
summary(treec5)
##
## Call:
## C5.0.formula(formula = predictor, data = trainData)
##
##
## C5.0 [Release 2.07 GPL Edition] Sun Nov 28 17:22:24 2021
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 243 cases (14 attributes) from undefined.data
##
## Decision tree:
##
## Thallium <= 3:
## :...Number.of.vessels.fluro <= 0:
## : :...BP <= 146: Absence (82/3)
## : : BP > 146:
## : : :...Slope.of.ST <= 1: Absence (8/2)
## : : Slope.of.ST > 1: Presence (5/1)
## : Number.of.vessels.fluro > 0:
## : :...Chest.pain.type <= 3:
## : :...Slope.of.ST <= 1: Absence (18/2)
## : : Slope.of.ST > 1:
## : : :...ST.depression <= 0.9: Absence (3)
## : : ST.depression > 0.9: Presence (2)
## : Chest.pain.type > 3:
## : :...Sex > 0: Presence (12)
## : Sex <= 0:
## : :...Slope.of.ST <= 1: Absence (2)
## : Slope.of.ST > 1: Presence (3/1)
## Thallium > 3:
## :...Chest.pain.type <= 3:
## :...Number.of.vessels.fluro <= 0:
## : :...Exercise.angina <= 0: Absence (15/2)
## : : Exercise.angina > 0:
## : : :...ST.depression <= 1.5: Absence (2)
## : : ST.depression > 1.5: Presence (3)
## : Number.of.vessels.fluro > 0:
## : :...Slope.of.ST > 1: Presence (12/1)
## : Slope.of.ST <= 1:
## : :...EKG.results <= 1: Absence (3)
## : EKG.results > 1: Presence (2)
## Chest.pain.type > 3:
## :...ST.depression > 0.5: Presence (53/2)
## ST.depression <= 0.5:
## :...EKG.results > 1: Presence (7/1)
## EKG.results <= 1:
## :...Max.HR <= 151: Absence (3)
## Max.HR > 151:
## :...Number.of.vessels.fluro <= 0: Absence (4/1)
## Number.of.vessels.fluro > 0: Presence (4)
##
##
## Evaluation on training data (243 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 20 16( 6.6%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 130 6 (a): class Absence
## 10 97 (b): class Presence
##
##
## Attribute usage:
##
## 100.00% Thallium
## 74.07% Number.of.vessels.fluro
## 60.91% Chest.pain.type
## 39.09% BP
## 33.33% ST.depression
## 23.87% Slope.of.ST
## 9.47% EKG.results
## 8.23% Exercise.angina
## 7.00% Sex
## 4.53% Max.HR
##
##
## Time: 0.0 secs
Membuat rule-based model
rules<-C5.0(predictor, data = trainData, rules = TRUE)
Menampilkan hasil dari rule-based model
rules
##
## Call:
## C5.0.formula(formula = predictor, data = trainData, rules = TRUE)
##
## Rule-Based Model
## Number of samples: 243
## Number of predictors: 13
##
## Number of Rules: 10
##
## Non-standard options: attempt to group attributes
summary(rules)
##
## Call:
## C5.0.formula(formula = predictor, data = trainData, rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Sun Nov 28 17:22:30 2021
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 243 cases (14 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (42/2, lift 1.7)
## Chest.pain.type <= 3
## EKG.results <= 1
## Slope.of.ST <= 1
## -> class Absence [0.932]
##
## Rule 2: (68/4, lift 1.7)
## Chest.pain.type <= 3
## ST.depression <= 1.5
## Number.of.vessels.fluro <= 0
## -> class Absence [0.929]
##
## Rule 3: (65/4, lift 1.7)
## Chest.pain.type <= 3
## ST.depression <= 0.9
## Thallium <= 3
## -> class Absence [0.925]
##
## Rule 4: (74/5, lift 1.6)
## Chest.pain.type <= 3
## Exercise.angina <= 0
## Number.of.vessels.fluro <= 0
## -> class Absence [0.921]
##
## Rule 5: (30/2, lift 1.6)
## EKG.results <= 1
## Max.HR > 151
## ST.depression <= 0.5
## Number.of.vessels.fluro <= 0
## -> class Absence [0.906]
##
## Rule 6: (135/27, lift 1.4)
## Thallium <= 3
## -> class Absence [0.796]
##
## Rule 7: (47/2, lift 2.1)
## Sex > 0
## Chest.pain.type > 3
## Number.of.vessels.fluro > 0
## -> class Presence [0.939]
##
## Rule 8: (46/2, lift 2.1)
## ST.depression > 0.9
## Slope.of.ST > 1
## Number.of.vessels.fluro > 0
## -> class Presence [0.938]
##
## Rule 9: (24/4, lift 1.8)
## BP > 146
## Slope.of.ST > 1
## -> class Presence [0.808]
##
## Rule 10: (108/28, lift 1.7)
## Thallium > 3
## -> class Presence [0.736]
##
## Default class: Absence
##
##
## Evaluation on training data (243 cases):
##
## Rules
## ----------------
## No Errors
##
## 10 20( 8.2%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 127 9 (a): class Absence
## 11 96 (b): class Presence
##
##
## Attribute usage:
##
## 100.00% Thallium
## 64.20% Number.of.vessels.fluro
## 62.14% Chest.pain.type
## 57.61% ST.depression
## 42.39% Slope.of.ST
## 30.45% Exercise.angina
## 21.81% EKG.results
## 19.34% Sex
## 12.35% Max.HR
## 9.88% BP
##
##
## Time: 0.0 secs
Mengurutkan rule berdasarkan yang paling berpengaruh
rules$output %>%
stringr::str_sub(start = 1,
end = 500) %>%
writeLines()
##
## C5.0 [Release 2.07 GPL Edition] Sun Nov 28 17:22:30 2021
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 243 cases (14 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (42/2, lift 1.7)
## Chest.pain.type <= 3
## EKG.results <= 1
## Slope.of.ST <= 1
## -> class Absence [0.932]
##
## Rule 2: (68/4, lift 1.7)
## Chest.pain.type <= 3
## ST.depression <= 1.5
## Number.of.vessels.fluro <= 0
## -> class Absence [0.929]
##
## Rule 3: (65/4, lift 1.7)
## Chest.pain.type <= 3
## ST.depression <= 0.
Mengurutkan rule untuk digunakan
tr_att <- tidyRules(rules)
tr_att
## # A tibble: 10 x 8
## id LHS RHS support confidence lift rule_number trial_number
## <int> <chr> <chr> <int> <dbl> <dbl> <int> <int>
## 1 1 Chest.pain.typ~ Abse~ 42 0.932 1.7 1 1
## 2 2 Chest.pain.typ~ Abse~ 68 0.929 1.7 2 1
## 3 3 Chest.pain.typ~ Abse~ 65 0.925 1.7 3 1
## 4 4 Chest.pain.typ~ Abse~ 74 0.921 1.6 4 1
## 5 5 EKG.results <=~ Abse~ 30 0.906 1.6 5 1
## 6 6 Thallium <= 3 Abse~ 135 0.796 1.4 6 1
## 7 7 Sex > 0 & Ches~ Pres~ 47 0.939 2.1 7 1
## 8 8 ST.depression ~ Pres~ 46 0.938 2.1 8 1
## 9 9 BP > 146 & Slo~ Pres~ 24 0.808 1.8 9 1
## 10 10 Thallium > 3 Pres~ 108 0.736 1.7 10 1
Pada tahap evaluasi dibagi menjadi 2 pengujian. Pengujian tree-based mode dan pengujian rule-based model. Dalam tahap evaluasi menggunakan metode confusion matrix
Menghitung akurasi dari tree-based model menggunakan data testing
treec5Predict <- predict(treec5, testData)
table(treec5Predict, testData$Heart.Disease)
##
## treec5Predict Absence Presence
## Absence 13 4
## Presence 1 9
Menghitung akurasi dari rule-based model menggunakan testing
rulesPred <- predict(rules, testData)
table(predict(rules, testData), testData$Heart.Disease)
##
## Absence Presence
## Absence 13 4
## Presence 1 9
[1] P. Nilima, L. Rekha, and V. Chitre, “‘Customer Card Classification Based on C5 . 0 & CART Algorithms,’” עלון הנוטע, vol. 66, no. 3, pp. 37–39, 2012.
[2] A. M.Elsayad and H. A. Elsalamony, “Diagnosis of Breast Cancer using Decision Tree Models and SVM,” Int. J. Comput. Appl., vol. 83, no. 5, pp. 19–29, 2013, doi: 10.5120/14445-2604.
[3] https://www.youtube.com/Irwansight
[4] https://www.kaggle.com/rishidamarla/heart-disease-prediction