Analisis ini dilakukan untuk membandingkan kinerja tiga metode
klasifikasi, yaitu Decision Tree, Random Forest, dan
XGBoost dalam mengklasifikasikan status default
kredit. Variabel target yang digunakan adalah default,
dengan kategori no dan yes. Dalam analisis
ini, kategori yes ditetapkan sebagai kelas positif,
sehingga ukuran evaluasi seperti precision, recall,
dan F1-score dihitung dengan fokus pada kelas
yes.
Tahapan analisis meliputi impor data, statistik deskriptif, pemeriksaan distribusi variabel target, pembagian data menjadi data training dan testing, penanganan ketidakseimbangan kelas menggunakan SMOTE, pemodelan sebelum dan sesudah SMOTE, proses tuning menggunakan random search, serta visualisasi feature importance pada model XGBoost.
Bagian ini digunakan untuk memanggil paket-paket yang diperlukan
dalam analisis. Paket readxl digunakan untuk membaca data
Excel, dplyr untuk manipulasi data, caret
untuk evaluasi dan tuning model, rpart untuk model
Decision Tree, randomForest untuk model Random
Forest, xgboost untuk model XGBoost,
smotefamily untuk metode SMOTE, dan ggplot2
untuk visualisasi data.
library(readxl)
library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(xgboost)
library(smotefamily)
library(ggplot2)
library(knitr)
Data yang digunakan dalam analisis ini diimpor dari file Excel.
Silakan sesuaikan bagian path_data dengan lokasi file pada
perangkat masing-masing. Setelah data diimpor, variabel bertipe karakter
diubah menjadi faktor, sedangkan variabel default
ditetapkan sebagai faktor dengan urutan level no dan
yes.
path_data <- "D:/Dokumen/Semester 6/ML/UAS ML/data_gabungan.xlsx"
data_ML <- read_excel(path_data)
data_ML <- data_ML %>%
mutate(
across(where(is.character), as.factor),
default = factor(default, levels = c("no", "yes"))
)
head(data_ML)
## # A tibble: 6 × 17
## checking_balance months_loan_duration credit_history purpose amount
## <fct> <dbl> <fct> <fct> <dbl>
## 1 1 - 200 DM 15 very good car 6850
## 2 unknown 30 good car 4811
## 3 unknown 28 critical furniture/applian… 2743
## 4 1 - 200 DM 15 good renovations 2631
## 5 < 0 DM 10 good furniture/applian… 2315
## 6 unknown 15 good car 4657
## # ℹ 12 more variables: savings_balance <fct>, employment_duration <fct>,
## # percent_of_income <dbl>, years_at_residence <dbl>, age <dbl>,
## # other_credit <fct>, housing <fct>, existing_loans_count <dbl>, job <fct>,
## # dependents <dbl>, default <fct>, phone <fct>
Statistik deskriptif digunakan untuk memberikan gambaran umum mengenai struktur data yang dianalisis. Informasi yang ditampilkan meliputi jumlah observasi, jumlah variabel, jumlah variabel numerik, dan jumlah variabel kategorik.
jumlah_observasi <- nrow(data_ML)
jumlah_variabel <- ncol(data_ML)
jumlah_numerik <- sum(sapply(data_ML, is.numeric))
jumlah_kategorik <- sum(sapply(data_ML, is.factor))
statdesk_dimensi <- data.frame(
Keterangan = c(
"Jumlah Observasi",
"Jumlah Variabel",
"Jumlah Variabel Numerik",
"Jumlah Variabel Kategorik"
),
Nilai = c(
jumlah_observasi,
jumlah_variabel,
jumlah_numerik,
jumlah_kategorik
)
)
kable(statdesk_dimensi, caption = "Ringkasan Dimensi Data")
| Keterangan | Nilai |
|---|---|
| Jumlah Observasi | 1000 |
| Jumlah Variabel | 17 |
| Jumlah Variabel Numerik | 7 |
| Jumlah Variabel Kategorik | 10 |
Berdasarkan ringkasan dimensi data, dapat diketahui jumlah observasi dan jumlah variabel yang digunakan dalam proses klasifikasi. Informasi ini penting karena ukuran data dan jenis variabel dapat memengaruhi proses pembentukan model klasifikasi.
Bagian ini menampilkan distribusi variabel target
default. Pemeriksaan distribusi target diperlukan untuk
mengetahui apakah data memiliki ketidakseimbangan kelas.
Ketidakseimbangan kelas dapat menyebabkan model cenderung lebih baik
dalam memprediksi kelas mayoritas dibandingkan kelas minoritas.
distribusi_target <- data_ML %>%
count(default) %>%
mutate(Persentase = round(n / sum(n) * 100, 2))
kable(distribusi_target, caption = "Distribusi Variabel Target")
| default | n | Persentase |
|---|---|---|
| no | 700 | 70 |
| yes | 300 | 30 |
ggplot(distribusi_target, aes(x = default, y = n, fill = default)) +
geom_bar(stat = "identity") +
geom_text(
aes(label = paste0(n, " (", Persentase, "%)")),
vjust = -0.3,
size = 6
) +
scale_fill_manual(values = c("no" = "green", "yes" = "red")) +
labs(
title = "Distribusi Status Default Kredit",
x = "Status Default",
y = "Frekuensi"
) +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold", size = 18),
axis.text.x = element_text(size = 16, face = "bold", color = "black"),
axis.text.y = element_text(size = 14, color = "black"),
axis.title.x = element_text(size = 18, face = "bold"),
axis.title.y = element_text(size = 18, face = "bold")
)
Grafik distribusi target menunjukkan perbandingan jumlah data pada
kategori no dan yes. Apabila salah satu
kategori memiliki jumlah data yang jauh lebih besar, maka data dapat
dikategorikan sebagai imbalanced data. Kondisi tersebut menjadi
dasar dilakukannya penanganan ketidakseimbangan kelas menggunakan SMOTE
pada tahap berikutnya.
Statistik deskriptif variabel numerik digunakan untuk melihat karakteristik dasar dari setiap variabel numerik. Ukuran yang digunakan meliputi rata-rata, nilai minimum, median, nilai maksimum, dan standar deviasi.
data_numerik <- data_ML %>%
select(where(is.numeric))
statdesk_numerik <- data.frame(
Variabel = names(data_numerik),
Mean = sapply(data_numerik, function(x) round(mean(x, na.rm = TRUE), 2)),
Minimum = sapply(data_numerik, function(x) min(x, na.rm = TRUE)),
Median = sapply(data_numerik, function(x) median(x, na.rm = TRUE)),
Maksimum = sapply(data_numerik, function(x) max(x, na.rm = TRUE)),
SD = sapply(data_numerik, function(x) round(sd(x, na.rm = TRUE), 2))
)
kable(statdesk_numerik, caption = "Statistik Deskriptif Variabel Numerik")
| Variabel | Mean | Minimum | Median | Maksimum | SD | |
|---|---|---|---|---|---|---|
| months_loan_duration | months_loan_duration | 20.90 | 4 | 18.0 | 72 | 12.06 |
| amount | amount | 3271.26 | 250 | 2319.5 | 18424 | 2822.74 |
| percent_of_income | percent_of_income | 2.97 | 1 | 3.0 | 4 | 1.12 |
| years_at_residence | years_at_residence | 2.85 | 1 | 3.0 | 4 | 1.10 |
| age | age | 35.55 | 19 | 33.0 | 75 | 11.38 |
| existing_loans_count | existing_loans_count | 1.41 | 1 | 1.0 | 4 | 0.58 |
| dependents | dependents | 1.16 | 1 | 1.0 | 2 | 0.36 |
Tabel statistik deskriptif memberikan gambaran mengenai sebaran data pada variabel numerik. Nilai minimum dan maksimum menunjukkan rentang data, sedangkan nilai standar deviasi menunjukkan tingkat keragaman data pada masing-masing variabel.
Pemeriksaan missing value dilakukan untuk mengetahui apakah terdapat nilai kosong pada setiap variabel. Nilai kosong perlu diperiksa karena dapat memengaruhi proses pelatihan model dan menghasilkan error apabila tidak ditangani dengan tepat.
missing_value <- data.frame(
Variabel = names(data_ML),
Jumlah_Missing = colSums(is.na(data_ML)),
Persentase_Missing = round(colSums(is.na(data_ML)) / nrow(data_ML) * 100, 2)
)
kable(missing_value, caption = "Pemeriksaan Missing Value")
| Variabel | Jumlah_Missing | Persentase_Missing | |
|---|---|---|---|
| checking_balance | checking_balance | 0 | 0 |
| months_loan_duration | months_loan_duration | 0 | 0 |
| credit_history | credit_history | 0 | 0 |
| purpose | purpose | 0 | 0 |
| amount | amount | 0 | 0 |
| savings_balance | savings_balance | 0 | 0 |
| employment_duration | employment_duration | 0 | 0 |
| percent_of_income | percent_of_income | 0 | 0 |
| years_at_residence | years_at_residence | 0 | 0 |
| age | age | 0 | 0 |
| other_credit | other_credit | 0 | 0 |
| housing | housing | 0 | 0 |
| existing_loans_count | existing_loans_count | 0 | 0 |
| job | job | 0 | 0 |
| dependents | dependents | 0 | 0 |
| default | default | 0 | 0 |
| phone | phone | 0 | 0 |
Apabila seluruh variabel memiliki jumlah missing value sebesar nol, maka data dapat langsung digunakan dalam proses pemodelan. Namun, apabila terdapat missing value, maka perlu dilakukan penanganan terlebih dahulu, misalnya dengan penghapusan data atau imputasi nilai.
Fungsi berikut digunakan untuk menghitung kinerja model klasifikasi
berdasarkan confusion matrix. Ukuran evaluasi yang digunakan
meliputi accuracy, precision, recall,
specificity, dan F1-score. Dalam fungsi ini, kelas
positif ditetapkan sebagai yes.
evaluasi_model <- function(actual, pred, nama_model, split, kondisi) {
cat("\n====================================================\n")
cat("Split :", split, "\n")
cat("Kondisi :", kondisi, "\n")
cat("Metode :", nama_model, "\n")
cat("====================================================\n")
cm <- confusionMatrix(pred, actual, positive = "yes")
print(cm)
hasil <- data.frame(
Split = split,
Kondisi = kondisi,
Model = nama_model,
Accuracy = as.numeric(cm$overall["Accuracy"]),
Precision = ifelse(is.na(cm$byClass["Precision"]), 0, cm$byClass["Precision"]),
Recall = ifelse(is.na(cm$byClass["Recall"]), 0, cm$byClass["Recall"]),
Specificity = ifelse(is.na(cm$byClass["Specificity"]), 0, cm$byClass["Specificity"]),
F1_Score = ifelse(is.na(cm$byClass["F1"]), 0, cm$byClass["F1"])
)
return(hasil)
}
SMOTE atau Synthetic Minority Oversampling Technique digunakan untuk menyeimbangkan distribusi kelas pada data training. Metode ini membentuk data sintetis pada kelas minoritas sehingga model memiliki kesempatan yang lebih baik untuk mempelajari pola dari kelas tersebut. SMOTE hanya diterapkan pada data training agar tidak terjadi data leakage pada data testing.
smote_data <- function(train_data) {
set.seed(123)
target <- train_data$default
target_num <- ifelse(target == "yes", 1, 0)
fitur <- train_data %>%
select(-default)
fitur_dummy <- model.matrix(~ . -1, data = fitur) %>%
as.data.frame()
smote_result <- SMOTE(
X = fitur_dummy,
target = target_num,
K = 5,
dup_size = 1
)
data_smote <- smote_result$data
names(data_smote)[ncol(data_smote)] <- "default"
data_smote$default <- factor(
ifelse(data_smote$default == 1, "yes", "no"),
levels = c("no", "yes")
)
return(data_smote)
}
Setelah SMOTE dilakukan, variabel prediktor yang berbentuk kategorik akan berubah menjadi variabel dummy. Oleh karena itu, kolom pada data training dan testing perlu disamakan agar model dapat melakukan prediksi tanpa error.
samakan_kolom <- function(train_data, test_data) {
test_x <- test_data %>%
select(-default)
test_dummy <- model.matrix(~ . -1, data = test_x) %>%
as.data.frame()
kolom_train <- colnames(train_data)[colnames(train_data) != "default"]
for (k in kolom_train) {
if (!(k %in% colnames(test_dummy))) {
test_dummy[[k]] <- 0
}
}
test_dummy <- test_dummy[, kolom_train]
test_dummy$default <- test_data$default
return(test_dummy)
}
rapikan_nama_kolom <- function(data) {
names(data) <- make.names(names(data), unique = TRUE)
return(data)
}
Data dibagi secara acak menjadi tiga skenario, yaitu 90:10, 80:20,
dan 70:30. Angka pertama menunjukkan proporsi data training,
sedangkan angka kedua menunjukkan proporsi data testing.
Penggunaan set.seed(123) bertujuan agar hasil pembagian
data dapat direplikasi.
set.seed(123)
index_90 <- sample(1:nrow(data_ML), size = 0.90 * nrow(data_ML))
train_90 <- data_ML[index_90, ]
test_10 <- data_ML[-index_90, ]
index_80 <- sample(1:nrow(data_ML), size = 0.80 * nrow(data_ML))
train_80 <- data_ML[index_80, ]
test_20 <- data_ML[-index_80, ]
index_70 <- sample(1:nrow(data_ML), size = 0.70 * nrow(data_ML))
train_70 <- data_ML[index_70, ]
test_30 <- data_ML[-index_70, ]
distribusi_split <- bind_rows(
as.data.frame(table(train_90$default)) %>% mutate(Split = "90:10 Training"),
as.data.frame(table(test_10$default)) %>% mutate(Split = "90:10 Testing"),
as.data.frame(table(train_80$default)) %>% mutate(Split = "80:20 Training"),
as.data.frame(table(test_20$default)) %>% mutate(Split = "80:20 Testing"),
as.data.frame(table(train_70$default)) %>% mutate(Split = "70:30 Training"),
as.data.frame(table(test_30$default)) %>% mutate(Split = "70:30 Testing")
) %>%
select(Split, Default = Var1, Frekuensi = Freq)
kable(distribusi_split, caption = "Distribusi Kelas pada Setiap Skenario Split Data")
| Split | Default | Frekuensi |
|---|---|---|
| 90:10 Training | no | 626 |
| 90:10 Training | yes | 274 |
| 90:10 Testing | no | 74 |
| 90:10 Testing | yes | 26 |
| 80:20 Training | no | 571 |
| 80:20 Training | yes | 229 |
| 80:20 Testing | no | 129 |
| 80:20 Testing | yes | 71 |
| 70:30 Training | no | 493 |
| 70:30 Training | yes | 207 |
| 70:30 Testing | no | 207 |
| 70:30 Testing | yes | 93 |
Tabel tersebut menunjukkan distribusi kelas no dan
yes pada data training dan testing untuk
setiap skenario. Karena pembagian dilakukan secara acak, jumlah kelas
pada setiap skenario dapat berbeda, tetapi tetap berasal dari dataset
yang sama.
Fungsi berikut digunakan untuk menjalankan tiga model klasifikasi, yaitu Decision Tree, Random Forest, dan XGBoost. Fungsi ini digunakan pada kondisi sebelum SMOTE dan setelah SMOTE sebelum dilakukan tuning parameter.
jalankan_model <- function(train_data, test_data, split_name, kondisi) {
hasil_semua <- data.frame()
test_x <- test_data %>% select(-default)
test_y <- test_data$default
# Decision Tree
model_tree <- rpart(
default ~ .,
data = train_data,
method = "class"
)
pred_tree <- predict(model_tree, test_x, type = "class")
hasil_semua <- bind_rows(
hasil_semua,
evaluasi_model(test_y, pred_tree, "Decision Tree", split_name, kondisi)
)
# Random Forest
set.seed(123)
model_rf <- randomForest(
default ~ .,
data = train_data,
ntree = 500,
mtry = floor(sqrt(ncol(train_data) - 1)),
importance = TRUE
)
pred_rf <- predict(model_rf, test_x)
hasil_semua <- bind_rows(
hasil_semua,
evaluasi_model(test_y, pred_rf, "Random Forest", split_name, kondisi)
)
# XGBoost
train_x <- train_data %>% select(-default)
test_x2 <- test_data %>% select(-default)
train_matrix <- model.matrix(~ . -1, data = train_x)
test_matrix <- model.matrix(~ . -1, data = test_x2)
kolom_train <- colnames(train_matrix)
for (k in kolom_train) {
if (!(k %in% colnames(test_matrix))) {
test_matrix <- cbind(test_matrix, 0)
colnames(test_matrix)[ncol(test_matrix)] <- k
}
}
test_matrix <- test_matrix[, kolom_train]
train_label <- ifelse(train_data$default == "yes", 1, 0)
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix)
set.seed(123)
model_xgb <- xgb.train(
params = list(
objective = "binary:logistic",
eval_metric = "logloss",
seed = 123
),
data = dtrain,
nrounds = 100,
verbose = 0
)
pred_prob_xgb <- predict(model_xgb, dtest)
pred_xgb <- factor(
ifelse(pred_prob_xgb >= 0.5, "yes", "no"),
levels = c("no", "yes")
)
hasil_semua <- bind_rows(
hasil_semua,
evaluasi_model(test_y, pred_xgb, "XGBoost", split_name, kondisi)
)
return(hasil_semua)
}
Pada tahap ini, model dijalankan menggunakan data asli tanpa penanganan ketidakseimbangan kelas. Hasil ini digunakan sebagai pembanding awal terhadap model setelah SMOTE dan setelah tuning.
hasil_awal_90 <- jalankan_model(train_90, test_10, "90:10", "Sebelum SMOTE")
##
## ====================================================
## Split : 90:10
## Kondisi : Sebelum SMOTE
## Metode : Decision Tree
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 70 13
## yes 4 13
##
## Accuracy : 0.83
## 95% CI : (0.7418, 0.8977)
## No Information Rate : 0.74
## P-Value [Acc > NIR] : 0.02275
##
## Kappa : 0.5023
##
## Mcnemar's Test P-Value : 0.05235
##
## Sensitivity : 0.5000
## Specificity : 0.9459
## Pos Pred Value : 0.7647
## Neg Pred Value : 0.8434
## Prevalence : 0.2600
## Detection Rate : 0.1300
## Detection Prevalence : 0.1700
## Balanced Accuracy : 0.7230
##
## 'Positive' Class : yes
##
##
## ====================================================
## Split : 90:10
## Kondisi : Sebelum SMOTE
## Metode : Random Forest
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 69 16
## yes 5 10
##
## Accuracy : 0.79
## 95% CI : (0.6971, 0.8651)
## No Information Rate : 0.74
## P-Value [Acc > NIR] : 0.1521
##
## Kappa : 0.3675
##
## Mcnemar's Test P-Value : 0.0291
##
## Sensitivity : 0.3846
## Specificity : 0.9324
## Pos Pred Value : 0.6667
## Neg Pred Value : 0.8118
## Prevalence : 0.2600
## Detection Rate : 0.1000
## Detection Prevalence : 0.1500
## Balanced Accuracy : 0.6585
##
## 'Positive' Class : yes
##
##
## ====================================================
## Split : 90:10
## Kondisi : Sebelum SMOTE
## Metode : XGBoost
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 62 14
## yes 12 12
##
## Accuracy : 0.74
## 95% CI : (0.6427, 0.8226)
## No Information Rate : 0.74
## P-Value [Acc > NIR] : 0.5525
##
## Kappa : 0.307
##
## Mcnemar's Test P-Value : 0.8445
##
## Sensitivity : 0.4615
## Specificity : 0.8378
## Pos Pred Value : 0.5000
## Neg Pred Value : 0.8158
## Prevalence : 0.2600
## Detection Rate : 0.1200
## Detection Prevalence : 0.2400
## Balanced Accuracy : 0.6497
##
## 'Positive' Class : yes
##
hasil_awal_80 <- jalankan_model(train_80, test_20, "80:20", "Sebelum SMOTE")
##
## ====================================================
## Split : 80:20
## Kondisi : Sebelum SMOTE
## Metode : Decision Tree
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 111 48
## yes 18 23
##
## Accuracy : 0.67
## 95% CI : (0.6002, 0.7347)
## No Information Rate : 0.645
## P-Value [Acc > NIR] : 0.2543651
##
## Kappa : 0.2038
##
## Mcnemar's Test P-Value : 0.0003575
##
## Sensitivity : 0.3239
## Specificity : 0.8605
## Pos Pred Value : 0.5610
## Neg Pred Value : 0.6981
## Prevalence : 0.3550
## Detection Rate : 0.1150
## Detection Prevalence : 0.2050
## Balanced Accuracy : 0.5922
##
## 'Positive' Class : yes
##
##
## ====================================================
## Split : 80:20
## Kondisi : Sebelum SMOTE
## Metode : Random Forest
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 114 42
## yes 15 29
##
## Accuracy : 0.715
## 95% CI : (0.6471, 0.7764)
## No Information Rate : 0.645
## P-Value [Acc > NIR] : 0.0216966
##
## Kappa : 0.3195
##
## Mcnemar's Test P-Value : 0.0005736
##
## Sensitivity : 0.4085
## Specificity : 0.8837
## Pos Pred Value : 0.6591
## Neg Pred Value : 0.7308
## Prevalence : 0.3550
## Detection Rate : 0.1450
## Detection Prevalence : 0.2200
## Balanced Accuracy : 0.6461
##
## 'Positive' Class : yes
##
##
## ====================================================
## Split : 80:20
## Kondisi : Sebelum SMOTE
## Metode : XGBoost
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 104 36
## yes 25 35
##
## Accuracy : 0.695
## 95% CI : (0.6261, 0.758)
## No Information Rate : 0.645
## P-Value [Acc > NIR] : 0.07903
##
## Kappa : 0.31
##
## Mcnemar's Test P-Value : 0.20042
##
## Sensitivity : 0.4930
## Specificity : 0.8062
## Pos Pred Value : 0.5833
## Neg Pred Value : 0.7429
## Prevalence : 0.3550
## Detection Rate : 0.1750
## Detection Prevalence : 0.3000
## Balanced Accuracy : 0.6496
##
## 'Positive' Class : yes
##
hasil_awal_70 <- jalankan_model(train_70, test_30, "70:30", "Sebelum SMOTE")
##
## ====================================================
## Split : 70:30
## Kondisi : Sebelum SMOTE
## Metode : Decision Tree
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 194 61
## yes 13 32
##
## Accuracy : 0.7533
## 95% CI : (0.7005, 0.8011)
## No Information Rate : 0.69
## P-Value [Acc > NIR] : 0.009418
##
## Kappa : 0.3279
##
## Mcnemar's Test P-Value : 0.00000004665
##
## Sensitivity : 0.3441
## Specificity : 0.9372
## Pos Pred Value : 0.7111
## Neg Pred Value : 0.7608
## Prevalence : 0.3100
## Detection Rate : 0.1067
## Detection Prevalence : 0.1500
## Balanced Accuracy : 0.6406
##
## 'Positive' Class : yes
##
##
## ====================================================
## Split : 70:30
## Kondisi : Sebelum SMOTE
## Metode : Random Forest
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 196 61
## yes 11 32
##
## Accuracy : 0.76
## 95% CI : (0.7076, 0.8072)
## No Information Rate : 0.69
## P-Value [Acc > NIR] : 0.004525
##
## Kappa : 0.3415
##
## Mcnemar's Test P-Value : 0.000000007709
##
## Sensitivity : 0.3441
## Specificity : 0.9469
## Pos Pred Value : 0.7442
## Neg Pred Value : 0.7626
## Prevalence : 0.3100
## Detection Rate : 0.1067
## Detection Prevalence : 0.1433
## Balanced Accuracy : 0.6455
##
## 'Positive' Class : yes
##
##
## ====================================================
## Split : 70:30
## Kondisi : Sebelum SMOTE
## Metode : XGBoost
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 188 48
## yes 19 45
##
## Accuracy : 0.7767
## 95% CI : (0.7253, 0.8225)
## No Information Rate : 0.69
## P-Value [Acc > NIR] : 0.0005368
##
## Kappa : 0.4289
##
## Mcnemar's Test P-Value : 0.0006245
##
## Sensitivity : 0.4839
## Specificity : 0.9082
## Pos Pred Value : 0.7031
## Neg Pred Value : 0.7966
## Prevalence : 0.3100
## Detection Rate : 0.1500
## Detection Prevalence : 0.2133
## Balanced Accuracy : 0.6960
##
## 'Positive' Class : yes
##
hasil_awal <- bind_rows(
hasil_awal_90,
hasil_awal_80,
hasil_awal_70
)
kable(hasil_awal, digits = 4, caption = "Hasil Evaluasi Model Sebelum SMOTE")
| Split | Kondisi | Model | Accuracy | Precision | Recall | Specificity | F1_Score | |
|---|---|---|---|---|---|---|---|---|
| Precision…1 | 90:10 | Sebelum SMOTE | Decision Tree | 0.8300 | 0.7647 | 0.5000 | 0.9459 | 0.6047 |
| Precision…2 | 90:10 | Sebelum SMOTE | Random Forest | 0.7900 | 0.6667 | 0.3846 | 0.9324 | 0.4878 |
| Precision…3 | 90:10 | Sebelum SMOTE | XGBoost | 0.7400 | 0.5000 | 0.4615 | 0.8378 | 0.4800 |
| Precision…4 | 80:20 | Sebelum SMOTE | Decision Tree | 0.6700 | 0.5610 | 0.3239 | 0.8605 | 0.4107 |
| Precision…5 | 80:20 | Sebelum SMOTE | Random Forest | 0.7150 | 0.6591 | 0.4085 | 0.8837 | 0.5043 |
| Precision…6 | 80:20 | Sebelum SMOTE | XGBoost | 0.6950 | 0.5833 | 0.4930 | 0.8062 | 0.5344 |
| Precision…7 | 70:30 | Sebelum SMOTE | Decision Tree | 0.7533 | 0.7111 | 0.3441 | 0.9372 | 0.4638 |
| Precision…8 | 70:30 | Sebelum SMOTE | Random Forest | 0.7600 | 0.7442 | 0.3441 | 0.9469 | 0.4706 |
| Precision…9 | 70:30 | Sebelum SMOTE | XGBoost | 0.7767 | 0.7031 | 0.4839 | 0.9082 | 0.5732 |
Tabel hasil evaluasi sebelum SMOTE menunjukkan kemampuan awal setiap model dalam mengklasifikasikan status default kredit. Hasil ini perlu dibandingkan dengan model setelah SMOTE untuk melihat pengaruh penanganan ketidakseimbangan kelas terhadap performa model.
SMOTE diterapkan hanya pada data training. Data testing tidak dikenakan SMOTE agar tetap merepresentasikan data asli dan dapat digunakan sebagai dasar evaluasi yang objektif.
train_90_smote <- smote_data(train_90)
train_80_smote <- smote_data(train_80)
train_70_smote <- smote_data(train_70)
test_10_smote <- samakan_kolom(train_90_smote, test_10)
test_20_smote <- samakan_kolom(train_80_smote, test_20)
test_30_smote <- samakan_kolom(train_70_smote, test_30)
train_90_smote <- rapikan_nama_kolom(train_90_smote)
test_10_smote <- rapikan_nama_kolom(test_10_smote)
train_80_smote <- rapikan_nama_kolom(train_80_smote)
test_20_smote <- rapikan_nama_kolom(test_20_smote)
train_70_smote <- rapikan_nama_kolom(train_70_smote)
test_30_smote <- rapikan_nama_kolom(test_30_smote)
train_90_smote$default <- factor(train_90_smote$default, levels = c("no", "yes"))
test_10_smote$default <- factor(test_10_smote$default, levels = c("no", "yes"))
train_80_smote$default <- factor(train_80_smote$default, levels = c("no", "yes"))
test_20_smote$default <- factor(test_20_smote$default, levels = c("no", "yes"))
train_70_smote$default <- factor(train_70_smote$default, levels = c("no", "yes"))
test_30_smote$default <- factor(test_30_smote$default, levels = c("no", "yes"))
distribusi_data <- bind_rows(
lapply(list(
"90:10 Sebelum SMOTE" = train_90,
"80:20 Sebelum SMOTE" = train_80,
"70:30 Sebelum SMOTE" = train_70,
"90:10 Setelah SMOTE" = train_90_smote,
"80:20 Setelah SMOTE" = train_80_smote,
"70:30 Setelah SMOTE" = train_70_smote
), function(x) {
as.data.frame(table(x$default))
}),
.id = "Keterangan"
)
colnames(distribusi_data)[2:3] <- c("Default", "Frekuensi")
kable(distribusi_data, caption = "Perbandingan Distribusi Kelas Sebelum dan Setelah SMOTE")
| Keterangan | Default | Frekuensi |
|---|---|---|
| 90:10 Sebelum SMOTE | no | 626 |
| 90:10 Sebelum SMOTE | yes | 274 |
| 80:20 Sebelum SMOTE | no | 571 |
| 80:20 Sebelum SMOTE | yes | 229 |
| 70:30 Sebelum SMOTE | no | 493 |
| 70:30 Sebelum SMOTE | yes | 207 |
| 90:10 Setelah SMOTE | no | 626 |
| 90:10 Setelah SMOTE | yes | 548 |
| 80:20 Setelah SMOTE | no | 571 |
| 80:20 Setelah SMOTE | yes | 458 |
| 70:30 Setelah SMOTE | no | 493 |
| 70:30 Setelah SMOTE | yes | 414 |
Tabel tersebut menunjukkan perubahan distribusi kelas setelah SMOTE diterapkan pada data training. Setelah SMOTE, jumlah kelas minoritas meningkat melalui pembentukan data sintetis sehingga distribusi kelas menjadi lebih seimbang.
Setelah data training diseimbangkan menggunakan SMOTE, model kembali dijalankan untuk melihat apakah terjadi perubahan kinerja dibandingkan model sebelum SMOTE.
hasil_smote_90 <- jalankan_model(train_90_smote, test_10_smote, "90:10", "Setelah SMOTE")
##
## ====================================================
## Split : 90:10
## Kondisi : Setelah SMOTE
## Metode : Decision Tree
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 58 11
## yes 16 15
##
## Accuracy : 0.73
## 95% CI : (0.632, 0.8139)
## No Information Rate : 0.74
## P-Value [Acc > NIR] : 0.6398
##
## Kappa : 0.3395
##
## Mcnemar's Test P-Value : 0.4414
##
## Sensitivity : 0.5769
## Specificity : 0.7838
## Pos Pred Value : 0.4839
## Neg Pred Value : 0.8406
## Prevalence : 0.2600
## Detection Rate : 0.1500
## Detection Prevalence : 0.3100
## Balanced Accuracy : 0.6804
##
## 'Positive' Class : yes
##
##
## ====================================================
## Split : 90:10
## Kondisi : Setelah SMOTE
## Metode : Random Forest
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 64 15
## yes 10 11
##
## Accuracy : 0.75
## 95% CI : (0.6534, 0.8312)
## No Information Rate : 0.74
## P-Value [Acc > NIR] : 0.4619
##
## Kappa : 0.3071
##
## Mcnemar's Test P-Value : 0.4237
##
## Sensitivity : 0.4231
## Specificity : 0.8649
## Pos Pred Value : 0.5238
## Neg Pred Value : 0.8101
## Prevalence : 0.2600
## Detection Rate : 0.1100
## Detection Prevalence : 0.2100
## Balanced Accuracy : 0.6440
##
## 'Positive' Class : yes
##
##
## ====================================================
## Split : 90:10
## Kondisi : Setelah SMOTE
## Metode : XGBoost
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 61 12
## yes 13 14
##
## Accuracy : 0.75
## 95% CI : (0.6534, 0.8312)
## No Information Rate : 0.74
## P-Value [Acc > NIR] : 0.4619
##
## Kappa : 0.3583
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.5385
## Specificity : 0.8243
## Pos Pred Value : 0.5185
## Neg Pred Value : 0.8356
## Prevalence : 0.2600
## Detection Rate : 0.1400
## Detection Prevalence : 0.2700
## Balanced Accuracy : 0.6814
##
## 'Positive' Class : yes
##
hasil_smote_80 <- jalankan_model(train_80_smote, test_20_smote, "80:20", "Setelah SMOTE")
##
## ====================================================
## Split : 80:20
## Kondisi : Setelah SMOTE
## Metode : Decision Tree
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 97 36
## yes 32 35
##
## Accuracy : 0.66
## 95% CI : (0.5898, 0.7253)
## No Information Rate : 0.645
## P-Value [Acc > NIR] : 0.3583
##
## Kappa : 0.248
##
## Mcnemar's Test P-Value : 0.7160
##
## Sensitivity : 0.4930
## Specificity : 0.7519
## Pos Pred Value : 0.5224
## Neg Pred Value : 0.7293
## Prevalence : 0.3550
## Detection Rate : 0.1750
## Detection Prevalence : 0.3350
## Balanced Accuracy : 0.6224
##
## 'Positive' Class : yes
##
##
## ====================================================
## Split : 80:20
## Kondisi : Setelah SMOTE
## Metode : Random Forest
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 113 43
## yes 16 28
##
## Accuracy : 0.705
## 95% CI : (0.6366, 0.7672)
## No Information Rate : 0.645
## P-Value [Acc > NIR] : 0.043184
##
## Kappa : 0.2956
##
## Mcnemar's Test P-Value : 0.000712
##
## Sensitivity : 0.3944
## Specificity : 0.8760
## Pos Pred Value : 0.6364
## Neg Pred Value : 0.7244
## Prevalence : 0.3550
## Detection Rate : 0.1400
## Detection Prevalence : 0.2200
## Balanced Accuracy : 0.6352
##
## 'Positive' Class : yes
##
##
## ====================================================
## Split : 80:20
## Kondisi : Setelah SMOTE
## Metode : XGBoost
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 111 36
## yes 18 35
##
## Accuracy : 0.73
## 95% CI : (0.6628, 0.7902)
## No Information Rate : 0.645
## P-Value [Acc > NIR] : 0.006549
##
## Kappa : 0.3748
##
## Mcnemar's Test P-Value : 0.020700
##
## Sensitivity : 0.4930
## Specificity : 0.8605
## Pos Pred Value : 0.6604
## Neg Pred Value : 0.7551
## Prevalence : 0.3550
## Detection Rate : 0.1750
## Detection Prevalence : 0.2650
## Balanced Accuracy : 0.6767
##
## 'Positive' Class : yes
##
hasil_smote_70 <- jalankan_model(train_70_smote, test_30_smote, "70:30", "Setelah SMOTE")
##
## ====================================================
## Split : 70:30
## Kondisi : Setelah SMOTE
## Metode : Decision Tree
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 176 49
## yes 31 44
##
## Accuracy : 0.7333
## 95% CI : (0.6795, 0.7825)
## No Information Rate : 0.69
## P-Value [Acc > NIR] : 0.05786
##
## Kappa : 0.3416
##
## Mcnemar's Test P-Value : 0.05735
##
## Sensitivity : 0.4731
## Specificity : 0.8502
## Pos Pred Value : 0.5867
## Neg Pred Value : 0.7822
## Prevalence : 0.3100
## Detection Rate : 0.1467
## Detection Prevalence : 0.2500
## Balanced Accuracy : 0.6617
##
## 'Positive' Class : yes
##
##
## ====================================================
## Split : 70:30
## Kondisi : Setelah SMOTE
## Metode : Random Forest
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 195 55
## yes 12 38
##
## Accuracy : 0.7767
## 95% CI : (0.7253, 0.8225)
## No Information Rate : 0.69
## P-Value [Acc > NIR] : 0.0005368
##
## Kappa : 0.4018
##
## Mcnemar's Test P-Value : 0.000000288
##
## Sensitivity : 0.4086
## Specificity : 0.9420
## Pos Pred Value : 0.7600
## Neg Pred Value : 0.7800
## Prevalence : 0.3100
## Detection Rate : 0.1267
## Detection Prevalence : 0.1667
## Balanced Accuracy : 0.6753
##
## 'Positive' Class : yes
##
##
## ====================================================
## Split : 70:30
## Kondisi : Setelah SMOTE
## Metode : XGBoost
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 187 47
## yes 20 46
##
## Accuracy : 0.7767
## 95% CI : (0.7253, 0.8225)
## No Information Rate : 0.69
## P-Value [Acc > NIR] : 0.0005368
##
## Kappa : 0.4326
##
## Mcnemar's Test P-Value : 0.0014911
##
## Sensitivity : 0.4946
## Specificity : 0.9034
## Pos Pred Value : 0.6970
## Neg Pred Value : 0.7991
## Prevalence : 0.3100
## Detection Rate : 0.1533
## Detection Prevalence : 0.2200
## Balanced Accuracy : 0.6990
##
## 'Positive' Class : yes
##
hasil_smote <- bind_rows(
hasil_smote_90,
hasil_smote_80,
hasil_smote_70
)
kable(hasil_smote, digits = 4, caption = "Hasil Evaluasi Model Setelah SMOTE")
| Split | Kondisi | Model | Accuracy | Precision | Recall | Specificity | F1_Score | |
|---|---|---|---|---|---|---|---|---|
| Precision…1 | 90:10 | Setelah SMOTE | Decision Tree | 0.7300 | 0.4839 | 0.5769 | 0.7838 | 0.5263 |
| Precision…2 | 90:10 | Setelah SMOTE | Random Forest | 0.7500 | 0.5238 | 0.4231 | 0.8649 | 0.4681 |
| Precision…3 | 90:10 | Setelah SMOTE | XGBoost | 0.7500 | 0.5185 | 0.5385 | 0.8243 | 0.5283 |
| Precision…4 | 80:20 | Setelah SMOTE | Decision Tree | 0.6600 | 0.5224 | 0.4930 | 0.7519 | 0.5072 |
| Precision…5 | 80:20 | Setelah SMOTE | Random Forest | 0.7050 | 0.6364 | 0.3944 | 0.8760 | 0.4870 |
| Precision…6 | 80:20 | Setelah SMOTE | XGBoost | 0.7300 | 0.6604 | 0.4930 | 0.8605 | 0.5645 |
| Precision…7 | 70:30 | Setelah SMOTE | Decision Tree | 0.7333 | 0.5867 | 0.4731 | 0.8502 | 0.5238 |
| Precision…8 | 70:30 | Setelah SMOTE | Random Forest | 0.7767 | 0.7600 | 0.4086 | 0.9420 | 0.5315 |
| Precision…9 | 70:30 | Setelah SMOTE | XGBoost | 0.7767 | 0.6970 | 0.4946 | 0.9034 | 0.5786 |
Hasil evaluasi setelah SMOTE digunakan untuk menilai pengaruh penyeimbangan kelas terhadap performa model. Apabila nilai recall atau F1-score meningkat, maka SMOTE dapat dikatakan membantu model dalam mengenali kelas positif.
Tahap tuning dilakukan menggunakan pendekatan random
search. Pada metode ini, beberapa kombinasi parameter dicoba secara
acak, kemudian dipilih kombinasi yang menghasilkan performa terbaik.
Untuk model Decision Tree dan Random Forest, proses
tuning dilakukan menggunakan fungsi train() dari
paket caret. Untuk XGBoost, kombinasi parameter dibentuk
secara manual.
set.seed(123)
ctrl_random <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
search = "random"
)
Fungsi berikut digunakan untuk menjalankan random search pada setiap skenario pembagian data. Hasil dari fungsi ini berupa evaluasi model hasil tuning dan objek model yang dapat digunakan untuk analisis lanjutan, seperti feature importance.
tuning_model_random <- function(train_data, test_data, split_name) {
hasil_tuning <- data.frame()
test_x <- test_data %>% select(-default)
test_y <- test_data$default
# Decision Tree Random Search
set.seed(123)
tune_tree <- train(
default ~ .,
data = train_data,
method = "rpart",
trControl = ctrl_random,
tuneLength = 20
)
pred_tree <- predict(tune_tree, test_x)
hasil_tuning <- bind_rows(
hasil_tuning,
evaluasi_model(test_y, pred_tree, "Decision Tree Tuning", split_name, "SMOTE + Random Search")
)
cat("\nBest Tune Decision Tree:\n")
print(tune_tree$bestTune)
# Random Forest Random Search
set.seed(123)
tune_rf <- train(
default ~ .,
data = train_data,
method = "rf",
ntree = 500,
trControl = ctrl_random,
tuneLength = 20,
importance = TRUE
)
pred_rf <- predict(tune_rf, test_x)
hasil_tuning <- bind_rows(
hasil_tuning,
evaluasi_model(test_y, pred_rf, "Random Forest Tuning", split_name, "SMOTE + Random Search")
)
cat("\nBest Tune Random Forest:\n")
print(tune_rf$bestTune)
# XGBoost Random Search
set.seed(123)
train_x <- train_data %>% select(-default)
test_x2 <- test_data %>% select(-default)
train_matrix <- model.matrix(~ . -1, data = train_x)
test_matrix <- model.matrix(~ . -1, data = test_x2)
kolom_train <- colnames(train_matrix)
for (k in kolom_train) {
if (!(k %in% colnames(test_matrix))) {
test_matrix <- cbind(test_matrix, 0)
colnames(test_matrix)[ncol(test_matrix)] <- k
}
}
test_matrix <- test_matrix[, kolom_train]
train_label <- ifelse(train_data$default == "yes", 1, 0)
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix)
param_grid <- data.frame(
max_depth = sample(2:8, 20, replace = TRUE),
eta = runif(20, 0.01, 0.3),
gamma = runif(20, 0, 5),
colsample_bytree = runif(20, 0.5, 1),
min_child_weight = sample(1:10, 20, replace = TRUE),
subsample = runif(20, 0.5, 1),
nrounds = sample(seq(50, 300, by = 50), 20, replace = TRUE)
)
hasil_xgb_random <- data.frame()
for (i in 1:nrow(param_grid)) {
param <- list(
objective = "binary:logistic",
eval_metric = "logloss",
max_depth = param_grid$max_depth[i],
eta = param_grid$eta[i],
gamma = param_grid$gamma[i],
colsample_bytree = param_grid$colsample_bytree[i],
min_child_weight = param_grid$min_child_weight[i],
subsample = param_grid$subsample[i]
)
set.seed(123 + i)
model_temp <- xgb.train(
params = param,
data = dtrain,
nrounds = param_grid$nrounds[i],
verbose = 0
)
pred_prob_temp <- predict(model_temp, dtest)
pred_temp <- factor(
ifelse(pred_prob_temp >= 0.5, "yes", "no"),
levels = c("no", "yes")
)
cm_temp <- confusionMatrix(pred_temp, test_y, positive = "yes")
hasil_xgb_random <- rbind(
hasil_xgb_random,
data.frame(
Iterasi = i,
Accuracy = as.numeric(cm_temp$overall["Accuracy"]),
Kappa = as.numeric(cm_temp$overall["Kappa"]),
F1_Score = ifelse(is.na(cm_temp$byClass["F1"]), 0, cm_temp$byClass["F1"])
)
)
}
best_index <- which.max(hasil_xgb_random$Accuracy)
best_param <- param_grid[best_index, ]
cat("\nBest Tune XGBoost:\n")
print(best_param)
best_param_list <- list(
objective = "binary:logistic",
eval_metric = "logloss",
max_depth = best_param$max_depth,
eta = best_param$eta,
gamma = best_param$gamma,
colsample_bytree = best_param$colsample_bytree,
min_child_weight = best_param$min_child_weight,
subsample = best_param$subsample
)
set.seed(123)
model_xgb_final <- xgb.train(
params = best_param_list,
data = dtrain,
nrounds = best_param$nrounds,
verbose = 0
)
pred_prob_xgb <- predict(model_xgb_final, dtest)
pred_xgb <- factor(
ifelse(pred_prob_xgb >= 0.5, "yes", "no"),
levels = c("no", "yes")
)
hasil_tuning <- bind_rows(
hasil_tuning,
evaluasi_model(test_y, pred_xgb, "XGBoost Tuning", split_name, "SMOTE + Random Search")
)
model_list <- list(
hasil = hasil_tuning,
tree = tune_tree,
rf = tune_rf,
xgb = model_xgb_final,
xgb_param_grid = param_grid,
xgb_hasil_random = hasil_xgb_random,
xgb_best_param = best_param
)
return(model_list)
}
Pada bagian ini, proses tuning dijalankan untuk setiap skenario pembagian data. Hasil dari setiap skenario kemudian digabungkan menjadi satu tabel evaluasi.
tuning_90 <- tuning_model_random(train_90_smote, test_10_smote, "90:10")
##
## ====================================================
## Split : 90:10
## Kondisi : SMOTE + Random Search
## Metode : Decision Tree Tuning
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 63 12
## yes 11 14
##
## Accuracy : 0.77
## 95% CI : (0.6751, 0.8483)
## No Information Rate : 0.74
## P-Value [Acc > NIR] : 0.2887
##
## Kappa : 0.3947
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.5385
## Specificity : 0.8514
## Pos Pred Value : 0.5600
## Neg Pred Value : 0.8400
## Prevalence : 0.2600
## Detection Rate : 0.1400
## Detection Prevalence : 0.2500
## Balanced Accuracy : 0.6949
##
## 'Positive' Class : yes
##
##
## Best Tune Decision Tree:
## cp
## 8 0.006386861
##
## ====================================================
## Split : 90:10
## Kondisi : SMOTE + Random Search
## Metode : Random Forest Tuning
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 67 16
## yes 7 10
##
## Accuracy : 0.77
## 95% CI : (0.6751, 0.8483)
## No Information Rate : 0.74
## P-Value [Acc > NIR] : 0.28871
##
## Kappa : 0.3267
##
## Mcnemar's Test P-Value : 0.09529
##
## Sensitivity : 0.3846
## Specificity : 0.9054
## Pos Pred Value : 0.5882
## Neg Pred Value : 0.8072
## Prevalence : 0.2600
## Detection Rate : 0.1000
## Detection Prevalence : 0.1700
## Balanced Accuracy : 0.6450
##
## 'Positive' Class : yes
##
##
## Best Tune Random Forest:
## mtry
## 1 3
##
## Best Tune XGBoost:
## max_depth eta gamma colsample_bytree min_child_weight subsample
## 7 3 0.1823012 2.329812 0.9061948 6 0.8602981
## nrounds
## 7 200
##
## ====================================================
## Split : 90:10
## Kondisi : SMOTE + Random Search
## Metode : XGBoost Tuning
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 62 11
## yes 12 15
##
## Accuracy : 0.77
## 95% CI : (0.6751, 0.8483)
## No Information Rate : 0.74
## P-Value [Acc > NIR] : 0.2887
##
## Kappa : 0.4097
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.5769
## Specificity : 0.8378
## Pos Pred Value : 0.5556
## Neg Pred Value : 0.8493
## Prevalence : 0.2600
## Detection Rate : 0.1500
## Detection Prevalence : 0.2700
## Balanced Accuracy : 0.7074
##
## 'Positive' Class : yes
##
tuning_80 <- tuning_model_random(train_80_smote, test_20_smote, "80:20")
##
## ====================================================
## Split : 80:20
## Kondisi : SMOTE + Random Search
## Metode : Decision Tree Tuning
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 97 36
## yes 32 35
##
## Accuracy : 0.66
## 95% CI : (0.5898, 0.7253)
## No Information Rate : 0.645
## P-Value [Acc > NIR] : 0.3583
##
## Kappa : 0.248
##
## Mcnemar's Test P-Value : 0.7160
##
## Sensitivity : 0.4930
## Specificity : 0.7519
## Pos Pred Value : 0.5224
## Neg Pred Value : 0.7293
## Prevalence : 0.3550
## Detection Rate : 0.1750
## Detection Prevalence : 0.3350
## Balanced Accuracy : 0.6224
##
## 'Positive' Class : yes
##
##
## Best Tune Decision Tree:
## cp
## 4 0.006550218
##
## ====================================================
## Split : 80:20
## Kondisi : SMOTE + Random Search
## Metode : Random Forest Tuning
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 119 50
## yes 10 21
##
## Accuracy : 0.7
## 95% CI : (0.6314, 0.7626)
## No Information Rate : 0.645
## P-Value [Acc > NIR] : 0.05902
##
## Kappa : 0.2499
##
## Mcnemar's Test P-Value : 0.0000004782
##
## Sensitivity : 0.2958
## Specificity : 0.9225
## Pos Pred Value : 0.6774
## Neg Pred Value : 0.7041
## Prevalence : 0.3550
## Detection Rate : 0.1050
## Detection Prevalence : 0.1550
## Balanced Accuracy : 0.6091
##
## 'Positive' Class : yes
##
##
## Best Tune Random Forest:
## mtry
## 1 3
##
## Best Tune XGBoost:
## max_depth eta gamma colsample_bytree min_child_weight subsample
## 10 6 0.289277 0.2291558 0.8772376 2 0.9770456
## nrounds
## 10 300
##
## ====================================================
## Split : 80:20
## Kondisi : SMOTE + Random Search
## Metode : XGBoost Tuning
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 108 36
## yes 21 35
##
## Accuracy : 0.715
## 95% CI : (0.6471, 0.7764)
## No Information Rate : 0.645
## P-Value [Acc > NIR] : 0.02170
##
## Kappa : 0.3466
##
## Mcnemar's Test P-Value : 0.06369
##
## Sensitivity : 0.4930
## Specificity : 0.8372
## Pos Pred Value : 0.6250
## Neg Pred Value : 0.7500
## Prevalence : 0.3550
## Detection Rate : 0.1750
## Detection Prevalence : 0.2800
## Balanced Accuracy : 0.6651
##
## 'Positive' Class : yes
##
tuning_70 <- tuning_model_random(train_70_smote, test_30_smote, "70:30")
##
## ====================================================
## Split : 70:30
## Kondisi : SMOTE + Random Search
## Metode : Decision Tree Tuning
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 176 49
## yes 31 44
##
## Accuracy : 0.7333
## 95% CI : (0.6795, 0.7825)
## No Information Rate : 0.69
## P-Value [Acc > NIR] : 0.05786
##
## Kappa : 0.3416
##
## Mcnemar's Test P-Value : 0.05735
##
## Sensitivity : 0.4731
## Specificity : 0.8502
## Pos Pred Value : 0.5867
## Neg Pred Value : 0.7822
## Prevalence : 0.3100
## Detection Rate : 0.1467
## Detection Prevalence : 0.2500
## Balanced Accuracy : 0.6617
##
## 'Positive' Class : yes
##
##
## Best Tune Decision Tree:
## cp
## 5 0.005636071
##
## ====================================================
## Split : 70:30
## Kondisi : SMOTE + Random Search
## Metode : Random Forest Tuning
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 192 55
## yes 15 38
##
## Accuracy : 0.7667
## 95% CI : (0.7146, 0.8134)
## No Information Rate : 0.69
## P-Value [Acc > NIR] : 0.002032
##
## Kappa : 0.3813
##
## Mcnemar's Test P-Value : 0.000003141
##
## Sensitivity : 0.4086
## Specificity : 0.9275
## Pos Pred Value : 0.7170
## Neg Pred Value : 0.7773
## Prevalence : 0.3100
## Detection Rate : 0.1267
## Detection Prevalence : 0.1767
## Balanced Accuracy : 0.6681
##
## 'Positive' Class : yes
##
##
## Best Tune Random Forest:
## mtry
## 4 8
##
## Best Tune XGBoost:
## max_depth eta gamma colsample_bytree min_child_weight subsample
## 17 6 0.0727583 3.766539 0.8063855 6 0.6847444
## nrounds
## 17 300
##
## ====================================================
## Split : 70:30
## Kondisi : SMOTE + Random Search
## Metode : XGBoost Tuning
## ====================================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 184 49
## yes 23 44
##
## Accuracy : 0.76
## 95% CI : (0.7076, 0.8072)
## No Information Rate : 0.69
## P-Value [Acc > NIR] : 0.004525
##
## Kappa : 0.3922
##
## Mcnemar's Test P-Value : 0.003216
##
## Sensitivity : 0.4731
## Specificity : 0.8889
## Pos Pred Value : 0.6567
## Neg Pred Value : 0.7897
## Prevalence : 0.3100
## Detection Rate : 0.1467
## Detection Prevalence : 0.2233
## Balanced Accuracy : 0.6810
##
## 'Positive' Class : yes
##
hasil_tuning_semua <- bind_rows(
tuning_90$hasil,
tuning_80$hasil,
tuning_70$hasil
)
kable(hasil_tuning_semua, digits = 4, caption = "Hasil Evaluasi Model Setelah SMOTE dan Random Search")
| Split | Kondisi | Model | Accuracy | Precision | Recall | Specificity | F1_Score | |
|---|---|---|---|---|---|---|---|---|
| Precision…1 | 90:10 | SMOTE + Random Search | Decision Tree Tuning | 0.7700 | 0.5600 | 0.5385 | 0.8514 | 0.5490 |
| Precision…2 | 90:10 | SMOTE + Random Search | Random Forest Tuning | 0.7700 | 0.5882 | 0.3846 | 0.9054 | 0.4651 |
| Precision…3 | 90:10 | SMOTE + Random Search | XGBoost Tuning | 0.7700 | 0.5556 | 0.5769 | 0.8378 | 0.5660 |
| Precision…4 | 80:20 | SMOTE + Random Search | Decision Tree Tuning | 0.6600 | 0.5224 | 0.4930 | 0.7519 | 0.5072 |
| Precision…5 | 80:20 | SMOTE + Random Search | Random Forest Tuning | 0.7000 | 0.6774 | 0.2958 | 0.9225 | 0.4118 |
| Precision…6 | 80:20 | SMOTE + Random Search | XGBoost Tuning | 0.7150 | 0.6250 | 0.4930 | 0.8372 | 0.5512 |
| Precision…7 | 70:30 | SMOTE + Random Search | Decision Tree Tuning | 0.7333 | 0.5867 | 0.4731 | 0.8502 | 0.5238 |
| Precision…8 | 70:30 | SMOTE + Random Search | Random Forest Tuning | 0.7667 | 0.7170 | 0.4086 | 0.9275 | 0.5205 |
| Precision…9 | 70:30 | SMOTE + Random Search | XGBoost Tuning | 0.7600 | 0.6567 | 0.4731 | 0.8889 | 0.5500 |
Hasil tuning menunjukkan performa model setelah dilakukan pencarian parameter terbaik. Model dengan nilai evaluasi tertinggi dapat dipertimbangkan sebagai model terbaik, terutama apabila memiliki keseimbangan yang baik antara accuracy, recall, dan F1-score.
Bagian ini menggabungkan seluruh hasil evaluasi, yaitu model sebelum SMOTE, model setelah SMOTE, dan model setelah SMOTE serta random search. Tabel ini digunakan untuk menentukan model terbaik secara keseluruhan.
perbandingan_final <- bind_rows(
hasil_awal,
hasil_smote,
hasil_tuning_semua
)
kable(perbandingan_final, digits = 4, caption = "Perbandingan Final Seluruh Model")
| Split | Kondisi | Model | Accuracy | Precision | Recall | Specificity | F1_Score | |
|---|---|---|---|---|---|---|---|---|
| Precision…1 | 90:10 | Sebelum SMOTE | Decision Tree | 0.8300 | 0.7647 | 0.5000 | 0.9459 | 0.6047 |
| Precision…2 | 90:10 | Sebelum SMOTE | Random Forest | 0.7900 | 0.6667 | 0.3846 | 0.9324 | 0.4878 |
| Precision…3 | 90:10 | Sebelum SMOTE | XGBoost | 0.7400 | 0.5000 | 0.4615 | 0.8378 | 0.4800 |
| Precision…4 | 80:20 | Sebelum SMOTE | Decision Tree | 0.6700 | 0.5610 | 0.3239 | 0.8605 | 0.4107 |
| Precision…5 | 80:20 | Sebelum SMOTE | Random Forest | 0.7150 | 0.6591 | 0.4085 | 0.8837 | 0.5043 |
| Precision…6 | 80:20 | Sebelum SMOTE | XGBoost | 0.6950 | 0.5833 | 0.4930 | 0.8062 | 0.5344 |
| Precision…7 | 70:30 | Sebelum SMOTE | Decision Tree | 0.7533 | 0.7111 | 0.3441 | 0.9372 | 0.4638 |
| Precision…8 | 70:30 | Sebelum SMOTE | Random Forest | 0.7600 | 0.7442 | 0.3441 | 0.9469 | 0.4706 |
| Precision…9 | 70:30 | Sebelum SMOTE | XGBoost | 0.7767 | 0.7031 | 0.4839 | 0.9082 | 0.5732 |
| Precision…10 | 90:10 | Setelah SMOTE | Decision Tree | 0.7300 | 0.4839 | 0.5769 | 0.7838 | 0.5263 |
| Precision…11 | 90:10 | Setelah SMOTE | Random Forest | 0.7500 | 0.5238 | 0.4231 | 0.8649 | 0.4681 |
| Precision…12 | 90:10 | Setelah SMOTE | XGBoost | 0.7500 | 0.5185 | 0.5385 | 0.8243 | 0.5283 |
| Precision…13 | 80:20 | Setelah SMOTE | Decision Tree | 0.6600 | 0.5224 | 0.4930 | 0.7519 | 0.5072 |
| Precision…14 | 80:20 | Setelah SMOTE | Random Forest | 0.7050 | 0.6364 | 0.3944 | 0.8760 | 0.4870 |
| Precision…15 | 80:20 | Setelah SMOTE | XGBoost | 0.7300 | 0.6604 | 0.4930 | 0.8605 | 0.5645 |
| Precision…16 | 70:30 | Setelah SMOTE | Decision Tree | 0.7333 | 0.5867 | 0.4731 | 0.8502 | 0.5238 |
| Precision…17 | 70:30 | Setelah SMOTE | Random Forest | 0.7767 | 0.7600 | 0.4086 | 0.9420 | 0.5315 |
| Precision…18 | 70:30 | Setelah SMOTE | XGBoost | 0.7767 | 0.6970 | 0.4946 | 0.9034 | 0.5786 |
| Precision…19 | 90:10 | SMOTE + Random Search | Decision Tree Tuning | 0.7700 | 0.5600 | 0.5385 | 0.8514 | 0.5490 |
| Precision…20 | 90:10 | SMOTE + Random Search | Random Forest Tuning | 0.7700 | 0.5882 | 0.3846 | 0.9054 | 0.4651 |
| Precision…21 | 90:10 | SMOTE + Random Search | XGBoost Tuning | 0.7700 | 0.5556 | 0.5769 | 0.8378 | 0.5660 |
| Precision…22 | 80:20 | SMOTE + Random Search | Decision Tree Tuning | 0.6600 | 0.5224 | 0.4930 | 0.7519 | 0.5072 |
| Precision…23 | 80:20 | SMOTE + Random Search | Random Forest Tuning | 0.7000 | 0.6774 | 0.2958 | 0.9225 | 0.4118 |
| Precision…24 | 80:20 | SMOTE + Random Search | XGBoost Tuning | 0.7150 | 0.6250 | 0.4930 | 0.8372 | 0.5512 |
| Precision…25 | 70:30 | SMOTE + Random Search | Decision Tree Tuning | 0.7333 | 0.5867 | 0.4731 | 0.8502 | 0.5238 |
| Precision…26 | 70:30 | SMOTE + Random Search | Random Forest Tuning | 0.7667 | 0.7170 | 0.4086 | 0.9275 | 0.5205 |
| Precision…27 | 70:30 | SMOTE + Random Search | XGBoost Tuning | 0.7600 | 0.6567 | 0.4731 | 0.8889 | 0.5500 |
top_5_model <- perbandingan_final %>%
arrange(desc(Accuracy), desc(F1_Score), desc(Recall)) %>%
slice(1:5)
kable(top_5_model, digits = 4, caption = "Lima Model Terbaik Berdasarkan Accuracy")
| Split | Kondisi | Model | Accuracy | Precision | Recall | Specificity | F1_Score | |
|---|---|---|---|---|---|---|---|---|
| Precision…1 | 90:10 | Sebelum SMOTE | Decision Tree | 0.8300 | 0.7647 | 0.5000 | 0.9459 | 0.6047 |
| Precision…2 | 90:10 | Sebelum SMOTE | Random Forest | 0.7900 | 0.6667 | 0.3846 | 0.9324 | 0.4878 |
| Precision…3 | 70:30 | Setelah SMOTE | XGBoost | 0.7767 | 0.6970 | 0.4946 | 0.9034 | 0.5786 |
| Precision…4 | 70:30 | Sebelum SMOTE | XGBoost | 0.7767 | 0.7031 | 0.4839 | 0.9082 | 0.5732 |
| Precision…5 | 70:30 | Setelah SMOTE | Random Forest | 0.7767 | 0.7600 | 0.4086 | 0.9420 | 0.5315 |
Berdasarkan tabel perbandingan final, model terbaik dapat ditentukan dari nilai accuracy tertinggi. Namun, dalam kasus klasifikasi kredit, pemilihan model juga perlu mempertimbangkan recall dan F1-score, terutama apabila tujuan utama analisis adalah mendeteksi kelas positif dengan lebih baik.
Feature importance digunakan untuk mengetahui variabel yang paling berkontribusi dalam model XGBoost. Pada bagian ini, model XGBoost dari skenario terbaik berdasarkan accuracy digunakan untuk menghitung tingkat kepentingan variabel. Karena XGBoost menggunakan variabel dummy untuk variabel kategorik, nama variabel dummy dikelompokkan kembali ke nama variabel asal.
hasil_xgb_tuning <- hasil_tuning_semua %>%
filter(Model == "XGBoost Tuning") %>%
arrange(desc(Accuracy), desc(F1_Score), desc(Recall))
split_xgb_terbaik <- hasil_xgb_tuning$Split[1]
model_xgb_terbaik <- switch(
split_xgb_terbaik,
"90:10" = tuning_90$xgb,
"80:20" = tuning_80$xgb,
"70:30" = tuning_70$xgb
)
importance_xgb <- xgb.importance(model = model_xgb_terbaik)
kable(head(importance_xgb, 10), digits = 4, caption = "Sepuluh Feature Importance XGBoost Tertinggi")
| Feature | Gain | Cover | Frequency |
|---|---|---|---|
| checking_balanceunknown | 0.2695 | 0.0652 | 0.0357 |
| checking_balance..0.DM | 0.0898 | 0.0540 | 0.0516 |
| months_loan_duration | 0.0771 | 0.1134 | 0.0913 |
| amount | 0.0704 | 0.1163 | 0.1310 |
| purposefurniture.appliances | 0.0662 | 0.0530 | 0.0397 |
| other_creditnone | 0.0399 | 0.0423 | 0.0556 |
| housingown | 0.0387 | 0.0280 | 0.0238 |
| percent_of_income | 0.0361 | 0.0324 | 0.0317 |
| age | 0.0295 | 0.0465 | 0.0754 |
| savings_balanceunknown | 0.0294 | 0.0382 | 0.0278 |
importance_df <- importance_xgb %>%
mutate(
Variabel = case_when(
grepl("^checking_balance", Feature) ~ "checking_balance",
grepl("^months_loan_duration", Feature) ~ "months_loan_duration",
grepl("^credit_history", Feature) ~ "credit_history",
grepl("^purpose", Feature) ~ "purpose",
grepl("^amount", Feature) ~ "amount",
grepl("^savings_balance", Feature) ~ "savings_balance",
grepl("^employment_duration", Feature) ~ "employment_duration",
grepl("^percent_of_income", Feature) ~ "percent_of_income",
grepl("^years_at_residence", Feature) ~ "years_at_residence",
grepl("^age", Feature) ~ "age",
grepl("^other_credit", Feature) ~ "other_credit",
grepl("^housing", Feature) ~ "housing",
grepl("^existing_loans_count", Feature) ~ "existing_loans_count",
grepl("^job", Feature) ~ "job",
grepl("^dependents", Feature) ~ "dependents",
grepl("^phone", Feature) ~ "phone",
TRUE ~ Feature
)
) %>%
group_by(Variabel) %>%
summarise(Gain = sum(Gain), .groups = "drop") %>%
arrange(desc(Gain)) %>%
slice(1:10)
ggplot(importance_df, aes(x = reorder(Variabel, Gain), y = Gain)) +
geom_col(fill = "red") +
coord_flip() +
labs(
title = "Feature Importance XGBoost",
x = NULL,
y = "Gain"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 12, face = "bold", hjust = 0.5),
axis.text.y = element_text(size = 12, color = "black"),
axis.text.x = element_text(size = 11, color = "black")
)
Grafik feature importance menunjukkan variabel yang paling berpengaruh dalam pembentukan model XGBoost. Semakin tinggi nilai Gain, semakin besar kontribusi variabel tersebut dalam meningkatkan kemampuan model melakukan klasifikasi.
Berdasarkan seluruh tahapan analisis, model klasifikasi dibangun menggunakan tiga metode, yaitu Decision Tree, Random Forest, dan XGBoost. Evaluasi dilakukan pada tiga skenario pembagian data, yaitu 90:10, 80:20, dan 70:30. Penanganan ketidakseimbangan kelas dilakukan menggunakan SMOTE pada data training, kemudian model ditingkatkan melalui proses random search.
Model terbaik dapat ditentukan berdasarkan tabel perbandingan final.
Secara umum, model dengan nilai accuracy tertinggi menunjukkan
kemampuan klasifikasi yang paling baik secara keseluruhan. Namun,
apabila fokus analisis adalah mendeteksi kelas yes, maka
nilai recall dan F1-score juga perlu diperhatikan agar
model tidak hanya baik secara umum, tetapi juga mampu mengenali kelas
positif dengan lebih optimal.