1 Pendahuluan

Analisis ini dilakukan untuk membandingkan kinerja tiga metode klasifikasi, yaitu Decision Tree, Random Forest, dan XGBoost dalam mengklasifikasikan status default kredit. Variabel target yang digunakan adalah default, dengan kategori no dan yes. Dalam analisis ini, kategori yes ditetapkan sebagai kelas positif, sehingga ukuran evaluasi seperti precision, recall, dan F1-score dihitung dengan fokus pada kelas yes.

Tahapan analisis meliputi impor data, statistik deskriptif, pemeriksaan distribusi variabel target, pembagian data menjadi data training dan testing, penanganan ketidakseimbangan kelas menggunakan SMOTE, pemodelan sebelum dan sesudah SMOTE, proses tuning menggunakan random search, serta visualisasi feature importance pada model XGBoost.

2 Persiapan Paket

Bagian ini digunakan untuk memanggil paket-paket yang diperlukan dalam analisis. Paket readxl digunakan untuk membaca data Excel, dplyr untuk manipulasi data, caret untuk evaluasi dan tuning model, rpart untuk model Decision Tree, randomForest untuk model Random Forest, xgboost untuk model XGBoost, smotefamily untuk metode SMOTE, dan ggplot2 untuk visualisasi data.

library(readxl)
library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(xgboost)
library(smotefamily)
library(ggplot2)
library(knitr)

3 Impor Data

Data yang digunakan dalam analisis ini diimpor dari file Excel. Silakan sesuaikan bagian path_data dengan lokasi file pada perangkat masing-masing. Setelah data diimpor, variabel bertipe karakter diubah menjadi faktor, sedangkan variabel default ditetapkan sebagai faktor dengan urutan level no dan yes.

path_data <- "D:/Dokumen/Semester 6/ML/UAS ML/data_gabungan.xlsx"

data_ML <- read_excel(path_data)

data_ML <- data_ML %>%
  mutate(
    across(where(is.character), as.factor),
    default = factor(default, levels = c("no", "yes"))
  )

head(data_ML)

## # A tibble: 6 × 17
##   checking_balance months_loan_duration credit_history purpose            amount
##   <fct>                           <dbl> <fct>          <fct>               <dbl>
## 1 1 - 200 DM                         15 very good      car                  6850
## 2 unknown                            30 good           car                  4811
## 3 unknown                            28 critical       furniture/applian…   2743
## 4 1 - 200 DM                         15 good           renovations          2631
## 5 < 0 DM                             10 good           furniture/applian…   2315
## 6 unknown                            15 good           car                  4657
## # ℹ 12 more variables: savings_balance <fct>, employment_duration <fct>,
## #   percent_of_income <dbl>, years_at_residence <dbl>, age <dbl>,
## #   other_credit <fct>, housing <fct>, existing_loans_count <dbl>, job <fct>,
## #   dependents <dbl>, default <fct>, phone <fct>

4 Statistik Deskriptif Data

Statistik deskriptif digunakan untuk memberikan gambaran umum mengenai struktur data yang dianalisis. Informasi yang ditampilkan meliputi jumlah observasi, jumlah variabel, jumlah variabel numerik, dan jumlah variabel kategorik.

jumlah_observasi <- nrow(data_ML)
jumlah_variabel <- ncol(data_ML)
jumlah_numerik <- sum(sapply(data_ML, is.numeric))
jumlah_kategorik <- sum(sapply(data_ML, is.factor))

statdesk_dimensi <- data.frame(
  Keterangan = c(
    "Jumlah Observasi",
    "Jumlah Variabel",
    "Jumlah Variabel Numerik",
    "Jumlah Variabel Kategorik"
  ),
  Nilai = c(
    jumlah_observasi,
    jumlah_variabel,
    jumlah_numerik,
    jumlah_kategorik
  )
)

kable(statdesk_dimensi, caption = "Ringkasan Dimensi Data")

Ringkasan Dimensi Data
Keterangan	Nilai
Jumlah Observasi	1000
Jumlah Variabel	17
Jumlah Variabel Numerik	7
Jumlah Variabel Kategorik	10

Berdasarkan ringkasan dimensi data, dapat diketahui jumlah observasi dan jumlah variabel yang digunakan dalam proses klasifikasi. Informasi ini penting karena ukuran data dan jenis variabel dapat memengaruhi proses pembentukan model klasifikasi.

5 Distribusi Variabel Target

Bagian ini menampilkan distribusi variabel target default. Pemeriksaan distribusi target diperlukan untuk mengetahui apakah data memiliki ketidakseimbangan kelas. Ketidakseimbangan kelas dapat menyebabkan model cenderung lebih baik dalam memprediksi kelas mayoritas dibandingkan kelas minoritas.

distribusi_target <- data_ML %>%
  count(default) %>%
  mutate(Persentase = round(n / sum(n) * 100, 2))

kable(distribusi_target, caption = "Distribusi Variabel Target")

Distribusi Variabel Target
default	n	Persentase
no	700	70
yes	300	30

ggplot(distribusi_target, aes(x = default, y = n, fill = default)) +
  geom_bar(stat = "identity") +
  geom_text(
    aes(label = paste0(n, " (", Persentase, "%)")),
    vjust = -0.3,
    size = 6
  ) +
  scale_fill_manual(values = c("no" = "green", "yes" = "red")) +
  labs(
    title = "Distribusi Status Default Kredit",
    x = "Status Default",
    y = "Frekuensi"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5, face = "bold", size = 18),
    axis.text.x = element_text(size = 16, face = "bold", color = "black"),
    axis.text.y = element_text(size = 14, color = "black"),
    axis.title.x = element_text(size = 18, face = "bold"),
    axis.title.y = element_text(size = 18, face = "bold")
  )

Grafik distribusi target menunjukkan perbandingan jumlah data pada kategori no dan yes. Apabila salah satu kategori memiliki jumlah data yang jauh lebih besar, maka data dapat dikategorikan sebagai imbalanced data. Kondisi tersebut menjadi dasar dilakukannya penanganan ketidakseimbangan kelas menggunakan SMOTE pada tahap berikutnya.

6 Statistik Deskriptif Variabel Numerik

Statistik deskriptif variabel numerik digunakan untuk melihat karakteristik dasar dari setiap variabel numerik. Ukuran yang digunakan meliputi rata-rata, nilai minimum, median, nilai maksimum, dan standar deviasi.

data_numerik <- data_ML %>%
  select(where(is.numeric))

statdesk_numerik <- data.frame(
  Variabel = names(data_numerik),
  Mean = sapply(data_numerik, function(x) round(mean(x, na.rm = TRUE), 2)),
  Minimum = sapply(data_numerik, function(x) min(x, na.rm = TRUE)),
  Median = sapply(data_numerik, function(x) median(x, na.rm = TRUE)),
  Maksimum = sapply(data_numerik, function(x) max(x, na.rm = TRUE)),
  SD = sapply(data_numerik, function(x) round(sd(x, na.rm = TRUE), 2))
)

kable(statdesk_numerik, caption = "Statistik Deskriptif Variabel Numerik")

Statistik Deskriptif Variabel Numerik
	Variabel	Mean	Minimum	Median	Maksimum	SD
months_loan_duration	months_loan_duration	20.90	4	18.0	72	12.06
amount	amount	3271.26	250	2319.5	18424	2822.74
percent_of_income	percent_of_income	2.97	1	3.0	4	1.12
years_at_residence	years_at_residence	2.85	1	3.0	4	1.10
age	age	35.55	19	33.0	75	11.38
existing_loans_count	existing_loans_count	1.41	1	1.0	4	0.58
dependents	dependents	1.16	1	1.0	2	0.36

Tabel statistik deskriptif memberikan gambaran mengenai sebaran data pada variabel numerik. Nilai minimum dan maksimum menunjukkan rentang data, sedangkan nilai standar deviasi menunjukkan tingkat keragaman data pada masing-masing variabel.

7 Pemeriksaan Missing Value

Pemeriksaan missing value dilakukan untuk mengetahui apakah terdapat nilai kosong pada setiap variabel. Nilai kosong perlu diperiksa karena dapat memengaruhi proses pelatihan model dan menghasilkan error apabila tidak ditangani dengan tepat.

missing_value <- data.frame(
  Variabel = names(data_ML),
  Jumlah_Missing = colSums(is.na(data_ML)),
  Persentase_Missing = round(colSums(is.na(data_ML)) / nrow(data_ML) * 100, 2)
)

kable(missing_value, caption = "Pemeriksaan Missing Value")

Pemeriksaan Missing Value
	Variabel	Jumlah_Missing	Persentase_Missing
checking_balance	checking_balance	0	0
months_loan_duration	months_loan_duration	0	0
credit_history	credit_history	0	0
purpose	purpose	0	0
amount	amount	0	0
savings_balance	savings_balance	0	0
employment_duration	employment_duration	0	0
percent_of_income	percent_of_income	0	0
years_at_residence	years_at_residence	0	0
age	age	0	0
other_credit	other_credit	0	0
housing	housing	0	0
existing_loans_count	existing_loans_count	0	0
job	job	0	0
dependents	dependents	0	0
default	default	0	0
phone	phone	0	0

Apabila seluruh variabel memiliki jumlah missing value sebesar nol, maka data dapat langsung digunakan dalam proses pemodelan. Namun, apabila terdapat missing value, maka perlu dilakukan penanganan terlebih dahulu, misalnya dengan penghapusan data atau imputasi nilai.

8 Fungsi Evaluasi Model

Fungsi berikut digunakan untuk menghitung kinerja model klasifikasi berdasarkan confusion matrix. Ukuran evaluasi yang digunakan meliputi accuracy, precision, recall, specificity, dan F1-score. Dalam fungsi ini, kelas positif ditetapkan sebagai yes.

evaluasi_model <- function(actual, pred, nama_model, split, kondisi) {
  cat("\n====================================================\n")
  cat("Split   :", split, "\n")
  cat("Kondisi :", kondisi, "\n")
  cat("Metode  :", nama_model, "\n")
  cat("====================================================\n")
  
  cm <- confusionMatrix(pred, actual, positive = "yes")
  print(cm)
  
  hasil <- data.frame(
    Split = split,
    Kondisi = kondisi,
    Model = nama_model,
    Accuracy = as.numeric(cm$overall["Accuracy"]),
    Precision = ifelse(is.na(cm$byClass["Precision"]), 0, cm$byClass["Precision"]),
    Recall = ifelse(is.na(cm$byClass["Recall"]), 0, cm$byClass["Recall"]),
    Specificity = ifelse(is.na(cm$byClass["Specificity"]), 0, cm$byClass["Specificity"]),
    F1_Score = ifelse(is.na(cm$byClass["F1"]), 0, cm$byClass["F1"])
  )
  
  return(hasil)
}

9 Fungsi SMOTE

SMOTE atau Synthetic Minority Oversampling Technique digunakan untuk menyeimbangkan distribusi kelas pada data training. Metode ini membentuk data sintetis pada kelas minoritas sehingga model memiliki kesempatan yang lebih baik untuk mempelajari pola dari kelas tersebut. SMOTE hanya diterapkan pada data training agar tidak terjadi data leakage pada data testing.

smote_data <- function(train_data) {
  set.seed(123)
  target <- train_data$default
  target_num <- ifelse(target == "yes", 1, 0)
  
  fitur <- train_data %>%
    select(-default)
  
  fitur_dummy <- model.matrix(~ . -1, data = fitur) %>%
    as.data.frame()
  
  smote_result <- SMOTE(
    X = fitur_dummy,
    target = target_num,
    K = 5,
    dup_size = 1
  )
  
  data_smote <- smote_result$data
  names(data_smote)[ncol(data_smote)] <- "default"
  
  data_smote$default <- factor(
    ifelse(data_smote$default == 1, "yes", "no"),
    levels = c("no", "yes")
  )
  
  return(data_smote)
}

10 Fungsi Penyesuaian Kolom

Setelah SMOTE dilakukan, variabel prediktor yang berbentuk kategorik akan berubah menjadi variabel dummy. Oleh karena itu, kolom pada data training dan testing perlu disamakan agar model dapat melakukan prediksi tanpa error.

samakan_kolom <- function(train_data, test_data) {
  test_x <- test_data %>%
    select(-default)
  
  test_dummy <- model.matrix(~ . -1, data = test_x) %>%
    as.data.frame()
  
  kolom_train <- colnames(train_data)[colnames(train_data) != "default"]
  
  for (k in kolom_train) {
    if (!(k %in% colnames(test_dummy))) {
      test_dummy[[k]] <- 0
    }
  }
  
  test_dummy <- test_dummy[, kolom_train]
  test_dummy$default <- test_data$default
  
  return(test_dummy)
}

rapikan_nama_kolom <- function(data) {
  names(data) <- make.names(names(data), unique = TRUE)
  return(data)
}

11 Pembagian Data

Data dibagi secara acak menjadi tiga skenario, yaitu 90:10, 80:20, dan 70:30. Angka pertama menunjukkan proporsi data training, sedangkan angka kedua menunjukkan proporsi data testing. Penggunaan set.seed(123) bertujuan agar hasil pembagian data dapat direplikasi.

set.seed(123)

index_90 <- sample(1:nrow(data_ML), size = 0.90 * nrow(data_ML))
train_90 <- data_ML[index_90, ]
test_10  <- data_ML[-index_90, ]

index_80 <- sample(1:nrow(data_ML), size = 0.80 * nrow(data_ML))
train_80 <- data_ML[index_80, ]
test_20  <- data_ML[-index_80, ]

index_70 <- sample(1:nrow(data_ML), size = 0.70 * nrow(data_ML))
train_70 <- data_ML[index_70, ]
test_30  <- data_ML[-index_70, ]

distribusi_split <- bind_rows(
  as.data.frame(table(train_90$default)) %>% mutate(Split = "90:10 Training"),
  as.data.frame(table(test_10$default)) %>% mutate(Split = "90:10 Testing"),
  as.data.frame(table(train_80$default)) %>% mutate(Split = "80:20 Training"),
  as.data.frame(table(test_20$default)) %>% mutate(Split = "80:20 Testing"),
  as.data.frame(table(train_70$default)) %>% mutate(Split = "70:30 Training"),
  as.data.frame(table(test_30$default)) %>% mutate(Split = "70:30 Testing")
) %>%
  select(Split, Default = Var1, Frekuensi = Freq)

kable(distribusi_split, caption = "Distribusi Kelas pada Setiap Skenario Split Data")

Distribusi Kelas pada Setiap Skenario Split Data
Split	Default	Frekuensi
90:10 Training	no	626
90:10 Training	yes	274
90:10 Testing	no	74
90:10 Testing	yes	26
80:20 Training	no	571
80:20 Training	yes	229
80:20 Testing	no	129
80:20 Testing	yes	71
70:30 Training	no	493
70:30 Training	yes	207
70:30 Testing	no	207
70:30 Testing	yes	93

Tabel tersebut menunjukkan distribusi kelas no dan yes pada data training dan testing untuk setiap skenario. Karena pembagian dilakukan secara acak, jumlah kelas pada setiap skenario dapat berbeda, tetapi tetap berasal dari dataset yang sama.

12 Fungsi Pemodelan Sebelum Tuning

Fungsi berikut digunakan untuk menjalankan tiga model klasifikasi, yaitu Decision Tree, Random Forest, dan XGBoost. Fungsi ini digunakan pada kondisi sebelum SMOTE dan setelah SMOTE sebelum dilakukan tuning parameter.

jalankan_model <- function(train_data, test_data, split_name, kondisi) {
  hasil_semua <- data.frame()
  
  test_x <- test_data %>% select(-default)
  test_y <- test_data$default
  
  # Decision Tree
  model_tree <- rpart(
    default ~ .,
    data = train_data,
    method = "class"
  )
  
  pred_tree <- predict(model_tree, test_x, type = "class")
  
  hasil_semua <- bind_rows(
    hasil_semua,
    evaluasi_model(test_y, pred_tree, "Decision Tree", split_name, kondisi)
  )
  
  # Random Forest
  set.seed(123)
  model_rf <- randomForest(
    default ~ .,
    data = train_data,
    ntree = 500,
    mtry = floor(sqrt(ncol(train_data) - 1)),
    importance = TRUE
  )
  
  pred_rf <- predict(model_rf, test_x)
  
  hasil_semua <- bind_rows(
    hasil_semua,
    evaluasi_model(test_y, pred_rf, "Random Forest", split_name, kondisi)
  )
  
  # XGBoost
  train_x <- train_data %>% select(-default)
  test_x2  <- test_data %>% select(-default)
  
  train_matrix <- model.matrix(~ . -1, data = train_x)
  test_matrix  <- model.matrix(~ . -1, data = test_x2)
  
  kolom_train <- colnames(train_matrix)
  
  for (k in kolom_train) {
    if (!(k %in% colnames(test_matrix))) {
      test_matrix <- cbind(test_matrix, 0)
      colnames(test_matrix)[ncol(test_matrix)] <- k
    }
  }
  
  test_matrix <- test_matrix[, kolom_train]
  train_label <- ifelse(train_data$default == "yes", 1, 0)
  
  dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
  dtest <- xgb.DMatrix(data = test_matrix)
  
  set.seed(123)
  model_xgb <- xgb.train(
    params = list(
      objective = "binary:logistic",
      eval_metric = "logloss",
      seed = 123
    ),
    data = dtrain,
    nrounds = 100,
    verbose = 0
  )
  
  pred_prob_xgb <- predict(model_xgb, dtest)
  pred_xgb <- factor(
    ifelse(pred_prob_xgb >= 0.5, "yes", "no"),
    levels = c("no", "yes")
  )
  
  hasil_semua <- bind_rows(
    hasil_semua,
    evaluasi_model(test_y, pred_xgb, "XGBoost", split_name, kondisi)
  )
  
  return(hasil_semua)
}

13 Model Sebelum SMOTE

Pada tahap ini, model dijalankan menggunakan data asli tanpa penanganan ketidakseimbangan kelas. Hasil ini digunakan sebagai pembanding awal terhadap model setelah SMOTE dan setelah tuning.

hasil_awal_90 <- jalankan_model(train_90, test_10, "90:10", "Sebelum SMOTE")

## 
## ====================================================
## Split   : 90:10 
## Kondisi : Sebelum SMOTE 
## Metode  : Decision Tree 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  70  13
##        yes  4  13
##                                           
##                Accuracy : 0.83            
##                  95% CI : (0.7418, 0.8977)
##     No Information Rate : 0.74            
##     P-Value [Acc > NIR] : 0.02275         
##                                           
##                   Kappa : 0.5023          
##                                           
##  Mcnemar's Test P-Value : 0.05235         
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 0.9459          
##          Pos Pred Value : 0.7647          
##          Neg Pred Value : 0.8434          
##              Prevalence : 0.2600          
##          Detection Rate : 0.1300          
##    Detection Prevalence : 0.1700          
##       Balanced Accuracy : 0.7230          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## ====================================================
## Split   : 90:10 
## Kondisi : Sebelum SMOTE 
## Metode  : Random Forest 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  69  16
##        yes  5  10
##                                           
##                Accuracy : 0.79            
##                  95% CI : (0.6971, 0.8651)
##     No Information Rate : 0.74            
##     P-Value [Acc > NIR] : 0.1521          
##                                           
##                   Kappa : 0.3675          
##                                           
##  Mcnemar's Test P-Value : 0.0291          
##                                           
##             Sensitivity : 0.3846          
##             Specificity : 0.9324          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value : 0.8118          
##              Prevalence : 0.2600          
##          Detection Rate : 0.1000          
##    Detection Prevalence : 0.1500          
##       Balanced Accuracy : 0.6585          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## ====================================================
## Split   : 90:10 
## Kondisi : Sebelum SMOTE 
## Metode  : XGBoost 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  62  14
##        yes 12  12
##                                           
##                Accuracy : 0.74            
##                  95% CI : (0.6427, 0.8226)
##     No Information Rate : 0.74            
##     P-Value [Acc > NIR] : 0.5525          
##                                           
##                   Kappa : 0.307           
##                                           
##  Mcnemar's Test P-Value : 0.8445          
##                                           
##             Sensitivity : 0.4615          
##             Specificity : 0.8378          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 0.8158          
##              Prevalence : 0.2600          
##          Detection Rate : 0.1200          
##    Detection Prevalence : 0.2400          
##       Balanced Accuracy : 0.6497          
##                                           
##        'Positive' Class : yes             
##

hasil_awal_80 <- jalankan_model(train_80, test_20, "80:20", "Sebelum SMOTE")

## 
## ====================================================
## Split   : 80:20 
## Kondisi : Sebelum SMOTE 
## Metode  : Decision Tree 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  111  48
##        yes  18  23
##                                           
##                Accuracy : 0.67            
##                  95% CI : (0.6002, 0.7347)
##     No Information Rate : 0.645           
##     P-Value [Acc > NIR] : 0.2543651       
##                                           
##                   Kappa : 0.2038          
##                                           
##  Mcnemar's Test P-Value : 0.0003575       
##                                           
##             Sensitivity : 0.3239          
##             Specificity : 0.8605          
##          Pos Pred Value : 0.5610          
##          Neg Pred Value : 0.6981          
##              Prevalence : 0.3550          
##          Detection Rate : 0.1150          
##    Detection Prevalence : 0.2050          
##       Balanced Accuracy : 0.5922          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## ====================================================
## Split   : 80:20 
## Kondisi : Sebelum SMOTE 
## Metode  : Random Forest 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  114  42
##        yes  15  29
##                                           
##                Accuracy : 0.715           
##                  95% CI : (0.6471, 0.7764)
##     No Information Rate : 0.645           
##     P-Value [Acc > NIR] : 0.0216966       
##                                           
##                   Kappa : 0.3195          
##                                           
##  Mcnemar's Test P-Value : 0.0005736       
##                                           
##             Sensitivity : 0.4085          
##             Specificity : 0.8837          
##          Pos Pred Value : 0.6591          
##          Neg Pred Value : 0.7308          
##              Prevalence : 0.3550          
##          Detection Rate : 0.1450          
##    Detection Prevalence : 0.2200          
##       Balanced Accuracy : 0.6461          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## ====================================================
## Split   : 80:20 
## Kondisi : Sebelum SMOTE 
## Metode  : XGBoost 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  104  36
##        yes  25  35
##                                          
##                Accuracy : 0.695          
##                  95% CI : (0.6261, 0.758)
##     No Information Rate : 0.645          
##     P-Value [Acc > NIR] : 0.07903        
##                                          
##                   Kappa : 0.31           
##                                          
##  Mcnemar's Test P-Value : 0.20042        
##                                          
##             Sensitivity : 0.4930         
##             Specificity : 0.8062         
##          Pos Pred Value : 0.5833         
##          Neg Pred Value : 0.7429         
##              Prevalence : 0.3550         
##          Detection Rate : 0.1750         
##    Detection Prevalence : 0.3000         
##       Balanced Accuracy : 0.6496         
##                                          
##        'Positive' Class : yes            
##

hasil_awal_70 <- jalankan_model(train_70, test_30, "70:30", "Sebelum SMOTE")

## 
## ====================================================
## Split   : 70:30 
## Kondisi : Sebelum SMOTE 
## Metode  : Decision Tree 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  194  61
##        yes  13  32
##                                           
##                Accuracy : 0.7533          
##                  95% CI : (0.7005, 0.8011)
##     No Information Rate : 0.69            
##     P-Value [Acc > NIR] : 0.009418        
##                                           
##                   Kappa : 0.3279          
##                                           
##  Mcnemar's Test P-Value : 0.00000004665   
##                                           
##             Sensitivity : 0.3441          
##             Specificity : 0.9372          
##          Pos Pred Value : 0.7111          
##          Neg Pred Value : 0.7608          
##              Prevalence : 0.3100          
##          Detection Rate : 0.1067          
##    Detection Prevalence : 0.1500          
##       Balanced Accuracy : 0.6406          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## ====================================================
## Split   : 70:30 
## Kondisi : Sebelum SMOTE 
## Metode  : Random Forest 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  196  61
##        yes  11  32
##                                           
##                Accuracy : 0.76            
##                  95% CI : (0.7076, 0.8072)
##     No Information Rate : 0.69            
##     P-Value [Acc > NIR] : 0.004525        
##                                           
##                   Kappa : 0.3415          
##                                           
##  Mcnemar's Test P-Value : 0.000000007709  
##                                           
##             Sensitivity : 0.3441          
##             Specificity : 0.9469          
##          Pos Pred Value : 0.7442          
##          Neg Pred Value : 0.7626          
##              Prevalence : 0.3100          
##          Detection Rate : 0.1067          
##    Detection Prevalence : 0.1433          
##       Balanced Accuracy : 0.6455          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## ====================================================
## Split   : 70:30 
## Kondisi : Sebelum SMOTE 
## Metode  : XGBoost 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  188  48
##        yes  19  45
##                                           
##                Accuracy : 0.7767          
##                  95% CI : (0.7253, 0.8225)
##     No Information Rate : 0.69            
##     P-Value [Acc > NIR] : 0.0005368       
##                                           
##                   Kappa : 0.4289          
##                                           
##  Mcnemar's Test P-Value : 0.0006245       
##                                           
##             Sensitivity : 0.4839          
##             Specificity : 0.9082          
##          Pos Pred Value : 0.7031          
##          Neg Pred Value : 0.7966          
##              Prevalence : 0.3100          
##          Detection Rate : 0.1500          
##    Detection Prevalence : 0.2133          
##       Balanced Accuracy : 0.6960          
##                                           
##        'Positive' Class : yes             
##

hasil_awal <- bind_rows(
  hasil_awal_90,
  hasil_awal_80,
  hasil_awal_70
)

kable(hasil_awal, digits = 4, caption = "Hasil Evaluasi Model Sebelum SMOTE")

Hasil Evaluasi Model Sebelum SMOTE
	Split	Kondisi	Model	Accuracy	Precision	Recall	Specificity	F1_Score
Precision…1	90:10	Sebelum SMOTE	Decision Tree	0.8300	0.7647	0.5000	0.9459	0.6047
Precision…2	90:10	Sebelum SMOTE	Random Forest	0.7900	0.6667	0.3846	0.9324	0.4878
Precision…3	90:10	Sebelum SMOTE	XGBoost	0.7400	0.5000	0.4615	0.8378	0.4800
Precision…4	80:20	Sebelum SMOTE	Decision Tree	0.6700	0.5610	0.3239	0.8605	0.4107
Precision…5	80:20	Sebelum SMOTE	Random Forest	0.7150	0.6591	0.4085	0.8837	0.5043
Precision…6	80:20	Sebelum SMOTE	XGBoost	0.6950	0.5833	0.4930	0.8062	0.5344
Precision…7	70:30	Sebelum SMOTE	Decision Tree	0.7533	0.7111	0.3441	0.9372	0.4638
Precision…8	70:30	Sebelum SMOTE	Random Forest	0.7600	0.7442	0.3441	0.9469	0.4706
Precision…9	70:30	Sebelum SMOTE	XGBoost	0.7767	0.7031	0.4839	0.9082	0.5732

Tabel hasil evaluasi sebelum SMOTE menunjukkan kemampuan awal setiap model dalam mengklasifikasikan status default kredit. Hasil ini perlu dibandingkan dengan model setelah SMOTE untuk melihat pengaruh penanganan ketidakseimbangan kelas terhadap performa model.

14 Penerapan SMOTE

SMOTE diterapkan hanya pada data training. Data testing tidak dikenakan SMOTE agar tetap merepresentasikan data asli dan dapat digunakan sebagai dasar evaluasi yang objektif.

train_90_smote <- smote_data(train_90)
train_80_smote <- smote_data(train_80)
train_70_smote <- smote_data(train_70)

test_10_smote <- samakan_kolom(train_90_smote, test_10)
test_20_smote <- samakan_kolom(train_80_smote, test_20)
test_30_smote <- samakan_kolom(train_70_smote, test_30)

train_90_smote <- rapikan_nama_kolom(train_90_smote)
test_10_smote  <- rapikan_nama_kolom(test_10_smote)

train_80_smote <- rapikan_nama_kolom(train_80_smote)
test_20_smote  <- rapikan_nama_kolom(test_20_smote)

train_70_smote <- rapikan_nama_kolom(train_70_smote)
test_30_smote  <- rapikan_nama_kolom(test_30_smote)

train_90_smote$default <- factor(train_90_smote$default, levels = c("no", "yes"))
test_10_smote$default  <- factor(test_10_smote$default, levels = c("no", "yes"))

train_80_smote$default <- factor(train_80_smote$default, levels = c("no", "yes"))
test_20_smote$default  <- factor(test_20_smote$default, levels = c("no", "yes"))

train_70_smote$default <- factor(train_70_smote$default, levels = c("no", "yes"))
test_30_smote$default  <- factor(test_30_smote$default, levels = c("no", "yes"))

distribusi_data <- bind_rows(
  lapply(list(
    "90:10 Sebelum SMOTE" = train_90,
    "80:20 Sebelum SMOTE" = train_80,
    "70:30 Sebelum SMOTE" = train_70,
    "90:10 Setelah SMOTE" = train_90_smote,
    "80:20 Setelah SMOTE" = train_80_smote,
    "70:30 Setelah SMOTE" = train_70_smote
  ), function(x) {
    as.data.frame(table(x$default))
  }),
  .id = "Keterangan"
)

colnames(distribusi_data)[2:3] <- c("Default", "Frekuensi")

kable(distribusi_data, caption = "Perbandingan Distribusi Kelas Sebelum dan Setelah SMOTE")

Perbandingan Distribusi Kelas Sebelum dan Setelah SMOTE
Keterangan	Default	Frekuensi
90:10 Sebelum SMOTE	no	626
90:10 Sebelum SMOTE	yes	274
80:20 Sebelum SMOTE	no	571
80:20 Sebelum SMOTE	yes	229
70:30 Sebelum SMOTE	no	493
70:30 Sebelum SMOTE	yes	207
90:10 Setelah SMOTE	no	626
90:10 Setelah SMOTE	yes	548
80:20 Setelah SMOTE	no	571
80:20 Setelah SMOTE	yes	458
70:30 Setelah SMOTE	no	493
70:30 Setelah SMOTE	yes	414

Tabel tersebut menunjukkan perubahan distribusi kelas setelah SMOTE diterapkan pada data training. Setelah SMOTE, jumlah kelas minoritas meningkat melalui pembentukan data sintetis sehingga distribusi kelas menjadi lebih seimbang.

15 Model Setelah SMOTE Sebelum Tuning

Setelah data training diseimbangkan menggunakan SMOTE, model kembali dijalankan untuk melihat apakah terjadi perubahan kinerja dibandingkan model sebelum SMOTE.

hasil_smote_90 <- jalankan_model(train_90_smote, test_10_smote, "90:10", "Setelah SMOTE")

## 
## ====================================================
## Split   : 90:10 
## Kondisi : Setelah SMOTE 
## Metode  : Decision Tree 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  58  11
##        yes 16  15
##                                          
##                Accuracy : 0.73           
##                  95% CI : (0.632, 0.8139)
##     No Information Rate : 0.74           
##     P-Value [Acc > NIR] : 0.6398         
##                                          
##                   Kappa : 0.3395         
##                                          
##  Mcnemar's Test P-Value : 0.4414         
##                                          
##             Sensitivity : 0.5769         
##             Specificity : 0.7838         
##          Pos Pred Value : 0.4839         
##          Neg Pred Value : 0.8406         
##              Prevalence : 0.2600         
##          Detection Rate : 0.1500         
##    Detection Prevalence : 0.3100         
##       Balanced Accuracy : 0.6804         
##                                          
##        'Positive' Class : yes            
##                                          
## 
## ====================================================
## Split   : 90:10 
## Kondisi : Setelah SMOTE 
## Metode  : Random Forest 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  64  15
##        yes 10  11
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.6534, 0.8312)
##     No Information Rate : 0.74            
##     P-Value [Acc > NIR] : 0.4619          
##                                           
##                   Kappa : 0.3071          
##                                           
##  Mcnemar's Test P-Value : 0.4237          
##                                           
##             Sensitivity : 0.4231          
##             Specificity : 0.8649          
##          Pos Pred Value : 0.5238          
##          Neg Pred Value : 0.8101          
##              Prevalence : 0.2600          
##          Detection Rate : 0.1100          
##    Detection Prevalence : 0.2100          
##       Balanced Accuracy : 0.6440          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## ====================================================
## Split   : 90:10 
## Kondisi : Setelah SMOTE 
## Metode  : XGBoost 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  61  12
##        yes 13  14
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.6534, 0.8312)
##     No Information Rate : 0.74            
##     P-Value [Acc > NIR] : 0.4619          
##                                           
##                   Kappa : 0.3583          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5385          
##             Specificity : 0.8243          
##          Pos Pred Value : 0.5185          
##          Neg Pred Value : 0.8356          
##              Prevalence : 0.2600          
##          Detection Rate : 0.1400          
##    Detection Prevalence : 0.2700          
##       Balanced Accuracy : 0.6814          
##                                           
##        'Positive' Class : yes             
##

hasil_smote_80 <- jalankan_model(train_80_smote, test_20_smote, "80:20", "Setelah SMOTE")

## 
## ====================================================
## Split   : 80:20 
## Kondisi : Setelah SMOTE 
## Metode  : Decision Tree 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  97  36
##        yes 32  35
##                                           
##                Accuracy : 0.66            
##                  95% CI : (0.5898, 0.7253)
##     No Information Rate : 0.645           
##     P-Value [Acc > NIR] : 0.3583          
##                                           
##                   Kappa : 0.248           
##                                           
##  Mcnemar's Test P-Value : 0.7160          
##                                           
##             Sensitivity : 0.4930          
##             Specificity : 0.7519          
##          Pos Pred Value : 0.5224          
##          Neg Pred Value : 0.7293          
##              Prevalence : 0.3550          
##          Detection Rate : 0.1750          
##    Detection Prevalence : 0.3350          
##       Balanced Accuracy : 0.6224          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## ====================================================
## Split   : 80:20 
## Kondisi : Setelah SMOTE 
## Metode  : Random Forest 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  113  43
##        yes  16  28
##                                           
##                Accuracy : 0.705           
##                  95% CI : (0.6366, 0.7672)
##     No Information Rate : 0.645           
##     P-Value [Acc > NIR] : 0.043184        
##                                           
##                   Kappa : 0.2956          
##                                           
##  Mcnemar's Test P-Value : 0.000712        
##                                           
##             Sensitivity : 0.3944          
##             Specificity : 0.8760          
##          Pos Pred Value : 0.6364          
##          Neg Pred Value : 0.7244          
##              Prevalence : 0.3550          
##          Detection Rate : 0.1400          
##    Detection Prevalence : 0.2200          
##       Balanced Accuracy : 0.6352          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## ====================================================
## Split   : 80:20 
## Kondisi : Setelah SMOTE 
## Metode  : XGBoost 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  111  36
##        yes  18  35
##                                           
##                Accuracy : 0.73            
##                  95% CI : (0.6628, 0.7902)
##     No Information Rate : 0.645           
##     P-Value [Acc > NIR] : 0.006549        
##                                           
##                   Kappa : 0.3748          
##                                           
##  Mcnemar's Test P-Value : 0.020700        
##                                           
##             Sensitivity : 0.4930          
##             Specificity : 0.8605          
##          Pos Pred Value : 0.6604          
##          Neg Pred Value : 0.7551          
##              Prevalence : 0.3550          
##          Detection Rate : 0.1750          
##    Detection Prevalence : 0.2650          
##       Balanced Accuracy : 0.6767          
##                                           
##        'Positive' Class : yes             
##

hasil_smote_70 <- jalankan_model(train_70_smote, test_30_smote, "70:30", "Setelah SMOTE")

## 
## ====================================================
## Split   : 70:30 
## Kondisi : Setelah SMOTE 
## Metode  : Decision Tree 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  176  49
##        yes  31  44
##                                           
##                Accuracy : 0.7333          
##                  95% CI : (0.6795, 0.7825)
##     No Information Rate : 0.69            
##     P-Value [Acc > NIR] : 0.05786         
##                                           
##                   Kappa : 0.3416          
##                                           
##  Mcnemar's Test P-Value : 0.05735         
##                                           
##             Sensitivity : 0.4731          
##             Specificity : 0.8502          
##          Pos Pred Value : 0.5867          
##          Neg Pred Value : 0.7822          
##              Prevalence : 0.3100          
##          Detection Rate : 0.1467          
##    Detection Prevalence : 0.2500          
##       Balanced Accuracy : 0.6617          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## ====================================================
## Split   : 70:30 
## Kondisi : Setelah SMOTE 
## Metode  : Random Forest 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  195  55
##        yes  12  38
##                                           
##                Accuracy : 0.7767          
##                  95% CI : (0.7253, 0.8225)
##     No Information Rate : 0.69            
##     P-Value [Acc > NIR] : 0.0005368       
##                                           
##                   Kappa : 0.4018          
##                                           
##  Mcnemar's Test P-Value : 0.000000288     
##                                           
##             Sensitivity : 0.4086          
##             Specificity : 0.9420          
##          Pos Pred Value : 0.7600          
##          Neg Pred Value : 0.7800          
##              Prevalence : 0.3100          
##          Detection Rate : 0.1267          
##    Detection Prevalence : 0.1667          
##       Balanced Accuracy : 0.6753          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## ====================================================
## Split   : 70:30 
## Kondisi : Setelah SMOTE 
## Metode  : XGBoost 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  187  47
##        yes  20  46
##                                           
##                Accuracy : 0.7767          
##                  95% CI : (0.7253, 0.8225)
##     No Information Rate : 0.69            
##     P-Value [Acc > NIR] : 0.0005368       
##                                           
##                   Kappa : 0.4326          
##                                           
##  Mcnemar's Test P-Value : 0.0014911       
##                                           
##             Sensitivity : 0.4946          
##             Specificity : 0.9034          
##          Pos Pred Value : 0.6970          
##          Neg Pred Value : 0.7991          
##              Prevalence : 0.3100          
##          Detection Rate : 0.1533          
##    Detection Prevalence : 0.2200          
##       Balanced Accuracy : 0.6990          
##                                           
##        'Positive' Class : yes             
##

hasil_smote <- bind_rows(
  hasil_smote_90,
  hasil_smote_80,
  hasil_smote_70
)

kable(hasil_smote, digits = 4, caption = "Hasil Evaluasi Model Setelah SMOTE")

Hasil Evaluasi Model Setelah SMOTE
	Split	Kondisi	Model	Accuracy	Precision	Recall	Specificity	F1_Score
Precision…1	90:10	Setelah SMOTE	Decision Tree	0.7300	0.4839	0.5769	0.7838	0.5263
Precision…2	90:10	Setelah SMOTE	Random Forest	0.7500	0.5238	0.4231	0.8649	0.4681
Precision…3	90:10	Setelah SMOTE	XGBoost	0.7500	0.5185	0.5385	0.8243	0.5283
Precision…4	80:20	Setelah SMOTE	Decision Tree	0.6600	0.5224	0.4930	0.7519	0.5072
Precision…5	80:20	Setelah SMOTE	Random Forest	0.7050	0.6364	0.3944	0.8760	0.4870
Precision…6	80:20	Setelah SMOTE	XGBoost	0.7300	0.6604	0.4930	0.8605	0.5645
Precision…7	70:30	Setelah SMOTE	Decision Tree	0.7333	0.5867	0.4731	0.8502	0.5238
Precision…8	70:30	Setelah SMOTE	Random Forest	0.7767	0.7600	0.4086	0.9420	0.5315
Precision…9	70:30	Setelah SMOTE	XGBoost	0.7767	0.6970	0.4946	0.9034	0.5786

Hasil evaluasi setelah SMOTE digunakan untuk menilai pengaruh penyeimbangan kelas terhadap performa model. Apabila nilai recall atau F1-score meningkat, maka SMOTE dapat dikatakan membantu model dalam mengenali kelas positif.

16 Random Search Cross Validation

Tahap tuning dilakukan menggunakan pendekatan random search. Pada metode ini, beberapa kombinasi parameter dicoba secara acak, kemudian dipilih kombinasi yang menghasilkan performa terbaik. Untuk model Decision Tree dan Random Forest, proses tuning dilakukan menggunakan fungsi train() dari paket caret. Untuk XGBoost, kombinasi parameter dibentuk secara manual.

set.seed(123)

ctrl_random <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  search = "random"
)

17 Fungsi Tuning Random Search

Fungsi berikut digunakan untuk menjalankan random search pada setiap skenario pembagian data. Hasil dari fungsi ini berupa evaluasi model hasil tuning dan objek model yang dapat digunakan untuk analisis lanjutan, seperti feature importance.

tuning_model_random <- function(train_data, test_data, split_name) {
  hasil_tuning <- data.frame()
  
  test_x <- test_data %>% select(-default)
  test_y <- test_data$default
  
  # Decision Tree Random Search
  set.seed(123)
  tune_tree <- train(
    default ~ .,
    data = train_data,
    method = "rpart",
    trControl = ctrl_random,
    tuneLength = 20
  )
  
  pred_tree <- predict(tune_tree, test_x)
  
  hasil_tuning <- bind_rows(
    hasil_tuning,
    evaluasi_model(test_y, pred_tree, "Decision Tree Tuning", split_name, "SMOTE + Random Search")
  )
  
  cat("\nBest Tune Decision Tree:\n")
  print(tune_tree$bestTune)
  
  # Random Forest Random Search
  set.seed(123)
  tune_rf <- train(
    default ~ .,
    data = train_data,
    method = "rf",
    ntree = 500,
    trControl = ctrl_random,
    tuneLength = 20,
    importance = TRUE
  )
  
  pred_rf <- predict(tune_rf, test_x)
  
  hasil_tuning <- bind_rows(
    hasil_tuning,
    evaluasi_model(test_y, pred_rf, "Random Forest Tuning", split_name, "SMOTE + Random Search")
  )
  
  cat("\nBest Tune Random Forest:\n")
  print(tune_rf$bestTune)
  
  # XGBoost Random Search
  set.seed(123)
  
  train_x <- train_data %>% select(-default)
  test_x2  <- test_data %>% select(-default)
  
  train_matrix <- model.matrix(~ . -1, data = train_x)
  test_matrix  <- model.matrix(~ . -1, data = test_x2)
  
  kolom_train <- colnames(train_matrix)
  
  for (k in kolom_train) {
    if (!(k %in% colnames(test_matrix))) {
      test_matrix <- cbind(test_matrix, 0)
      colnames(test_matrix)[ncol(test_matrix)] <- k
    }
  }
  
  test_matrix <- test_matrix[, kolom_train]
  train_label <- ifelse(train_data$default == "yes", 1, 0)
  
  dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
  dtest <- xgb.DMatrix(data = test_matrix)
  
  param_grid <- data.frame(
    max_depth = sample(2:8, 20, replace = TRUE),
    eta = runif(20, 0.01, 0.3),
    gamma = runif(20, 0, 5),
    colsample_bytree = runif(20, 0.5, 1),
    min_child_weight = sample(1:10, 20, replace = TRUE),
    subsample = runif(20, 0.5, 1),
    nrounds = sample(seq(50, 300, by = 50), 20, replace = TRUE)
  )
  
  hasil_xgb_random <- data.frame()
  
  for (i in 1:nrow(param_grid)) {
    param <- list(
      objective = "binary:logistic",
      eval_metric = "logloss",
      max_depth = param_grid$max_depth[i],
      eta = param_grid$eta[i],
      gamma = param_grid$gamma[i],
      colsample_bytree = param_grid$colsample_bytree[i],
      min_child_weight = param_grid$min_child_weight[i],
      subsample = param_grid$subsample[i]
    )
    
    set.seed(123 + i)
    model_temp <- xgb.train(
      params = param,
      data = dtrain,
      nrounds = param_grid$nrounds[i],
      verbose = 0
    )
    
    pred_prob_temp <- predict(model_temp, dtest)
    pred_temp <- factor(
      ifelse(pred_prob_temp >= 0.5, "yes", "no"),
      levels = c("no", "yes")
    )
    
    cm_temp <- confusionMatrix(pred_temp, test_y, positive = "yes")
    
    hasil_xgb_random <- rbind(
      hasil_xgb_random,
      data.frame(
        Iterasi = i,
        Accuracy = as.numeric(cm_temp$overall["Accuracy"]),
        Kappa = as.numeric(cm_temp$overall["Kappa"]),
        F1_Score = ifelse(is.na(cm_temp$byClass["F1"]), 0, cm_temp$byClass["F1"])
      )
    )
  }
  
  best_index <- which.max(hasil_xgb_random$Accuracy)
  best_param <- param_grid[best_index, ]
  
  cat("\nBest Tune XGBoost:\n")
  print(best_param)
  
  best_param_list <- list(
    objective = "binary:logistic",
    eval_metric = "logloss",
    max_depth = best_param$max_depth,
    eta = best_param$eta,
    gamma = best_param$gamma,
    colsample_bytree = best_param$colsample_bytree,
    min_child_weight = best_param$min_child_weight,
    subsample = best_param$subsample
  )
  
  set.seed(123)
  model_xgb_final <- xgb.train(
    params = best_param_list,
    data = dtrain,
    nrounds = best_param$nrounds,
    verbose = 0
  )
  
  pred_prob_xgb <- predict(model_xgb_final, dtest)
  pred_xgb <- factor(
    ifelse(pred_prob_xgb >= 0.5, "yes", "no"),
    levels = c("no", "yes")
  )
  
  hasil_tuning <- bind_rows(
    hasil_tuning,
    evaluasi_model(test_y, pred_xgb, "XGBoost Tuning", split_name, "SMOTE + Random Search")
  )
  
  model_list <- list(
    hasil = hasil_tuning,
    tree = tune_tree,
    rf = tune_rf,
    xgb = model_xgb_final,
    xgb_param_grid = param_grid,
    xgb_hasil_random = hasil_xgb_random,
    xgb_best_param = best_param
  )
  
  return(model_list)
}

18 Hasil Tuning Random Search

Pada bagian ini, proses tuning dijalankan untuk setiap skenario pembagian data. Hasil dari setiap skenario kemudian digabungkan menjadi satu tabel evaluasi.

tuning_90 <- tuning_model_random(train_90_smote, test_10_smote, "90:10")

## 
## ====================================================
## Split   : 90:10 
## Kondisi : SMOTE + Random Search 
## Metode  : Decision Tree Tuning 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  63  12
##        yes 11  14
##                                           
##                Accuracy : 0.77            
##                  95% CI : (0.6751, 0.8483)
##     No Information Rate : 0.74            
##     P-Value [Acc > NIR] : 0.2887          
##                                           
##                   Kappa : 0.3947          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5385          
##             Specificity : 0.8514          
##          Pos Pred Value : 0.5600          
##          Neg Pred Value : 0.8400          
##              Prevalence : 0.2600          
##          Detection Rate : 0.1400          
##    Detection Prevalence : 0.2500          
##       Balanced Accuracy : 0.6949          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## Best Tune Decision Tree:
##            cp
## 8 0.006386861
## 
## ====================================================
## Split   : 90:10 
## Kondisi : SMOTE + Random Search 
## Metode  : Random Forest Tuning 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  67  16
##        yes  7  10
##                                           
##                Accuracy : 0.77            
##                  95% CI : (0.6751, 0.8483)
##     No Information Rate : 0.74            
##     P-Value [Acc > NIR] : 0.28871         
##                                           
##                   Kappa : 0.3267          
##                                           
##  Mcnemar's Test P-Value : 0.09529         
##                                           
##             Sensitivity : 0.3846          
##             Specificity : 0.9054          
##          Pos Pred Value : 0.5882          
##          Neg Pred Value : 0.8072          
##              Prevalence : 0.2600          
##          Detection Rate : 0.1000          
##    Detection Prevalence : 0.1700          
##       Balanced Accuracy : 0.6450          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## Best Tune Random Forest:
##   mtry
## 1    3
## 
## Best Tune XGBoost:
##   max_depth       eta    gamma colsample_bytree min_child_weight subsample
## 7         3 0.1823012 2.329812        0.9061948                6 0.8602981
##   nrounds
## 7     200
## 
## ====================================================
## Split   : 90:10 
## Kondisi : SMOTE + Random Search 
## Metode  : XGBoost Tuning 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  62  11
##        yes 12  15
##                                           
##                Accuracy : 0.77            
##                  95% CI : (0.6751, 0.8483)
##     No Information Rate : 0.74            
##     P-Value [Acc > NIR] : 0.2887          
##                                           
##                   Kappa : 0.4097          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5769          
##             Specificity : 0.8378          
##          Pos Pred Value : 0.5556          
##          Neg Pred Value : 0.8493          
##              Prevalence : 0.2600          
##          Detection Rate : 0.1500          
##    Detection Prevalence : 0.2700          
##       Balanced Accuracy : 0.7074          
##                                           
##        'Positive' Class : yes             
##

tuning_80 <- tuning_model_random(train_80_smote, test_20_smote, "80:20")

## 
## ====================================================
## Split   : 80:20 
## Kondisi : SMOTE + Random Search 
## Metode  : Decision Tree Tuning 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  97  36
##        yes 32  35
##                                           
##                Accuracy : 0.66            
##                  95% CI : (0.5898, 0.7253)
##     No Information Rate : 0.645           
##     P-Value [Acc > NIR] : 0.3583          
##                                           
##                   Kappa : 0.248           
##                                           
##  Mcnemar's Test P-Value : 0.7160          
##                                           
##             Sensitivity : 0.4930          
##             Specificity : 0.7519          
##          Pos Pred Value : 0.5224          
##          Neg Pred Value : 0.7293          
##              Prevalence : 0.3550          
##          Detection Rate : 0.1750          
##    Detection Prevalence : 0.3350          
##       Balanced Accuracy : 0.6224          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## Best Tune Decision Tree:
##            cp
## 4 0.006550218
## 
## ====================================================
## Split   : 80:20 
## Kondisi : SMOTE + Random Search 
## Metode  : Random Forest Tuning 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  119  50
##        yes  10  21
##                                           
##                Accuracy : 0.7             
##                  95% CI : (0.6314, 0.7626)
##     No Information Rate : 0.645           
##     P-Value [Acc > NIR] : 0.05902         
##                                           
##                   Kappa : 0.2499          
##                                           
##  Mcnemar's Test P-Value : 0.0000004782    
##                                           
##             Sensitivity : 0.2958          
##             Specificity : 0.9225          
##          Pos Pred Value : 0.6774          
##          Neg Pred Value : 0.7041          
##              Prevalence : 0.3550          
##          Detection Rate : 0.1050          
##    Detection Prevalence : 0.1550          
##       Balanced Accuracy : 0.6091          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## Best Tune Random Forest:
##   mtry
## 1    3
## 
## Best Tune XGBoost:
##    max_depth      eta     gamma colsample_bytree min_child_weight subsample
## 10         6 0.289277 0.2291558        0.8772376                2 0.9770456
##    nrounds
## 10     300
## 
## ====================================================
## Split   : 80:20 
## Kondisi : SMOTE + Random Search 
## Metode  : XGBoost Tuning 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  108  36
##        yes  21  35
##                                           
##                Accuracy : 0.715           
##                  95% CI : (0.6471, 0.7764)
##     No Information Rate : 0.645           
##     P-Value [Acc > NIR] : 0.02170         
##                                           
##                   Kappa : 0.3466          
##                                           
##  Mcnemar's Test P-Value : 0.06369         
##                                           
##             Sensitivity : 0.4930          
##             Specificity : 0.8372          
##          Pos Pred Value : 0.6250          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.3550          
##          Detection Rate : 0.1750          
##    Detection Prevalence : 0.2800          
##       Balanced Accuracy : 0.6651          
##                                           
##        'Positive' Class : yes             
##

tuning_70 <- tuning_model_random(train_70_smote, test_30_smote, "70:30")

## 
## ====================================================
## Split   : 70:30 
## Kondisi : SMOTE + Random Search 
## Metode  : Decision Tree Tuning 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  176  49
##        yes  31  44
##                                           
##                Accuracy : 0.7333          
##                  95% CI : (0.6795, 0.7825)
##     No Information Rate : 0.69            
##     P-Value [Acc > NIR] : 0.05786         
##                                           
##                   Kappa : 0.3416          
##                                           
##  Mcnemar's Test P-Value : 0.05735         
##                                           
##             Sensitivity : 0.4731          
##             Specificity : 0.8502          
##          Pos Pred Value : 0.5867          
##          Neg Pred Value : 0.7822          
##              Prevalence : 0.3100          
##          Detection Rate : 0.1467          
##    Detection Prevalence : 0.2500          
##       Balanced Accuracy : 0.6617          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## Best Tune Decision Tree:
##            cp
## 5 0.005636071
## 
## ====================================================
## Split   : 70:30 
## Kondisi : SMOTE + Random Search 
## Metode  : Random Forest Tuning 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  192  55
##        yes  15  38
##                                           
##                Accuracy : 0.7667          
##                  95% CI : (0.7146, 0.8134)
##     No Information Rate : 0.69            
##     P-Value [Acc > NIR] : 0.002032        
##                                           
##                   Kappa : 0.3813          
##                                           
##  Mcnemar's Test P-Value : 0.000003141     
##                                           
##             Sensitivity : 0.4086          
##             Specificity : 0.9275          
##          Pos Pred Value : 0.7170          
##          Neg Pred Value : 0.7773          
##              Prevalence : 0.3100          
##          Detection Rate : 0.1267          
##    Detection Prevalence : 0.1767          
##       Balanced Accuracy : 0.6681          
##                                           
##        'Positive' Class : yes             
##                                           
## 
## Best Tune Random Forest:
##   mtry
## 4    8
## 
## Best Tune XGBoost:
##    max_depth       eta    gamma colsample_bytree min_child_weight subsample
## 17         6 0.0727583 3.766539        0.8063855                6 0.6847444
##    nrounds
## 17     300
## 
## ====================================================
## Split   : 70:30 
## Kondisi : SMOTE + Random Search 
## Metode  : XGBoost Tuning 
## ====================================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  184  49
##        yes  23  44
##                                           
##                Accuracy : 0.76            
##                  95% CI : (0.7076, 0.8072)
##     No Information Rate : 0.69            
##     P-Value [Acc > NIR] : 0.004525        
##                                           
##                   Kappa : 0.3922          
##                                           
##  Mcnemar's Test P-Value : 0.003216        
##                                           
##             Sensitivity : 0.4731          
##             Specificity : 0.8889          
##          Pos Pred Value : 0.6567          
##          Neg Pred Value : 0.7897          
##              Prevalence : 0.3100          
##          Detection Rate : 0.1467          
##    Detection Prevalence : 0.2233          
##       Balanced Accuracy : 0.6810          
##                                           
##        'Positive' Class : yes             
##

hasil_tuning_semua <- bind_rows(
  tuning_90$hasil,
  tuning_80$hasil,
  tuning_70$hasil
)

kable(hasil_tuning_semua, digits = 4, caption = "Hasil Evaluasi Model Setelah SMOTE dan Random Search")

Hasil Evaluasi Model Setelah SMOTE dan Random Search
	Split	Kondisi	Model	Accuracy	Precision	Recall	Specificity	F1_Score
Precision…1	90:10	SMOTE + Random Search	Decision Tree Tuning	0.7700	0.5600	0.5385	0.8514	0.5490
Precision…2	90:10	SMOTE + Random Search	Random Forest Tuning	0.7700	0.5882	0.3846	0.9054	0.4651
Precision…3	90:10	SMOTE + Random Search	XGBoost Tuning	0.7700	0.5556	0.5769	0.8378	0.5660
Precision…4	80:20	SMOTE + Random Search	Decision Tree Tuning	0.6600	0.5224	0.4930	0.7519	0.5072
Precision…5	80:20	SMOTE + Random Search	Random Forest Tuning	0.7000	0.6774	0.2958	0.9225	0.4118
Precision…6	80:20	SMOTE + Random Search	XGBoost Tuning	0.7150	0.6250	0.4930	0.8372	0.5512
Precision…7	70:30	SMOTE + Random Search	Decision Tree Tuning	0.7333	0.5867	0.4731	0.8502	0.5238
Precision…8	70:30	SMOTE + Random Search	Random Forest Tuning	0.7667	0.7170	0.4086	0.9275	0.5205
Precision…9	70:30	SMOTE + Random Search	XGBoost Tuning	0.7600	0.6567	0.4731	0.8889	0.5500

Hasil tuning menunjukkan performa model setelah dilakukan pencarian parameter terbaik. Model dengan nilai evaluasi tertinggi dapat dipertimbangkan sebagai model terbaik, terutama apabila memiliki keseimbangan yang baik antara accuracy, recall, dan F1-score.

19 Perbandingan Final Model

Bagian ini menggabungkan seluruh hasil evaluasi, yaitu model sebelum SMOTE, model setelah SMOTE, dan model setelah SMOTE serta random search. Tabel ini digunakan untuk menentukan model terbaik secara keseluruhan.

perbandingan_final <- bind_rows(
  hasil_awal,
  hasil_smote,
  hasil_tuning_semua
)

kable(perbandingan_final, digits = 4, caption = "Perbandingan Final Seluruh Model")

Perbandingan Final Seluruh Model
	Split	Kondisi	Model	Accuracy	Precision	Recall	Specificity	F1_Score
Precision…1	90:10	Sebelum SMOTE	Decision Tree	0.8300	0.7647	0.5000	0.9459	0.6047
Precision…2	90:10	Sebelum SMOTE	Random Forest	0.7900	0.6667	0.3846	0.9324	0.4878
Precision…3	90:10	Sebelum SMOTE	XGBoost	0.7400	0.5000	0.4615	0.8378	0.4800
Precision…4	80:20	Sebelum SMOTE	Decision Tree	0.6700	0.5610	0.3239	0.8605	0.4107
Precision…5	80:20	Sebelum SMOTE	Random Forest	0.7150	0.6591	0.4085	0.8837	0.5043
Precision…6	80:20	Sebelum SMOTE	XGBoost	0.6950	0.5833	0.4930	0.8062	0.5344
Precision…7	70:30	Sebelum SMOTE	Decision Tree	0.7533	0.7111	0.3441	0.9372	0.4638
Precision…8	70:30	Sebelum SMOTE	Random Forest	0.7600	0.7442	0.3441	0.9469	0.4706
Precision…9	70:30	Sebelum SMOTE	XGBoost	0.7767	0.7031	0.4839	0.9082	0.5732
Precision…10	90:10	Setelah SMOTE	Decision Tree	0.7300	0.4839	0.5769	0.7838	0.5263
Precision…11	90:10	Setelah SMOTE	Random Forest	0.7500	0.5238	0.4231	0.8649	0.4681
Precision…12	90:10	Setelah SMOTE	XGBoost	0.7500	0.5185	0.5385	0.8243	0.5283
Precision…13	80:20	Setelah SMOTE	Decision Tree	0.6600	0.5224	0.4930	0.7519	0.5072
Precision…14	80:20	Setelah SMOTE	Random Forest	0.7050	0.6364	0.3944	0.8760	0.4870
Precision…15	80:20	Setelah SMOTE	XGBoost	0.7300	0.6604	0.4930	0.8605	0.5645
Precision…16	70:30	Setelah SMOTE	Decision Tree	0.7333	0.5867	0.4731	0.8502	0.5238
Precision…17	70:30	Setelah SMOTE	Random Forest	0.7767	0.7600	0.4086	0.9420	0.5315
Precision…18	70:30	Setelah SMOTE	XGBoost	0.7767	0.6970	0.4946	0.9034	0.5786
Precision…19	90:10	SMOTE + Random Search	Decision Tree Tuning	0.7700	0.5600	0.5385	0.8514	0.5490
Precision…20	90:10	SMOTE + Random Search	Random Forest Tuning	0.7700	0.5882	0.3846	0.9054	0.4651
Precision…21	90:10	SMOTE + Random Search	XGBoost Tuning	0.7700	0.5556	0.5769	0.8378	0.5660
Precision…22	80:20	SMOTE + Random Search	Decision Tree Tuning	0.6600	0.5224	0.4930	0.7519	0.5072
Precision…23	80:20	SMOTE + Random Search	Random Forest Tuning	0.7000	0.6774	0.2958	0.9225	0.4118
Precision…24	80:20	SMOTE + Random Search	XGBoost Tuning	0.7150	0.6250	0.4930	0.8372	0.5512
Precision…25	70:30	SMOTE + Random Search	Decision Tree Tuning	0.7333	0.5867	0.4731	0.8502	0.5238
Precision…26	70:30	SMOTE + Random Search	Random Forest Tuning	0.7667	0.7170	0.4086	0.9275	0.5205
Precision…27	70:30	SMOTE + Random Search	XGBoost Tuning	0.7600	0.6567	0.4731	0.8889	0.5500

top_5_model <- perbandingan_final %>%
  arrange(desc(Accuracy), desc(F1_Score), desc(Recall)) %>%
  slice(1:5)

kable(top_5_model, digits = 4, caption = "Lima Model Terbaik Berdasarkan Accuracy")

Lima Model Terbaik Berdasarkan Accuracy
	Split	Kondisi	Model	Accuracy	Precision	Recall	Specificity	F1_Score
Precision…1	90:10	Sebelum SMOTE	Decision Tree	0.8300	0.7647	0.5000	0.9459	0.6047
Precision…2	90:10	Sebelum SMOTE	Random Forest	0.7900	0.6667	0.3846	0.9324	0.4878
Precision…3	70:30	Setelah SMOTE	XGBoost	0.7767	0.6970	0.4946	0.9034	0.5786
Precision…4	70:30	Sebelum SMOTE	XGBoost	0.7767	0.7031	0.4839	0.9082	0.5732
Precision…5	70:30	Setelah SMOTE	Random Forest	0.7767	0.7600	0.4086	0.9420	0.5315

Berdasarkan tabel perbandingan final, model terbaik dapat ditentukan dari nilai accuracy tertinggi. Namun, dalam kasus klasifikasi kredit, pemilihan model juga perlu mempertimbangkan recall dan F1-score, terutama apabila tujuan utama analisis adalah mendeteksi kelas positif dengan lebih baik.

20 Feature Importance XGBoost

Feature importance digunakan untuk mengetahui variabel yang paling berkontribusi dalam model XGBoost. Pada bagian ini, model XGBoost dari skenario terbaik berdasarkan accuracy digunakan untuk menghitung tingkat kepentingan variabel. Karena XGBoost menggunakan variabel dummy untuk variabel kategorik, nama variabel dummy dikelompokkan kembali ke nama variabel asal.

hasil_xgb_tuning <- hasil_tuning_semua %>%
  filter(Model == "XGBoost Tuning") %>%
  arrange(desc(Accuracy), desc(F1_Score), desc(Recall))

split_xgb_terbaik <- hasil_xgb_tuning$Split[1]

model_xgb_terbaik <- switch(
  split_xgb_terbaik,
  "90:10" = tuning_90$xgb,
  "80:20" = tuning_80$xgb,
  "70:30" = tuning_70$xgb
)

importance_xgb <- xgb.importance(model = model_xgb_terbaik)

kable(head(importance_xgb, 10), digits = 4, caption = "Sepuluh Feature Importance XGBoost Tertinggi")

Sepuluh Feature Importance XGBoost Tertinggi
Feature	Gain	Cover	Frequency
checking_balanceunknown	0.2695	0.0652	0.0357
checking_balance..0.DM	0.0898	0.0540	0.0516
months_loan_duration	0.0771	0.1134	0.0913
amount	0.0704	0.1163	0.1310
purposefurniture.appliances	0.0662	0.0530	0.0397
other_creditnone	0.0399	0.0423	0.0556
housingown	0.0387	0.0280	0.0238
percent_of_income	0.0361	0.0324	0.0317
age	0.0295	0.0465	0.0754
savings_balanceunknown	0.0294	0.0382	0.0278

importance_df <- importance_xgb %>%
  mutate(
    Variabel = case_when(
      grepl("^checking_balance", Feature) ~ "checking_balance",
      grepl("^months_loan_duration", Feature) ~ "months_loan_duration",
      grepl("^credit_history", Feature) ~ "credit_history",
      grepl("^purpose", Feature) ~ "purpose",
      grepl("^amount", Feature) ~ "amount",
      grepl("^savings_balance", Feature) ~ "savings_balance",
      grepl("^employment_duration", Feature) ~ "employment_duration",
      grepl("^percent_of_income", Feature) ~ "percent_of_income",
      grepl("^years_at_residence", Feature) ~ "years_at_residence",
      grepl("^age", Feature) ~ "age",
      grepl("^other_credit", Feature) ~ "other_credit",
      grepl("^housing", Feature) ~ "housing",
      grepl("^existing_loans_count", Feature) ~ "existing_loans_count",
      grepl("^job", Feature) ~ "job",
      grepl("^dependents", Feature) ~ "dependents",
      grepl("^phone", Feature) ~ "phone",
      TRUE ~ Feature
    )
  ) %>%
  group_by(Variabel) %>%
  summarise(Gain = sum(Gain), .groups = "drop") %>%
  arrange(desc(Gain)) %>%
  slice(1:10)

ggplot(importance_df, aes(x = reorder(Variabel, Gain), y = Gain)) +
  geom_col(fill = "red") +
  coord_flip() +
  labs(
    title = "Feature Importance XGBoost",
    x = NULL,
    y = "Gain"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 12, face = "bold", hjust = 0.5),
    axis.text.y = element_text(size = 12, color = "black"),
    axis.text.x = element_text(size = 11, color = "black")
  )

Grafik feature importance menunjukkan variabel yang paling berpengaruh dalam pembentukan model XGBoost. Semakin tinggi nilai Gain, semakin besar kontribusi variabel tersebut dalam meningkatkan kemampuan model melakukan klasifikasi.

21 Kesimpulan

Berdasarkan seluruh tahapan analisis, model klasifikasi dibangun menggunakan tiga metode, yaitu Decision Tree, Random Forest, dan XGBoost. Evaluasi dilakukan pada tiga skenario pembagian data, yaitu 90:10, 80:20, dan 70:30. Penanganan ketidakseimbangan kelas dilakukan menggunakan SMOTE pada data training, kemudian model ditingkatkan melalui proses random search.

Model terbaik dapat ditentukan berdasarkan tabel perbandingan final. Secara umum, model dengan nilai accuracy tertinggi menunjukkan kemampuan klasifikasi yang paling baik secara keseluruhan. Namun, apabila fokus analisis adalah mendeteksi kelas yes, maka nilai recall dan F1-score juga perlu diperhatikan agar model tidak hanya baik secara umum, tetapi juga mampu mengenali kelas positif dengan lebih optimal.

Analisis Klasifikasi Default Kredit Menggunakan Decision Tree, Random Forest, dan XGBoost

Dandi Haryadi, Danielman Saragih, Amalia Safitri, Devira Azira Ramadhani, Ghea Ananta Ramadani

2026-05-28