1. Latar Belakang

Penyakit Lardiovaskular (Cardiovascular diseases) adalah penyebab kematian nomor 1 secara global, mengambil sekitar 17,9 juta jiwa setiap tahun, yang menyumbang 31% dari semua kematian di seluruh dunia. Gagal jantung adalah kejadian umum yang disebabkan oleh Penyakit Lardiovaskular dan kumpulan data ini berisi 12 fitur yang dapat digunakan untuk memprediksi kematian akibat gagal jantung.

Sebagian besar penyakit kardiovaskular dapat dicegah dengan mengatasi faktor risiko perilaku seperti penggunaan tembakau, diet tidak sehat dan obesitas, kurangnya aktivitas fisik, dan penggunaan alkohol yang berbahaya menggunakan strategi di seluruh populasi.Orang dengan penyakit kardiovaskular atau yang berada pada risiko kardiovaskular tinggi (karena adanya satu atau lebih faktor risiko seperti hipertensi, diabetes, hiperlipidemia, atau penyakit yang sudah ada).

1.1 Deskripsi Data

Dataset dari Davide Chicco, Giuseppe Jurman:

  1. Age: usia pasien (dalam tahun)

  2. Anemia: Penurunan sel darah merah atau hemoglobin

  3. High blood pressure: Jika pasien menderita hipertensi

  4. Creatinine phosphokinase: Tingkat enzim CPK dalam darah (mcg/L)

  5. Diabetes: Jika pasien menderita diabetes

  6. Ejection fraction: Persentase darah yang meninggalkan jantung pada setiap kontraksi

  7. Sex: Wanita atau pria

  8. Platelets: Trombosit dalam darah (kiloplatelet/mL)

  9. Serum creatinine: Tingkat kreatinin dalam darah (mg/dL)

  10. Serum sodium: Tingkat natrium dalam darah (mEq/L)

  11. Smoking: Jika pasien merokok

  12. Time:: Periode tindak lanjut (dalam hari)

  13. (target) death event: Jika pasien meninggal selama masa tindak lanjut

1.2 Deskripsi Masalah

  • Membuat model untuk memprediksi kemungkinan pasien meninggal karena gagal jantung.
  • Ini adalah masalah klasifikasi biner karena kelas target (Kematian) terdiri dari dua kelas Benar atau Salah

2. Persiapan Data

2.1 Package yang Dibutuhkan

Install packages (jalankan sekali) install.packages(c( “tidyverse”, # data manipulation & visualization (ggplot2, dplyr, dll) “data.table”, # alternatif manipulasi data “e1071”, # SVM “randomForest”, # Random Forest “ranger”, # Extra Trees / Random Forest cepat “caret”, # model training & evaluation (setara sklearn) “smotefamily”, # SMOTE (imbalance data) “ROSE”, # alternatif SMOTE “pROC”, # ROC curve & AUC “plotly”, # visualisasi interaktif “bestNormalize”, # PowerTransformer “scales”, “gridExtra”, “ggplot2”, update = FALSE ))

# Data Manipulation 
library(tidyverse)   # dplyr + ggplot2 (setara pandas + matplotlib)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(data.table)  # setara numpy untuk operasi array/matrix cepat
## Warning: package 'data.table' was built under R version 4.4.3
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
# Data Visualization 
library(ggplot2)     # setara matplotlib

# Data Preprocessing 
library(smotefamily) # SMOTE
## Warning: package 'smotefamily' was built under R version 4.4.3
# Classifiers 
library(e1071)        # SVM
## Warning: package 'e1071' was built under R version 4.4.3
library(rpart.plot)   # plot_tree
## Warning: package 'rpart.plot' was built under R version 4.4.3
## Loading required package: rpart
## Warning: package 'rpart' was built under R version 4.4.3
library(randomForest) # Random Forest & Extra Trees
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Model Evaluation & Selection 
library(caret)  # MinMaxScaler
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
                # PowerTransformer# \createDataPartition()
                # \trainControl(method = "cv")
                # \train() dengan tuneGrid
                # train() dengan tuneLength
                # confusionMatrix()
library(pROC)   # roc(), auc()
## Warning: package 'pROC' was built under R version 4.4.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

2.2 Dataset dan Deskripsi Fitur

# Load data
df <- read.csv("C:\\Users\\Elsa Ika R\\OneDrive\\KULIAH\\SEMESTER 6\\Komstat Lanjut\\heart_failure_clinical_records_dataset.csv")

# Setara df.head()
head(df)
# df.shape
dim(df)
## [1] 299  13
# df.info()
str(df)
## 'data.frame':    299 obs. of  13 variables:
##  $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
##  $ anaemia                 : int  0 0 0 1 1 1 1 1 0 1 ...
##  $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
##  $ diabetes                : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
##  $ high_blood_pressure     : int  1 0 0 0 0 1 0 0 0 1 ...
##  $ platelets               : num  265000 263358 162000 210000 327000 ...
##  $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
##  $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
##  $ sex                     : int  1 1 1 1 0 1 1 1 0 1 ...
##  $ smoking                 : int  0 0 1 0 0 1 0 1 0 1 ...
##  $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
##  $ DEATH_EVENT             : int  1 1 1 1 1 1 1 1 1 1 ...
# Perbaiki tipe data usia → integer
df$age <- as.integer(df$age)
# Kolom numerik
numeric <- c('age', 'creatinine_phosphokinase',
             'ejection_fraction', 'platelets',
             'serum_creatinine', 'time')

# Kolom kategori
categorical <- c('anaemia', 'diabetes', 'high_blood_pressure',
                 'sex', 'smoking')

2.3 Missing Values

colSums(is.na(df))
##                      age                  anaemia creatinine_phosphokinase 
##                        0                        0                        0 
##                 diabetes        ejection_fraction      high_blood_pressure 
##                        0                        0                        0 
##                platelets         serum_creatinine             serum_sodium 
##                        0                        0                        0 
##                      sex                  smoking                     time 
##                        0                        0                        0 
##              DEATH_EVENT 
##                        0

2.4 Cek Variabel Target

library(ggplot2)

death_color <- c("0" = "navy", "1" = "#DC143C")
target_count <- table(df$DEATH_EVENT)

df$DEATH_EVENT_label <- factor(df$DEATH_EVENT, labels = c("No", "True"))

ggplot(df, aes(x = DEATH_EVENT_label, fill = factor(DEATH_EVENT))) +
  geom_bar() +
  geom_text(stat = "count",
            aes(label = paste0(round(after_stat(count)/nrow(df)*100, 2), "%\n(",
                               after_stat(count), ")")),
            position = position_stack(vjust = 0.5),
            color = "white", size = 5) +
  scale_fill_manual(values = death_color) +
  scale_y_continuous(breaks = seq(0, 225, 25)) +
  theme_classic() +
  labs(x = "DEATH_EVENT", fill = "DEATH_EVENT")

2.5 Exploratory Data Visualization

library(ggplot2)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.3
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
library(tidyr)
# Countplot fitur Kategori per DEATH_EVENT
cat_plots <- list()
for (name in categorical) {
  p <- ggplot(df, aes(x = factor(.data[[name]]), fill = factor(DEATH_EVENT))) +
    geom_bar(position = "dodge") +
    scale_fill_manual(values = c("0" = "navy", "1" = "red3"),
                      labels = c("No", "True"), name = "DEATH_EVENT") +
    labs(x = name, title = name) +
    theme_bw()
  cat_plots[[name]] <- p
}

grid.arrange(grobs = cat_plots, ncol = 2)

# Histogram fitur Numerik 
colors_hist <- scales::hue_pal()(length(numeric))

hist_plots <- list()
for (i in seq_along(numeric)) {
  name <- numeric[i]
  p <- ggplot(df, aes(x = .data[[name]])) +
    geom_histogram(fill = colors_hist[i], bins = 30, color = "white") +
    labs(title = name) +
    theme_bw()
  hist_plots[[name]] <- p
}

grid.arrange(grobs = hist_plots, ncol = 3,
             top = grid::textGrob("Histogram fitur Numerik", 
                                  gp = grid::gpar(fontsize = 15)))

3. Feature Engineering

3.1 Pisahkan Fitur dan Kelas Target

X <- df[, !names(df) %in% c("DEATH_EVENT", "DEATH_EVENT_label")]
dim(X)
## [1] 299  12
colnames(X)
##  [1] "age"                      "anaemia"                 
##  [3] "creatinine_phosphokinase" "diabetes"                
##  [5] "ejection_fraction"        "high_blood_pressure"     
##  [7] "platelets"                "serum_creatinine"        
##  [9] "serum_sodium"             "sex"                     
## [11] "smoking"                  "time"
y <- df$DEATH_EVENT            # kolom target
length(y)
## [1] 299

3.2 Perbaiki Ketidakseimbangan Kelas

SMOTE adalah teknik oversampling di mana sampel sintetis dihasilkan untuk kelas minoritas.

library(smotefamily)

# Setara SMOTE(random_state=2021, k_neighbors=5)
set.seed(2021)
smote_result <- SMOTE(X, y, K = 5)

# Ambil data hasil resample
X_smote <- smote_result$data[, -ncol(smote_result$data)]  # fitur
dim(X_smote)
## [1] 395  12
y_smote <- smote_result$data[, ncol(smote_result$data)]   # target (kolom "class")
length(y_smote)
## [1] 395

3.3 Transformasi Data

Selama EDA untuk fitur numerik, histogram dari beberapa fitur menunjukkan kemiringan. Beberapa fitur seperti creatinine_phosphokinase dan serum_creatinine sangat miring. Fitur miring seperti ini dapat dibuat lebih mirip Gaussian menggunakan transformasi daya atau transformasi log.

1. creatine_phospokinase menggunakan transformasi log dapat membuat data sesuai dengan normalitas. Dalam hal ini, log-transform menghilangkan atau mengurangi skewness karena data asli mengikuti distribusi log-normal atau kira-kira begitu.

library(ggplot2)
# QQ Plot sebelum transformasi
p1 <- ggplot(df, aes(sample = creatinine_phosphokinase)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  labs(title = "sebelum transformasi",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_bw()

# QQ Plot setelah transformasi log10
p2 <- ggplot(df, aes(sample = log10(creatinine_phosphokinase))) +
  stat_qq() +
  stat_qq_line(color = "red") +
  labs(title = "setelah transformasi",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_bw()


grid.arrange(p1, p2, ncol = 2,
  top = grid::textGrob(
    "QQ Plot untuk creatinine_phosphokinase",
    gp = grid::gpar(fontface = "bold")
  )
)

2. serum_sreatine menggunakan transformasi resiprokal (p = -1). Transformasi ini memiliki efek radikal karena membalikkan urutan di antara nilai-nilai dari tanda yang sama, oleh karena itu, nilai yang lebih besar menjadi lebih kecil, dan sebaliknya.

p <- -1

# QQ Plot sebelum transformasi
p1 <- ggplot(df, aes(sample = serum_creatinine)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  labs(title = "sebelum transformasi",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_bw()

# QQ Plot sesudah transformasi (pangkat -1)
p2 <- ggplot(df, aes(sample = serum_creatinine^p)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  labs(title = "sesudah transformasi",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_bw()

# Gabungkan kedua plot
grid.arrange(p1, p2, ncol = 2,
  top = grid::textGrob(
    "QQ Plot untuk serum_creatinine",
    gp = grid::gpar(fontface = "bold")
  )
)

Power Transformer dari paket sklearn-learn menyediakan dua metode untuk membuat distribusi seperti gaussian: 1. Boxcox 2. Yeo-johnson Kedua metode ini mencari nilai p yang tepat (seperti pada contoh di atas) agar distribusinya menjadi normal. Yeo-johnson adalah versi upgrade dari Boxcox karena berurusan dengan data dengan nilai negatif.

library(caret)

# Setara PowerTransformer(method='yeo-johnson')
pt <- preProcess(X_smote, method = "YeoJohnson")

# Setara pt.fit_transform(X_smote)
X_pt <- predict(pt, X_smote)

# Tampilkan hasil
X_pt

3.4 Normalisasi Data

mm <- preProcess(X_pt, method = "range")
X_scaled <- predict(mm, X_pt)
colnames(X_scaled)
##  [1] "age"                      "anaemia"                 
##  [3] "creatinine_phosphokinase" "diabetes"                
##  [5] "ejection_fraction"        "high_blood_pressure"     
##  [7] "platelets"                "serum_creatinine"        
##  [9] "serum_sodium"             "sex"                     
## [11] "smoking"                  "time"

3.5 EDA setelah Transformasi

library(tidyr)
library(ggplot2)

# Konversi ke dataframe dengan nama kolom dari X
X_scaled_df <- as.data.frame(X_scaled)
colnames(X_scaled_df) <- colnames(X)

# Histogram semua fitur
X_scaled_df |>
  pivot_longer(cols = everything(), names_to = "fitur", values_to = "nilai") |>
  ggplot(aes(x = nilai)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  facet_wrap(~ fitur, scales = "free") +
  theme_bw()

3.6 Feature Selection

library(randomForest)

set.seed(2021)

# Konversi y_smote ke factor (wajib untuk klasifikasi)
y_smote_factor <- as.factor(y_smote)

# Setara RandomForestClassifier(n_estimators=100, max_depth=10, class_weight='balanced')
rf <- randomForest(x = X_scaled, 
                   y = y_smote_factor,
                   ntree = 100,       # n_estimators
                   maxnodes = 2^10,   # pendekatan max_depth=10
                   classwt = c("0" = 1, "1" = 1),  # class_weight='balanced'
                   importance = TRUE)
## Warning in randomForest.default(x = X_scaled, y = y_smote_factor, ntree = 100,
## : maxnodes exceeds its max value.
rf
## 
## Call:
##  randomForest(x = X_scaled, y = y_smote_factor, ntree = 100, classwt = c(`0` = 1,      `1` = 1), maxnodes = 2^10, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 12.15%
## Confusion matrix:
##     0   1 class.error
## 0 178  25   0.1231527
## 1  23 169   0.1197917
library(ggplot2)

# Ambil feature importance dari model rf
feature_imp <- data.frame(
  fitur      = colnames(X),
  importance = round(rf$importance[, "MeanDecreaseGini"] / 
                     sum(rf$importance[, "MeanDecreaseGini"]) * 100, 2)
)

# Sort descending
feature_imp <- feature_imp[order(feature_imp$importance, decreasing = FALSE), ]
feature_imp$fitur <- factor(feature_imp$fitur, levels = feature_imp$fitur)

# Plot barh (horizontal)
ggplot(feature_imp, aes(x = importance, y = fitur)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "percentage", y = NULL) +
  theme_bw()

# Ambil 3 fitur teratas (index sudah diurutkan descending sebelumnya)
imp_features <- tail(feature_imp$fitur, 3)
imp_features
## [1] ejection_fraction serum_creatinine  time             
## 12 Levels: smoking sex anaemia diabetes high_blood_pressure ... time
# Pilih kolom berdasarkan imp_features
X_scaled_df <- as.data.frame(X_scaled)
colnames(X_scaled_df) <- colnames(X)

X_selected <- X_scaled_df[, as.character(imp_features)]
X_selected

4. Pemodelan Data dan Hyperparameter

Oleh karena itu, tujuannya sekarang adalah untuk memisahkan kedua kelas seperti yang ditunjukkan pada gambar di bawah ini

Catatan: Semua pengklasifikasi telah disetel untuk memaksimalkan skor f1 alih-alih akurasi. Skor F1 adalah rata-rata harmonik dari recall dan presisi. Skor ini akan mendukung pengklasifikasi dengan presisi dan daya ingat yang sama. Saya bisa saja mencapai daya ingat atau presisi yang tinggi tetapi sayangnya, kita tidak dapat memilikinya dua arah karena meningkatkan presisi mengurangi daya ingat, dan sebaliknya.

model_data <- X_selected
model_data$target <- y_smote

model_data
library(caret)

set.seed(2021)

# Pisahkan fitur dan target
X_model <- model_data[, !names(model_data) %in% "target"]
y_model  <- as.factor(model_data$target)

# Stratified split 80/20 (setara stratify=model_data['target'])
train_idx <- createDataPartition(y_model, p = 0.80, list = FALSE)

X_train <- X_model[train_idx, ]
X_test  <- X_model[-train_idx, ]
y_train <- y_model[train_idx]
y_test  <- y_model[-train_idx]

# Setara print shape
dim(X_train)
## [1] 317   3
length(y_train)
## [1] 317
dim(X_test)
## [1] 78  3
length(y_test)
## [1] 78

4.1 Support Vector Machine (SVM)

set.seed(2021)

# Grid parameter C
tune_grid <- expand.grid(C = c(0.001, 0.01, 0.1, 1, 10, 100, 1000))

# Setara GridSearchCV dengan cv=10 dan scoring='f1'
train_ctrl <- trainControl(method = "cv", 
                           number = 10,
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary)

# Konversi y_train ke label valid (tidak boleh 0/1)
levels(y_train) <- c("No", "Yes")
levels(y_test)  <- c("No", "Yes")

# Setara SVC(kernel='linear', class_weight='balanced')
lin_svm_cv <- train(x = X_train,
                    y = y_train,
                    method = "svmLinear",
                    trControl = train_ctrl,
                    tuneGrid = tune_grid,
                    metric = "F1",
                    class.weights = c("No" = 1, "Yes" = 1))
## Warning in train.default(x = X_train, y = y_train, method = "svmLinear", : The
## metric "F1" was not in the result set. ROC will be used instead.
# Setara lin_svm_cv.best_params_
lin_svm_cv$bestTune
# Setara cross_val_predict + classification_report pada data train
train_ctrl_pred <- trainControl(method = "cv",
                                number = 10,
                                savePredictions = "final",
                                classProbs = TRUE)

set.seed(2021)
lin_svm_cv_pred <- train(x = X_train,
                         y = y_train,
                         method = "svmLinear",
                         trControl = train_ctrl_pred,
                         tuneGrid = data.frame(C = lin_svm_cv$bestTune$C),
                         metric = "Accuracy")

# Ambil prediksi cross-validation
lin_svm_train_pred <- lin_svm_cv_pred$pred$pred

# Setara classification_report
confusionMatrix(lin_svm_train_pred, 
                lin_svm_cv_pred$pred$obs,
                positive = "Yes",
                dnn = c("Predicted", "Actual"))
## Confusion Matrix and Statistics
## 
##          Actual
## Predicted  No Yes
##       No  107   6
##       Yes  56 148
##                                           
##                Accuracy : 0.8044          
##                  95% CI : (0.7564, 0.8466)
##     No Information Rate : 0.5142          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.612           
##                                           
##  Mcnemar's Test P-Value : 4.877e-10       
##                                           
##             Sensitivity : 0.9610          
##             Specificity : 0.6564          
##          Pos Pred Value : 0.7255          
##          Neg Pred Value : 0.9469          
##              Prevalence : 0.4858          
##          Detection Rate : 0.4669          
##    Detection Prevalence : 0.6435          
##       Balanced Accuracy : 0.8087          
##                                           
##        'Positive' Class : Yes             
## 
set.seed(2021)

# Grid parameter C
tune_grid_rbf <- expand.grid(C = c(0.001, 0.01, 0.1, 1, 10, 100, 1000),
                              sigma = 0.1)  # sigma = 1/gamma (RBF kernel)

# Setara GridSearchCV dengan cv=10 dan scoring='f1'
train_ctrl_rbf <- trainControl(method = "cv",
                               number = 10,
                               classProbs = TRUE,
                               summaryFunction = twoClassSummary)

# Setara SVC(kernel='rbf', class_weight='balanced')
rbf_svm_cv <- train(x = X_train,
                    y = y_train,
                    method = "svmRadial",
                    trControl = train_ctrl_rbf,
                    tuneGrid = tune_grid_rbf,
                    metric = "ROC",
                    class.weights = c("No" = 1, "Yes" = 1))

# Setara rbf_svm_cv.best_params_
rbf_svm_cv$bestTune
library(caret)
library(MLmetrics)
## Warning: package 'MLmetrics' was built under R version 4.4.3
## 
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## The following object is masked from 'package:base':
## 
##     Recall
set.seed(2021)

# Setara cross_val_predict pada data train
train_ctrl_pred_rbf <- trainControl(method = "cv",
                                    number = 10,
                                    savePredictions = "final",
                                    classProbs = TRUE)

rbf_svm_cv_pred <- train(x = X_train,
                         y = y_train,
                         method = "svmRadial",
                         trControl = train_ctrl_pred_rbf,
                         tuneGrid = data.frame(C = rbf_svm_cv$bestTune$C,
                                               sigma = rbf_svm_cv$bestTune$sigma),
                         metric = "Accuracy")

# Ambil prediksi cross-validation
rbf_svm_train_pred <- rbf_svm_cv_pred$pred$pred
actual            <- rbf_svm_cv_pred$pred$obs

# Setara classification_report
confusionMatrix(rbf_svm_train_pred, actual,
                positive = "Yes",
                dnn = c("Predicted", "Actual"))
## Confusion Matrix and Statistics
## 
##          Actual
## Predicted  No Yes
##       No  113   6
##       Yes  50 148
##                                           
##                Accuracy : 0.8233          
##                  95% CI : (0.7768, 0.8637)
##     No Information Rate : 0.5142          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6492          
##                                           
##  Mcnemar's Test P-Value : 9.132e-09       
##                                           
##             Sensitivity : 0.9610          
##             Specificity : 0.6933          
##          Pos Pred Value : 0.7475          
##          Neg Pred Value : 0.9496          
##              Prevalence : 0.4858          
##          Detection Rate : 0.4669          
##    Detection Prevalence : 0.6246          
##       Balanced Accuracy : 0.8271          
##                                           
##        'Positive' Class : Yes             
## 
# Detail per kelas
cat("Precision:", round(Precision(actual, rbf_svm_train_pred, positive = "No"),  4), "\n")
## Precision: 0.9496
cat("Recall   :", round(Recall(actual, rbf_svm_train_pred,    positive = "No"),  4), "\n")
## Recall   : 0.6933
cat("F1       :", round(F1_Score(actual, rbf_svm_train_pred,  positive = "No"),  4), "\n\n")
## F1       : 0.8014
cat("Precision:", round(Precision(actual, rbf_svm_train_pred, positive = "Yes"), 4), "\n")
## Precision: 0.7475
cat("Recall   :", round(Recall(actual, rbf_svm_train_pred,    positive = "Yes"), 4), "\n")
## Recall   : 0.961
cat("F1       :", round(F1_Score(actual, rbf_svm_train_pred,  positive = "Yes"), 4), "\n")
## F1       : 0.8409

4.2 Random Forest

set.seed(2021)

tune_grid_rf <- expand.grid(
  mtry          = floor(sqrt(ncol(X_train))),
  splitrule     = c("gini", "extratrees"),
  min.node.size = 1
)

train_ctrl_rf <- trainControl(method = "cv",
                              number = 10,
                              savePredictions = "final",
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)

rf_cv <- train(x = X_train,
               y = y_train,
               method = "ranger",
               trControl = train_ctrl_rf,
               tuneGrid = tune_grid_rf,
               metric = "ROC",
               num.trees = 100,
               max.depth = 10,   # pakai nilai tunggal
               importance = "impurity",
               seed = 2021)

rf_cv$bestTune
# Cross-val predict
rf_train_pred <- rf_cv$pred$pred
actual        <- rf_cv$pred$obs

# Confusion Matrix
cm_rf <- confusionMatrix(rf_train_pred, actual,
                         positive = "Yes",
                         dnn = c("Predicted", "Actual"))
print(cm_rf)
## Confusion Matrix and Statistics
## 
##          Actual
## Predicted  No Yes
##       No  142  29
##       Yes  21 125
##                                           
##                Accuracy : 0.8423          
##                  95% CI : (0.7974, 0.8806)
##     No Information Rate : 0.5142          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6838          
##                                           
##  Mcnemar's Test P-Value : 0.3222          
##                                           
##             Sensitivity : 0.8117          
##             Specificity : 0.8712          
##          Pos Pred Value : 0.8562          
##          Neg Pred Value : 0.8304          
##              Prevalence : 0.4858          
##          Detection Rate : 0.3943          
##    Detection Prevalence : 0.4606          
##       Balanced Accuracy : 0.8414          
##                                           
##        'Positive' Class : Yes             
## 
# Detail per kelas
cat("Precision:", round(cm_rf$byClass["Neg Pred Value"], 4), "\n")
## Precision: 0.8304
cat("Recall   :", round(cm_rf$byClass["Specificity"],    4), "\n\n")
## Recall   : 0.8712
cat("Precision:", round(cm_rf$byClass["Pos Pred Value"], 4), "\n")
## Precision: 0.8562
cat("Recall   :", round(cm_rf$byClass["Sensitivity"],    4), "\n")
## Recall   : 0.8117
cat("F1       :", round(cm_rf$byClass["F1"],              4), "\n\n")
## F1       : 0.8333
cat("Accuracy :", round(cm_rf$overall["Accuracy"],        4), "\n")
## Accuracy : 0.8423
library(caret)
library(MLmetrics)

set.seed(2021)

# Setara cross_val_predict pada data train
train_ctrl_pred_rf <- trainControl(method = "cv",
                                   number = 10,
                                   savePredictions = "final",
                                   classProbs = TRUE,
                                   verboseIter = TRUE)  # setara verbose=1

rf_cv_pred <- train(x = X_train,
                    y = y_train,
                    method = "ranger",
                    trControl = train_ctrl_pred_rf,
                    tuneGrid = rf_cv$bestTune,
                    metric = "Accuracy",
                    num.trees = 100,
                    importance = "impurity")
## + Fold01: mtry=1, splitrule=gini, min.node.size=1 
## - Fold01: mtry=1, splitrule=gini, min.node.size=1 
## + Fold02: mtry=1, splitrule=gini, min.node.size=1 
## - Fold02: mtry=1, splitrule=gini, min.node.size=1 
## + Fold03: mtry=1, splitrule=gini, min.node.size=1 
## - Fold03: mtry=1, splitrule=gini, min.node.size=1 
## + Fold04: mtry=1, splitrule=gini, min.node.size=1 
## - Fold04: mtry=1, splitrule=gini, min.node.size=1 
## + Fold05: mtry=1, splitrule=gini, min.node.size=1 
## - Fold05: mtry=1, splitrule=gini, min.node.size=1 
## + Fold06: mtry=1, splitrule=gini, min.node.size=1 
## - Fold06: mtry=1, splitrule=gini, min.node.size=1 
## + Fold07: mtry=1, splitrule=gini, min.node.size=1 
## - Fold07: mtry=1, splitrule=gini, min.node.size=1 
## + Fold08: mtry=1, splitrule=gini, min.node.size=1 
## - Fold08: mtry=1, splitrule=gini, min.node.size=1 
## + Fold09: mtry=1, splitrule=gini, min.node.size=1 
## - Fold09: mtry=1, splitrule=gini, min.node.size=1 
## + Fold10: mtry=1, splitrule=gini, min.node.size=1 
## - Fold10: mtry=1, splitrule=gini, min.node.size=1 
## Aggregating results
## Fitting final model on full training set
# Ambil prediksi cross-validation
rf_train_pred <- rf_cv_pred$pred$pred
actual        <- rf_cv_pred$pred$obs
library(MLmetrics)

# Setara classification_report
confusionMatrix(rf_train_pred, actual,
                positive = "Yes",
                dnn = c("Predicted", "Actual"))
## Confusion Matrix and Statistics
## 
##          Actual
## Predicted  No Yes
##       No  142  27
##       Yes  21 127
##                                           
##                Accuracy : 0.8486          
##                  95% CI : (0.8043, 0.8862)
##     No Information Rate : 0.5142          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6966          
##                                           
##  Mcnemar's Test P-Value : 0.4705          
##                                           
##             Sensitivity : 0.8247          
##             Specificity : 0.8712          
##          Pos Pred Value : 0.8581          
##          Neg Pred Value : 0.8402          
##              Prevalence : 0.4858          
##          Detection Rate : 0.4006          
##    Detection Prevalence : 0.4669          
##       Balanced Accuracy : 0.8479          
##                                           
##        'Positive' Class : Yes             
## 
# Detail per kelas
cat("Precision:", round(Precision(actual, rf_train_pred, positive = "No"),  4), "\n")
## Precision: 0.8402
cat("Recall   :", round(Recall(actual, rf_train_pred,    positive = "No"),  4), "\n")
## Recall   : 0.8712
cat("F1       :", round(F1_Score(actual, rf_train_pred,  positive = "No"),  4), "\n\n")
## F1       : 0.8554
cat("Precision:", round(Precision(actual, rf_train_pred, positive = "Yes"), 4), "\n")
## Precision: 0.8581
cat("Recall   :", round(Recall(actual, rf_train_pred,    positive = "Yes"), 4), "\n")
## Recall   : 0.8247
cat("F1       :", round(F1_Score(actual, rf_train_pred,  positive = "Yes"), 4), "\n\n")
## F1       : 0.8411
# Accuracy & Macro avg
cat("=== Overall ===\n")
## === Overall ===
cat("Accuracy :", round(Accuracy(actual, rf_train_pred), 4), "\n")
## Accuracy : 0.8486
cat("Macro F1 :", round((F1_Score(actual, rf_train_pred, positive = "No") +
                         F1_Score(actual, rf_train_pred, positive = "Yes")) / 2, 4), "\n")
## Macro F1 : 0.8482

6. Evaluasi Model

Random Forest telah mengungguli semua pengklasifikasi lainnya dalam akurasi, presisi, daya ingat, skor f1, dan skor AUC.

library(MLmetrics)
library(pROC)
library(ggplot2)
library(tidyr)

# Daftar nama model
models <- c("Linear SVM", "Non-Linear SVM", "Random Forest")

# Daftar prediksi tiap model
predictions <- list(lin_svm_train_pred, rbf_svm_train_pred, rf_train_pred)

# Inisialisasi vektor metrik
accuracy  <- c()
recall    <- c()
precision <- c()
f1        <- c()
auc       <- c()

# Hitung metrik tiap model
for (model_pred in predictions) {
  accuracy  <- c(accuracy,  Accuracy(actual, model_pred))
  precision <- c(precision, Precision(actual, model_pred, positive = "Yes"))
  recall    <- c(recall,    Recall(actual, model_pred,    positive = "Yes"))
  f1        <- c(f1,        F1_Score(actual, model_pred,  positive = "Yes"))
  auc       <- c(auc,       as.numeric(roc(actual, as.numeric(model_pred == "Yes"),
                                           quiet = TRUE)$auc))
}

# Gabungkan ke dataframe
results_df <- data.frame(
  model     = models,
  accuracy  = round(accuracy,  4),
  precision = round(precision, 4),
  recall    = round(recall,    4),
  f1        = round(f1,        4),
  auc       = round(auc,       4)
)

results_df
library(ggplot2)

# Setara sns.color_palette("Dark2")
model_colors <- RColorBrewer::brewer.pal(length(models), "Dark2")

ggplot(results_df, aes(x = model, y = accuracy, fill = model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(accuracy * 100, 3), "%")),
            vjust = -0.5, hjust = 0.5, size = 3.5) +
  scale_fill_manual(values = model_colors) +
  labs(x = "Models",
       y = "Accuracy percentage (%)",
       title = "Model comparison on training data using Accuracy") +
  theme_classic() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 15, hjust = 1))

ggplot(results_df, aes(x = model, y = recall, fill = model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(recall * 100, 3), "%")),
            vjust = -0.5, hjust = 0.5, size = 3.5) +
  scale_fill_manual(values = model_colors) +
  labs(x = "Models",
       y = "Recall percentage (%)",
       title = "Model comparison on training data using Recall") +
  theme_classic() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 15, hjust = 1))

ggplot(results_df, aes(x = model, y = precision, fill = model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(precision * 100, 3), "%")),
            vjust = -0.5, hjust = 0.5, size = 3.5) +
  scale_fill_manual(values = model_colors) +
  labs(x = "Models",
       y = "Precision percentage (%)",
       title = "Model comparison on training data using Precision") +
  theme_classic() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 15, hjust = 1))

ggplot(results_df, aes(x = model, y = f1, fill = model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(f1 * 100, 3), "%")),
            vjust = -0.5, hjust = 0.5, size = 3.5) +
  scale_fill_manual(values = model_colors) +
  labs(x = "Models",
       y = "F1 percentage (%)",
       title = "Model comparison on training data using F1 score") +
  theme_classic() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 15, hjust = 1))

ggplot(results_df, aes(x = model, y = auc, fill = model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(auc, 3)),
            vjust = -0.5, hjust = 0.5, size = 3.5) +
  scale_fill_manual(values = model_colors) +
  labs(x = "Models",
       y = "AUC",
       title = "Model comparison on training data using AUC") +
  theme_classic() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 15, hjust = 1))

7. Prediksi pada Data Uji

library(caret)
library(MLmetrics)

# Daftar model
best_models <- list(lin_svm_cv, rbf_svm_cv, rf_cv)

# Evaluasi tiap model pada data test
for (i in seq_along(models)) {
  model      <- best_models[[i]]
  name       <- models[i]
  best_pred  <- predict(model, newdata = X_test)
  
  cat(toupper(name), "\n")
  
  # Confusion Matrix + metrik
  cm <- confusionMatrix(best_pred, y_test,
                        positive = "Yes",
                        dnn = c("Predicted", "Actual"))
  print(cm)
  
  # Detail per kelas
  cat("Precision:", round(Precision(y_test, best_pred, positive = "No"),  4), "\n")
  cat("Recall   :", round(Recall(y_test, best_pred,    positive = "No"),  4), "\n")
  cat("F1       :", round(F1_Score(y_test, best_pred,  positive = "No"),  4), "\n\n")
  
  cat("Precision:", round(Precision(y_test, best_pred, positive = "Yes"), 4), "\n")
  cat("Recall   :", round(Recall(y_test, best_pred,    positive = "Yes"), 4), "\n")
  cat("F1       :", round(F1_Score(y_test, best_pred,  positive = "Yes"), 4), "\n\n")
  
  cat("Accuracy :", round(Accuracy(y_test, best_pred), 4), "\n")
}
## LINEAR SVM 
## Confusion Matrix and Statistics
## 
##          Actual
## Predicted No Yes
##       No  17   0
##       Yes 23  38
##                                          
##                Accuracy : 0.7051         
##                  95% CI : (0.5911, 0.803)
##     No Information Rate : 0.5128         
##     P-Value [Acc > NIR] : 0.0004323      
##                                          
##                   Kappa : 0.4187         
##                                          
##  Mcnemar's Test P-Value : 4.49e-06       
##                                          
##             Sensitivity : 1.0000         
##             Specificity : 0.4250         
##          Pos Pred Value : 0.6230         
##          Neg Pred Value : 1.0000         
##              Prevalence : 0.4872         
##          Detection Rate : 0.4872         
##    Detection Prevalence : 0.7821         
##       Balanced Accuracy : 0.7125         
##                                          
##        'Positive' Class : Yes            
##                                          
## Precision: 1 
## Recall   : 0.425 
## F1       : 0.5965 
## 
## Precision: 0.623 
## Recall   : 1 
## F1       : 0.7677 
## 
## Accuracy : 0.7051 
## NON-LINEAR SVM 
## Confusion Matrix and Statistics
## 
##          Actual
## Predicted No Yes
##       No  20   0
##       Yes 20  38
##                                           
##                Accuracy : 0.7436          
##                  95% CI : (0.6321, 0.8358)
##     No Information Rate : 0.5128          
##     P-Value [Acc > NIR] : 2.584e-05       
##                                           
##                   Kappa : 0.4935          
##                                           
##  Mcnemar's Test P-Value : 2.152e-05       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.6552          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.4872          
##          Detection Rate : 0.4872          
##    Detection Prevalence : 0.7436          
##       Balanced Accuracy : 0.7500          
##                                           
##        'Positive' Class : Yes             
##                                           
## Precision: 1 
## Recall   : 0.5 
## F1       : 0.6667 
## 
## Precision: 0.6552 
## Recall   : 1 
## F1       : 0.7917 
## 
## Accuracy : 0.7436 
## RANDOM FOREST 
## Confusion Matrix and Statistics
## 
##          Actual
## Predicted No Yes
##       No  35   3
##       Yes  5  35
##                                           
##                Accuracy : 0.8974          
##                  95% CI : (0.8079, 0.9547)
##     No Information Rate : 0.5128          
##     P-Value [Acc > NIR] : 4.197e-13       
##                                           
##                   Kappa : 0.795           
##                                           
##  Mcnemar's Test P-Value : 0.7237          
##                                           
##             Sensitivity : 0.9211          
##             Specificity : 0.8750          
##          Pos Pred Value : 0.8750          
##          Neg Pred Value : 0.9211          
##              Prevalence : 0.4872          
##          Detection Rate : 0.4487          
##    Detection Prevalence : 0.5128          
##       Balanced Accuracy : 0.8980          
##                                           
##        'Positive' Class : Yes             
##                                           
## Precision: 0.9211 
## Recall   : 0.875 
## F1       : 0.8974 
## 
## Precision: 0.875 
## Recall   : 0.9211 
## F1       : 0.8974 
## 
## Accuracy : 0.8974

Perbandingan model pada data uji menggunakan Kurva ROC kurva ROC, memplot tingkat positif sejati (nama lain untuk penarikan) terhadap tingkat positif palsu (FPR). Sekali lagi ada trade-off: semakin tinggi recall (TPR), semakin banyak false positive (FPR) yang dihasilkan classifier. Garis putus-putus mewakili kurva ROC dari pengklasifikasi acak murni; pengklasifikasi yang baik tetap sejauh mungkin dari garis itu Salah satu cara untuk membandingkan pengklasifikasi adalah dengan mengukur area di bawah kurva (AUC). Pengklasifikasi sempurna akan memiliki ROC AUC sama dengan 1, sedangkan pengklasifikasi acak murni akan memiliki ROC AUC sama dengan 0,5. Hutan acak memiliki AUC tertinggi.

library(ggplot2)
library(pROC)

best_models <- list(lin_svm_cv, rbf_svm_cv, rf_cv)

# Kumpulkan data ROC semua model
roc_data <- data.frame()

for (i in seq_along(models)) {
  model     <- best_models[[i]]
  name      <- models[i]
  best_pred <- predict(model, newdata = X_test)
  
  roc_obj <- roc(y_test, as.numeric(best_pred == "Yes"), quiet = TRUE)
  
  roc_data <- rbind(roc_data, data.frame(
    fpr   = 1 - roc_obj$specificities,
    tpr   = roc_obj$sensitivities,
    model = name
  ))
}

# Plot ROC curve semua model
ggplot(roc_data, aes(x = fpr, y = tpr, color = model)) +
  geom_line(linewidth = 1) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") +
  labs(x = "False Positive Rate",
       y = "True Positive Rate (Recall)",
       color = "Model") +
  theme_classic()

8. Kesimpulan

Model yang terbaik untuk melakukan klasifikasi untuk dataset ini adalah Random Forest.