Analisis sebelumnya menghasilkan akurasi LDA dan QDA yang sangat rendah pada data uji (~25%), hampir setara dengan random guessing untuk 4 kelas. Setelah dievaluasi, terdapat dua penyebab utama:
RNA_Base,
RNA4, RNA12, RNA_EOT, dan
RNA_EF memiliki rentang nilai ratusan ribu hingga jutaan.
Distribusi yang jauh dari normal melanggar asumsi LDA (normalitas
multivariat) sehingga fungsi diskriminan yang dihasilkan tidak
optimal.Dua strategi diterapkan untuk meningkatkan akurasi:
| Solusi | Strategi | Mekanisme |
|---|---|---|
| 1 | Seleksi Fitur | Pilih top-N variabel berdasarkan nilai absolut koefisien LD1 dari model awal. Mengurangi noise dari variabel tidak relevan. |
| 3 | Transformasi Log | Terapkan log1p() pada variabel RNA dan ALT sebelum
standardisasi untuk memperbaiki normalitas dan homogenitas
kovarians. |
library(tidyverse)
library(caret)
library(MASS) # lda(), qda()
library(klaR) # partimat()
library(ggplot2)
library(dplyr)
library(reshape2)
library(gridExtra)path <- "D:/02. UNS/01. Kuliah/Sem 4/PDM/Diskriminan/HCV-Egy-Data.csv"
df <- read.csv(path, check.names = TRUE)
# Hapus kolom nama kosong & bersihkan spasi
df <- df[, nzchar(trimws(colnames(df)))]
colnames(df) <- trimws(colnames(df))
# Rename kolom
colnames(df)[colnames(df) == "Baselinehistological.staging"] <- "Staging"
colnames(df)[colnames(df) == "Fatigue...generalized.bone.ache"] <- "Fatigue"
colnames(df)[colnames(df) == "Nausea.Vomting"] <- "Nausea"
colnames(df)[colnames(df) == "Epigastric.pain"] <- "EpigastricPain"
colnames(df)[colnames(df) == "Baseline.histological.Grading"] <- "Grading"
colnames(df)[colnames(df) == "ALT.after.24.w"] <- "ALT_after24w"
colnames(df)[colnames(df) == "RNA.Base"] <- "RNA_Base"
colnames(df)[colnames(df) == "RNA.EOT"] <- "RNA_EOT"
colnames(df)[colnames(df) == "RNA.EF"] <- "RNA_EF"
colnames(df)[colnames(df) == "AST.1"] <- "AST1"
colnames(df)[colnames(df) == "ALT.1"] <- "ALT1"
colnames(df)[colnames(df) == "ALT4"] <- "ALT4"
colnames(df)[colnames(df) == "ALT.12"] <- "ALT12"
colnames(df)[colnames(df) == "ALT.24"] <- "ALT24"
colnames(df)[colnames(df) == "ALT.36"] <- "ALT36"
colnames(df)[colnames(df) == "ALT.48"] <- "ALT48"
colnames(df)[colnames(df) == "RNA.4"] <- "RNA4"
colnames(df)[colnames(df) == "RNA.12"] <- "RNA12"
df$Staging <- as.factor(df$Staging)
levels(df$Staging) <- c("Staging1", "Staging2", "Staging3", "Staging4")
features <- setdiff(colnames(df), "Staging")
cat("Total fitur prediktor:", length(features), "\n")## Total fitur prediktor: 28
## Distribusi kelas:
##
## Staging1 Staging2 Staging3 Staging4
## 336 332 355 362
Variabel RNA memiliki skala ratusan ribu hingga jutaan, sementara ALT
berada di skala puluhan hingga ratusan. Distribusi seperti ini sangat
miring ke kanan (right-skewed) dan melanggar asumsi normalitas
LDA. Transformasi log1p(x) (yaitu \(\ln(x+1)\)) digunakan agar nilai nol tidak
menghasilkan -Inf.
# Variabel RNA dan ALT yang akan ditransformasi
vars_log <- c("RNA_Base", "RNA4", "RNA12", "RNA_EOT", "RNA_EF",
"ALT1", "ALT4", "ALT12", "ALT24", "ALT36", "ALT48",
"ALT_after24w", "AST1")
# Cek distribusi SEBELUM transformasi (contoh RNA_Base)
cat("=== Ringkasan RNA_Base SEBELUM transformasi ===\n")## === Ringkasan RNA_Base SEBELUM transformasi ===
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11 269253 593103 590951 886791 1201086
cat("Skewness RNA_Base (approx):",
round((mean(df$RNA_Base) - median(df$RNA_Base)) / sd(df$RNA_Base), 3), "\n\n")## Skewness RNA_Base (approx): -0.006
# Terapkan log1p
df_log <- df
df_log[, vars_log] <- lapply(df[, vars_log], log1p)
# Cek distribusi SETELAH transformasi
cat("=== Ringkasan RNA_Base SETELAH log1p ===\n")## === Ringkasan RNA_Base SETELAH log1p ===
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.485 12.503 13.293 12.953 13.695 13.999
cat("Skewness RNA_Base (approx):",
round((mean(df_log$RNA_Base) - median(df_log$RNA_Base)) / sd(df_log$RNA_Base), 3), "\n")## Skewness RNA_Base (approx): -0.32
par(mfrow = c(1, 2))
# Sebelum
hist(df$RNA_Base,
main = "RNA_Base — Sebelum Transformasi",
xlab = "Nilai", col = "#d5e8f5", border = "white",
breaks = 40)
# Sesudah
hist(df_log$RNA_Base,
main = "RNA_Base — Setelah log1p",
xlab = "log1p(Nilai)", col = "#a8d5a2", border = "white",
breaks = 40)Dilakukan pada data yang sudah ditransformasi.
df_scaled <- df_log
df_scaled[, features] <- as.data.frame(scale(df_log[, features]))
cat("Standardisasi selesai. Contoh mean & sd setelah scaling:\n")## Standardisasi selesai. Contoh mean & sd setelah scaling:
## Age Gender BMI Fever Nausea
## mean 0 0 0 0 0
## sd 1 1 1 1 1
set.seed(123)
train_idx <- createDataPartition(df_scaled$Staging, p = 0.7, list = FALSE)
data_train <- df_scaled[ train_idx, ]
data_test <- df_scaled[-train_idx, ]
cat("Data latih :", nrow(data_train), "observasi\n")## Data latih : 972 observasi
## Data uji : 413 observasi
##
## Distribusi kelas — Data Latih:
##
## Staging1 Staging2 Staging3 Staging4
## 236 233 249 254
##
## Distribusi kelas — Data Uji:
##
## Staging1 Staging2 Staging3 Staging4
## 100 99 106 108
Model LDA awal dilatih menggunakan seluruh 28 fitur (yang sudah ditransformasi dan distandardisasi) untuk mendapatkan koefisien LD1 sebagai dasar seleksi.
model_lda_full <- lda(Staging ~ ., data = data_train)
# Ambil koefisien LD1, urutkan berdasarkan nilai absolut
coef_ld1 <- abs(model_lda_full$scaling[, "LD1"])
coef_ld1_sorted <- sort(coef_ld1, decreasing = TRUE)
cat("=== Peringkat Variabel berdasarkan |Koefisien LD1| ===\n")## === Peringkat Variabel berdasarkan |Koefisien LD1| ===
## RNA12 RNA_EOT BMI RNA_EF Age
## 1.3414 0.8517 0.6046 0.4005 0.3070
## Fever RNA4 WBC Plat AST1
## 0.2326 0.2113 0.2061 0.2053 0.2037
## EpigastricPain ALT12 Gender Grading ALT_after24w
## 0.2021 0.1930 0.1898 0.1728 0.1685
## ALT4 ALT36 Diarrhea Fatigue RNA_Base
## 0.1658 0.1638 0.1231 0.1218 0.1168
## Nausea ALT1 Headache ALT24 HGB
## 0.1163 0.1006 0.0880 0.0850 0.0846
## ALT48 Jaundice RBC
## 0.0677 0.0348 0.0341
Top-10 fitur dipilih berdasarkan nilai absolut koefisien LD1 tertinggi. Ini merupakan variabel yang paling berkontribusi dalam memisahkan kelas Staging.
# Pilih top-10 variabel
n_top <- 17
top_features <- names(coef_ld1_sorted)[1:n_top]
cat("=== Top", n_top, "Fitur Terpilih ===\n")## === Top 17 Fitur Terpilih ===
## [1] "RNA12" "RNA_EOT" "BMI" "RNA_EF"
## [5] "Age" "Fever" "RNA4" "WBC"
## [9] "Plat" "AST1" "EpigastricPain" "ALT12"
## [13] "Gender" "Grading" "ALT_after24w" "ALT4"
## [17] "ALT36"
##
## Nilai |Koefisien LD1|:
## RNA12 RNA_EOT BMI RNA_EF Age
## 1.3414 0.8517 0.6046 0.4005 0.3070
## Fever RNA4 WBC Plat AST1
## 0.2326 0.2113 0.2061 0.2053 0.2037
## EpigastricPain ALT12 Gender Grading ALT_after24w
## 0.2021 0.1930 0.1898 0.1728 0.1685
## ALT4 ALT36
## 0.1658 0.1638
# Semua variabel + tandai yang terpilih
coef_df <- data.frame(
Variabel = names(coef_ld1_sorted),
AbsCoef = as.numeric(coef_ld1_sorted),
Terpilih = names(coef_ld1_sorted) %in% top_features
)
coef_df$Variabel <- factor(coef_df$Variabel,
levels = rev(coef_df$Variabel))
ggplot(coef_df, aes(x = AbsCoef, y = Variabel,
fill = Terpilih, alpha = Terpilih)) +
geom_bar(stat = "identity") +
geom_vline(
xintercept = coef_ld1_sorted[n_top],
linetype = "dashed", color = "#e74c3c", linewidth = 0.8
) +
scale_fill_manual(
values = c("TRUE" = "#2980b9", "FALSE" = "#bdc3c7"),
labels = c("TRUE" = "Terpilih", "FALSE" = "Tidak terpilih")
) +
scale_alpha_manual(values = c("TRUE" = 1, "FALSE" = 0.5), guide = "none") +
annotate("text", x = coef_ld1_sorted[n_top] + 0.01,
y = 1, label = paste0("Threshold top-", n_top),
color = "#e74c3c", size = 3, hjust = 0) +
labs(
title = paste0("Seleksi Fitur: Top-", n_top,
" Berdasarkan |Koefisien LD1|"),
subtitle = "Garis merah putus-putus = batas threshold seleksi",
x = "|Koefisien LD1|",
y = "Variabel",
fill = "Status"
) +
theme_bw(base_size = 10) +
theme(
plot.title = element_text(face = "bold"),
axis.text.y = element_text(size = 8)
)# Subset data hanya fitur terpilih + target
data_train_sel <- data_train[, c(top_features, "Staging")]
data_test_sel <- data_test[, c(top_features, "Staging")]
cat("Dimensi data latih (fitur terpilih):", dim(data_train_sel), "\n")## Dimensi data latih (fitur terpilih): 972 18
## Dimensi data uji (fitur terpilih): 413 18
## Fitur yang digunakan:
## [1] "RNA12" "RNA_EOT" "BMI" "RNA_EF"
## [5] "Age" "Fever" "RNA4" "WBC"
## [9] "Plat" "AST1" "EpigastricPain" "ALT12"
## [13] "Gender" "Grading" "ALT_after24w" "ALT4"
## [17] "ALT36"
## Call:
## lda(Staging ~ ., data = data_train_sel)
##
## Prior probabilities of groups:
## Staging1 Staging2 Staging3 Staging4
## 0.2427984 0.2397119 0.2561728 0.2613169
##
## Group means:
## RNA12 RNA_EOT BMI RNA_EF Age
## Staging1 0.08153927 0.10797836 -0.006907335 0.08427587 -0.04116680
## Staging2 -0.11486932 -0.11166845 0.192871439 -0.12113648 0.07509045
## Staging3 0.02064937 0.01696758 -0.008431122 0.03400137 0.03271559
## Staging4 0.01442030 -0.01599513 -0.147389230 -0.01165920 -0.08745112
## Fever RNA4 WBC Plat AST1
## Staging1 0.07913214 0.04790800 -0.01277460 -0.047917098 0.011036204
## Staging2 0.03333479 0.06129037 -0.04539041 0.099416138 0.002823359
## Staging3 0.03723067 0.03334706 -0.03541789 -0.008423181 0.003214609
## Staging4 -0.07042553 -0.06440774 0.04323104 -0.042663762 -0.098137913
## EpigastricPain ALT12 Gender Grading ALT_after24w
## Staging1 0.08525000 -0.004151124 0.01246228 0.04341757 -0.03186623
## Staging2 -0.08087695 -0.055510820 -0.09492773 0.04854715 -0.04124403
## Staging3 -0.09224918 -0.075754800 0.10526104 -0.01564098 0.03587153
## Staging4 0.03928890 0.056175715 0.02880856 -0.02982198 0.05074444
## ALT4 ALT36
## Staging1 -0.0005127219 0.006093398
## Staging2 0.0069159749 -0.039754574
## Staging3 0.0570109767 -0.005716282
## Staging4 -0.0782862189 0.045380910
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## RNA12 -1.3269398 -1.56440151 -0.14090482
## RNA_EOT 0.8487747 1.87324935 -2.07048683
## BMI 0.6271381 -0.12379509 -0.13454469
## RNA_EF 0.3647851 0.23581020 2.30237834
## Age 0.3294103 -0.04013388 0.18198921
## Fever 0.2480340 0.30837301 0.02073241
## RNA4 0.2494758 0.21186201 0.02267521
## WBC -0.1974222 -0.05431645 -0.07213922
## Plat 0.2128685 -0.26368813 -0.01933234
## AST1 0.2078274 0.22703814 0.05265109
## EpigastricPain -0.2024045 0.25374011 -0.47855581
## ALT12 -0.1829528 -0.04823834 -0.25078101
## Gender -0.2124784 0.27686589 0.48364707
## Grading 0.1971010 0.13529508 -0.17586702
## ALT_after24w -0.1712792 -0.12131325 0.26589266
## ALT4 0.1706905 0.17264606 0.27010664
## ALT36 -0.1758468 0.01299862 -0.02326335
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.6141 0.2044 0.1815
ev_imp <- model_lda_imp$svd^2 / sum(model_lda_imp$svd^2)
ev_imp_pct <- round(ev_imp * 100, 2)
ev_imp_tbl <- data.frame(
Fungsi = paste0("LD", 1:3),
Proporsi = paste0(ev_imp_pct, "%"),
Kumulatif = paste0(round(cumsum(ev_imp_pct), 2), "%")
)
print(ev_imp_tbl)## Fungsi Proporsi Kumulatif
## 1 LD1 61.41% 61.41%
## 2 LD2 20.44% 81.85%
## 3 LD3 18.15% 100%
##
## LD1 menjelaskan 61.41% pemisahan antar kelas.
## LD1 + LD2 menjelaskan 81.85% pemisahan antar kelas.
## LD1 LD2 LD3
## RNA12 -1.3269 -1.5644 -0.1409
## RNA_EOT 0.8488 1.8732 -2.0705
## BMI 0.6271 -0.1238 -0.1345
## RNA_EF 0.3648 0.2358 2.3024
## Age 0.3294 -0.0401 0.1820
## Fever 0.2480 0.3084 0.0207
## RNA4 0.2495 0.2119 0.0227
## WBC -0.1974 -0.0543 -0.0721
## Plat 0.2129 -0.2637 -0.0193
## AST1 0.2078 0.2270 0.0527
## EpigastricPain -0.2024 0.2537 -0.4786
## ALT12 -0.1830 -0.0482 -0.2508
## Gender -0.2125 0.2769 0.4836
## Grading 0.1971 0.1353 -0.1759
## ALT_after24w -0.1713 -0.1213 0.2659
## ALT4 0.1707 0.1726 0.2701
## ALT36 -0.1758 0.0130 -0.0233
pred_train_lda_imp <- predict(model_lda_imp, data_train_sel)
cm_train_lda_imp <- confusionMatrix(pred_train_lda_imp$class,
data_train_sel$Staging)
print(cm_train_lda_imp$table)## Reference
## Prediction Staging1 Staging2 Staging3 Staging4
## Staging1 54 33 42 39
## Staging2 52 87 58 47
## Staging3 50 48 70 52
## Staging4 80 65 79 116
cat(sprintf("\nAkurasi LDA Improved (train): %.2f%%\n",
cm_train_lda_imp$overall["Accuracy"] * 100))##
## Akurasi LDA Improved (train): 33.64%
pred_test_lda_imp <- predict(model_lda_imp, data_test_sel)
cm_test_lda_imp <- confusionMatrix(pred_test_lda_imp$class,
data_test_sel$Staging)
print(cm_test_lda_imp$table)## Reference
## Prediction Staging1 Staging2 Staging3 Staging4
## Staging1 14 23 17 13
## Staging2 22 27 19 26
## Staging3 23 21 29 26
## Staging4 41 28 41 43
##
## Akurasi LDA Improved (test): 27.36%
## Sensitivity Specificity Precision F1
## Class: Staging1 0.1400000 0.8306709 0.2089552 0.1676647
## Class: Staging2 0.2727273 0.7866242 0.2872340 0.2797927
## Class: Staging3 0.2735849 0.7719870 0.2929293 0.2829268
## Class: Staging4 0.3981481 0.6393443 0.2810458 0.3295019
## Call:
## qda(Staging ~ ., data = data_train_sel)
##
## Prior probabilities of groups:
## Staging1 Staging2 Staging3 Staging4
## 0.2427984 0.2397119 0.2561728 0.2613169
##
## Group means:
## RNA12 RNA_EOT BMI RNA_EF Age
## Staging1 0.08153927 0.10797836 -0.006907335 0.08427587 -0.04116680
## Staging2 -0.11486932 -0.11166845 0.192871439 -0.12113648 0.07509045
## Staging3 0.02064937 0.01696758 -0.008431122 0.03400137 0.03271559
## Staging4 0.01442030 -0.01599513 -0.147389230 -0.01165920 -0.08745112
## Fever RNA4 WBC Plat AST1
## Staging1 0.07913214 0.04790800 -0.01277460 -0.047917098 0.011036204
## Staging2 0.03333479 0.06129037 -0.04539041 0.099416138 0.002823359
## Staging3 0.03723067 0.03334706 -0.03541789 -0.008423181 0.003214609
## Staging4 -0.07042553 -0.06440774 0.04323104 -0.042663762 -0.098137913
## EpigastricPain ALT12 Gender Grading ALT_after24w
## Staging1 0.08525000 -0.004151124 0.01246228 0.04341757 -0.03186623
## Staging2 -0.08087695 -0.055510820 -0.09492773 0.04854715 -0.04124403
## Staging3 -0.09224918 -0.075754800 0.10526104 -0.01564098 0.03587153
## Staging4 0.03928890 0.056175715 0.02880856 -0.02982198 0.05074444
## ALT4 ALT36
## Staging1 -0.0005127219 0.006093398
## Staging2 0.0069159749 -0.039754574
## Staging3 0.0570109767 -0.005716282
## Staging4 -0.0782862189 0.045380910
pred_train_qda_imp <- predict(model_qda_imp, data_train_sel)
cm_train_qda_imp <- confusionMatrix(pred_train_qda_imp$class,
data_train_sel$Staging)
print(cm_train_qda_imp$table)## Reference
## Prediction Staging1 Staging2 Staging3 Staging4
## Staging1 87 29 34 41
## Staging2 38 102 33 29
## Staging3 53 56 132 42
## Staging4 58 46 50 142
cat(sprintf("\nAkurasi QDA Improved (train): %.2f%%\n",
cm_train_qda_imp$overall["Accuracy"] * 100))##
## Akurasi QDA Improved (train): 47.63%
pred_test_qda_imp <- predict(model_qda_imp, data_test_sel)
cm_test_qda_imp <- confusionMatrix(pred_test_qda_imp$class,
data_test_sel$Staging)
print(cm_test_qda_imp$table)## Reference
## Prediction Staging1 Staging2 Staging3 Staging4
## Staging1 22 31 26 16
## Staging2 23 19 24 25
## Staging3 32 27 23 26
## Staging4 23 22 33 41
##
## Akurasi QDA Improved (test): 25.42%
## Sensitivity Specificity Precision F1
## Class: Staging1 0.2200000 0.7667732 0.2315789 0.2256410
## Class: Staging2 0.1919192 0.7707006 0.2087912 0.2000000
## Class: Staging3 0.2169811 0.7231270 0.2129630 0.2149533
## Class: Staging4 0.3796296 0.7442623 0.3445378 0.3612335
# ---- Bangun ulang model awal (28 fitur, tanpa log-transform) ----
df_baseline <- df
features_base <- setdiff(colnames(df_baseline), "Staging")
df_baseline[, features_base] <- as.data.frame(scale(df_baseline[, features_base]))
set.seed(123)
idx_base <- createDataPartition(df_baseline$Staging, p = 0.7, list = FALSE)
train_base <- df_baseline[ idx_base, ]
test_base <- df_baseline[-idx_base, ]
model_lda_base <- lda(Staging ~ ., data = train_base)
model_qda_base <- qda(Staging ~ ., data = train_base)
acc_lda_train_base <- confusionMatrix(
predict(model_lda_base, train_base)$class, train_base$Staging
)$overall["Accuracy"] * 100
acc_lda_test_base <- confusionMatrix(
predict(model_lda_base, test_base)$class, test_base$Staging
)$overall["Accuracy"] * 100
acc_qda_train_base <- confusionMatrix(
predict(model_qda_base, train_base)$class, train_base$Staging
)$overall["Accuracy"] * 100
acc_qda_test_base <- confusionMatrix(
predict(model_qda_base, test_base)$class, test_base$Staging
)$overall["Accuracy"] * 100
# ---- Susun tabel perbandingan ----
comparison_tbl <- data.frame(
Model = c("LDA Baseline", "LDA Improved",
"QDA Baseline", "QDA Improved"),
Fitur = c(28, n_top, 28, n_top),
Log_Transform = c("Tidak", "Ya", "Tidak", "Ya"),
Akurasi_Train = round(c(acc_lda_train_base,
cm_train_lda_imp$overall["Accuracy"] * 100,
acc_qda_train_base,
cm_train_qda_imp$overall["Accuracy"] * 100), 2),
Akurasi_Test = round(c(acc_lda_test_base,
cm_test_lda_imp$overall["Accuracy"] * 100,
acc_qda_test_base,
cm_test_qda_imp$overall["Accuracy"] * 100), 2)
)
comparison_tbl$Gap_Overfit <- round(
comparison_tbl$Akurasi_Train - comparison_tbl$Akurasi_Test, 2
)
comparison_tbl$Delta_Test <- round(
comparison_tbl$Akurasi_Test - c(acc_lda_test_base,
acc_lda_test_base,
acc_qda_test_base,
acc_qda_test_base), 2
)
print(comparison_tbl)## Model Fitur Log_Transform Akurasi_Train Akurasi_Test Gap_Overfit
## 1 LDA Baseline 28 Tidak 35.49 25.42 10.07
## 2 LDA Improved 17 Ya 33.64 27.36 6.28
## 3 QDA Baseline 28 Tidak 63.37 26.39 36.98
## 4 QDA Improved 17 Ya 47.63 25.42 22.21
## Delta_Test
## 1 0.00
## 2 1.94
## 3 0.00
## 4 -0.97
##
## === Peningkatan Akurasi Test ===
cat(sprintf("LDA: %.2f%% → %.2f%% (Δ = %+.2f%%)\n",
acc_lda_test_base,
cm_test_lda_imp$overall["Accuracy"] * 100,
cm_test_lda_imp$overall["Accuracy"] * 100 - acc_lda_test_base))## LDA: 25.42% → 27.36% (Δ = +1.94%)
cat(sprintf("QDA: %.2f%% → %.2f%% (Δ = %+.2f%%)\n",
acc_qda_test_base,
cm_test_qda_imp$overall["Accuracy"] * 100,
cm_test_qda_imp$overall["Accuracy"] * 100 - acc_qda_test_base))## QDA: 26.39% → 25.42% (Δ = -0.97%)
lda_scores_imp <- as.data.frame(pred_train_lda_imp$x)
lda_scores_imp$Staging <- data_train_sel$Staging
ggplot(lda_scores_imp, aes(x = LD1, y = LD2, color = Staging)) +
geom_point(alpha = 0.5, size = 1.5) +
stat_ellipse(level = 0.68, linewidth = 0.8) +
labs(
title = "Proyeksi LDA Improved pada LD1 dan LD2 (Data Latih)",
subtitle = sprintf("LD1: %.2f%% | LD2: %.2f%% | LD3: %.2f%%",
ev_imp_pct[1], ev_imp_pct[2], ev_imp_pct[3]),
x = sprintf("LD1 (%.2f%%)", ev_imp_pct[1]),
y = sprintf("LD2 (%.2f%%)", ev_imp_pct[2]),
color = "Staging"
) +
scale_color_manual(values = c("#e74c3c", "#3498db", "#2ecc71", "#f39c12")) +
theme_bw(base_size = 11) +
theme(plot.title = element_text(face = "bold"))cm_lda_imp_df <- as.data.frame(cm_test_lda_imp$table)
colnames(cm_lda_imp_df) <- c("Prediksi", "Aktual", "Freq")
ggplot(cm_lda_imp_df, aes(x = Prediksi, y = Aktual, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), size = 4.5, fontface = "bold") +
scale_fill_gradient(low = "#dceefb", high = "#2980b9") +
labs(
title = sprintf("Confusion Matrix LDA Improved (Data Uji) — Akurasi: %.2f%%",
cm_test_lda_imp$overall["Accuracy"] * 100),
x = "Prediksi", y = "Aktual"
) +
theme_bw(base_size = 11) +
theme(plot.title = element_text(face = "bold"))cm_qda_imp_df <- as.data.frame(cm_test_qda_imp$table)
colnames(cm_qda_imp_df) <- c("Prediksi", "Aktual", "Freq")
ggplot(cm_qda_imp_df, aes(x = Prediksi, y = Aktual, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), size = 4.5, fontface = "bold") +
scale_fill_gradient(low = "#fdebd0", high = "#e67e22") +
labs(
title = sprintf("Confusion Matrix QDA Improved (Data Uji) — Akurasi: %.2f%%",
cm_test_qda_imp$overall["Accuracy"] * 100),
x = "Prediksi", y = "Aktual"
) +
theme_bw(base_size = 11) +
theme(plot.title = element_text(face = "bold"))acc_comp_df <- data.frame(
Model = rep(c("LDA", "QDA"), each = 4),
Versi = rep(c("Baseline\nTrain", "Baseline\nTest",
"Improved\nTrain", "Improved\nTest"), 2),
Akurasi = c(
round(acc_lda_train_base, 2),
round(acc_lda_test_base, 2),
round(cm_train_lda_imp$overall["Accuracy"] * 100, 2),
round(cm_test_lda_imp$overall["Accuracy"] * 100, 2),
round(acc_qda_train_base, 2),
round(acc_qda_test_base, 2),
round(cm_train_qda_imp$overall["Accuracy"] * 100, 2),
round(cm_test_qda_imp$overall["Accuracy"] * 100, 2)
),
Tipe = rep(c("Train", "Test", "Train", "Test"), 2)
)
acc_comp_df$Versi <- factor(
acc_comp_df$Versi,
levels = c("Baseline\nTrain", "Baseline\nTest",
"Improved\nTrain", "Improved\nTest")
)
ggplot(acc_comp_df, aes(x = Versi, y = Akurasi,
fill = interaction(Model, Tipe))) +
geom_bar(stat = "identity", position = position_dodge(width = 0.7),
width = 0.6) +
geom_text(aes(label = paste0(Akurasi, "%")),
position = position_dodge(width = 0.7),
vjust = -0.4, size = 3) +
scale_fill_manual(
values = c(
"LDA.Train" = "#2c3e50", "LDA.Test" = "#3498db",
"QDA.Train" = "#7d3c98", "QDA.Test" = "#a569bd"
),
labels = c(
"LDA.Train" = "LDA Train", "LDA.Test" = "LDA Test",
"QDA.Train" = "QDA Train", "QDA.Test" = "QDA Test"
)
) +
ylim(0, 100) +
labs(
title = "Perbandingan Akurasi: Baseline vs Improved",
subtitle = paste0("Improved = Log-Transform + Seleksi Top-",
n_top, " Fitur"),
x = "",
y = "Akurasi (%)",
fill = "Model & Data"
) +
theme_bw(base_size = 11) +
theme(
plot.title = element_text(face = "bold"),
legend.position = "right"
)# Skor LD1 baseline (data uji)
lda_test_base <- as.data.frame(predict(model_lda_base, test_base)$x)
lda_test_base$Staging <- test_base$Staging
lda_test_base$Versi <- "Baseline (28 fitur)"
# Skor LD1 improved (data uji)
lda_test_imp <- as.data.frame(pred_test_lda_imp$x)
lda_test_imp$Staging <- data_test_sel$Staging
lda_test_imp$Versi <- paste0("Improved (Top-", n_top, " + Log)")
ld1_compare <- bind_rows(lda_test_base, lda_test_imp)
ggplot(ld1_compare, aes(x = LD1, fill = Staging)) +
geom_density(alpha = 0.45) +
facet_wrap(~ Versi, ncol = 2) +
scale_fill_manual(values = c("#e74c3c", "#3498db", "#2ecc71", "#f39c12")) +
labs(
title = "Distribusi Skor LD1 per Staging (Data Uji): Baseline vs Improved",
x = "Skor LD1",
y = "Densitas",
fill = "Staging"
) +
theme_bw(base_size = 10) +
theme(
plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"),
legend.position = "bottom"
)scaling_imp_df <- as.data.frame(model_lda_imp$scaling)
scaling_imp_df$Variabel <- rownames(scaling_imp_df)
scaling_imp_df <- scaling_imp_df %>%
arrange(abs(LD1)) %>%
mutate(
Variabel = factor(Variabel, levels = Variabel),
Warna = ifelse(LD1 >= 0, "Positif", "Negatif")
)
ggplot(scaling_imp_df, aes(x = LD1, y = Variabel, fill = Warna)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = 0, linewidth = 0.6) +
scale_fill_manual(values = c("Positif" = "#3498db", "Negatif" = "#e74c3c")) +
labs(
title = "Koefisien LD1 — Model Improved",
subtitle = paste0("Hanya top-", n_top, " fitur terpilih"),
x = "Koefisien LD1", y = "Variabel", fill = "Arah"
) +
theme_bw(base_size = 10) +
theme(
plot.title = element_text(face = "bold"),
axis.text.y = element_text(size = 9)
)Partition plot menggunakan 2 fitur teratas dari model improved. Proses ini membutuhkan waktu komputasi lebih lama.
# Ambil 2 fitur teratas dari top_features
feat1 <- top_features[1]
feat2 <- top_features[2]
cat("Partition plot menggunakan:", feat1, "vs", feat2, "\n")## Partition plot menggunakan: RNA12 vs RNA_EOT
partimat(
as.formula(paste("Staging ~", feat1, "+", feat2)),
data = data_train_sel,
method = "lda",
main = paste("Partition Plot LDA Improved:", feat1, "vs", feat2)
)partimat(
as.formula(paste("Staging ~", feat1, "+", feat2)),
data = data_train_sel,
method = "qda",
main = paste("Partition Plot QDA Improved:", feat1, "vs", feat2)
)Dua strategi perbaikan diterapkan secara bersamaan pada model LDA dan QDA:
Solusi 1 — Seleksi Fitur: Dari 28 variabel prediktor, dipilih top-17 berdasarkan nilai absolut koefisien LD1. Variabel yang tidak berpengaruh signifikan dieliminasi untuk mengurangi noise pada fungsi diskriminan.
Solusi 3 — Transformasi Log: Variabel RNA
(RNA_Base, RNA4, RNA12,
RNA_EOT, RNA_EF) dan ALT
(ALT1–ALT48, AST1) yang memiliki
distribusi sangat miring (right-skewed) ditransformasi
menggunakan log1p(). Hasilnya distribusi mendekati simetris
yang lebih sesuai dengan asumsi normalitas LDA.
Hasil Perbaikan:
## Model Akurasi_Test Gap_Overfit Peningkatan
## 1 LDA Baseline 25.42 10.07 -
## 2 LDA Improved 27.36 6.28 +1.94%
## 3 QDA Baseline 26.39 36.98 -
## 4 QDA Improved 25.42 22.21 +-0.97%
Penurunan Gap Overfit pada QDA menunjukkan bahwa seleksi fitur berhasil mengurangi kompleksitas estimasi matriks kovarians per kelas, sehingga model lebih mampu melakukan generalisasi pada data baru.
## === SESSION INFO ===
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=Indonesian_Indonesia.utf8 LC_CTYPE=Indonesian_Indonesia.utf8
## [3] LC_MONETARY=Indonesian_Indonesia.utf8 LC_NUMERIC=C
## [5] LC_TIME=Indonesian_Indonesia.utf8
##
## time zone: Asia/Jakarta
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gridExtra_2.3 reshape2_1.4.5 klaR_1.7-4 MASS_7.3-60.2
## [5] caret_7.0-1 lattice_0.22-6 lubridate_1.9.4 forcats_1.0.0
## [9] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
## [13] tidyr_1.3.1 tibble_3.2.1 ggplot2_4.0.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 timeDate_4052.112 farver_2.1.2
## [4] S7_0.2.1 fastmap_1.2.0 combinat_0.0-8
## [7] promises_1.3.3 pROC_1.19.0.1 labelled_2.16.0
## [10] digest_0.6.37 rpart_4.1.23 mime_0.12
## [13] timechange_0.3.0 lifecycle_1.0.4 survival_3.6-4
## [16] magrittr_2.0.3 compiler_4.4.1 rlang_1.1.4
## [19] sass_0.4.9 tools_4.4.1 utf8_1.2.4
## [22] yaml_2.3.10 data.table_1.16.4 knitr_1.49
## [25] labeling_0.4.3 plyr_1.8.9 RColorBrewer_1.1-3
## [28] miniUI_0.1.2 withr_3.0.1 nnet_7.3-19
## [31] grid_4.4.1 stats4_4.4.1 fansi_1.0.6
## [34] e1071_1.7-17 xtable_1.8-4 future_1.69.0
## [37] globals_0.19.0 scales_1.4.0 iterators_1.0.14
## [40] cli_3.6.3 rmarkdown_2.30 generics_0.1.3
## [43] rstudioapi_0.17.1 future.apply_1.20.2 tzdb_0.4.0
## [46] proxy_0.4-29 cachem_1.1.0 splines_4.4.1
## [49] parallel_4.4.1 vctrs_0.6.5 hardhat_1.4.2
## [52] Matrix_1.7-0 jsonlite_1.8.9 hms_1.1.3
## [55] listenv_0.10.0 foreach_1.5.2 gower_1.0.2
## [58] jquerylib_0.1.4 recipes_1.3.1 glue_1.7.0
## [61] parallelly_1.46.1 codetools_0.2-20 stringi_1.8.4
## [64] gtable_0.3.6 later_1.4.2 questionr_0.8.2
## [67] pillar_1.9.0 htmltools_0.5.8.1 ipred_0.9-15
## [70] lava_1.8.2 R6_2.5.1 evaluate_1.0.1
## [73] shiny_1.10.0 haven_2.5.4 highr_0.11
## [76] httpuv_1.6.16 bslib_0.8.0 class_7.3-22
## [79] Rcpp_1.0.13 nlme_3.1-164 prodlim_2025.04.28
## [82] xfun_0.49 pkgconfig_2.0.3 ModelMetrics_1.2.2.2