1 Pendahuluan

1.1 Latar Belakang

Analisis sebelumnya menghasilkan akurasi LDA dan QDA yang sangat rendah pada data uji (~25%), hampir setara dengan random guessing untuk 4 kelas. Setelah dievaluasi, terdapat dua penyebab utama:

  1. Fitur tidak diskriminatif secara merata — dari 28 variabel, sebagian besar tidak berkontribusi signifikan dalam memisahkan kelas Staging. Kehadiran variabel “noise” justru menurunkan kinerja model.
  2. Distribusi variabel RNA dan ALT sangat miring (skewed) — variabel seperti RNA_Base, RNA4, RNA12, RNA_EOT, dan RNA_EF memiliki rentang nilai ratusan ribu hingga jutaan. Distribusi yang jauh dari normal melanggar asumsi LDA (normalitas multivariat) sehingga fungsi diskriminan yang dihasilkan tidak optimal.

1.2 Strategi Perbaikan

Dua strategi diterapkan untuk meningkatkan akurasi:

Solusi Strategi Mekanisme
1 Seleksi Fitur Pilih top-N variabel berdasarkan nilai absolut koefisien LD1 dari model awal. Mengurangi noise dari variabel tidak relevan.
3 Transformasi Log Terapkan log1p() pada variabel RNA dan ALT sebelum standardisasi untuk memperbaiki normalitas dan homogenitas kovarians.

1.3 Tujuan

  1. Menerapkan transformasi log pada variabel RNA dan ALT.
  2. Melakukan seleksi fitur berdasarkan koefisien fungsi diskriminan awal.
  3. Membandingkan akurasi model LDA dan QDA sebelum dan sesudah perbaikan.

2 Load Library

library(tidyverse)
library(caret)
library(MASS)       # lda(), qda()
library(klaR)       # partimat()
library(ggplot2)
library(dplyr)
library(reshape2)
library(gridExtra)

3 Load & Pra-Pemrosesan Data

3.1 Load Dataset

path <- "D:/02. UNS/01. Kuliah/Sem 4/PDM/Diskriminan/HCV-Egy-Data.csv"
df <- read.csv(path, check.names = TRUE)

# Hapus kolom nama kosong & bersihkan spasi
df <- df[, nzchar(trimws(colnames(df)))]
colnames(df) <- trimws(colnames(df))

# Rename kolom
colnames(df)[colnames(df) == "Baselinehistological.staging"]    <- "Staging"
colnames(df)[colnames(df) == "Fatigue...generalized.bone.ache"] <- "Fatigue"
colnames(df)[colnames(df) == "Nausea.Vomting"]                  <- "Nausea"
colnames(df)[colnames(df) == "Epigastric.pain"]                 <- "EpigastricPain"
colnames(df)[colnames(df) == "Baseline.histological.Grading"]   <- "Grading"
colnames(df)[colnames(df) == "ALT.after.24.w"]                  <- "ALT_after24w"
colnames(df)[colnames(df) == "RNA.Base"]                        <- "RNA_Base"
colnames(df)[colnames(df) == "RNA.EOT"]                         <- "RNA_EOT"
colnames(df)[colnames(df) == "RNA.EF"]                          <- "RNA_EF"
colnames(df)[colnames(df) == "AST.1"]                           <- "AST1"
colnames(df)[colnames(df) == "ALT.1"]                           <- "ALT1"
colnames(df)[colnames(df) == "ALT4"]                            <- "ALT4"
colnames(df)[colnames(df) == "ALT.12"]                          <- "ALT12"
colnames(df)[colnames(df) == "ALT.24"]                          <- "ALT24"
colnames(df)[colnames(df) == "ALT.36"]                          <- "ALT36"
colnames(df)[colnames(df) == "ALT.48"]                          <- "ALT48"
colnames(df)[colnames(df) == "RNA.4"]                           <- "RNA4"
colnames(df)[colnames(df) == "RNA.12"]                          <- "RNA12"

df$Staging <- as.factor(df$Staging)
levels(df$Staging) <- c("Staging1", "Staging2", "Staging3", "Staging4")

features <- setdiff(colnames(df), "Staging")
cat("Total fitur prediktor:", length(features), "\n")
## Total fitur prediktor: 28
cat("Distribusi kelas:\n")
## Distribusi kelas:
print(table(df$Staging))
## 
## Staging1 Staging2 Staging3 Staging4 
##      336      332      355      362

3.2 Solusi 3: Transformasi Log pada Variabel RNA dan ALT

Variabel RNA memiliki skala ratusan ribu hingga jutaan, sementara ALT berada di skala puluhan hingga ratusan. Distribusi seperti ini sangat miring ke kanan (right-skewed) dan melanggar asumsi normalitas LDA. Transformasi log1p(x) (yaitu \(\ln(x+1)\)) digunakan agar nilai nol tidak menghasilkan -Inf.

# Variabel RNA dan ALT yang akan ditransformasi
vars_log <- c("RNA_Base", "RNA4", "RNA12", "RNA_EOT", "RNA_EF",
              "ALT1", "ALT4", "ALT12", "ALT24", "ALT36", "ALT48",
              "ALT_after24w", "AST1")

# Cek distribusi SEBELUM transformasi (contoh RNA_Base)
cat("=== Ringkasan RNA_Base SEBELUM transformasi ===\n")
## === Ringkasan RNA_Base SEBELUM transformasi ===
print(summary(df$RNA_Base))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      11  269253  593103  590951  886791 1201086
cat("Skewness RNA_Base (approx):",
    round((mean(df$RNA_Base) - median(df$RNA_Base)) / sd(df$RNA_Base), 3), "\n\n")
## Skewness RNA_Base (approx): -0.006
# Terapkan log1p
df_log <- df
df_log[, vars_log] <- lapply(df[, vars_log], log1p)

# Cek distribusi SETELAH transformasi
cat("=== Ringkasan RNA_Base SETELAH log1p ===\n")
## === Ringkasan RNA_Base SETELAH log1p ===
print(summary(df_log$RNA_Base))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.485  12.503  13.293  12.953  13.695  13.999
cat("Skewness RNA_Base (approx):",
    round((mean(df_log$RNA_Base) - median(df_log$RNA_Base)) / sd(df_log$RNA_Base), 3), "\n")
## Skewness RNA_Base (approx): -0.32

3.3 Visualisasi Efek Transformasi Log

par(mfrow = c(1, 2))

# Sebelum
hist(df$RNA_Base,
     main  = "RNA_Base — Sebelum Transformasi",
     xlab  = "Nilai", col = "#d5e8f5", border = "white",
     breaks = 40)

# Sesudah
hist(df_log$RNA_Base,
     main  = "RNA_Base — Setelah log1p",
     xlab  = "log1p(Nilai)", col = "#a8d5a2", border = "white",
     breaks = 40)

par(mfrow = c(1, 1))

3.4 Standardisasi Z-Score

Dilakukan pada data yang sudah ditransformasi.

df_scaled <- df_log
df_scaled[, features] <- as.data.frame(scale(df_log[, features]))

cat("Standardisasi selesai. Contoh mean & sd setelah scaling:\n")
## Standardisasi selesai. Contoh mean & sd setelah scaling:
round(apply(df_scaled[, features[1:5]], 2, function(x) c(mean=mean(x), sd=sd(x))), 4)
##      Age Gender BMI Fever Nausea
## mean   0      0   0     0      0
## sd     1      1   1     1      1

4 Split Data Train & Test (70:30 Stratified)

set.seed(123)
train_idx  <- createDataPartition(df_scaled$Staging, p = 0.7, list = FALSE)
data_train <- df_scaled[ train_idx, ]
data_test  <- df_scaled[-train_idx, ]

cat("Data latih :", nrow(data_train), "observasi\n")
## Data latih : 972 observasi
cat("Data uji   :", nrow(data_test),  "observasi\n")
## Data uji   : 413 observasi
cat("\nDistribusi kelas — Data Latih:\n")
## 
## Distribusi kelas — Data Latih:
print(table(data_train$Staging))
## 
## Staging1 Staging2 Staging3 Staging4 
##      236      233      249      254
cat("\nDistribusi kelas — Data Uji:\n")
## 
## Distribusi kelas — Data Uji:
print(table(data_test$Staging))
## 
## Staging1 Staging2 Staging3 Staging4 
##      100       99      106      108

5 Solusi 1: Seleksi Fitur Berbasis Koefisien LD1

5.1 Langkah 1 — Latih LDA Awal dengan Semua Fitur (Post Log-Transform)

Model LDA awal dilatih menggunakan seluruh 28 fitur (yang sudah ditransformasi dan distandardisasi) untuk mendapatkan koefisien LD1 sebagai dasar seleksi.

model_lda_full <- lda(Staging ~ ., data = data_train)

# Ambil koefisien LD1, urutkan berdasarkan nilai absolut
coef_ld1 <- abs(model_lda_full$scaling[, "LD1"])
coef_ld1_sorted <- sort(coef_ld1, decreasing = TRUE)

cat("=== Peringkat Variabel berdasarkan |Koefisien LD1| ===\n")
## === Peringkat Variabel berdasarkan |Koefisien LD1| ===
print(round(coef_ld1_sorted, 4))
##          RNA12        RNA_EOT            BMI         RNA_EF            Age 
##         1.3414         0.8517         0.6046         0.4005         0.3070 
##          Fever           RNA4            WBC           Plat           AST1 
##         0.2326         0.2113         0.2061         0.2053         0.2037 
## EpigastricPain          ALT12         Gender        Grading   ALT_after24w 
##         0.2021         0.1930         0.1898         0.1728         0.1685 
##           ALT4          ALT36       Diarrhea        Fatigue       RNA_Base 
##         0.1658         0.1638         0.1231         0.1218         0.1168 
##         Nausea           ALT1       Headache          ALT24            HGB 
##         0.1163         0.1006         0.0880         0.0850         0.0846 
##          ALT48       Jaundice            RBC 
##         0.0677         0.0348         0.0341

5.2 Langkah 2 — Pilih Top-10 Fitur Terpenting

Top-10 fitur dipilih berdasarkan nilai absolut koefisien LD1 tertinggi. Ini merupakan variabel yang paling berkontribusi dalam memisahkan kelas Staging.

# Pilih top-10 variabel
n_top <- 17
top_features <- names(coef_ld1_sorted)[1:n_top]

cat("=== Top", n_top, "Fitur Terpilih ===\n")
## === Top 17 Fitur Terpilih ===
print(top_features)
##  [1] "RNA12"          "RNA_EOT"        "BMI"            "RNA_EF"        
##  [5] "Age"            "Fever"          "RNA4"           "WBC"           
##  [9] "Plat"           "AST1"           "EpigastricPain" "ALT12"         
## [13] "Gender"         "Grading"        "ALT_after24w"   "ALT4"          
## [17] "ALT36"
cat("\nNilai |Koefisien LD1|:\n")
## 
## Nilai |Koefisien LD1|:
print(round(coef_ld1_sorted[1:n_top], 4))
##          RNA12        RNA_EOT            BMI         RNA_EF            Age 
##         1.3414         0.8517         0.6046         0.4005         0.3070 
##          Fever           RNA4            WBC           Plat           AST1 
##         0.2326         0.2113         0.2061         0.2053         0.2037 
## EpigastricPain          ALT12         Gender        Grading   ALT_after24w 
##         0.2021         0.1930         0.1898         0.1728         0.1685 
##           ALT4          ALT36 
##         0.1658         0.1638

5.3 Visualisasi Seleksi Fitur

# Semua variabel + tandai yang terpilih
coef_df <- data.frame(
  Variabel = names(coef_ld1_sorted),
  AbsCoef  = as.numeric(coef_ld1_sorted),
  Terpilih = names(coef_ld1_sorted) %in% top_features
)
coef_df$Variabel <- factor(coef_df$Variabel,
                           levels = rev(coef_df$Variabel))

ggplot(coef_df, aes(x = AbsCoef, y = Variabel,
                    fill = Terpilih, alpha = Terpilih)) +
  geom_bar(stat = "identity") +
  geom_vline(
    xintercept = coef_ld1_sorted[n_top],
    linetype   = "dashed", color = "#e74c3c", linewidth = 0.8
  ) +
  scale_fill_manual(
    values = c("TRUE" = "#2980b9", "FALSE" = "#bdc3c7"),
    labels = c("TRUE" = "Terpilih", "FALSE" = "Tidak terpilih")
  ) +
  scale_alpha_manual(values = c("TRUE" = 1, "FALSE" = 0.5), guide = "none") +
  annotate("text", x = coef_ld1_sorted[n_top] + 0.01,
           y = 1, label = paste0("Threshold top-", n_top),
           color = "#e74c3c", size = 3, hjust = 0) +
  labs(
    title    = paste0("Seleksi Fitur: Top-", n_top,
                      " Berdasarkan |Koefisien LD1|"),
    subtitle = "Garis merah putus-putus = batas threshold seleksi",
    x        = "|Koefisien LD1|",
    y        = "Variabel",
    fill     = "Status"
  ) +
  theme_bw(base_size = 10) +
  theme(
    plot.title  = element_text(face = "bold"),
    axis.text.y = element_text(size = 8)
  )

5.4 Langkah 3 — Buat Dataset dengan Fitur Terpilih

# Subset data hanya fitur terpilih + target
data_train_sel <- data_train[, c(top_features, "Staging")]
data_test_sel  <- data_test[,  c(top_features, "Staging")]

cat("Dimensi data latih (fitur terpilih):", dim(data_train_sel), "\n")
## Dimensi data latih (fitur terpilih): 972 18
cat("Dimensi data uji  (fitur terpilih):", dim(data_test_sel),  "\n")
## Dimensi data uji  (fitur terpilih): 413 18
cat("Fitur yang digunakan:\n")
## Fitur yang digunakan:
print(top_features)
##  [1] "RNA12"          "RNA_EOT"        "BMI"            "RNA_EF"        
##  [5] "Age"            "Fever"          "RNA4"           "WBC"           
##  [9] "Plat"           "AST1"           "EpigastricPain" "ALT12"         
## [13] "Gender"         "Grading"        "ALT_after24w"   "ALT4"          
## [17] "ALT36"

6 LDA dengan Fitur Terpilih + Log-Transform

6.1 Melatih Model LDA (Improved)

model_lda_imp <- lda(Staging ~ ., data = data_train_sel)
print(model_lda_imp)
## Call:
## lda(Staging ~ ., data = data_train_sel)
## 
## Prior probabilities of groups:
##  Staging1  Staging2  Staging3  Staging4 
## 0.2427984 0.2397119 0.2561728 0.2613169 
## 
## Group means:
##                RNA12     RNA_EOT          BMI      RNA_EF         Age
## Staging1  0.08153927  0.10797836 -0.006907335  0.08427587 -0.04116680
## Staging2 -0.11486932 -0.11166845  0.192871439 -0.12113648  0.07509045
## Staging3  0.02064937  0.01696758 -0.008431122  0.03400137  0.03271559
## Staging4  0.01442030 -0.01599513 -0.147389230 -0.01165920 -0.08745112
##                Fever        RNA4         WBC         Plat         AST1
## Staging1  0.07913214  0.04790800 -0.01277460 -0.047917098  0.011036204
## Staging2  0.03333479  0.06129037 -0.04539041  0.099416138  0.002823359
## Staging3  0.03723067  0.03334706 -0.03541789 -0.008423181  0.003214609
## Staging4 -0.07042553 -0.06440774  0.04323104 -0.042663762 -0.098137913
##          EpigastricPain        ALT12      Gender     Grading ALT_after24w
## Staging1     0.08525000 -0.004151124  0.01246228  0.04341757  -0.03186623
## Staging2    -0.08087695 -0.055510820 -0.09492773  0.04854715  -0.04124403
## Staging3    -0.09224918 -0.075754800  0.10526104 -0.01564098   0.03587153
## Staging4     0.03928890  0.056175715  0.02880856 -0.02982198   0.05074444
##                   ALT4        ALT36
## Staging1 -0.0005127219  0.006093398
## Staging2  0.0069159749 -0.039754574
## Staging3  0.0570109767 -0.005716282
## Staging4 -0.0782862189  0.045380910
## 
## Coefficients of linear discriminants:
##                       LD1         LD2         LD3
## RNA12          -1.3269398 -1.56440151 -0.14090482
## RNA_EOT         0.8487747  1.87324935 -2.07048683
## BMI             0.6271381 -0.12379509 -0.13454469
## RNA_EF          0.3647851  0.23581020  2.30237834
## Age             0.3294103 -0.04013388  0.18198921
## Fever           0.2480340  0.30837301  0.02073241
## RNA4            0.2494758  0.21186201  0.02267521
## WBC            -0.1974222 -0.05431645 -0.07213922
## Plat            0.2128685 -0.26368813 -0.01933234
## AST1            0.2078274  0.22703814  0.05265109
## EpigastricPain -0.2024045  0.25374011 -0.47855581
## ALT12          -0.1829528 -0.04823834 -0.25078101
## Gender         -0.2124784  0.27686589  0.48364707
## Grading         0.1971010  0.13529508 -0.17586702
## ALT_after24w   -0.1712792 -0.12131325  0.26589266
## ALT4            0.1706905  0.17264606  0.27010664
## ALT36          -0.1758468  0.01299862 -0.02326335
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.6141 0.2044 0.1815

6.2 Proporsi Trace

ev_imp     <- model_lda_imp$svd^2 / sum(model_lda_imp$svd^2)
ev_imp_pct <- round(ev_imp * 100, 2)

ev_imp_tbl <- data.frame(
  Fungsi    = paste0("LD", 1:3),
  Proporsi  = paste0(ev_imp_pct, "%"),
  Kumulatif = paste0(round(cumsum(ev_imp_pct), 2), "%")
)
print(ev_imp_tbl)
##   Fungsi Proporsi Kumulatif
## 1    LD1   61.41%    61.41%
## 2    LD2   20.44%    81.85%
## 3    LD3   18.15%      100%
cat(sprintf("\nLD1 menjelaskan %.2f%% pemisahan antar kelas.\n", ev_imp_pct[1]))
## 
## LD1 menjelaskan 61.41% pemisahan antar kelas.
cat(sprintf("LD1 + LD2 menjelaskan %.2f%% pemisahan antar kelas.\n",
            sum(ev_imp_pct[1:2])))
## LD1 + LD2 menjelaskan 81.85% pemisahan antar kelas.

6.3 Koefisien Fungsi Diskriminan (Improved)

round(model_lda_imp$scaling, 4)
##                    LD1     LD2     LD3
## RNA12          -1.3269 -1.5644 -0.1409
## RNA_EOT         0.8488  1.8732 -2.0705
## BMI             0.6271 -0.1238 -0.1345
## RNA_EF          0.3648  0.2358  2.3024
## Age             0.3294 -0.0401  0.1820
## Fever           0.2480  0.3084  0.0207
## RNA4            0.2495  0.2119  0.0227
## WBC            -0.1974 -0.0543 -0.0721
## Plat            0.2129 -0.2637 -0.0193
## AST1            0.2078  0.2270  0.0527
## EpigastricPain -0.2024  0.2537 -0.4786
## ALT12          -0.1830 -0.0482 -0.2508
## Gender         -0.2125  0.2769  0.4836
## Grading         0.1971  0.1353 -0.1759
## ALT_after24w   -0.1713 -0.1213  0.2659
## ALT4            0.1707  0.1726  0.2701
## ALT36          -0.1758  0.0130 -0.0233

6.4 Evaluasi Model LDA (Improved)

6.4.1 Confusion Matrix — Data Latih

pred_train_lda_imp <- predict(model_lda_imp, data_train_sel)
cm_train_lda_imp   <- confusionMatrix(pred_train_lda_imp$class,
                                       data_train_sel$Staging)
print(cm_train_lda_imp$table)
##           Reference
## Prediction Staging1 Staging2 Staging3 Staging4
##   Staging1       54       33       42       39
##   Staging2       52       87       58       47
##   Staging3       50       48       70       52
##   Staging4       80       65       79      116
cat(sprintf("\nAkurasi LDA Improved (train): %.2f%%\n",
            cm_train_lda_imp$overall["Accuracy"] * 100))
## 
## Akurasi LDA Improved (train): 33.64%

6.4.2 Confusion Matrix — Data Uji

pred_test_lda_imp <- predict(model_lda_imp, data_test_sel)
cm_test_lda_imp   <- confusionMatrix(pred_test_lda_imp$class,
                                      data_test_sel$Staging)
print(cm_test_lda_imp$table)
##           Reference
## Prediction Staging1 Staging2 Staging3 Staging4
##   Staging1       14       23       17       13
##   Staging2       22       27       19       26
##   Staging3       23       21       29       26
##   Staging4       41       28       41       43
cat(sprintf("\nAkurasi LDA Improved (test): %.2f%%\n",
            cm_test_lda_imp$overall["Accuracy"] * 100))
## 
## Akurasi LDA Improved (test): 27.36%

6.4.3 Statistik Per Kelas — Data Uji

cm_test_lda_imp$byClass[, c("Sensitivity", "Specificity", "Precision", "F1")]
##                 Sensitivity Specificity Precision        F1
## Class: Staging1   0.1400000   0.8306709 0.2089552 0.1676647
## Class: Staging2   0.2727273   0.7866242 0.2872340 0.2797927
## Class: Staging3   0.2735849   0.7719870 0.2929293 0.2829268
## Class: Staging4   0.3981481   0.6393443 0.2810458 0.3295019

7 QDA dengan Fitur Terpilih + Log-Transform

7.1 Melatih Model QDA (Improved)

model_qda_imp <- qda(Staging ~ ., data = data_train_sel)
print(model_qda_imp)
## Call:
## qda(Staging ~ ., data = data_train_sel)
## 
## Prior probabilities of groups:
##  Staging1  Staging2  Staging3  Staging4 
## 0.2427984 0.2397119 0.2561728 0.2613169 
## 
## Group means:
##                RNA12     RNA_EOT          BMI      RNA_EF         Age
## Staging1  0.08153927  0.10797836 -0.006907335  0.08427587 -0.04116680
## Staging2 -0.11486932 -0.11166845  0.192871439 -0.12113648  0.07509045
## Staging3  0.02064937  0.01696758 -0.008431122  0.03400137  0.03271559
## Staging4  0.01442030 -0.01599513 -0.147389230 -0.01165920 -0.08745112
##                Fever        RNA4         WBC         Plat         AST1
## Staging1  0.07913214  0.04790800 -0.01277460 -0.047917098  0.011036204
## Staging2  0.03333479  0.06129037 -0.04539041  0.099416138  0.002823359
## Staging3  0.03723067  0.03334706 -0.03541789 -0.008423181  0.003214609
## Staging4 -0.07042553 -0.06440774  0.04323104 -0.042663762 -0.098137913
##          EpigastricPain        ALT12      Gender     Grading ALT_after24w
## Staging1     0.08525000 -0.004151124  0.01246228  0.04341757  -0.03186623
## Staging2    -0.08087695 -0.055510820 -0.09492773  0.04854715  -0.04124403
## Staging3    -0.09224918 -0.075754800  0.10526104 -0.01564098   0.03587153
## Staging4     0.03928890  0.056175715  0.02880856 -0.02982198   0.05074444
##                   ALT4        ALT36
## Staging1 -0.0005127219  0.006093398
## Staging2  0.0069159749 -0.039754574
## Staging3  0.0570109767 -0.005716282
## Staging4 -0.0782862189  0.045380910

7.2 Evaluasi Model QDA (Improved)

7.2.1 Confusion Matrix — Data Latih

pred_train_qda_imp <- predict(model_qda_imp, data_train_sel)
cm_train_qda_imp   <- confusionMatrix(pred_train_qda_imp$class,
                                       data_train_sel$Staging)
print(cm_train_qda_imp$table)
##           Reference
## Prediction Staging1 Staging2 Staging3 Staging4
##   Staging1       87       29       34       41
##   Staging2       38      102       33       29
##   Staging3       53       56      132       42
##   Staging4       58       46       50      142
cat(sprintf("\nAkurasi QDA Improved (train): %.2f%%\n",
            cm_train_qda_imp$overall["Accuracy"] * 100))
## 
## Akurasi QDA Improved (train): 47.63%

7.2.2 Confusion Matrix — Data Uji

pred_test_qda_imp <- predict(model_qda_imp, data_test_sel)
cm_test_qda_imp   <- confusionMatrix(pred_test_qda_imp$class,
                                      data_test_sel$Staging)
print(cm_test_qda_imp$table)
##           Reference
## Prediction Staging1 Staging2 Staging3 Staging4
##   Staging1       22       31       26       16
##   Staging2       23       19       24       25
##   Staging3       32       27       23       26
##   Staging4       23       22       33       41
cat(sprintf("\nAkurasi QDA Improved (test): %.2f%%\n",
            cm_test_qda_imp$overall["Accuracy"] * 100))
## 
## Akurasi QDA Improved (test): 25.42%

7.2.3 Statistik Per Kelas — Data Uji

cm_test_qda_imp$byClass[, c("Sensitivity", "Specificity", "Precision", "F1")]
##                 Sensitivity Specificity Precision        F1
## Class: Staging1   0.2200000   0.7667732 0.2315789 0.2256410
## Class: Staging2   0.1919192   0.7707006 0.2087912 0.2000000
## Class: Staging3   0.2169811   0.7231270 0.2129630 0.2149533
## Class: Staging4   0.3796296   0.7442623 0.3445378 0.3612335

8 Ringkasan Perbandingan: Sebelum vs Sesudah Perbaikan

# ---- Bangun ulang model awal (28 fitur, tanpa log-transform) ----
df_baseline        <- df
features_base      <- setdiff(colnames(df_baseline), "Staging")
df_baseline[, features_base] <- as.data.frame(scale(df_baseline[, features_base]))

set.seed(123)
idx_base       <- createDataPartition(df_baseline$Staging, p = 0.7, list = FALSE)
train_base     <- df_baseline[ idx_base, ]
test_base      <- df_baseline[-idx_base, ]

model_lda_base <- lda(Staging ~ ., data = train_base)
model_qda_base <- qda(Staging ~ ., data = train_base)

acc_lda_train_base <- confusionMatrix(
  predict(model_lda_base, train_base)$class, train_base$Staging
)$overall["Accuracy"] * 100

acc_lda_test_base <- confusionMatrix(
  predict(model_lda_base, test_base)$class, test_base$Staging
)$overall["Accuracy"] * 100

acc_qda_train_base <- confusionMatrix(
  predict(model_qda_base, train_base)$class, train_base$Staging
)$overall["Accuracy"] * 100

acc_qda_test_base <- confusionMatrix(
  predict(model_qda_base, test_base)$class, test_base$Staging
)$overall["Accuracy"] * 100

# ---- Susun tabel perbandingan ----
comparison_tbl <- data.frame(
  Model   = c("LDA Baseline", "LDA Improved",
               "QDA Baseline", "QDA Improved"),
  Fitur   = c(28, n_top, 28, n_top),
  Log_Transform = c("Tidak", "Ya", "Tidak", "Ya"),
  Akurasi_Train = round(c(acc_lda_train_base,
                           cm_train_lda_imp$overall["Accuracy"] * 100,
                           acc_qda_train_base,
                           cm_train_qda_imp$overall["Accuracy"] * 100), 2),
  Akurasi_Test  = round(c(acc_lda_test_base,
                           cm_test_lda_imp$overall["Accuracy"] * 100,
                           acc_qda_test_base,
                           cm_test_qda_imp$overall["Accuracy"] * 100), 2)
)
comparison_tbl$Gap_Overfit <- round(
  comparison_tbl$Akurasi_Train - comparison_tbl$Akurasi_Test, 2
)
comparison_tbl$Delta_Test <- round(
  comparison_tbl$Akurasi_Test - c(acc_lda_test_base,
                                   acc_lda_test_base,
                                   acc_qda_test_base,
                                   acc_qda_test_base), 2
)

print(comparison_tbl)
##          Model Fitur Log_Transform Akurasi_Train Akurasi_Test Gap_Overfit
## 1 LDA Baseline    28         Tidak         35.49        25.42       10.07
## 2 LDA Improved    17            Ya         33.64        27.36        6.28
## 3 QDA Baseline    28         Tidak         63.37        26.39       36.98
## 4 QDA Improved    17            Ya         47.63        25.42       22.21
##   Delta_Test
## 1       0.00
## 2       1.94
## 3       0.00
## 4      -0.97
cat("\n=== Peningkatan Akurasi Test ===\n")
## 
## === Peningkatan Akurasi Test ===
cat(sprintf("LDA: %.2f%% → %.2f%% (Δ = %+.2f%%)\n",
    acc_lda_test_base,
    cm_test_lda_imp$overall["Accuracy"] * 100,
    cm_test_lda_imp$overall["Accuracy"] * 100 - acc_lda_test_base))
## LDA: 25.42% → 27.36% (Δ = +1.94%)
cat(sprintf("QDA: %.2f%% → %.2f%% (Δ = %+.2f%%)\n",
    acc_qda_test_base,
    cm_test_qda_imp$overall["Accuracy"] * 100,
    cm_test_qda_imp$overall["Accuracy"] * 100 - acc_qda_test_base))
## QDA: 26.39% → 25.42% (Δ = -0.97%)

9 Visualisasi

9.1 Proyeksi LDA Improved (LD1 vs LD2)

lda_scores_imp <- as.data.frame(pred_train_lda_imp$x)
lda_scores_imp$Staging <- data_train_sel$Staging

ggplot(lda_scores_imp, aes(x = LD1, y = LD2, color = Staging)) +
  geom_point(alpha = 0.5, size = 1.5) +
  stat_ellipse(level = 0.68, linewidth = 0.8) +
  labs(
    title    = "Proyeksi LDA Improved pada LD1 dan LD2 (Data Latih)",
    subtitle = sprintf("LD1: %.2f%%  |  LD2: %.2f%%  |  LD3: %.2f%%",
                       ev_imp_pct[1], ev_imp_pct[2], ev_imp_pct[3]),
    x     = sprintf("LD1 (%.2f%%)", ev_imp_pct[1]),
    y     = sprintf("LD2 (%.2f%%)", ev_imp_pct[2]),
    color = "Staging"
  ) +
  scale_color_manual(values = c("#e74c3c", "#3498db", "#2ecc71", "#f39c12")) +
  theme_bw(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

9.2 Confusion Matrix Heatmap — LDA Improved (Data Uji)

cm_lda_imp_df <- as.data.frame(cm_test_lda_imp$table)
colnames(cm_lda_imp_df) <- c("Prediksi", "Aktual", "Freq")

ggplot(cm_lda_imp_df, aes(x = Prediksi, y = Aktual, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), size = 4.5, fontface = "bold") +
  scale_fill_gradient(low = "#dceefb", high = "#2980b9") +
  labs(
    title = sprintf("Confusion Matrix LDA Improved (Data Uji) — Akurasi: %.2f%%",
                    cm_test_lda_imp$overall["Accuracy"] * 100),
    x = "Prediksi", y = "Aktual"
  ) +
  theme_bw(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

9.3 Confusion Matrix Heatmap — QDA Improved (Data Uji)

cm_qda_imp_df <- as.data.frame(cm_test_qda_imp$table)
colnames(cm_qda_imp_df) <- c("Prediksi", "Aktual", "Freq")

ggplot(cm_qda_imp_df, aes(x = Prediksi, y = Aktual, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), size = 4.5, fontface = "bold") +
  scale_fill_gradient(low = "#fdebd0", high = "#e67e22") +
  labs(
    title = sprintf("Confusion Matrix QDA Improved (Data Uji) — Akurasi: %.2f%%",
                    cm_test_qda_imp$overall["Accuracy"] * 100),
    x = "Prediksi", y = "Aktual"
  ) +
  theme_bw(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

9.4 Perbandingan Akurasi: Sebelum vs Sesudah (Grouped Bar Chart)

acc_comp_df <- data.frame(
  Model   = rep(c("LDA", "QDA"), each = 4),
  Versi   = rep(c("Baseline\nTrain", "Baseline\nTest",
                   "Improved\nTrain", "Improved\nTest"), 2),
  Akurasi = c(
    round(acc_lda_train_base, 2),
    round(acc_lda_test_base, 2),
    round(cm_train_lda_imp$overall["Accuracy"] * 100, 2),
    round(cm_test_lda_imp$overall["Accuracy"]  * 100, 2),
    round(acc_qda_train_base, 2),
    round(acc_qda_test_base, 2),
    round(cm_train_qda_imp$overall["Accuracy"] * 100, 2),
    round(cm_test_qda_imp$overall["Accuracy"]  * 100, 2)
  ),
  Tipe = rep(c("Train", "Test", "Train", "Test"), 2)
)

acc_comp_df$Versi <- factor(
  acc_comp_df$Versi,
  levels = c("Baseline\nTrain", "Baseline\nTest",
             "Improved\nTrain", "Improved\nTest")
)

ggplot(acc_comp_df, aes(x = Versi, y = Akurasi,
                         fill = interaction(Model, Tipe))) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.7),
           width = 0.6) +
  geom_text(aes(label = paste0(Akurasi, "%")),
            position = position_dodge(width = 0.7),
            vjust = -0.4, size = 3) +
  scale_fill_manual(
    values = c(
      "LDA.Train" = "#2c3e50", "LDA.Test" = "#3498db",
      "QDA.Train" = "#7d3c98", "QDA.Test" = "#a569bd"
    ),
    labels = c(
      "LDA.Train" = "LDA Train", "LDA.Test" = "LDA Test",
      "QDA.Train" = "QDA Train", "QDA.Test" = "QDA Test"
    )
  ) +
  ylim(0, 100) +
  labs(
    title    = "Perbandingan Akurasi: Baseline vs Improved",
    subtitle = paste0("Improved = Log-Transform + Seleksi Top-",
                      n_top, " Fitur"),
    x        = "",
    y        = "Akurasi (%)",
    fill     = "Model & Data"
  ) +
  theme_bw(base_size = 11) +
  theme(
    plot.title   = element_text(face = "bold"),
    legend.position = "right"
  )

9.5 Distribusi Skor LD1 — Sebelum vs Sesudah

# Skor LD1 baseline (data uji)
lda_test_base  <- as.data.frame(predict(model_lda_base, test_base)$x)
lda_test_base$Staging <- test_base$Staging
lda_test_base$Versi   <- "Baseline (28 fitur)"

# Skor LD1 improved (data uji)
lda_test_imp  <- as.data.frame(pred_test_lda_imp$x)
lda_test_imp$Staging <- data_test_sel$Staging
lda_test_imp$Versi   <- paste0("Improved (Top-", n_top, " + Log)")

ld1_compare <- bind_rows(lda_test_base, lda_test_imp)

ggplot(ld1_compare, aes(x = LD1, fill = Staging)) +
  geom_density(alpha = 0.45) +
  facet_wrap(~ Versi, ncol = 2) +
  scale_fill_manual(values = c("#e74c3c", "#3498db", "#2ecc71", "#f39c12")) +
  labs(
    title = "Distribusi Skor LD1 per Staging (Data Uji): Baseline vs Improved",
    x     = "Skor LD1",
    y     = "Densitas",
    fill  = "Staging"
  ) +
  theme_bw(base_size = 10) +
  theme(
    plot.title    = element_text(face = "bold"),
    strip.text    = element_text(face = "bold"),
    legend.position = "bottom"
  )

9.6 Koefisien LD1 — Model Improved

scaling_imp_df <- as.data.frame(model_lda_imp$scaling)
scaling_imp_df$Variabel <- rownames(scaling_imp_df)
scaling_imp_df <- scaling_imp_df %>%
  arrange(abs(LD1)) %>%
  mutate(
    Variabel = factor(Variabel, levels = Variabel),
    Warna    = ifelse(LD1 >= 0, "Positif", "Negatif")
  )

ggplot(scaling_imp_df, aes(x = LD1, y = Variabel, fill = Warna)) +
  geom_bar(stat = "identity") +
  geom_vline(xintercept = 0, linewidth = 0.6) +
  scale_fill_manual(values = c("Positif" = "#3498db", "Negatif" = "#e74c3c")) +
  labs(
    title    = "Koefisien LD1 — Model Improved",
    subtitle = paste0("Hanya top-", n_top, " fitur terpilih"),
    x = "Koefisien LD1", y = "Variabel", fill = "Arah"
  ) +
  theme_bw(base_size = 10) +
  theme(
    plot.title  = element_text(face = "bold"),
    axis.text.y = element_text(size = 9)
  )

9.7 Partition Plot — LDA Improved

Partition plot menggunakan 2 fitur teratas dari model improved. Proses ini membutuhkan waktu komputasi lebih lama.

# Ambil 2 fitur teratas dari top_features
feat1 <- top_features[1]
feat2 <- top_features[2]

cat("Partition plot menggunakan:", feat1, "vs", feat2, "\n")
## Partition plot menggunakan: RNA12 vs RNA_EOT
partimat(
  as.formula(paste("Staging ~", feat1, "+", feat2)),
  data   = data_train_sel,
  method = "lda",
  main   = paste("Partition Plot LDA Improved:", feat1, "vs", feat2)
)

partimat(
  as.formula(paste("Staging ~", feat1, "+", feat2)),
  data   = data_train_sel,
  method = "qda",
  main   = paste("Partition Plot QDA Improved:", feat1, "vs", feat2)
)


10 Kesimpulan

Dua strategi perbaikan diterapkan secara bersamaan pada model LDA dan QDA:

Solusi 1 — Seleksi Fitur: Dari 28 variabel prediktor, dipilih top-17 berdasarkan nilai absolut koefisien LD1. Variabel yang tidak berpengaruh signifikan dieliminasi untuk mengurangi noise pada fungsi diskriminan.

Solusi 3 — Transformasi Log: Variabel RNA (RNA_Base, RNA4, RNA12, RNA_EOT, RNA_EF) dan ALT (ALT1ALT48, AST1) yang memiliki distribusi sangat miring (right-skewed) ditransformasi menggunakan log1p(). Hasilnya distribusi mendekati simetris yang lebih sesuai dengan asumsi normalitas LDA.

Hasil Perbaikan:

##          Model Akurasi_Test Gap_Overfit Peningkatan
## 1 LDA Baseline        25.42       10.07           -
## 2 LDA Improved        27.36        6.28      +1.94%
## 3 QDA Baseline        26.39       36.98           -
## 4 QDA Improved        25.42       22.21     +-0.97%

Penurunan Gap Overfit pada QDA menunjukkan bahwa seleksi fitur berhasil mengurangi kompleksitas estimasi matriks kovarians per kelas, sehingga model lebih mampu melakukan generalisasi pada data baru.

## === SESSION INFO ===
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=Indonesian_Indonesia.utf8  LC_CTYPE=Indonesian_Indonesia.utf8   
## [3] LC_MONETARY=Indonesian_Indonesia.utf8 LC_NUMERIC=C                         
## [5] LC_TIME=Indonesian_Indonesia.utf8    
## 
## time zone: Asia/Jakarta
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gridExtra_2.3   reshape2_1.4.5  klaR_1.7-4      MASS_7.3-60.2  
##  [5] caret_7.0-1     lattice_0.22-6  lubridate_1.9.4 forcats_1.0.0  
##  [9] stringr_1.5.1   dplyr_1.1.4     purrr_1.0.2     readr_2.1.5    
## [13] tidyr_1.3.1     tibble_3.2.1    ggplot2_4.0.2   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     timeDate_4052.112    farver_2.1.2        
##  [4] S7_0.2.1             fastmap_1.2.0        combinat_0.0-8      
##  [7] promises_1.3.3       pROC_1.19.0.1        labelled_2.16.0     
## [10] digest_0.6.37        rpart_4.1.23         mime_0.12           
## [13] timechange_0.3.0     lifecycle_1.0.4      survival_3.6-4      
## [16] magrittr_2.0.3       compiler_4.4.1       rlang_1.1.4         
## [19] sass_0.4.9           tools_4.4.1          utf8_1.2.4          
## [22] yaml_2.3.10          data.table_1.16.4    knitr_1.49          
## [25] labeling_0.4.3       plyr_1.8.9           RColorBrewer_1.1-3  
## [28] miniUI_0.1.2         withr_3.0.1          nnet_7.3-19         
## [31] grid_4.4.1           stats4_4.4.1         fansi_1.0.6         
## [34] e1071_1.7-17         xtable_1.8-4         future_1.69.0       
## [37] globals_0.19.0       scales_1.4.0         iterators_1.0.14    
## [40] cli_3.6.3            rmarkdown_2.30       generics_0.1.3      
## [43] rstudioapi_0.17.1    future.apply_1.20.2  tzdb_0.4.0          
## [46] proxy_0.4-29         cachem_1.1.0         splines_4.4.1       
## [49] parallel_4.4.1       vctrs_0.6.5          hardhat_1.4.2       
## [52] Matrix_1.7-0         jsonlite_1.8.9       hms_1.1.3           
## [55] listenv_0.10.0       foreach_1.5.2        gower_1.0.2         
## [58] jquerylib_0.1.4      recipes_1.3.1        glue_1.7.0          
## [61] parallelly_1.46.1    codetools_0.2-20     stringi_1.8.4       
## [64] gtable_0.3.6         later_1.4.2          questionr_0.8.2     
## [67] pillar_1.9.0         htmltools_0.5.8.1    ipred_0.9-15        
## [70] lava_1.8.2           R6_2.5.1             evaluate_1.0.1      
## [73] shiny_1.10.0         haven_2.5.4          highr_0.11          
## [76] httpuv_1.6.16        bslib_0.8.0          class_7.3-22        
## [79] Rcpp_1.0.13          nlme_3.1-164         prodlim_2025.04.28  
## [82] xfun_0.49            pkgconfig_2.0.3      ModelMetrics_1.2.2.2