Hepatitis C Virus (HCV) merupakan penyakit yang menyerang organ hati dan dapat berkembang menjadi fibrosis hingga sirosis apabila tidak ditangani. Tingkat keparahan penyakit ini dinilai melalui Baselinehistological Staging (Staging 1–4) yang mencerminkan derajat kerusakan jaringan hati pasien.
Dataset yang digunakan adalah HCV for Egyptian Patients dari UCI Machine Learning Repository, yang memuat 28 variabel klinis dan biomarker dari 1.385 pasien. Dengan jumlah variabel yang besar, analisis diskriminan dimanfaatkan tidak hanya sebagai alat klasifikasi, tetapi sekaligus sebagai teknik reduksi dimensi berbasis kelas melalui fungsi diskriminan (LD1, LD2, LD3).
LDA menghasilkan fungsi diskriminan (LD1, LD2, …) yang merupakan kombinasi linear dari prediktor. Berbeda dengan PCA yang memaksimalkan keragaman total (unsupervised), LDA memaksimalkan jarak antar kelas sambil meminimalkan keragaman dalam kelas (supervised):
\[D = b_1X_1 + b_2X_2 + \cdots + b_pX_p\]
Untuk \(g\) kelas, LDA menghasilkan maksimal \(g-1\) fungsi diskriminan. Dengan 4 kelas Staging, terdapat 3 fungsi diskriminan (LD1, LD2, LD3) yang memproyeksikan data dari 28 dimensi ke ruang 3 dimensi. Proporsi trace menunjukkan seberapa besar setiap LD menjelaskan pemisahan antar kelas.
library(tidyverse)
library(caret)
library(MASS) # lda(), qda()
library(klaR) # partimat()
library(ggplot2)
library(dplyr)
library(reshape2)
library(gridExtra)path <- "D:/02. UNS/01. Kuliah/Sem 4/PDM/Diskriminan/HCV-Egy-Data.csv"
df <- read.csv(path, check.names = TRUE)
# Hapus kolom dengan nama kosong (mencegah error zero-length variable name)
df <- df[, nzchar(trimws(colnames(df)))]
## Struktur Data## 'data.frame': 1385 obs. of 29 variables:
## $ Age : int 56 46 57 49 59 58 42 48 44 45 ...
## $ Gender : int 1 1 1 2 1 2 2 2 1 1 ...
## $ BMI : int 35 29 33 33 32 22 26 30 23 30 ...
## $ Fever : int 2 1 2 1 1 2 1 1 1 2 ...
## $ Nausea.Vomting : int 1 2 2 2 1 2 1 1 1 1 ...
## $ Headache : int 1 2 2 1 2 2 2 2 2 2 ...
## $ Diarrhea : int 1 1 2 2 1 1 2 2 2 2 ...
## $ Fatigue...generalized.bone.ache: int 2 2 1 1 2 2 2 1 2 1 ...
## $ Jaundice : int 2 2 1 2 2 2 2 1 1 1 ...
## $ Epigastric.pain : int 2 1 1 1 2 1 2 2 2 2 ...
## $ WBC : int 7425 12101 4178 6490 3661 11785 11620 7335 10480 6681 ...
## $ RBC : num 4248807 4429425 4621191 4794631 4606375 ...
## $ HGB : int 14 10 12 10 11 15 12 11 12 12 ...
## $ Plat : num 112132 129367 151522 146457 187684 ...
## $ AST.1 : int 99 91 113 43 99 66 78 119 93 55 ...
## $ ALT.1 : int 84 123 49 64 104 104 57 112 83 68 ...
## $ ALT4 : num 52 95 95 109 67 121 113 80 55 72 ...
## $ ALT.12 : int 109 75 107 80 48 96 118 127 102 127 ...
## $ ALT.24 : int 81 113 116 88 120 65 107 45 97 81 ...
## $ ALT.36 : int 5 57 5 48 94 73 84 96 122 125 ...
## $ ALT.48 : int 5 123 5 77 90 114 80 53 39 43 ...
## $ ALT.after.24.w : int 5 44 5 33 30 29 28 39 45 30 ...
## $ RNA.Base : int 655330 40620 571148 1041941 660410 1157452 325694 641129 591441 1151206 ...
## $ RNA.4 : int 634536 538635 661346 449939 738756 1086852 1034008 72050 757361 230488 ...
## $ RNA.12 : int 288194 637056 5 585688 3731527 5 275095 787295 5 267320 ...
## $ RNA.EOT : int 5 336804 735945 744463 338946 5 214566 370605 371090 275295 ...
## $ RNA.EF : int 5 31085 558829 582301 242861 5 635157 506296 203042 555516 ...
## $ Baseline.histological.Grading : int 13 4 4 10 11 4 12 12 5 4 ...
## $ Baselinehistological.staging : int 2 2 4 3 1 4 4 3 2 2 ...
## Jumlah baris : 1385
## Jumlah kolom : 29
## Jumlah fitur : 28 (exclude target)
##
## === CEK MISSING VALUE ===
## Age Gender
## 0 0
## BMI Fever
## 0 0
## Nausea.Vomting Headache
## 0 0
## Diarrhea Fatigue...generalized.bone.ache
## 0 0
## Jaundice Epigastric.pain
## 0 0
## WBC RBC
## 0 0
## HGB Plat
## 0 0
## AST.1 ALT.1
## 0 0
## ALT4 ALT.12
## 0 0
## ALT.24 ALT.36
## 0 0
## ALT.48 ALT.after.24.w
## 0 0
## RNA.Base RNA.4
## 0 0
## RNA.12 RNA.EOT
## 0 0
## RNA.EF Baseline.histological.Grading
## 0 0
## Baselinehistological.staging
## 0
##
## === DISTRIBUSI VARIABEL TARGET ===
##
## 1 2 3 4
## 336 332 355 362
##
## 1 2 3 4
## 0.2425993 0.2397112 0.2563177 0.2613718
colnames(df)[colnames(df) == "Baselinehistological.staging"] <- "Staging"
colnames(df)[colnames(df) == "Fatigue...generalized.bone.ache"] <- "Fatigue"
colnames(df)[colnames(df) == "Nausea.Vomting"] <- "Nausea"
colnames(df)[colnames(df) == "Epigastric.pain"] <- "EpigastricPain"
colnames(df)[colnames(df) == "Baseline.histological.Grading"] <- "Grading"
colnames(df)[colnames(df) == "ALT.after.24.w"] <- "ALT_after24w"
colnames(df)[colnames(df) == "RNA.Base"] <- "RNA_Base"
colnames(df)[colnames(df) == "RNA.EOT"] <- "RNA_EOT"
colnames(df)[colnames(df) == "RNA.EF"] <- "RNA_EF"
colnames(df)[colnames(df) == "AST.1"] <- "AST1"
colnames(df)[colnames(df) == "ALT.1"] <- "ALT1"
colnames(df)[colnames(df) == "ALT4"] <- "ALT4"
colnames(df)[colnames(df) == "ALT.12"] <- "ALT12"
colnames(df)[colnames(df) == "ALT.24"] <- "ALT24"
colnames(df)[colnames(df) == "ALT.36"] <- "ALT36"
colnames(df)[colnames(df) == "ALT.48"] <- "ALT48"
colnames(df)[colnames(df) == "RNA.4"] <- "RNA4"
colnames(df)[colnames(df) == "RNA.12"] <- "RNA12"
df$Staging <- as.factor(df$Staging)
levels(df$Staging) <- c("Staging1", "Staging2", "Staging3", "Staging4")
features <- setdiff(colnames(df), "Staging")
cat("Nama kolom setelah cleaning:\n")## Nama kolom setelah cleaning:
## [1] "Age" "Gender" "BMI" "Fever"
## [5] "Nausea" "Headache" "Diarrhea" "Fatigue"
## [9] "Jaundice" "EpigastricPain" "WBC" "RBC"
## [13] "HGB" "Plat" "AST1" "ALT1"
## [17] "ALT4" "ALT12" "ALT24" "ALT36"
## [21] "ALT48" "ALT_after24w" "RNA_Base" "RNA4"
## [25] "RNA12" "RNA_EOT" "RNA_EF" "Grading"
## [29] "Staging"
##
## Total fitur prediktor: 28
Pengecekan baris duplikat dilakukan untuk memastikan tidak ada observasi yang tercatat lebih dari satu kali. Baris duplikat dapat menyebabkan model overfit karena data yang sama muncul di training dan testing set.
## Jumlah baris duplikat : 0
if (n_duplikat > 0) {
cat("\nIndeks baris duplikat:\n")
print(which(duplicated(df)))
cat("\nContoh baris duplikat (5 pertama):\n")
print(head(df[duplicated(df), ], 5))
# Hapus duplikat
df <- df[!duplicated(df), ]
cat("\nDuplikat dihapus. Sisa baris:", nrow(df), "\n")
} else {
cat("Tidak ditemukan duplikat. Data bersih.\n")
}## Tidak ditemukan duplikat. Data bersih.
Metode IQR mendefinisikan outlier sebagai nilai di luar batas:
\[\text{Batas Bawah} = Q1 - 1.5 \times IQR \quad ; \quad \text{Batas Atas} = Q3 + 1.5 \times IQR\]
Pengecekan dilakukan pada seluruh variabel numerik (fitur prediktor). Variabel kategoris biner (nilai hanya 1 dan 2) dilewati karena metode IQR tidak relevan untuk variabel tersebut.
# Identifikasi variabel numerik kontinu (lebih dari 2 nilai unik)
vars_numerik <- features[sapply(df[, features], function(x) {
is.numeric(x) && length(unique(x)) > 2
})]
cat("Variabel yang dicek outlier:", length(vars_numerik), "variabel\n")## Variabel yang dicek outlier: 20 variabel
## Age, BMI, WBC, RBC, HGB, Plat, AST1, ALT1, ALT4, ALT12, ALT24, ALT36, ALT48, ALT_after24w, RNA_Base, RNA4, RNA12, RNA_EOT, RNA_EF, Grading
# Hitung jumlah outlier per variabel
outlier_summary <- data.frame(
Variabel = vars_numerik,
Q1 = NA_real_,
Q3 = NA_real_,
IQR = NA_real_,
Batas_Bawah = NA_real_,
Batas_Atas = NA_real_,
N_Outlier = NA_integer_,
Pct_Outlier = NA_real_
)
for (i in seq_along(vars_numerik)) {
v <- vars_numerik[i]
x <- df[[v]]
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lb <- q1 - 1.5 * iqr
ub <- q3 + 1.5 * iqr
n_out <- sum(x < lb | x > ub, na.rm = TRUE)
outlier_summary[i, "Q1"] <- round(q1, 2)
outlier_summary[i, "Q3"] <- round(q3, 2)
outlier_summary[i, "IQR"] <- round(iqr, 2)
outlier_summary[i, "Batas_Bawah"] <- round(lb, 2)
outlier_summary[i, "Batas_Atas"] <- round(ub, 2)
outlier_summary[i, "N_Outlier"] <- n_out
outlier_summary[i, "Pct_Outlier"] <- round(n_out / nrow(df) * 100, 2)
}
cat("=== RINGKASAN OUTLIER PER VARIABEL ===\n")## === RINGKASAN OUTLIER PER VARIABEL ===
## Variabel Q1 Q3 IQR Batas_Bawah Batas_Atas N_Outlier
## 1 Age 39 54 15 16.5 76.5 0
## 2 BMI 25 32 7 14.5 42.5 0
## 3 WBC 5219 9902 4683 -1805.5 16926.5 0
## 4 RBC 4121374 4721279 599905 3221516.5 5621136.5 0
## 5 HGB 11 14 3 6.5 18.5 0
## 6 Plat 124479 190314 65835 25726.5 289066.5 0
## 7 AST1 60 105 45 -7.5 172.5 0
## 8 ALT1 62 106 44 -4.0 172.0 0
## 9 ALT4 61 107 46 -8.0 176.0 0
## 10 ALT12 60 106 46 -9.0 175.0 0
## 11 ALT24 61 107 46 -8.0 176.0 0
## 12 ALT36 61 106 45 -6.5 173.5 0
## 13 ALT48 61 106 45 -6.5 173.5 0
## 14 ALT_after24w 28 40 12 10.0 58.0 3
## 15 RNA_Base 269253 886791 617538 -657054.0 1813098.0 0
## 16 RNA4 270893 909093 638200 -686407.0 1866393.0 0
## 17 RNA12 5 524819 524814 -787216.0 1312040.0 1
## 18 RNA_EOT 5 517806 517801 -776696.5 1294507.5 0
## 19 RNA_EF 5 527864 527859 -791783.5 1319652.5 0
## 20 Grading 6 13 7 -4.5 23.5 0
## Pct_Outlier
## 1 0.00
## 2 0.00
## 3 0.00
## 4 0.00
## 5 0.00
## 6 0.00
## 7 0.00
## 8 0.00
## 9 0.00
## 10 0.00
## 11 0.00
## 12 0.00
## 13 0.00
## 14 0.22
## 15 0.00
## 16 0.00
## 17 0.07
## 18 0.00
## 19 0.00
## 20 0.00
##
## Total variabel yang memiliki outlier: 2
## Total sel outlier (seluruh variabel): 4
# Ambil hanya variabel yang memiliki outlier untuk divisualisasikan
vars_ada_outlier <- outlier_summary$Variabel[outlier_summary$N_Outlier > 0]
if (length(vars_ada_outlier) > 0) {
df_long <- df[, vars_ada_outlier] %>%
pivot_longer(cols = everything(),
names_to = "Variabel",
values_to = "Nilai")
ggplot(df_long, aes(x = Variabel, y = Nilai)) +
geom_boxplot(fill = "#aed6f1", color = "#1a5276",
outlier.color = "#e74c3c",
outlier.shape = 16,
outlier.size = 1.2,
linewidth = 0.5) +
labs(
title = "Deteksi Outlier dengan Metode IQR",
subtitle = "Titik merah = outlier (di luar Q1 - 1.5·IQR atau Q3 + 1.5·IQR)",
x = "Variabel",
y = "Nilai"
) +
theme_bw(base_size = 9) +
theme(
plot.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1, size = 8)
)
} else {
cat("Tidak ada variabel dengan outlier yang perlu divisualisasikan.\n")
}outlier_plot_df <- outlier_summary %>%
filter(N_Outlier > 0) %>%
arrange(desc(Pct_Outlier)) %>%
mutate(Variabel = factor(Variabel, levels = Variabel))
if (nrow(outlier_plot_df) > 0) {
ggplot(outlier_plot_df,
aes(x = Variabel, y = Pct_Outlier)) +
geom_bar(stat = "identity", fill = "#e74c3c", width = 0.6) +
geom_text(aes(label = paste0(Pct_Outlier, "%")),
vjust = -0.4, size = 3) +
labs(
title = "Persentase Outlier per Variabel (Metode IQR)",
subtitle = "Hanya variabel dengan outlier > 0 yang ditampilkan",
x = "Variabel",
y = "% Outlier dari Total Observasi"
) +
theme_bw(base_size = 10) +
theme(
plot.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1, size = 8)
)
} else {
cat("Tidak ada outlier yang ditemukan.\n")
}Untuk analisis diskriminan, outlier tidak dihapus melainkan di-cap (winsorizing) pada batas IQR. Strategi ini mempertahankan jumlah observasi (penting untuk estimasi kovarians pada QDA) sambil mengurangi pengaruh nilai ekstrem terhadap fungsi diskriminan.
\[x_{\text{cap}} = \begin{cases} \text{Batas Bawah} & \text{jika } x < Q1 - 1.5 \times IQR \\ x & \text{jika } Q1 - 1.5 \times IQR \leq x \leq Q3 + 1.5 \times IQR \\ \text{Batas Atas} & \text{jika } x > Q3 + 1.5 \times IQR \end{cases}\]
df_clean <- df
for (v in vars_numerik) {
x <- df_clean[[v]]
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lb <- q1 - 1.5 * iqr
ub <- q3 + 1.5 * iqr
df_clean[[v]] <- pmin(pmax(x, lb), ub)
}
# Verifikasi: hitung outlier setelah winsorizing
n_outlier_after <- sum(sapply(vars_numerik, function(v) {
x <- df_clean[[v]]
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
sum(x < q1 - 1.5 * iqr | x > q3 + 1.5 * iqr, na.rm = TRUE)
}))
cat("Jumlah outlier SEBELUM winsorizing:", sum(outlier_summary$N_Outlier), "\n")## Jumlah outlier SEBELUM winsorizing: 4
## Jumlah outlier SETELAH winsorizing: 0
## Dimensi data setelah penanganan : 1385 x 29
set.seed(123)
train_idx <- createDataPartition(df_scaled$Staging, p = 0.7, list = FALSE)
data_train <- df_scaled[ train_idx, ]
data_test <- df_scaled[-train_idx, ]
cat("Data latih :", nrow(data_train), "observasi\n")## Data latih : 972 observasi
## Data uji : 413 observasi
##
## Distribusi kelas — Data Latih:
##
## Staging1 Staging2 Staging3 Staging4
## 236 233 249 254
##
## Distribusi kelas — Data Uji:
##
## Staging1 Staging2 Staging3 Staging4
## 100 99 106 108
## Call:
## lda(Staging ~ ., data = data_train)
##
## Prior probabilities of groups:
## Staging1 Staging2 Staging3 Staging4
## 0.2427984 0.2397119 0.2561728 0.2613169
##
## Group means:
## Age Gender BMI Fever Nausea
## Staging1 -0.04116680 0.01246228 -0.006907335 0.07913214 -0.03893889
## Staging2 0.07509045 -0.09492773 0.192871439 0.03333479 -0.04366555
## Staging3 0.03271559 0.10526104 -0.008431122 0.03723067 0.01502094
## Staging4 -0.08745112 0.02880856 -0.147389230 -0.07042553 0.02643270
## Headache Diarrhea Fatigue Jaundice EpigastricPain
## Staging1 0.04182675 0.06272060 -0.05713547 -0.01063682 0.08525000
## Staging2 -0.03067426 -0.02650414 -0.01928624 0.03644746 -0.08087695
## Staging3 -0.04425201 0.05516759 -0.04199557 -0.08647238 -0.09224918
## Staging4 0.01581104 0.05004652 0.05726363 0.06867543 0.03928890
## WBC RBC HGB Plat AST1
## Staging1 -0.01277460 -0.004632153 0.05513693 -0.047917098 0.011600782
## Staging2 -0.04539041 -0.023528199 0.03772072 0.099416138 0.002887526
## Staging3 -0.03541789 -0.017863048 -0.03127443 -0.008423181 0.003413371
## Staging4 0.04323104 -0.016507172 0.04070862 -0.042663762 -0.097660336
## ALT1 ALT4 ALT12 ALT24 ALT36
## Staging1 -0.08503638 0.015051362 0.001874303 0.048777499 0.003246343
## Staging2 0.03336331 0.002985419 -0.064867090 -0.005929071 -0.035835244
## Staging3 0.01051235 0.065238886 -0.076132923 -0.018320603 -0.021496574
## Staging4 0.06762564 -0.083855879 0.061226237 0.012004764 0.030290252
## ALT48 ALT_after24w RNA_Base RNA4 RNA12
## Staging1 -0.040328356 -0.04155957 0.07645703 0.07828545 0.02216276
## Staging2 0.017070283 -0.06140861 -0.11574169 0.05099877 -0.07175850
## Staging3 0.049653991 0.02579360 0.04280821 0.01317952 0.01181295
## Staging4 -0.009746323 0.04753783 0.03611084 -0.04085885 0.11859502
## RNA_EOT RNA_EF Grading
## Staging1 0.14305299 -0.006916713 0.04341757
## Staging2 -0.10623974 -0.101179568 0.04854715
## Staging3 -0.02516523 0.086308247 -0.01564098
## Staging4 -0.01384450 -0.004377552 -0.02982198
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## Age 0.298139658 0.043451487 -0.09067125
## Gender -0.203854473 -0.089721778 -0.43315233
## BMI 0.584624973 -0.006574060 0.17119362
## Fever 0.210229515 -0.278852105 -0.06533441
## Nausea -0.113299666 0.074865688 -0.14878909
## Headache -0.082909731 -0.117885721 0.15036269
## Diarrhea -0.140408608 -0.126401583 -0.04000293
## Fatigue -0.108756561 0.174072142 0.08157072
## Jaundice -0.018048351 0.166561811 0.39112798
## EpigastricPain -0.215024572 -0.245110887 0.29624732
## WBC -0.187643473 0.082225930 0.04269870
## RBC -0.043499445 -0.062835918 0.00574792
## HGB -0.095523942 -0.045118166 0.22801009
## Plat 0.237194387 0.137788121 0.09318579
## AST1 0.160553229 -0.225494806 -0.08635463
## ALT1 -0.069242366 0.375557251 0.03292185
## ALT4 0.175886489 -0.127294513 -0.24722853
## ALT12 -0.235032261 0.000646653 0.18647315
## ALT24 -0.068976459 -0.121357218 0.10947546
## ALT36 -0.113472361 0.001774546 0.04624063
## ALT48 0.055847118 0.084377488 -0.16011193
## ALT_after24w -0.158298318 0.134311922 -0.15451587
## RNA_Base -0.223969439 -0.271529834 -0.17268027
## RNA4 0.127926783 -0.194573409 0.04312482
## RNA12 -0.385260107 0.255811118 0.11433871
## RNA_EOT -0.002753399 -0.689494586 0.24105531
## RNA_EF 0.137564016 0.091599413 -0.61316789
## Grading 0.138220305 -0.169724522 0.11274798
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.5094 0.2582 0.2324
Proporsi trace menunjukkan seberapa besar setiap fungsi diskriminan berkontribusi dalam memisahkan antar kelas. LDA mereduksi dimensi fitur menjadi maksimal 3 dimensi (LD1, LD2, LD3) untuk 4 kelas.
ev <- model_lda$svd^2 / sum(model_lda$svd^2)
ev_pct <- round(ev * 100, 2)
ev_tbl <- data.frame(
Fungsi = paste0("LD", 1:3),
Proporsi = paste0(ev_pct, "%"),
Kumulatif = paste0(round(cumsum(ev_pct), 2), "%")
)
print(ev_tbl)## Fungsi Proporsi Kumulatif
## 1 LD1 50.94% 50.94%
## 2 LD2 25.82% 76.76%
## 3 LD3 23.24% 100%
##
## LD1 menjelaskan 50.94% pemisahan antar kelas.
## LD1 + LD2 menjelaskan 76.76% pemisahan antar kelas.
Koefisien ini membentuk persamaan fungsi diskriminan yang memproyeksikan data ke ruang berdimensi lebih rendah. Variabel dengan nilai absolut koefisien terbesar paling berpengaruh dalam pemisahan kelas.
## LD1 LD2 LD3
## Age 0.2981 0.0435 -0.0907
## Gender -0.2039 -0.0897 -0.4332
## BMI 0.5846 -0.0066 0.1712
## Fever 0.2102 -0.2789 -0.0653
## Nausea -0.1133 0.0749 -0.1488
## Headache -0.0829 -0.1179 0.1504
## Diarrhea -0.1404 -0.1264 -0.0400
## Fatigue -0.1088 0.1741 0.0816
## Jaundice -0.0180 0.1666 0.3911
## EpigastricPain -0.2150 -0.2451 0.2962
## WBC -0.1876 0.0822 0.0427
## RBC -0.0435 -0.0628 0.0057
## HGB -0.0955 -0.0451 0.2280
## Plat 0.2372 0.1378 0.0932
## AST1 0.1606 -0.2255 -0.0864
## ALT1 -0.0692 0.3756 0.0329
## ALT4 0.1759 -0.1273 -0.2472
## ALT12 -0.2350 0.0006 0.1865
## ALT24 -0.0690 -0.1214 0.1095
## ALT36 -0.1135 0.0018 0.0462
## ALT48 0.0558 0.0844 -0.1601
## ALT_after24w -0.1583 0.1343 -0.1545
## RNA_Base -0.2240 -0.2715 -0.1727
## RNA4 0.1279 -0.1946 0.0431
## RNA12 -0.3853 0.2558 0.1143
## RNA_EOT -0.0028 -0.6895 0.2411
## RNA_EF 0.1376 0.0916 -0.6132
## Grading 0.1382 -0.1697 0.1127
## Staging1 Staging2 Staging3 Staging4
## 5 0.2016 0.3291 0.1481 0.3212
## 6 0.1892 0.1680 0.2961 0.3468
## 7 0.2798 0.1656 0.2273 0.3273
## 8 0.2197 0.1646 0.2629 0.3528
## 9 0.2846 0.1156 0.1584 0.4413
## 10 0.3421 0.1343 0.2248 0.2989
## Reference
## Prediction Staging1 Staging2 Staging3 Staging4
## Staging1 62 39 43 35
## Staging2 46 82 54 48
## Staging3 48 57 80 54
## Staging4 80 55 72 117
##
## Akurasi LDA (train): 35.08%
pred_test_lda <- predict(model_lda, data_test)
cm_test_lda <- confusionMatrix(pred_test_lda$class, data_test$Staging)
print(cm_test_lda$table)## Reference
## Prediction Staging1 Staging2 Staging3 Staging4
## Staging1 14 23 21 12
## Staging2 29 23 20 31
## Staging3 27 19 32 26
## Staging4 30 34 33 39
##
## Akurasi LDA (test): 26.15%
## Sensitivity Specificity Precision F1
## Class: Staging1 0.1400000 0.8210863 0.2000000 0.1647059
## Class: Staging2 0.2323232 0.7452229 0.2233010 0.2277228
## Class: Staging3 0.3018868 0.7654723 0.3076923 0.3047619
## Class: Staging4 0.3611111 0.6819672 0.2867647 0.3196721
## Call:
## qda(Staging ~ ., data = data_train)
##
## Prior probabilities of groups:
## Staging1 Staging2 Staging3 Staging4
## 0.2427984 0.2397119 0.2561728 0.2613169
##
## Group means:
## Age Gender BMI Fever Nausea
## Staging1 -0.04116680 0.01246228 -0.006907335 0.07913214 -0.03893889
## Staging2 0.07509045 -0.09492773 0.192871439 0.03333479 -0.04366555
## Staging3 0.03271559 0.10526104 -0.008431122 0.03723067 0.01502094
## Staging4 -0.08745112 0.02880856 -0.147389230 -0.07042553 0.02643270
## Headache Diarrhea Fatigue Jaundice EpigastricPain
## Staging1 0.04182675 0.06272060 -0.05713547 -0.01063682 0.08525000
## Staging2 -0.03067426 -0.02650414 -0.01928624 0.03644746 -0.08087695
## Staging3 -0.04425201 0.05516759 -0.04199557 -0.08647238 -0.09224918
## Staging4 0.01581104 0.05004652 0.05726363 0.06867543 0.03928890
## WBC RBC HGB Plat AST1
## Staging1 -0.01277460 -0.004632153 0.05513693 -0.047917098 0.011600782
## Staging2 -0.04539041 -0.023528199 0.03772072 0.099416138 0.002887526
## Staging3 -0.03541789 -0.017863048 -0.03127443 -0.008423181 0.003413371
## Staging4 0.04323104 -0.016507172 0.04070862 -0.042663762 -0.097660336
## ALT1 ALT4 ALT12 ALT24 ALT36
## Staging1 -0.08503638 0.015051362 0.001874303 0.048777499 0.003246343
## Staging2 0.03336331 0.002985419 -0.064867090 -0.005929071 -0.035835244
## Staging3 0.01051235 0.065238886 -0.076132923 -0.018320603 -0.021496574
## Staging4 0.06762564 -0.083855879 0.061226237 0.012004764 0.030290252
## ALT48 ALT_after24w RNA_Base RNA4 RNA12
## Staging1 -0.040328356 -0.04155957 0.07645703 0.07828545 0.02216276
## Staging2 0.017070283 -0.06140861 -0.11574169 0.05099877 -0.07175850
## Staging3 0.049653991 0.02579360 0.04280821 0.01317952 0.01181295
## Staging4 -0.009746323 0.04753783 0.03611084 -0.04085885 0.11859502
## RNA_EOT RNA_EF Grading
## Staging1 0.14305299 -0.006916713 0.04341757
## Staging2 -0.10623974 -0.101179568 0.04854715
## Staging3 -0.02516523 0.086308247 -0.01564098
## Staging4 -0.01384450 -0.004377552 -0.02982198
pred_train_qda <- predict(model_qda, data_train)
cm_train_qda <- confusionMatrix(pred_train_qda$class, data_train$Staging)
print(cm_train_qda$table)## Reference
## Prediction Staging1 Staging2 Staging3 Staging4
## Staging1 129 24 30 23
## Staging2 38 156 27 26
## Staging3 39 28 160 31
## Staging4 30 25 32 174
##
## Akurasi QDA (train): 63.68%
pred_test_qda <- predict(model_qda, data_test)
cm_test_qda <- confusionMatrix(pred_test_qda$class, data_test$Staging)
print(cm_test_qda$table)## Reference
## Prediction Staging1 Staging2 Staging3 Staging4
## Staging1 20 29 31 19
## Staging2 23 25 18 22
## Staging3 29 28 31 29
## Staging4 28 17 26 38
##
## Akurasi QDA (test): 27.60%
## Sensitivity Specificity Precision F1
## Class: Staging1 0.2000000 0.7476038 0.2020202 0.2010050
## Class: Staging2 0.2525253 0.7993631 0.2840909 0.2673797
## Class: Staging3 0.2924528 0.7198697 0.2649573 0.2780269
## Class: Staging4 0.3518519 0.7672131 0.3486239 0.3502304
summary_tbl <- data.frame(
Model = c("LDA", "QDA"),
Akurasi_Train = c(round(cm_train_lda$overall["Accuracy"] * 100, 2),
round(cm_train_qda$overall["Accuracy"] * 100, 2)),
Akurasi_Test = c(round(cm_test_lda$overall["Accuracy"] * 100, 2),
round(cm_test_qda$overall["Accuracy"] * 100, 2))
)
summary_tbl$Gap_Overfit <- summary_tbl$Akurasi_Train - summary_tbl$Akurasi_Test
print(summary_tbl)## Model Akurasi_Train Akurasi_Test Gap_Overfit
## 1 LDA 35.08 26.15 8.93
## 2 QDA 63.68 27.60 36.08
Plot ini menampilkan data yang telah direduksi dari dimensi fitur ke ruang LD1 dan LD2. Semakin terpisah klaster antar warna, semakin baik kemampuan LDA memisahkan kelas.
lda_scores <- as.data.frame(pred_train_lda$x)
lda_scores$Staging <- data_train$Staging
ggplot(lda_scores, aes(x = LD1, y = LD2, color = Staging)) +
geom_point(alpha = 0.5, size = 1.5) +
stat_ellipse(level = 0.68, linewidth = 0.8) +
labs(
title = "Proyeksi Data pada Ruang LD1 dan LD2 (Data Latih)",
subtitle = sprintf("LD1: %.2f%% | LD2: %.2f%% | LD3: %.2f%%",
ev_pct[1], ev_pct[2], ev_pct[3]),
x = sprintf("LD1 (%.2f%%)", ev_pct[1]),
y = sprintf("LD2 (%.2f%%)", ev_pct[2]),
color = "Staging"
) +
scale_color_manual(values = c("#e74c3c", "#3498db", "#2ecc71", "#f39c12")) +
theme_bw(base_size = 11) +
theme(plot.title = element_text(face = "bold"))ev_df <- data.frame(
LD = paste0("LD", 1:3),
Var = ev_pct,
Cum = cumsum(ev_pct)
)
ggplot(ev_df, aes(x = LD, y = Var)) +
geom_bar(stat = "identity",
fill = c("#2c3e50", "#7f8c8d", "#bdc3c7"), width = 0.5) +
geom_line(aes(y = Cum, group = 1), color = "red", linewidth = 0.8) +
geom_point(aes(y = Cum), color = "red", size = 2.5) +
geom_text(aes(label = paste0(Var, "%")), vjust = -0.5, size = 3.5) +
ylim(0, 120) +
labs(
title = "Proporsi Varian Antar Kelas — Fungsi Diskriminan LDA",
subtitle = "Garis merah = varian kumulatif",
x = "Fungsi Diskriminan",
y = "Explained Variance (%)"
) +
theme_bw(base_size = 11) +
theme(plot.title = element_text(face = "bold"))Koefisien LD1 menunjukkan seberapa besar setiap variabel asli berkontribusi pada fungsi diskriminan pertama. Ini merupakan interpretasi reduksi dimensi LDA: variabel mana yang paling membedakan antar kelas Staging.
scaling_df <- as.data.frame(model_lda$scaling)
scaling_df$Variabel <- rownames(scaling_df)
scaling_df <- scaling_df %>%
arrange(abs(LD1)) %>%
mutate(
Variabel = factor(Variabel, levels = Variabel),
Warna = ifelse(LD1 >= 0, "Positif", "Negatif")
)
ggplot(scaling_df, aes(x = LD1, y = Variabel, fill = Warna)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = 0, linewidth = 0.6) +
scale_fill_manual(values = c("Positif" = "#3498db", "Negatif" = "#e74c3c")) +
labs(
title = "Koefisien Fungsi Diskriminan LD1",
subtitle = "Variabel dengan |koefisien| besar = paling berpengaruh dalam pemisahan kelas",
x = "Koefisien LD1",
y = "Variabel",
fill = "Arah"
) +
theme_bw(base_size = 10) +
theme(
plot.title = element_text(face = "bold"),
axis.text.y = element_text(size = 8)
)scaling_df2 <- as.data.frame(model_lda$scaling)
scaling_df2$Variabel <- rownames(scaling_df2)
scaling_df2 <- scaling_df2 %>%
arrange(abs(LD2)) %>%
mutate(
Variabel = factor(Variabel, levels = Variabel),
Warna = ifelse(LD2 >= 0, "Positif", "Negatif")
)
ggplot(scaling_df2, aes(x = LD2, y = Variabel, fill = Warna)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = 0, linewidth = 0.6) +
scale_fill_manual(values = c("Positif" = "#3498db", "Negatif" = "#e74c3c")) +
labs(
title = "Koefisien Fungsi Diskriminan LD2",
subtitle = "Variabel dengan |koefisien| besar = paling berpengaruh pada LD2",
x = "Koefisien LD2",
y = "Variabel",
fill = "Arah"
) +
theme_bw(base_size = 10) +
theme(
plot.title = element_text(face = "bold"),
axis.text.y = element_text(size = 8)
)cm_lda_df <- as.data.frame(cm_test_lda$table)
colnames(cm_lda_df) <- c("Prediksi", "Aktual", "Freq")
ggplot(cm_lda_df, aes(x = Prediksi, y = Aktual, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), size = 4.5, fontface = "bold") +
scale_fill_gradient(low = "#dceefb", high = "#2980b9") +
labs(
title = sprintf("Confusion Matrix LDA (Data Uji) — Akurasi: %.2f%%",
cm_test_lda$overall["Accuracy"] * 100),
x = "Prediksi", y = "Aktual"
) +
theme_bw(base_size = 11) +
theme(plot.title = element_text(face = "bold"))cm_qda_df <- as.data.frame(cm_test_qda$table)
colnames(cm_qda_df) <- c("Prediksi", "Aktual", "Freq")
ggplot(cm_qda_df, aes(x = Prediksi, y = Aktual, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), size = 4.5, fontface = "bold") +
scale_fill_gradient(low = "#fdebd0", high = "#e67e22") +
labs(
title = sprintf("Confusion Matrix QDA (Data Uji) — Akurasi: %.2f%%",
cm_test_qda$overall["Accuracy"] * 100),
x = "Prediksi", y = "Aktual"
) +
theme_bw(base_size = 11) +
theme(plot.title = element_text(face = "bold"))acc_df <- data.frame(
Model = c("LDA Train", "LDA Test", "QDA Train", "QDA Test"),
Akurasi = c(
cm_train_lda$overall["Accuracy"] * 100,
cm_test_lda$overall["Accuracy"] * 100,
cm_train_qda$overall["Accuracy"] * 100,
cm_test_qda$overall["Accuracy"] * 100
),
Jenis = c("Train", "Test", "Train", "Test")
)
acc_df$Model <- factor(acc_df$Model, levels = acc_df$Model)
ggplot(acc_df, aes(x = Model, y = Akurasi, fill = Jenis)) +
geom_bar(stat = "identity", width = 0.5) +
geom_text(aes(label = paste0(round(Akurasi, 2), "%")),
vjust = -0.4, size = 3.5) +
scale_fill_manual(values = c("Train" = "#2c3e50", "Test" = "#3498db")) +
ylim(0, 100) +
labs(
title = "Perbandingan Akurasi LDA vs QDA",
x = "",
y = "Akurasi (%)",
fill = "Data"
) +
theme_bw(base_size = 11) +
theme(plot.title = element_text(face = "bold"))Plot densitas ini menunjukkan sejauh mana keempat kelas Staging terpisah pada sumbu LD1 setelah reduksi dimensi.
lda_scores_test <- as.data.frame(pred_test_lda$x)
lda_scores_test$Staging <- data_test$Staging
ggplot(lda_scores_test, aes(x = LD1, fill = Staging)) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("#e74c3c", "#3498db", "#2ecc71", "#f39c12")) +
labs(
title = "Distribusi Skor LD1 per Staging (Data Uji)",
x = "Skor LD1",
y = "Densitas",
fill = "Staging"
) +
theme_bw(base_size = 11) +
theme(plot.title = element_text(face = "bold"))ggplot(lda_scores_test, aes(x = LD2, fill = Staging)) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("#e74c3c", "#3498db", "#2ecc71", "#f39c12")) +
labs(
title = "Distribusi Skor LD2 per Staging (Data Uji)",
x = "Skor LD2",
y = "Densitas",
fill = "Staging"
) +
theme_bw(base_size = 11) +
theme(plot.title = element_text(face = "bold"))Partition plot menggunakan dua variabel paling berpengaruh pada LD1: RNA_EF dan RNA_EOT. Proses ini membutuhkan waktu komputasi lebih lama.
partimat(
Staging ~ RNA_EF + RNA_EOT,
data = df_scaled,
method = "lda",
main = "Partition Plot LDA: RNA_EF vs RNA_EOT"
)partimat(
Staging ~ RNA_EF + RNA_EOT,
data = df_scaled,
method = "qda",
main = "Partition Plot QDA: RNA_EF vs RNA_EOT"
)Analisis diskriminan pada dataset HCV Egyptian Patients menghasilkan temuan sebagai berikut:
Pra-Pemrosesan: Pengecekan duplikasi memastikan tidak ada observasi ganda dalam data. Outlier dideteksi menggunakan metode IQR dan ditangani dengan winsorizing (pem-cap-an pada batas \(Q1 - 1.5 \times IQR\) dan \(Q3 + 1.5 \times IQR\)) untuk mempertahankan jumlah observasi sekaligus mengurangi pengaruh nilai ekstrem.
Reduksi Dimensi via LDA: LDA berhasil mereduksi dimensi fitur menjadi 3 fungsi diskriminan (LD1, LD2, LD3). LD1 menjelaskan porsi terbesar pemisahan antar kelas Staging, yang berarti sebagian besar informasi pembeda kelas dapat dirangkum hanya dalam satu dimensi baru.
Perbandingan LDA vs QDA:
## === SESSION INFO ===
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=Indonesian_Indonesia.utf8 LC_CTYPE=Indonesian_Indonesia.utf8
## [3] LC_MONETARY=Indonesian_Indonesia.utf8 LC_NUMERIC=C
## [5] LC_TIME=Indonesian_Indonesia.utf8
##
## time zone: Asia/Jakarta
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gridExtra_2.3 reshape2_1.4.5 klaR_1.7-4 MASS_7.3-60.2
## [5] caret_7.0-1 lattice_0.22-6 lubridate_1.9.4 forcats_1.0.0
## [9] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
## [13] tidyr_1.3.1 tibble_3.2.1 ggplot2_4.0.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 timeDate_4052.112 farver_2.1.2
## [4] S7_0.2.1 fastmap_1.2.0 combinat_0.0-8
## [7] promises_1.3.3 pROC_1.19.0.1 labelled_2.16.0
## [10] digest_0.6.37 rpart_4.1.23 mime_0.12
## [13] timechange_0.3.0 lifecycle_1.0.4 survival_3.6-4
## [16] magrittr_2.0.3 compiler_4.4.1 rlang_1.1.4
## [19] sass_0.4.9 tools_4.4.1 utf8_1.2.4
## [22] yaml_2.3.10 data.table_1.16.4 knitr_1.49
## [25] labeling_0.4.3 plyr_1.8.9 RColorBrewer_1.1-3
## [28] miniUI_0.1.2 withr_3.0.1 nnet_7.3-19
## [31] grid_4.4.1 stats4_4.4.1 fansi_1.0.6
## [34] e1071_1.7-17 xtable_1.8-4 future_1.69.0
## [37] globals_0.19.0 scales_1.4.0 iterators_1.0.14
## [40] cli_3.6.3 rmarkdown_2.30 generics_0.1.3
## [43] rstudioapi_0.17.1 future.apply_1.20.2 tzdb_0.4.0
## [46] proxy_0.4-29 cachem_1.1.0 splines_4.4.1
## [49] parallel_4.4.1 vctrs_0.6.5 hardhat_1.4.2
## [52] Matrix_1.7-0 jsonlite_1.8.9 hms_1.1.3
## [55] listenv_0.10.0 foreach_1.5.2 gower_1.0.2
## [58] jquerylib_0.1.4 recipes_1.3.1 glue_1.7.0
## [61] parallelly_1.46.1 codetools_0.2-20 stringi_1.8.4
## [64] gtable_0.3.6 later_1.4.2 questionr_0.8.2
## [67] pillar_1.9.0 htmltools_0.5.8.1 ipred_0.9-15
## [70] lava_1.8.2 R6_2.5.1 evaluate_1.0.1
## [73] shiny_1.10.0 haven_2.5.4 highr_0.11
## [76] httpuv_1.6.16 bslib_0.8.0 class_7.3-22
## [79] Rcpp_1.0.13 nlme_3.1-164 prodlim_2025.04.28
## [82] xfun_0.49 pkgconfig_2.0.3 ModelMetrics_1.2.2.2