Kanker payudara merupakan salah satu jenis kanker yang memiliki tingkat kejadian tinggi di dunia dan memerlukan diagnosis yang akurat untuk meningkatkan peluang keberhasilan terapi. Perkembangan machine learning memberikan peluang untuk membangun model klasifikasi yang mampu membantu proses diagnosis secara lebih cepat dan objektif. Penelitian ini bertujuan membangun model Support Vector Machine (SVM) untuk mengklasifikasikan diagnosis tumor payudara ke dalam kategori malignant dan benign menggunakan Breast Cancer Wisconsin Diagnostic Dataset dari UCI Machine Learning Repository. Dua kernel dibandingkan dalam penelitian ini, yaitu kernel linear dan kernel radial basis function (RBF). Data dibagi menjadi data latih dan data uji secara stratified dengan proporsi 80:20. Selanjutnya dilakukan standardisasi prediktor numerik dan tuning parameter menggunakan 5-fold cross-validation. Evaluasi model dilakukan menggunakan confusion matrix, accuracy, sensitivity, specificity, dan Area Under Curve (AUC). Hasil penelitian diharapkan dapat menunjukkan kemampuan SVM dalam mendeteksi kasus kanker ganas serta memberikan gambaran mengenai keunggulan kernel RBF dibandingkan kernel linear dalam menangkap pola kompleks pada data medis.
Kata Kunci: Support Vector Machine, Klasifikasi, Kanker Payudara, Machine Learning, Breast Cancer Wisconsin Diagnostic Dataset
Kanker payudara merupakan salah satu masalah kesehatan utama yang menyebabkan tingginya angka morbiditas dan mortalitas pada perempuan di berbagai negara. Menurut berbagai penelitian medis, deteksi dini menjadi faktor penting dalam meningkatkan efektivitas pengobatan dan peluang kesembuhan pasien. Oleh karena itu, diperlukan metode yang mampu membantu proses diagnosis secara cepat, konsisten, dan akurat.
Perkembangan machine learning telah membuka peluang pemanfaatan algoritma komputasi dalam bidang kesehatan. Salah satu metode yang banyak digunakan untuk klasifikasi adalah Support Vector Machine (SVM). SVM memiliki kemampuan yang baik dalam menangani data berdimensi tinggi dan menghasilkan batas pemisah optimal antar kelas melalui konsep hyperplane dan margin maksimum.
Breast Cancer Wisconsin Diagnostic Dataset (WDBC) merupakan dataset yang banyak digunakan dalam penelitian klasifikasi medis. Dataset ini berisi karakteristik inti sel hasil pemeriksaan Fine Needle Aspirate (FNA) yang digunakan untuk membedakan tumor ganas (malignant) dan tumor jinak (benign).
Dapatkah karakteristik inti sel digunakan untuk mengklasifikasikan tumor sebagai malignant atau benign menggunakan metode Support Vector Machine?
Penelitian ini bertujuan untuk:
Penelitian ini diharapkan dapat memberikan pemahaman mengenai penerapan machine learning pada bidang kesehatan serta menunjukkan efektivitas metode SVM dalam klasifikasi diagnosis kanker payudara.
Dataset yang digunakan adalah Breast Cancer Wisconsin Diagnostic (WDBC) yang diperoleh dari UCI Machine Learning Repository.
Sumber data:
https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
Dataset terdiri atas 569 observasi dengan 30 variabel prediktor numerik dan satu variabel target berupa diagnosis tumor.
Variabel target:
Unit observasi merupakan hasil pemeriksaan Fine Needle Aspirate (FNA) dari massa payudara.
library(tidyverse)
library(tidymodels)
library(kernlab)
library(corrplot)
library(knitr)
library(GGally)
theme_set(theme_minimal())
set.seed(15062026)
url_wdbc <- paste0(
"https://archive.ics.uci.edu/ml/machine-learning-databases/",
"breast-cancer-wisconsin/wdbc.data"
)
feature_names <- c(
"radius", "texture", "perimeter", "area",
"smoothness", "compactness", "concavity",
"concave_points", "symmetry",
"fractal_dimension"
)
col_names <- c(
"id",
"diagnosis",
paste0(feature_names, "_mean"),
paste0(feature_names, "_se"),
paste0(feature_names, "_worst")
)
wdbc <- read_csv(
url_wdbc,
col_names = col_names,
show_col_types = FALSE
) %>%
mutate(
diagnosis = factor(
if_else(
diagnosis == "M",
"Malignant",
"Benign"
),
levels = c(
"Malignant",
"Benign"
)
)
)
dim(wdbc)
## [1] 569 32
Interpretasi:
Output menunjukkan jumlah observasi dan jumlah variabel yang tersedia dalam dataset.
glimpse(wdbc)
## Rows: 569
## Columns: 32
## $ id <dbl> 842302, 842517, 84300903, 84348301, 84358402, …
## $ diagnosis <fct> Malignant, Malignant, Malignant, Malignant, Ma…
## $ radius_mean <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450…
## $ texture_mean <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9…
## $ perimeter_mean <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, …
## $ area_mean <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, …
## $ smoothness_mean <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0…
## $ compactness_mean <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0…
## $ concavity_mean <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0…
## $ concave_points_mean <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0…
## $ symmetry_mean <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087…
## $ fractal_dimension_mean <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0…
## $ radius_se <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345…
## $ texture_se <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902…
## $ perimeter_se <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18…
## $ area_se <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.…
## $ smoothness_se <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114…
## $ compactness_se <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246…
## $ concavity_se <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0…
## $ concave_points_se <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188…
## $ symmetry_se <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0…
## $ fractal_dimension_se <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051…
## $ radius_worst <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8…
## $ texture_worst <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6…
## $ perimeter_worst <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,…
## $ area_worst <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, …
## $ smoothness_worst <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791…
## $ compactness_worst <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249…
## $ concavity_worst <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0…
## $ concave_points_worst <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0…
## $ symmetry_worst <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985…
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0…
missing_tbl <- data.frame(
Variable = names(wdbc),
Missing = colSums(is.na(wdbc))
)
kable(
missing_tbl,
caption = "Jumlah Missing Value pada Setiap Variabel"
)
| Variable | Missing | |
|---|---|---|
| id | id | 0 |
| diagnosis | diagnosis | 0 |
| radius_mean | radius_mean | 0 |
| texture_mean | texture_mean | 0 |
| perimeter_mean | perimeter_mean | 0 |
| area_mean | area_mean | 0 |
| smoothness_mean | smoothness_mean | 0 |
| compactness_mean | compactness_mean | 0 |
| concavity_mean | concavity_mean | 0 |
| concave_points_mean | concave_points_mean | 0 |
| symmetry_mean | symmetry_mean | 0 |
| fractal_dimension_mean | fractal_dimension_mean | 0 |
| radius_se | radius_se | 0 |
| texture_se | texture_se | 0 |
| perimeter_se | perimeter_se | 0 |
| area_se | area_se | 0 |
| smoothness_se | smoothness_se | 0 |
| compactness_se | compactness_se | 0 |
| concavity_se | concavity_se | 0 |
| concave_points_se | concave_points_se | 0 |
| symmetry_se | symmetry_se | 0 |
| fractal_dimension_se | fractal_dimension_se | 0 |
| radius_worst | radius_worst | 0 |
| texture_worst | texture_worst | 0 |
| perimeter_worst | perimeter_worst | 0 |
| area_worst | area_worst | 0 |
| smoothness_worst | smoothness_worst | 0 |
| compactness_worst | compactness_worst | 0 |
| concavity_worst | concavity_worst | 0 |
| concave_points_worst | concave_points_worst | 0 |
| symmetry_worst | symmetry_worst | 0 |
| fractal_dimension_worst | fractal_dimension_worst | 0 |
Interpretasi:
Berdasarkan hasil pemeriksaan, tidak ditemukan nilai hilang pada seluruh variabel sehingga tidak diperlukan proses imputasi data.
class_dist <- wdbc %>%
count(diagnosis) %>%
mutate(
proporsi = n/sum(n)
)
kable(
class_dist,
digits = 3,
caption = "Distribusi Kelas Diagnosis"
)
| diagnosis | n | proporsi |
|---|---|---|
| Malignant | 212 | 0.373 |
| Benign | 357 | 0.627 |
ggplot(
class_dist,
aes(
x = diagnosis,
y = n,
fill = diagnosis
)
) +
geom_col() +
labs(
title = "Distribusi Diagnosis Tumor",
x = "Diagnosis",
y = "Frekuensi"
)
Interpretasi:
Distribusi kelas menunjukkan jumlah observasi pada masing-masing kategori diagnosis. Informasi ini penting karena proporsi kelas perlu dipertahankan pada proses train-test split melalui teknik stratified sampling.
summary(
select(
wdbc,
-id
)
)
## diagnosis radius_mean texture_mean perimeter_mean
## Malignant:212 Min. : 6.981 Min. : 9.71 Min. : 43.79
## Benign :357 1st Qu.:11.700 1st Qu.:16.17 1st Qu.: 75.17
## Median :13.370 Median :18.84 Median : 86.24
## Mean :14.127 Mean :19.29 Mean : 91.97
## 3rd Qu.:15.780 3rd Qu.:21.80 3rd Qu.:104.10
## Max. :28.110 Max. :39.28 Max. :188.50
## area_mean smoothness_mean compactness_mean concavity_mean
## Min. : 143.5 Min. :0.05263 Min. :0.01938 Min. :0.00000
## 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492 1st Qu.:0.02956
## Median : 551.1 Median :0.09587 Median :0.09263 Median :0.06154
## Mean : 654.9 Mean :0.09636 Mean :0.10434 Mean :0.08880
## 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040 3rd Qu.:0.13070
## Max. :2501.0 Max. :0.16340 Max. :0.34540 Max. :0.42680
## concave_points_mean symmetry_mean fractal_dimension_mean radius_se
## Min. :0.00000 Min. :0.1060 Min. :0.04996 Min. :0.1115
## 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770 1st Qu.:0.2324
## Median :0.03350 Median :0.1792 Median :0.06154 Median :0.3242
## Mean :0.04892 Mean :0.1812 Mean :0.06280 Mean :0.4052
## 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612 3rd Qu.:0.4789
## Max. :0.20120 Max. :0.3040 Max. :0.09744 Max. :2.8730
## texture_se perimeter_se area_se smoothness_se
## Min. :0.3602 Min. : 0.757 Min. : 6.802 Min. :0.001713
## 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850 1st Qu.:0.005169
## Median :1.1080 Median : 2.287 Median : 24.530 Median :0.006380
## Mean :1.2169 Mean : 2.866 Mean : 40.337 Mean :0.007041
## 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190 3rd Qu.:0.008146
## Max. :4.8850 Max. :21.980 Max. :542.200 Max. :0.031130
## compactness_se concavity_se concave_points_se symmetry_se
## Min. :0.002252 Min. :0.00000 Min. :0.000000 Min. :0.007882
## 1st Qu.:0.013080 1st Qu.:0.01509 1st Qu.:0.007638 1st Qu.:0.015160
## Median :0.020450 Median :0.02589 Median :0.010930 Median :0.018730
## Mean :0.025478 Mean :0.03189 Mean :0.011796 Mean :0.020542
## 3rd Qu.:0.032450 3rd Qu.:0.04205 3rd Qu.:0.014710 3rd Qu.:0.023480
## Max. :0.135400 Max. :0.39600 Max. :0.052790 Max. :0.078950
## fractal_dimension_se radius_worst texture_worst perimeter_worst
## Min. :0.0008948 Min. : 7.93 Min. :12.02 Min. : 50.41
## 1st Qu.:0.0022480 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11
## Median :0.0031870 Median :14.97 Median :25.41 Median : 97.66
## Mean :0.0037949 Mean :16.27 Mean :25.68 Mean :107.26
## 3rd Qu.:0.0045580 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40
## Max. :0.0298400 Max. :36.04 Max. :49.54 Max. :251.20
## area_worst smoothness_worst compactness_worst concavity_worst
## Min. : 185.2 Min. :0.07117 Min. :0.02729 Min. :0.0000
## 1st Qu.: 515.3 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145
## Median : 686.5 Median :0.13130 Median :0.21190 Median :0.2267
## Mean : 880.6 Mean :0.13237 Mean :0.25427 Mean :0.2722
## 3rd Qu.:1084.0 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829
## Max. :4254.0 Max. :0.22260 Max. :1.05800 Max. :1.2520
## concave_points_worst symmetry_worst fractal_dimension_worst
## Min. :0.00000 Min. :0.1565 Min. :0.05504
## 1st Qu.:0.06493 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.09993 Median :0.2822 Median :0.08004
## Mean :0.11461 Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.16140 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.29100 Max. :0.6638 Max. :0.20750
ggplot(
wdbc,
aes(
radius_mean,
fill = diagnosis
)
) +
geom_density(
alpha = 0.5
) +
labs(
title = "Distribusi Radius Mean"
)
ggplot(
wdbc,
aes(
area_mean,
fill = diagnosis
)
) +
geom_density(
alpha = 0.5
) +
labs(
title = "Distribusi Area Mean"
)
ggplot(
wdbc,
aes(
concavity_mean,
fill = diagnosis
)
) +
geom_density(
alpha = 0.5
) +
labs(
title = "Distribusi Concavity Mean"
)
Interpretasi:
Ketiga variabel menunjukkan pola distribusi yang berbeda antara kelas malignant dan benign sehingga berpotensi menjadi prediktor penting dalam proses klasifikasi.
ggplot(
wdbc,
aes(
diagnosis,
radius_mean,
fill = diagnosis
)
) +
geom_boxplot()
Interpretasi:
Boxplot menunjukkan bahwa kelompok malignant cenderung memiliki nilai radius_mean yang lebih tinggi dibandingkan kelompok benign.
cor_matrix <- wdbc %>%
select(
-id,
-diagnosis
) %>%
cor()
corrplot(
cor_matrix,
method = "color",
tl.cex = 0.6,
type = "upper"
)
Interpretasi:
Terlihat adanya korelasi yang tinggi pada beberapa kelompok variabel. Hubungan yang kompleks antar prediktor menjadi salah satu alasan penggunaan metode SVM dalam penelitian ini.
Pembagian data dilakukan secara stratified untuk mempertahankan proporsi kelas diagnosis pada data latih dan data uji.
split_obj <- initial_split(
wdbc,
prop = 0.80,
strata = diagnosis
)
train_dat <- training(split_obj)
test_dat <- testing(split_obj)
dim(train_dat)
## [1] 454 32
dim(test_dat)
## [1] 115 32
train_dat %>%
count(diagnosis) %>%
mutate(proporsi = round(n/sum(n), 3))
## # A tibble: 2 × 3
## diagnosis n proporsi
## <fct> <int> <dbl>
## 1 Malignant 169 0.372
## 2 Benign 285 0.628
test_dat %>%
count(diagnosis) %>%
mutate(proporsi = round(n/sum(n), 3))
## # A tibble: 2 × 3
## diagnosis n proporsi
## <fct> <int> <dbl>
## 1 Malignant 43 0.374
## 2 Benign 72 0.626
Interpretasi:
Distribusi kelas pada data latih dan data uji relatif seimbang sehingga stratified sampling berhasil mempertahankan proporsi kelas.
folds <- vfold_cv(
train_dat,
v = 5,
strata = diagnosis
)
folds
## # 5-fold cross-validation using stratification
## # A tibble: 5 × 2
## splits id
## <list> <chr>
## 1 <split [363/91]> Fold1
## 2 <split [363/91]> Fold2
## 3 <split [363/91]> Fold3
## 4 <split [363/91]> Fold4
## 5 <split [364/90]> Fold5
Interpretasi:
Cross-validation digunakan untuk memperoleh estimasi performa model yang lebih stabil selama proses tuning.
rec_svm <- recipe(
diagnosis ~ .,
data = train_dat
) %>%
update_role(
id,
new_role = "id"
) %>%
step_zv(all_predictors()) %>%
step_normalize(
all_numeric_predictors()
)
rec_svm
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 30
## id: 1
##
## ── Operations
## • Zero variance filter on: all_predictors()
## • Centering and scaling for: all_numeric_predictors()
Interpretasi:
Normalisasi dilakukan karena SVM sensitif terhadap skala variabel. Proses normalisasi dilakukan hanya menggunakan data latih sehingga tidak terjadi data leakage.
Support Vector Machine (SVM) merupakan metode klasifikasi yang bekerja dengan mencari hyperplane optimal yang mampu memisahkan dua kelas dengan margin maksimum.
Penelitian ini membandingkan dua kernel:
Parameter cost mengontrol penalti terhadap kesalahan klasifikasi.
Parameter sigma digunakan pada kernel RBF untuk mengontrol fleksibilitas batas keputusan.
svm_linear_spec <- svm_linear(
cost = tune()
) %>%
set_engine(
"kernlab",
prob.model = TRUE
) %>%
set_mode(
"classification"
)
svm_linear_spec
## Linear Support Vector Machine Model Specification (classification)
##
## Main Arguments:
## cost = tune()
##
## Engine-Specific Arguments:
## prob.model = TRUE
##
## Computational engine: kernlab
svm_linear_wf <- workflow() %>%
add_recipe(rec_svm) %>%
add_model(svm_linear_spec)
svm_linear_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_linear()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## • step_zv()
## • step_normalize()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Support Vector Machine Model Specification (classification)
##
## Main Arguments:
## cost = tune()
##
## Engine-Specific Arguments:
## prob.model = TRUE
##
## Computational engine: kernlab
linear_grid <- grid_regular(
cost(
range = c(-5, 5)
),
levels = 10
)
linear_grid
## # A tibble: 10 × 1
## cost
## <dbl>
## 1 0.0312
## 2 0.0675
## 3 0.146
## 4 0.315
## 5 0.680
## 6 1.47
## 7 3.17
## 8 6.86
## 9 14.8
## 10 32
linear_metrics <- metric_set(
roc_auc,
accuracy,
sens,
spec
)
linear_tuned <- tune_grid(
svm_linear_wf,
resamples = folds,
grid = linear_grid,
metrics = linear_metrics,
control = control_grid(
save_pred = TRUE
)
)
show_best(
linear_tuned,
metric = "roc_auc"
)
## # A tibble: 5 × 7
## cost .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0312 roc_auc binary 0.995 5 0.00265 pre0_mod01_post0
## 2 0.0675 roc_auc binary 0.995 5 0.00327 pre0_mod02_post0
## 3 0.146 roc_auc binary 0.994 5 0.00369 pre0_mod03_post0
## 4 0.680 roc_auc binary 0.994 5 0.00390 pre0_mod05_post0
## 5 0.315 roc_auc binary 0.994 5 0.00371 pre0_mod04_post0
collect_metrics(
linear_tuned
)
## # A tibble: 40 × 7
## cost .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0312 accuracy binary 0.974 5 0.00438 pre0_mod01_post0
## 2 0.0312 roc_auc binary 0.995 5 0.00265 pre0_mod01_post0
## 3 0.0312 sens binary 0.935 5 0.0110 pre0_mod01_post0
## 4 0.0312 spec binary 0.996 5 0.00351 pre0_mod01_post0
## 5 0.0675 accuracy binary 0.980 5 0.00410 pre0_mod02_post0
## 6 0.0675 roc_auc binary 0.995 5 0.00327 pre0_mod02_post0
## 7 0.0675 sens binary 0.953 5 0.00710 pre0_mod02_post0
## 8 0.0675 spec binary 0.996 5 0.00351 pre0_mod02_post0
## 9 0.146 accuracy binary 0.982 5 0.00560 pre0_mod03_post0
## 10 0.146 roc_auc binary 0.994 5 0.00369 pre0_mod03_post0
## # ℹ 30 more rows
autoplot(
linear_tuned
) +
labs(
title = "Ringkasan Tuning SVM Linear"
)
Interpretasi:
Visualisasi menunjukkan perubahan performa model pada berbagai nilai cost. Nilai ROC-AUC tertinggi dipilih sebagai parameter terbaik.
best_linear <- select_best(
linear_tuned,
metric = "roc_auc"
)
best_linear
## # A tibble: 1 × 2
## cost .config
## <dbl> <chr>
## 1 0.0312 pre0_mod01_post0
final_linear_wf <- finalize_workflow(
svm_linear_wf,
best_linear
)
final_linear_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_linear()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## • step_zv()
## • step_normalize()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Support Vector Machine Model Specification (classification)
##
## Main Arguments:
## cost = 0.03125
##
## Engine-Specific Arguments:
## prob.model = TRUE
##
## Computational engine: kernlab
svm_rbf_spec <- svm_rbf(
cost = tune(),
rbf_sigma = tune()
) %>%
set_engine(
"kernlab",
prob.model = TRUE
) %>%
set_mode(
"classification"
)
svm_rbf_spec
## Radial Basis Function Support Vector Machine Model Specification (classification)
##
## Main Arguments:
## cost = tune()
## rbf_sigma = tune()
##
## Engine-Specific Arguments:
## prob.model = TRUE
##
## Computational engine: kernlab
svm_rbf_wf <- workflow() %>%
add_recipe(rec_svm) %>%
add_model(svm_rbf_spec)
svm_rbf_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_rbf()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## • step_zv()
## • step_normalize()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Radial Basis Function Support Vector Machine Model Specification (classification)
##
## Main Arguments:
## cost = tune()
## rbf_sigma = tune()
##
## Engine-Specific Arguments:
## prob.model = TRUE
##
## Computational engine: kernlab
svm_grid <- grid_regular(
cost(
range = c(-5, 5)
),
rbf_sigma(
range = c(-10, 0)
),
levels = 5
)
svm_grid
## # A tibble: 25 × 2
## cost rbf_sigma
## <dbl> <dbl>
## 1 0.0312 0.0000000001
## 2 0.177 0.0000000001
## 3 1 0.0000000001
## 4 5.66 0.0000000001
## 5 32 0.0000000001
## 6 0.0312 0.0000000316
## 7 0.177 0.0000000316
## 8 1 0.0000000316
## 9 5.66 0.0000000316
## 10 32 0.0000000316
## # ℹ 15 more rows
svm_metrics <- metric_set(
roc_auc,
accuracy,
sens,
spec
)
svm_tuned <- tune_grid(
svm_rbf_wf,
resamples = folds,
grid = svm_grid,
metrics = svm_metrics,
control = control_grid(
save_pred = TRUE
)
)
## maximum number of iterations reached 2.861758e-05 -2.861758e-05maximum number of iterations reached 0.0001599879 -0.000159988maximum number of iterations reached 0.0008895575 -0.0008895568maximum number of iterations reached 0.00023972 -0.0002397193maximum number of iterations reached 0.001362556 -0.00136252maximum number of iterations reached 0.007299979 -0.007287874maximum number of iterations reached 0.01408169 -0.01422563maximum number of iterations reached 0.00177549 -0.001765752maximum number of iterations reached 0.01033313 -0.01037645maximum number of iterations reached 0.0002162732 -0.0002151397maximum number of iterations reached 0.006641301 -0.006235938maximum number of iterations reached 2.810598e-05 -2.810598e-05maximum number of iterations reached 0.0001601492 -0.0001601493maximum number of iterations reached 0.000904895 -0.0009048946maximum number of iterations reached 0.0002379273 -0.0002379266maximum number of iterations reached 0.001336436 -0.001336404maximum number of iterations reached 0.007269128 -0.007257217maximum number of iterations reached 0.01417667 -0.01429472maximum number of iterations reached 0.001947118 -0.001934891maximum number of iterations reached 0.010074 -0.01008327maximum number of iterations reached 0.0002042831 -0.0002031999maximum number of iterations reached 0.004500174 -0.004196676maximum number of iterations reached 2.83372e-05 -2.83372e-05maximum number of iterations reached 0.0001566373 -0.0001566374maximum number of iterations reached 0.0008985434 -0.0008985423maximum number of iterations reached 0.0002340835 -0.000234083maximum number of iterations reached 0.001323663 -0.00132363maximum number of iterations reached 0.00717984 -0.007168409maximum number of iterations reached 0.01421069 -0.01437225maximum number of iterations reached 0.002557738 -0.002540893maximum number of iterations reached 0.01027745 -0.01036685maximum number of iterations reached 0.0001924939 -0.0001914824maximum number of iterations reached 0.003159766 -0.003029883maximum number of iterations reached 2.585456e-05 -2.585457e-05maximum number of iterations reached 0.0001518923 -0.0001518924maximum number of iterations reached 0.0008701417 -0.0008701419maximum number of iterations reached 0.000228507 -0.0002285064maximum number of iterations reached 0.001292729 -0.001292696maximum number of iterations reached 0.006967677 -0.00695739maximum number of iterations reached 0.01426127 -0.01444588maximum number of iterations reached 0.002542626 -0.002529864maximum number of iterations reached 0.01042355 -0.01053881maximum number of iterations reached 0.0002736097 -0.0002722411maximum number of iterations reached 0.002550232 -0.002441856maximum number of iterations reached 2.488812e-05 -2.488813e-05maximum number of iterations reached 0.0001529773 -0.0001529774maximum number of iterations reached 0.0008936505 -0.0008936492maximum number of iterations reached 0.0002268538 -0.0002268532maximum number of iterations reached 0.001278651 -0.001278623maximum number of iterations reached 0.006918786 -0.006908756maximum number of iterations reached 0.01412428 -0.01428327maximum number of iterations reached 0.002111655 -0.002101279maximum number of iterations reached 0.01044433 -0.01051128maximum number of iterations reached 7.161995e-05 -7.130906e-05maximum number of iterations reached 0.005321621 -0.005001752
show_best(
svm_tuned,
metric = "roc_auc"
)
## # A tibble: 5 × 8
## cost rbf_sigma .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 5.66 0.00316 roc_auc binary 0.994 5 0.00281 pre0_mod19_post0
## 2 32 0.00316 roc_auc binary 0.993 5 0.00386 pre0_mod24_post0
## 3 1 0.00316 roc_auc binary 0.993 5 0.00347 pre0_mod14_post0
## 4 32 0.00001 roc_auc binary 0.988 5 0.00373 pre0_mod23_post0
## 5 0.177 0.00316 roc_auc binary 0.988 5 0.00369 pre0_mod09_post0
collect_metrics(
svm_tuned
)
## # A tibble: 100 × 8
## cost rbf_sigma .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0312 0.0000000001 accuracy binary 0.628 5 0.00139 pre0_mod01_post0
## 2 0.0312 0.0000000001 roc_auc binary 0.979 5 0.00465 pre0_mod01_post0
## 3 0.0312 0.0000000001 sens binary 0 5 0 pre0_mod01_post0
## 4 0.0312 0.0000000001 spec binary 1 5 0 pre0_mod01_post0
## 5 0.0312 0.0000000316 accuracy binary 0.628 5 0.00139 pre0_mod02_post0
## 6 0.0312 0.0000000316 roc_auc binary 0.985 5 0.00377 pre0_mod02_post0
## 7 0.0312 0.0000000316 sens binary 0 5 0 pre0_mod02_post0
## 8 0.0312 0.0000000316 spec binary 1 5 0 pre0_mod02_post0
## 9 0.0312 0.00001 accuracy binary 0.628 5 0.00139 pre0_mod03_post0
## 10 0.0312 0.00001 roc_auc binary 0.985 5 0.00370 pre0_mod03_post0
## # ℹ 90 more rows
autoplot(
svm_tuned
) +
labs(
title = "Ringkasan Tuning SVM RBF"
)
Interpretasi:
Visualisasi tuning menunjukkan pengaruh kombinasi parameter cost dan sigma terhadap performa model.
best_par <- select_best(
svm_tuned,
metric = "roc_auc"
)
best_par
## # A tibble: 1 × 3
## cost rbf_sigma .config
## <dbl> <dbl> <chr>
## 1 5.66 0.00316 pre0_mod19_post0
final_rbf_wf <- finalize_workflow(
svm_rbf_wf,
best_par
)
final_rbf_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: svm_rbf()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## • step_zv()
## • step_normalize()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Radial Basis Function Support Vector Machine Model Specification (classification)
##
## Main Arguments:
## cost = 5.65685424949238
## rbf_sigma = 0.00316227766016838
##
## Engine-Specific Arguments:
## prob.model = TRUE
##
## Computational engine: kernlab
linear_fit <- last_fit(
final_linear_wf,
split_obj,
metrics = svm_metrics
)
linear_fit
## # Resampling results
## # Manual resampling
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [454/115]> train/test split <tibble> <tibble> <tibble> <workflow>
rbf_fit <- last_fit(
final_rbf_wf,
split_obj,
metrics = svm_metrics
)
rbf_fit
## # Resampling results
## # Manual resampling
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [454/115]> train/test split <tibble> <tibble> <tibble> <workflow>
Evaluasi dilakukan menggunakan data uji yang tidak digunakan selama proses pelatihan maupun tuning model sehingga memberikan gambaran performa model yang lebih objektif.
linear_test_metrics <- collect_metrics(
linear_fit
)
kable(
linear_test_metrics,
digits = 4,
caption = "Metrik Evaluasi Data Uji SVM Linear"
)
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| accuracy | binary | 0.9739 | pre0_mod0_post0 |
| sens | binary | 0.9535 | pre0_mod0_post0 |
| spec | binary | 0.9861 | pre0_mod0_post0 |
| roc_auc | binary | 0.9958 | pre0_mod0_post0 |
rbf_test_metrics <- collect_metrics(
rbf_fit
)
kable(
rbf_test_metrics,
digits = 4,
caption = "Metrik Evaluasi Data Uji SVM RBF"
)
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| accuracy | binary | 0.9739 | pre0_mod0_post0 |
| sens | binary | 0.9535 | pre0_mod0_post0 |
| spec | binary | 0.9861 | pre0_mod0_post0 |
| roc_auc | binary | 0.9958 | pre0_mod0_post0 |
Interpretasi:
Tabel di atas menunjukkan performa kedua model berdasarkan metrik accuracy, sensitivity, specificity, dan ROC-AUC pada data uji.
pred_linear <- collect_predictions(
linear_fit
)
head(pred_linear)
## # A tibble: 6 × 7
## .pred_class .pred_Malignant .pred_Benign id diagnosis .row .config
## <fct> <dbl> <dbl> <chr> <fct> <int> <chr>
## 1 Malignant 1.000 0.00000164 train/test s… Malignant 3 pre0_m…
## 2 Malignant 1.000 0.000289 train/test s… Malignant 4 pre0_m…
## 3 Malignant 0.997 0.00317 train/test s… Malignant 9 pre0_m…
## 4 Malignant 0.997 0.00318 train/test s… Malignant 12 pre0_m…
## 5 Malignant 1.000 0.0000000572 train/test s… Malignant 19 pre0_m…
## 6 Benign 0.0501 0.950 train/test s… Benign 20 pre0_m…
cm_linear <- conf_mat(
pred_linear,
truth = diagnosis,
estimate = .pred_class
)
cm_linear
## Truth
## Prediction Malignant Benign
## Malignant 41 1
## Benign 2 71
pred_rbf <- collect_predictions(
rbf_fit
)
head(pred_rbf)
## # A tibble: 6 × 7
## .pred_class .pred_Malignant .pred_Benign id diagnosis .row .config
## <fct> <dbl> <dbl> <chr> <fct> <int> <chr>
## 1 Malignant 1.000 0.00000185 train/test s… Malignant 3 pre0_m…
## 2 Malignant 0.999 0.000695 train/test s… Malignant 4 pre0_m…
## 3 Malignant 0.999 0.00133 train/test s… Malignant 9 pre0_m…
## 4 Malignant 0.998 0.00206 train/test s… Malignant 12 pre0_m…
## 5 Malignant 1.000 0.000000105 train/test s… Malignant 19 pre0_m…
## 6 Benign 0.0415 0.959 train/test s… Benign 20 pre0_m…
cm_rbf <- conf_mat(
pred_rbf,
truth = diagnosis,
estimate = .pred_class
)
cm_rbf
## Truth
## Prediction Malignant Benign
## Malignant 41 1
## Benign 2 71
Interpretasi:
Confusion matrix menunjukkan jumlah observasi yang berhasil maupun gagal diklasifikasikan oleh model.
names(pred_rbf)
## [1] ".pred_class" ".pred_Malignant" ".pred_Benign" "id"
## [5] "diagnosis" ".row" ".config"
Perhatikan output yang muncul.
Biasanya terdapat kolom:
Jika kolom tersebut muncul, gunakan chunk berikut.
roc_curve(
pred_linear,
truth = diagnosis,
.pred_Malignant
) %>%
autoplot() +
labs(
title = "ROC Curve SVM Linear"
)
roc_curve(
pred_rbf,
truth = diagnosis,
.pred_Malignant
) %>%
autoplot() +
labs(
title = "ROC Curve SVM RBF"
)
Interpretasi:
Semakin dekat kurva ROC ke sudut kiri atas, semakin baik kemampuan model membedakan kelas malignant dan benign.
comparison_metrics <- bind_rows(
linear_test_metrics %>%
mutate(Model = "Linear"),
rbf_test_metrics %>%
mutate(Model = "RBF")
) %>%
select(
Model,
.metric,
.estimate
)
kable(
comparison_metrics,
digits = 4,
caption = "Perbandingan Metrik Evaluasi Kedua Model"
)
| Model | .metric | .estimate |
|---|---|---|
| Linear | accuracy | 0.9739 |
| Linear | sens | 0.9535 |
| Linear | spec | 0.9861 |
| Linear | roc_auc | 0.9958 |
| RBF | accuracy | 0.9739 |
| RBF | sens | 0.9535 |
| RBF | spec | 0.9861 |
| RBF | roc_auc | 0.9958 |
comparison_plot <- comparison_metrics %>%
filter(
.metric %in%
c(
"accuracy",
"sens",
"spec",
"roc_auc"
)
)
ggplot(
comparison_plot,
aes(
x = Model,
y = .estimate,
fill = Model
)
) +
geom_col() +
facet_wrap(
~ .metric,
scales = "free_y"
) +
labs(
title = "Perbandingan Kinerja SVM Linear dan RBF",
x = "Model",
y = "Nilai Metrik"
)
False negative merupakan kasus ketika observasi yang sebenarnya malignant diprediksi sebagai benign.
Kesalahan ini sangat penting dalam konteks diagnosis kanker karena dapat menyebabkan keterlambatan penanganan pasien.
false_negative_rbf <- pred_rbf %>%
filter(
diagnosis == "Malignant",
.pred_class == "Benign"
)
nrow(false_negative_rbf)
## [1] 2
false_negative_rbf
## # A tibble: 2 × 7
## .pred_class .pred_Malignant .pred_Benign id diagnosis .row .config
## <fct> <dbl> <dbl> <chr> <fct> <int> <chr>
## 1 Benign 0.250 0.750 train/test s… Malignant 39 pre0_m…
## 2 Benign 0.0733 0.927 train/test s… Malignant 74 pre0_m…
Interpretasi:
Jumlah observasi pada output di atas menunjukkan banyaknya kasus malignant yang salah diklasifikasikan sebagai benign.
Accuracy menunjukkan proporsi keseluruhan observasi yang berhasil diklasifikasikan dengan benar.
Sensitivity merupakan metrik terpenting dalam penelitian ini karena menunjukkan kemampuan model mendeteksi kasus malignant.
Semakin tinggi sensitivity maka semakin sedikit kasus kanker yang gagal terdeteksi.
Specificity menunjukkan kemampuan model dalam mengenali kasus benign secara tepat.
ROC-AUC digunakan untuk mengevaluasi kemampuan model dalam membedakan kedua kelas diagnosis.
Nilai yang mendekati 1 menunjukkan kemampuan diskriminasi yang sangat baik.
Bandingkan nilai accuracy, sensitivity, specificity, dan ROC-AUC antara model linear dan model RBF.
Model dengan sensitivity dan ROC-AUC tertinggi dapat dipertimbangkan sebagai model terbaik.
Penelitian ini memiliki beberapa keterbatasan:
Penelitian ini menerapkan Support Vector Machine (SVM) untuk mengklasifikasikan diagnosis kanker payudara menggunakan Breast Cancer Wisconsin Diagnostic Dataset.
Dua kernel dibandingkan yaitu kernel linear dan kernel radial basis function (RBF).
Evaluasi dilakukan menggunakan confusion matrix, accuracy, sensitivity, specificity, dan ROC-AUC.
Model dengan nilai sensitivity dan ROC-AUC tertinggi dapat dipertimbangkan sebagai model terbaik karena lebih mampu mendeteksi kasus malignant dengan baik.
Breast Cancer Wisconsin (Diagnostic) Dataset. UCI Machine Learning Repository.
https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
Kuhn, M., & Silge, J. (2022). Tidy Modeling with R.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning.
Vapnik, V. N. (1998). Statistical Learning Theory.
Posit Team. Tidymodels Documentation.