1. Dataset yang Digunakan

Dataset yang digunakan adalah Palmer Penguins, berisi informasi morfologi dan spesies penguin dari kepulauan Palmer.

library(palmerpenguins)
## Warning: package 'palmerpenguins' was built under R version 4.4.3
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

data("penguins")
head(penguins)

2. Eksplorasi Data dan Visualisasi Awal

Eksplorasi dilakukan dengan visualisasi scatter plot antara panjang paruh dan panjang sirip untuk memisahkan spesies.

summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2
ggplot(penguins, aes(x = flipper_length_mm, y = bill_length_mm, color = species)) +
  geom_point(size = 2) +
  labs(title = "Distribusi Penguin berdasarkan Fitur Morfologi") +
  theme_minimal()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

3. Model SVM Linear dan RBF dengan K-Fold CV

Model dievaluasi menggunakan K-Fold Cross Validation (k=5). Berikut implementasi dan ringkasan performa model.

library(e1071)
## Warning: package 'e1071' was built under R version 4.4.3
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
penguins_clean <- penguins %>%
  na.omit() %>%
  select(species, bill_length_mm, flipper_length_mm)

train_control <- trainControl(method = "cv", number = 5)

svm_linear <- train(species ~ ., data = penguins_clean, method = "svmLinear", trControl = train_control)
svm_rbf <- train(species ~ ., data = penguins_clean, method = "svmRadial", trControl = train_control)

svm_linear
## Support Vector Machines with Linear Kernel 
## 
## 333 samples
##   2 predictor
##   3 classes: 'Adelie', 'Chinstrap', 'Gentoo' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 267, 267, 266, 266, 266 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.951877  0.9241863
## 
## Tuning parameter 'C' was held constant at a value of 1
svm_rbf
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 333 samples
##   2 predictor
##   3 classes: 'Adelie', 'Chinstrap', 'Gentoo' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 265, 266, 267, 268, 266 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.9521416  0.9246351
##   0.50  0.9551719  0.9295420
##   1.00  0.9551719  0.9295158
## 
## Tuning parameter 'sigma' was held constant at a value of 2.332133
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 2.332133 and C = 0.5.

4. Visualisasi Decision Boundary (2D)

Visualisasi decision boundary dibuat untuk membandingkan model linear dan RBF.

model_linear <- svm(species ~ ., data = penguins_clean, kernel = "linear")
model_rbf <- svm(species ~ ., data = penguins_clean, kernel = "radial")

x_range <- seq(min(penguins_clean$bill_length_mm), max(penguins_clean$bill_length_mm), length = 200)
y_range <- seq(min(penguins_clean$flipper_length_mm), max(penguins_clean$flipper_length_mm), length = 200)
grid <- expand.grid(bill_length_mm = x_range, flipper_length_mm = y_range)

grid$pred_linear <- predict(model_linear, grid)
grid$pred_rbf <- predict(model_rbf, grid)

# SVM Linear
ggplot() +
  geom_tile(data = grid, aes(x = bill_length_mm, y = flipper_length_mm, fill = pred_linear), alpha = 0.3) +
  geom_point(data = penguins_clean, aes(x = bill_length_mm, y = flipper_length_mm, color = species), size = 2) +
  labs(title = "Decision Boundary - SVM Linear") +
  theme_minimal()

# SVM RBF
ggplot() +
  geom_tile(data = grid, aes(x = bill_length_mm, y = flipper_length_mm, fill = pred_rbf), alpha = 0.3) +
  geom_point(data = penguins_clean, aes(x = bill_length_mm, y = flipper_length_mm, color = species), size = 2) +
  labs(title = "Decision Boundary - SVM RBF") +
  theme_minimal()

5. Interpretasi Parameter C dan Gamma

Parameter C mengontrol margin dan regularisasi. Gamma mengontrol sensitivitas terhadap titik data.

set.seed(123)
tune.out <- tune(svm, species ~ ., data = penguins_clean,
                 kernel = "radial",
                 ranges = list(cost = c(0.1, 1, 10), gamma = c(0.01, 0.1, 1)))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##    10     1
## 
## - best performance: 0.04180036 
## 
## - Detailed performance results:
##   cost gamma      error dispersion
## 1  0.1  0.01 0.24367201 0.07043843
## 2  1.0  0.01 0.05989305 0.03432061
## 3 10.0  0.01 0.04483066 0.02052299
## 4  0.1  0.10 0.05989305 0.03432061
## 5  1.0  0.10 0.04483066 0.02052299
## 6 10.0  0.10 0.04777184 0.02813163
## 7  0.1  1.00 0.04786096 0.02037625
## 8  1.0  1.00 0.04786096 0.02037625
## 9 10.0  1.00 0.04180036 0.03497188

6. Kesimpulan dan Refleksi

SVM efektif dalam klasifikasi spesies penguin. Model RBF menunjukkan akurasi yang lebih baik dibanding linear. Tuning parameter sangat mempengaruhi akurasi dan generalisasi model.