Dataset yang digunakan adalah Palmer Penguins, berisi informasi morfologi dan spesies penguin dari kepulauan Palmer.
library(palmerpenguins)
## Warning: package 'palmerpenguins' was built under R version 4.4.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
data("penguins")
head(penguins)
Eksplorasi dilakukan dengan visualisasi scatter plot antara panjang paruh dan panjang sirip untuk memisahkan spesies.
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
ggplot(penguins, aes(x = flipper_length_mm, y = bill_length_mm, color = species)) +
geom_point(size = 2) +
labs(title = "Distribusi Penguin berdasarkan Fitur Morfologi") +
theme_minimal()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Model dievaluasi menggunakan K-Fold Cross Validation (k=5). Berikut implementasi dan ringkasan performa model.
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.3
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
penguins_clean <- penguins %>%
na.omit() %>%
select(species, bill_length_mm, flipper_length_mm)
train_control <- trainControl(method = "cv", number = 5)
svm_linear <- train(species ~ ., data = penguins_clean, method = "svmLinear", trControl = train_control)
svm_rbf <- train(species ~ ., data = penguins_clean, method = "svmRadial", trControl = train_control)
svm_linear
## Support Vector Machines with Linear Kernel
##
## 333 samples
## 2 predictor
## 3 classes: 'Adelie', 'Chinstrap', 'Gentoo'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 267, 267, 266, 266, 266
## Resampling results:
##
## Accuracy Kappa
## 0.951877 0.9241863
##
## Tuning parameter 'C' was held constant at a value of 1
svm_rbf
## Support Vector Machines with Radial Basis Function Kernel
##
## 333 samples
## 2 predictor
## 3 classes: 'Adelie', 'Chinstrap', 'Gentoo'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 265, 266, 267, 268, 266
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.9521416 0.9246351
## 0.50 0.9551719 0.9295420
## 1.00 0.9551719 0.9295158
##
## Tuning parameter 'sigma' was held constant at a value of 2.332133
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 2.332133 and C = 0.5.
Visualisasi decision boundary dibuat untuk membandingkan model linear dan RBF.
model_linear <- svm(species ~ ., data = penguins_clean, kernel = "linear")
model_rbf <- svm(species ~ ., data = penguins_clean, kernel = "radial")
x_range <- seq(min(penguins_clean$bill_length_mm), max(penguins_clean$bill_length_mm), length = 200)
y_range <- seq(min(penguins_clean$flipper_length_mm), max(penguins_clean$flipper_length_mm), length = 200)
grid <- expand.grid(bill_length_mm = x_range, flipper_length_mm = y_range)
grid$pred_linear <- predict(model_linear, grid)
grid$pred_rbf <- predict(model_rbf, grid)
# SVM Linear
ggplot() +
geom_tile(data = grid, aes(x = bill_length_mm, y = flipper_length_mm, fill = pred_linear), alpha = 0.3) +
geom_point(data = penguins_clean, aes(x = bill_length_mm, y = flipper_length_mm, color = species), size = 2) +
labs(title = "Decision Boundary - SVM Linear") +
theme_minimal()
# SVM RBF
ggplot() +
geom_tile(data = grid, aes(x = bill_length_mm, y = flipper_length_mm, fill = pred_rbf), alpha = 0.3) +
geom_point(data = penguins_clean, aes(x = bill_length_mm, y = flipper_length_mm, color = species), size = 2) +
labs(title = "Decision Boundary - SVM RBF") +
theme_minimal()
Parameter C mengontrol margin dan regularisasi. Gamma mengontrol sensitivitas terhadap titik data.
set.seed(123)
tune.out <- tune(svm, species ~ ., data = penguins_clean,
kernel = "radial",
ranges = list(cost = c(0.1, 1, 10), gamma = c(0.01, 0.1, 1)))
summary(tune.out)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 10 1
##
## - best performance: 0.04180036
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 0.1 0.01 0.24367201 0.07043843
## 2 1.0 0.01 0.05989305 0.03432061
## 3 10.0 0.01 0.04483066 0.02052299
## 4 0.1 0.10 0.05989305 0.03432061
## 5 1.0 0.10 0.04483066 0.02052299
## 6 10.0 0.10 0.04777184 0.02813163
## 7 0.1 1.00 0.04786096 0.02037625
## 8 1.0 1.00 0.04786096 0.02037625
## 9 10.0 1.00 0.04180036 0.03497188
SVM efektif dalam klasifikasi spesies penguin. Model RBF menunjukkan akurasi yang lebih baik dibanding linear. Tuning parameter sangat mempengaruhi akurasi dan generalisasi model.