1 Pendahuluan

Dataset Palmer Penguins berisi informasi morfologi tiga spesies penguin — Adelie, Chinstrap, dan Gentoo — yang dikumpulkan di Kepulauan Palmer, Antarktika oleh Dr. Kristen Gorman. Dataset ini terdiri dari 344 observasi dengan variabel ukuran fisik penguin.

Tujuan analisis ini adalah mengklasifikasikan spesies penguin menggunakan dua metode statistika multivariat:

  1. Linear Discriminant Analysis (LDA) — metode dependensi yang mencari fungsi diskriminan untuk memisahkan kelompok.
  2. Regresi Logistik Multinomial — memodelkan peluang keanggotaan ke salah satu dari tiga kategori spesies.

Variabel yang digunakan:

Variabel Tipe Keterangan
species Dependen (Nominal) Adelie, Chinstrap, Gentoo
culmen_length_mm Prediktor (Rasio) Panjang paruh (mm)
culmen_depth_mm Prediktor (Rasio) Kedalaman paruh (mm)
flipper_length_mm Prediktor (Rasio) Panjang sirip (mm)
body_mass_g Prediktor (Rasio) Massa tubuh (gram)

2 Persiapan Data

2.1 Load Library

# Install jika belum ada:
# install.packages(c("MASS", "nnet", "tidyverse", "kableExtra",
#                    "ggplot2", "car", "MVN", "biotools", "caret"))

library(MASS)          # LDA
library(nnet)          # Multinomial Logistic Regression
library(tidyverse)     # Data manipulation & visualisasi
library(kableExtra)    # Tabel yang rapi
library(ggplot2)       # Visualisasi
library(car)           # Uji asumsi
library(MVN)           # Uji normalitas multivariat
library(biotools)      # Uji homogenitas matriks kovarians (Box's M)
library(caret)         # Confusion matrix

2.2 Load & Inspeksi Data

# Load dataset
df <- read.csv("penguins_size.csv", stringsAsFactors = TRUE)

# Tampilkan 10 baris pertama
head(df, 10) %>%
  kable(caption = "10 Baris Pertama Dataset Palmer Penguins") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
10 Baris Pertama Dataset Palmer Penguins
species island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex
Adelie Torgersen 39.1 18.7 181 3750 MALE
Adelie Torgersen 39.5 17.4 186 3800 FEMALE
Adelie Torgersen 40.3 18.0 195 3250 FEMALE
Adelie Torgersen NA NA NA NA NA
Adelie Torgersen 36.7 19.3 193 3450 FEMALE
Adelie Torgersen 39.3 20.6 190 3650 MALE
Adelie Torgersen 38.9 17.8 181 3625 FEMALE
Adelie Torgersen 39.2 19.6 195 4675 MALE
Adelie Torgersen 34.1 18.1 193 3475 NA
Adelie Torgersen 42.0 20.2 190 4250 NA
# Struktur data
glimpse(df)
## Rows: 344
## Columns: 7
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> MALE, FEMALE, FEMALE, NA, FEMALE, MALE, FEMALE, MALE…
# Cek missing values
cat("Jumlah missing values per variabel:\n")
## Jumlah missing values per variabel:
colSums(is.na(df)) %>%
  as.data.frame() %>%
  setNames("Missing") %>%
  kable(caption = "Missing Values per Variabel") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Missing Values per Variabel
Missing
species 0
island 0
culmen_length_mm 2
culmen_depth_mm 2
flipper_length_mm 2
body_mass_g 2
sex 10

2.3 Penanganan Missing Values

# Hapus baris dengan missing values
df_clean <- df %>%
  select(species, culmen_length_mm, culmen_depth_mm,
         flipper_length_mm, body_mass_g) %>%
  na.omit()

cat("Dimensi data setelah cleaning:", nrow(df_clean), "baris x", ncol(df_clean), "kolom\n")
## Dimensi data setelah cleaning: 342 baris x 5 kolom
cat("Distribusi spesies:\n")
## Distribusi spesies:
table(df_clean$species)
## 
##    Adelie Chinstrap    Gentoo 
##       151        68       123

3 Statistika Deskriptif

df_clean %>%
  group_by(species) %>%
  summarise(
    n          = n(),
    Mean_CulLen  = round(mean(culmen_length_mm), 2),
    SD_CulLen    = round(sd(culmen_length_mm), 2),
    Mean_CulDep  = round(mean(culmen_depth_mm), 2),
    SD_CulDep    = round(sd(culmen_depth_mm), 2),
    Mean_Flip    = round(mean(flipper_length_mm), 2),
    SD_Flip      = round(sd(flipper_length_mm), 2),
    Mean_Mass    = round(mean(body_mass_g), 2),
    SD_Mass      = round(sd(body_mass_g), 2)
  ) %>%
  kable(caption = "Statistika Deskriptif Berdasarkan Spesies Penguin",
        col.names = c("Spesies", "n",
                      "Mean Culmen Length", "SD",
                      "Mean Culmen Depth", "SD",
                      "Mean Flipper Length", "SD",
                      "Mean Body Mass", "SD")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
  add_header_above(c(" " = 2,
                     "Culmen Length (mm)" = 2,
                     "Culmen Depth (mm)" = 2,
                     "Flipper Length (mm)" = 2,
                     "Body Mass (g)" = 2))
Statistika Deskriptif Berdasarkan Spesies Penguin
Culmen Length (mm)
Culmen Depth (mm)
Flipper Length (mm)
Body Mass (g)
Spesies n Mean Culmen Length SD Mean Culmen Depth SD Mean Flipper Length SD Mean Body Mass SD
Adelie 151 38.79 2.66 18.35 1.22 189.95 6.54 3700.66 458.57
Chinstrap 68 48.83 3.34 18.42 1.14 195.82 7.13 3733.09 384.34
Gentoo 123 47.50 3.08 14.98 0.98 217.19 6.48 5076.02 504.12
df_clean %>%
  count(species) %>%
  mutate(pct = round(n / sum(n) * 100, 1),
         label = paste0(species, "\n", n, " (", pct, "%)")) %>%
  ggplot(aes(x = "", y = n, fill = species)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  geom_text(aes(label = label),
            position = position_stack(vjust = 0.5), size = 3.5) +
  scale_fill_manual(values = c("#F4A460", "#6495ED", "#90EE90")) +
  labs(title = "Distribusi Spesies Penguin",
       fill = "Spesies") +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13))
Distribusi Spesies Penguin

Distribusi Spesies Penguin

df_clean %>%
  pivot_longer(cols = -species, names_to = "Variabel", values_to = "Nilai") %>%
  ggplot(aes(x = species, y = Nilai, fill = species)) +
  geom_boxplot(alpha = 0.7, outlier.color = "red", outlier.size = 1.5) +
  facet_wrap(~Variabel, scales = "free_y", ncol = 2,
             labeller = as_labeller(c(
               culmen_length_mm  = "Culmen Length (mm)",
               culmen_depth_mm   = "Culmen Depth (mm)",
               flipper_length_mm = "Flipper Length (mm)",
               body_mass_g       = "Body Mass (g)"
             ))) +
  scale_fill_manual(values = c("#F4A460", "#6495ED", "#90EE90")) +
  labs(title = "Distribusi Variabel Prediktor per Spesies",
       x = "Spesies", y = "Nilai", fill = "Spesies") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        strip.text = element_text(face = "bold"),
        legend.position = "none")
Boxplot Variabel Prediktor per Spesies

Boxplot Variabel Prediktor per Spesies


4 Uji Asumsi

4.1 Uji Normalitas Multivariat (Mardia’s Test)

LDA mengasumsikan bahwa data pada setiap kelompok berdistribusi normal multivariat (Johnson & Wichern, 2007).

prediktor <- df_clean %>%
  select(culmen_length_mm, culmen_depth_mm, flipper_length_mm, body_mass_g)

# Uji Mardia menggunakan MVN
mvn_args <- names(formals(MVN::mvn))

if ("mvnTest" %in% mvn_args) {
  mvn_result <- mvn(data = prediktor, mvnTest = "mardia")
} else {
  mvn_result <- mvn(data = prediktor)
}

# Debug: lihat struktur output
cat("=== Struktur mvn_result ===\n")
## === Struktur mvn_result ===
print(str(mvn_result))
## List of 6
##  $ multivariate_normality:'data.frame':  1 obs. of  5 variables:
##   ..$ Test     : chr "Henze-Zirkler"
##   ..$ Statistic: num 5.22
##   ..$ p.value  : chr "<0.001"
##   ..$ Method   : chr "asymptotic"
##   ..$ MVN      : chr "✗ Not normal"
##  $ univariate_normality  :'data.frame':  4 obs. of  5 variables:
##   ..$ Test     : chr [1:4] "Anderson-Darling" "Anderson-Darling" "Anderson-Darling" "Anderson-Darling"
##   ..$ Variable : chr [1:4] "culmen_length_mm" "culmen_depth_mm" "flipper_length_mm" "body_mass_g"
##   ..$ Statistic: num [1:4] 3.02 2.9 6.44 4.54
##   ..$ p.value  : chr [1:4] "<0.001" "<0.001" "<0.001" "<0.001"
##   ..$ Normality: chr [1:4] "✗ Not normal" "✗ Not normal" "✗ Not normal" "✗ Not normal"
##  $ descriptives          :'data.frame':  4 obs. of  11 variables:
##   ..$ Variable: chr [1:4] "culmen_length_mm" "culmen_depth_mm" "flipper_length_mm" "body_mass_g"
##   ..$ n       : num [1:4] 342 342 342 342
##   ..$ Mean    : num [1:4] 43.9 17.2 200.9 4201.8
##   ..$ Std.Dev : num [1:4] 5.46 1.98 14.06 801.96
##   ..$ Median  : num [1:4] 44.5 17.3 197 4050
##   ..$ Min     : num [1:4] 32.1 13.1 172 2700
##   ..$ Max     : num [1:4] 59.6 21.5 231 6300
##   ..$ 25th    : num [1:4] 39.2 15.6 190 3550
##   ..$ 75th    : num [1:4] 48.5 18.7 213 4750
##   ..$ Skew    : num [1:4] 0.053 -0.143 0.344 0.468
##   ..$ Kurtosis: num [1:4] 2.12 2.09 2.01 2.27
##  $ data                  :'data.frame':  342 obs. of  4 variables:
##   ..$ culmen_length_mm : num [1:342] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 34.1 42 37.8 ...
##   ..$ culmen_depth_mm  : num [1:342] 18.7 17.4 18 19.3 20.6 17.8 19.6 18.1 20.2 17.1 ...
##   ..$ flipper_length_mm: int [1:342] 181 186 195 193 190 181 195 193 190 186 ...
##   ..$ body_mass_g      : int [1:342] 3750 3800 3250 3450 3650 3625 4675 3475 4250 3300 ...
##   ..- attr(*, "na.action")= 'omit' Named int [1:2] 4 340
##   .. ..- attr(*, "names")= chr [1:2] "4" "340"
##  $ subset                : NULL
##  $ outlierMethod         : chr "none"
##  - attr(*, "class")= chr "mvn"
## NULL
cat("\n=== Names ===\n")
## 
## === Names ===
print(names(mvn_result))
## [1] "multivariate_normality" "univariate_normality"   "descriptives"          
## [4] "data"                   "subset"                 "outlierMethod"
cat("\n=== multivariateNormality ===\n")
## 
## === multivariateNormality ===
print(mvn_result$multivariateNormality)
## NULL
cat("\n=== Class & Dim ===\n")
## 
## === Class & Dim ===
print(class(mvn_result$multivariateNormality))
## [1] "NULL"
print(dim(mvn_result$multivariateNormality))
## NULL
# Q-Q Plot manual menggunakan jarak Mahalanobis
n  <- nrow(prediktor)
p  <- ncol(prediktor)
mu <- colMeans(prediktor)
S  <- cov(prediktor)

# Hitung jarak Mahalanobis kuadrat
d2 <- mahalanobis(prediktor, center = mu, cov = S)

# Quantile chi-square teoritis
q_teoritis <- qchisq(ppoints(n), df = p)

# Plot
qqplot(q_teoritis, sort(d2),
       main  = "Chi-Square Q-Q Plot (Normalitas Multivariat)",
       xlab  = "Kuantil Chi-Square Teoritis",
       ylab  = "Jarak Mahalanobis (Data)",
       pch   = 19, col = "#4472C4", cex = 0.8)
abline(0, 1, col = "red", lwd = 2, lty = 2)
legend("topleft", legend = "Garis Referensi", col = "red",
       lty = 2, lwd = 2, bty = "n")
Chi-Square Q-Q Plot Normalitas Multivariat

Chi-Square Q-Q Plot Normalitas Multivariat

Interpretasi: Jika p-value > 0.05, asumsi normalitas multivariat terpenuhi.

4.2 Uji Homogenitas Matriks Kovarians (Box’s M Test)

LDA mengasumsikan bahwa matriks kovarians antar kelompok adalah homogen (sama).

boxm_result <- boxM(
  data   = df_clean[, c("culmen_length_mm", "culmen_depth_mm",
                         "flipper_length_mm", "body_mass_g")],
  grouping = df_clean$species
)

print(boxm_result)
## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  df_clean[, c("culmen_length_mm", "culmen_depth_mm", "flipper_length_mm",     "body_mass_g")]
## Chi-Sq (approx.) = 76.795, df = 20, p-value = 1.365e-08

Interpretasi: Jika p-value > 0.05, asumsi homogenitas matriks kovarians terpenuhi dan LDA dapat digunakan. Jika p-value < 0.05, dapat dipertimbangkan Quadratic DA (QDA) sebagai alternatif, namun LDA tetap cukup robust untuk ukuran sampel yang memadai.


5 Linear Discriminant Analysis (LDA)

5.1 Pembentukan Fungsi Diskriminan

set.seed(42)

# Split data: 80% training, 20% testing
idx      <- createDataPartition(df_clean$species, p = 0.8, list = FALSE)
train    <- df_clean[idx, ]
test     <- df_clean[-idx, ]

# Fit LDA model
lda_model <- lda(species ~ culmen_length_mm + culmen_depth_mm +
                   flipper_length_mm + body_mass_g,
                 data = train,
                 prior = as.vector(prop.table(table(train$species))))

# Tampilkan hasil
print(lda_model)
## Call:
## lda(species ~ culmen_length_mm + culmen_depth_mm + flipper_length_mm + 
##     body_mass_g, data = train, prior = as.vector(prop.table(table(train$species))))
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
##      0.44      0.20      0.36 
## 
## Group means:
##           culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g
## Adelie            38.79752        18.39174          190.3471    3704.132
## Chinstrap         48.80909        18.36182          195.4182    3721.364
## Gentoo            47.66061        14.99091          217.9293    5101.263
## 
## Coefficients of linear discriminants:
##                            LD1          LD2
## culmen_length_mm  -0.086609377 -0.419396547
## culmen_depth_mm    1.011226772 -0.005156790
## flipper_length_mm -0.089996132  0.013219110
## body_mass_g       -0.001281395  0.001769146
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8659 0.1341

5.2 Koefisien Fungsi Diskriminan

lda_model$scaling %>%
  as.data.frame() %>%
  rownames_to_column("Variabel") %>%
  kable(caption = "Koefisien Fungsi Diskriminan Linear",
        digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Koefisien Fungsi Diskriminan Linear
Variabel LD1 LD2
culmen_length_mm -0.0866 -0.4194
culmen_depth_mm 1.0112 -0.0052
flipper_length_mm -0.0900 0.0132
body_mass_g -0.0013 0.0018

Interpretasi: Terdapat 2 fungsi diskriminan yang terbentuk (= jumlah kelas - 1). LD1 adalah fungsi diskriminan pertama yang memisahkan kelompok paling maksimal, diikuti LD2.

5.3 Proporsi Variansi yang Dijelaskan

prop_var <- lda_model$svd^2 / sum(lda_model$svd^2) * 100

data.frame(
  `Fungsi Diskriminan` = c("LD1", "LD2"),
  `Eigenvalue`         = round(lda_model$svd^2, 4),
  `Proporsi (%)`       = round(prop_var, 2),
  `Kumulatif (%)`      = round(cumsum(prop_var), 2)
) %>%
  kable(caption = "Proporsi Variansi yang Dijelaskan Setiap Fungsi Diskriminan") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Proporsi Variansi yang Dijelaskan Setiap Fungsi Diskriminan
Fungsi.Diskriminan Eigenvalue Proporsi…. Kumulatif….
LD1 2088.3517 86.59 86.59
LD2 323.5356 13.41 100.00

5.4 Visualisasi LDA

lda_pred_train <- predict(lda_model, train)

lda_plot_df <- data.frame(
  LD1     = lda_pred_train$x[, 1],
  LD2     = lda_pred_train$x[, 2],
  species = train$species
)

ggplot(lda_plot_df, aes(x = LD1, y = LD2, color = species, shape = species)) +
  geom_point(size = 2.5, alpha = 0.8) +
  stat_ellipse(aes(fill = species), geom = "polygon",
               alpha = 0.1, level = 0.95) +
  scale_color_manual(values = c("#E07B39", "#4472C4", "#5BAD5B")) +
  scale_fill_manual(values  = c("#E07B39", "#4472C4", "#5BAD5B")) +
  labs(title  = "Plot Fungsi Diskriminan Linear",
       subtitle = "Pemisahan Tiga Spesies Penguin",
       x = "LD1", y = "LD2",
       color = "Spesies", shape = "Spesies", fill = "Spesies") +
  theme_bw() +
  theme(plot.title    = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5))
Scatter Plot Fungsi Diskriminan Linear (LD1 vs LD2)

Scatter Plot Fungsi Diskriminan Linear (LD1 vs LD2)

5.5 Evaluasi Model LDA

5.5.1 Klasifikasi pada Data Testing

lda_pred_test <- predict(lda_model, test)

# Confusion Matrix
cm_lda <- confusionMatrix(lda_pred_test$class, test$species)
print(cm_lda)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        30         1      0
##   Chinstrap      0        12      0
##   Gentoo         0         0     24
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9851          
##                  95% CI : (0.9196, 0.9996)
##     No Information Rate : 0.4478          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9763          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           0.9231        1.0000
## Specificity                 0.9730           1.0000        1.0000
## Pos Pred Value              0.9677           1.0000        1.0000
## Neg Pred Value              1.0000           0.9818        1.0000
## Prevalence                  0.4478           0.1940        0.3582
## Detection Rate              0.4478           0.1791        0.3582
## Detection Prevalence        0.4627           0.1791        0.3582
## Balanced Accuracy           0.9865           0.9615        1.0000

5.5.2 Tabel Confusion Matrix LDA

cm_lda$table %>%
  as.data.frame() %>%
  pivot_wider(names_from = Reference, values_from = Freq) %>%
  rename(`Prediksi` = Prediction) %>%
  kable(caption = "Confusion Matrix — Linear Discriminant Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 1, "Aktual" = 3))
Confusion Matrix — Linear Discriminant Analysis
Aktual
Prediksi Adelie Chinstrap Gentoo
Adelie 30 1 0
Chinstrap 0 12 0
Gentoo 0 0 24

5.5.3 APER (Apparent Error Rate)

# APER pada data training (in-sample)
lda_pred_train_class <- predict(lda_model, train)$class
n_salah_train        <- sum(lda_pred_train_class != train$species)
aper_train           <- n_salah_train / nrow(train)

# APER pada data testing (out-of-sample)
n_salah_test <- sum(lda_pred_test$class != test$species)
aper_test    <- n_salah_test / nrow(test)

data.frame(
  Data        = c("Training", "Testing"),
  `N Total`   = c(nrow(train), nrow(test)),
  `N Salah`   = c(n_salah_train, n_salah_test),
  `APER (%)`  = round(c(aper_train, aper_test) * 100, 2),
  `Akurasi (%)` = round((1 - c(aper_train, aper_test)) * 100, 2)
) %>%
  kable(caption = "APER dan Akurasi Model LDA") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
APER dan Akurasi Model LDA
Data N.Total N.Salah APER…. Akurasi….
Training 275 3 1.09 98.91
Testing 67 1 1.49 98.51

Interpretasi APER: Nilai APER yang kecil menunjukkan fungsi diskriminan yang terbentuk memiliki kemampuan klasifikasi yang baik. Akurasi pada data testing mencerminkan kemampuan generalisasi model.


6 Regresi Logistik Multinomial

6.1 Pembentukan Model

# Set referensi kategori (kategori pembanding = Gentoo, terbanyak)
train$species <- relevel(train$species, ref = "Gentoo")
test$species  <- relevel(test$species,  ref = "Gentoo")

# Fit model multinomial
multinom_model <- multinom(
  species ~ culmen_length_mm + culmen_depth_mm +
    flipper_length_mm + body_mass_g,
  data  = train,
  trace = FALSE,
  Hess  = TRUE
)

summary(multinom_model)
## Call:
## multinom(formula = species ~ culmen_length_mm + culmen_depth_mm + 
##     flipper_length_mm + body_mass_g, data = train, Hess = TRUE, 
##     trace = FALSE)
## 
## Coefficients:
##           (Intercept) culmen_length_mm culmen_depth_mm flipper_length_mm
## Adelie       28.10100        -19.61744       36.333602           2.41380
## Chinstrap   -21.02514         25.72737        1.335902          -1.96246
##           body_mass_g
## Adelie     -0.0673793
## Chinstrap  -0.1910147
## 
## Std. Errors:
##           (Intercept) culmen_length_mm culmen_depth_mm flipper_length_mm
## Adelie      0.8292441         14.10153        20.24099          5.101465
## Chinstrap   0.9970848         15.28610        35.34625          5.513135
##           body_mass_g
## Adelie      0.1515416
## Chinstrap   0.1491054
## 
## Residual Deviance: 0.0007645983 
## AIC: 20.00076

6.2 Estimasi Parameter

koef <- summary(multinom_model)$coefficients
se   <- summary(multinom_model)$standard.errors

koef %>%
  as.data.frame() %>%
  rownames_to_column("Model") %>%
  kable(caption = "Estimasi Koefisien (β) Model Regresi Logistik Multinomial",
        digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE)
Estimasi Koefisien (β) Model Regresi Logistik Multinomial
Model (Intercept) culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g
Adelie 28.1010 -19.6174 36.3336 2.4138 -0.0674
Chinstrap -21.0251 25.7274 1.3359 -1.9625 -0.1910

Catatan: Kategori referensi adalah Gentoo. Model Logit 1 membandingkan Adelie vs Gentoo, Model Logit 2 membandingkan Chinstrap vs Gentoo.

6.3 Uji Signifikansi Parameter

6.3.1 Uji Serentak (Likelihood Ratio Test / G²)

# Model null (intercept only)
multinom_null <- multinom(species ~ 1, data = train, trace = FALSE)

# Likelihood Ratio Test
lrt <- anova(multinom_null, multinom_model, test = "Chisq")
print(lrt)
## Likelihood ratio tests of Multinomial Models
## 
## Response: species
##                                                                  Model
## 1                                                                    1
## 2 culmen_length_mm + culmen_depth_mm + flipper_length_mm + body_mass_g
##   Resid. df   Resid. Dev   Test    Df LR stat. Pr(Chi)
## 1       548 5.780024e+02                              
## 2       540 7.645983e-04 1 vs 2     8 578.0016       0

Interpretasi: Tolak H₀ jika p-value < 0.05, artinya minimal satu variabel prediktor berpengaruh signifikan terhadap model.

6.3.2 Uji Parsial (Wald Test)

# Hitung Wald statistik (z-value)
z_val  <- koef / se
p_val  <- 2 * (1 - pnorm(abs(z_val)))

# Tabel Wald untuk Logit 1 (Adelie vs Gentoo)
data.frame(
  Variabel  = colnames(koef),
  Beta      = round(koef["Adelie", ], 4),
  SE        = round(se["Adelie", ], 4),
  Wald_Z    = round(z_val["Adelie", ], 4),
  P_value   = round(p_val["Adelie", ], 4),
  Sig       = ifelse(p_val["Adelie", ] < 0.001, "***",
              ifelse(p_val["Adelie", ] < 0.01,  "**",
              ifelse(p_val["Adelie", ] < 0.05,  "*", "ns")))
) %>%
  kable(caption = "Uji Parsial (Wald) — Model Logit 1: Adelie vs Gentoo",
        row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  footnote(general = "*** p<0.001  ** p<0.01  * p<0.05  ns = tidak signifikan")
Uji Parsial (Wald) — Model Logit 1: Adelie vs Gentoo
Variabel Beta SE Wald_Z P_value Sig
(Intercept) 28.1010 0.8292 33.8875 0.0000 ***
culmen_length_mm -19.6174 14.1015 -1.3912 0.1642 ns
culmen_depth_mm 36.3336 20.2410 1.7951 0.0726 ns
flipper_length_mm 2.4138 5.1015 0.4732 0.6361 ns
body_mass_g -0.0674 0.1515 -0.4446 0.6566 ns
Note:
*** p<0.001 ** p<0.01 * p<0.05 ns = tidak signifikan
# Tabel Wald untuk Logit 2 (Chinstrap vs Gentoo)
data.frame(
  Variabel  = colnames(koef),
  Beta      = round(koef["Chinstrap", ], 4),
  SE        = round(se["Chinstrap", ], 4),
  Wald_Z    = round(z_val["Chinstrap", ], 4),
  P_value   = round(p_val["Chinstrap", ], 4),
  Sig       = ifelse(p_val["Chinstrap", ] < 0.001, "***",
              ifelse(p_val["Chinstrap", ] < 0.01,  "**",
              ifelse(p_val["Chinstrap", ] < 0.05,  "*", "ns")))
) %>%
  kable(caption = "Uji Parsial (Wald) — Model Logit 2: Chinstrap vs Gentoo",
        row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  footnote(general = "*** p<0.001  ** p<0.01  * p<0.05  ns = tidak signifikan")
Uji Parsial (Wald) — Model Logit 2: Chinstrap vs Gentoo
Variabel Beta SE Wald_Z P_value Sig
(Intercept) -21.0251 0.9971 -21.0866 0.0000 ***
culmen_length_mm 25.7274 15.2861 1.6831 0.0924 ns
culmen_depth_mm 1.3359 35.3462 0.0378 0.9699 ns
flipper_length_mm -1.9625 5.5131 -0.3560 0.7219 ns
body_mass_g -0.1910 0.1491 -1.2811 0.2002 ns
Note:
*** p<0.001 ** p<0.01 * p<0.05 ns = tidak signifikan

6.4 Persamaan Model

b <- round(koef, 4)

cat("**Model Logit 1 (Adelie vs Gentoo):**\n\n")

Model Logit 1 (Adelie vs Gentoo):

cat(sprintf("$$\\ln\\left(\\frac{P(Adelie)}{P(Gentoo)}\\right) = %.4f + %.4f \\cdot X_1 + %.4f \\cdot X_2 + %.4f \\cdot X_3 + %.4f \\cdot X_4$$\n\n",
    b["Adelie","(Intercept)"], b["Adelie","culmen_length_mm"],
    b["Adelie","culmen_depth_mm"], b["Adelie","flipper_length_mm"],
    b["Adelie","body_mass_g"]))

\[\ln\left(\frac{P(Adelie)}{P(Gentoo)}\right) = 28.1010 + -19.6174 \cdot X_1 + 36.3336 \cdot X_2 + 2.4138 \cdot X_3 + -0.0674 \cdot X_4\]

cat("**Model Logit 2 (Chinstrap vs Gentoo):**\n\n")

Model Logit 2 (Chinstrap vs Gentoo):

cat(sprintf("$$\\ln\\left(\\frac{P(Chinstrap)}{P(Gentoo)}\\right) = %.4f + %.4f \\cdot X_1 + %.4f \\cdot X_2 + %.4f \\cdot X_3 + %.4f \\cdot X_4$$\n\n",
    b["Chinstrap","(Intercept)"], b["Chinstrap","culmen_length_mm"],
    b["Chinstrap","culmen_depth_mm"], b["Chinstrap","flipper_length_mm"],
    b["Chinstrap","body_mass_g"]))

\[\ln\left(\frac{P(Chinstrap)}{P(Gentoo)}\right) = -21.0251 + 25.7274 \cdot X_1 + 1.3359 \cdot X_2 + -1.9625 \cdot X_3 + -0.1910 \cdot X_4\]

cat("Di mana: X₁ = culmen_length_mm, X₂ = culmen_depth_mm, X₃ = flipper_length_mm, X₄ = body_mass_g\n")

Di mana: X₁ = culmen_length_mm, X₂ = culmen_depth_mm, X₃ = flipper_length_mm, X₄ = body_mass_g

6.5 Odds Ratio

# Odds Ratio = exp(beta)
or_adelie    <- round(exp(koef["Adelie", ]), 4)
or_chinstrap <- round(exp(koef["Chinstrap", ]), 4)

data.frame(
  Variabel      = colnames(koef),
  OR_Adelie     = or_adelie,
  OR_Chinstrap  = or_chinstrap
) %>%
  kable(caption = "Odds Ratio — Model Regresi Logistik Multinomial",
        col.names = c("Variabel", "OR (Adelie vs Gentoo)", "OR (Chinstrap vs Gentoo)"),
        row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  footnote(general = "OR < 1: hubungan negatif; OR > 1: hubungan positif terhadap kategori tersebut dibanding Gentoo")
Odds Ratio — Model Regresi Logistik Multinomial
Variabel OR (Adelie vs Gentoo) OR (Chinstrap vs Gentoo)
(Intercept) 1.599955e+12 0.000000e+00
culmen_length_mm 0.000000e+00 1.490229e+11
culmen_depth_mm 6.018423e+15 3.803400e+00
flipper_length_mm 1.117630e+01 1.405000e-01
body_mass_g 9.348000e-01 8.261000e-01
Note:
OR < 1: hubungan negatif; OR > 1: hubungan positif terhadap kategori tersebut dibanding Gentoo

Cara membaca Odds Ratio: - OR > 1 → setiap kenaikan 1 satuan variabel tersebut meningkatkan kecenderungan masuk kategori tersebut dibanding Gentoo - OR < 1 → setiap kenaikan 1 satuan variabel tersebut menurunkan kecenderungan masuk kategori tersebut dibanding Gentoo

6.6 Evaluasi Model Multinomial

# Prediksi pada data testing
multinom_pred <- predict(multinom_model, newdata = test, type = "class")

# Confusion Matrix
cm_multi <- confusionMatrix(multinom_pred, test$species)
print(cm_multi)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Gentoo Adelie Chinstrap
##   Gentoo        24      1         0
##   Adelie         0     28         1
##   Chinstrap      0      1        12
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9552          
##                  95% CI : (0.8747, 0.9907)
##     No Information Rate : 0.4478          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9295          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Gentoo Class: Adelie Class: Chinstrap
## Sensitivity                 1.0000        0.9333           0.9231
## Specificity                 0.9767        0.9730           0.9815
## Pos Pred Value              0.9600        0.9655           0.9231
## Neg Pred Value              1.0000        0.9474           0.9815
## Prevalence                  0.3582        0.4478           0.1940
## Detection Rate              0.3582        0.4179           0.1791
## Detection Prevalence        0.3731        0.4328           0.1940
## Balanced Accuracy           0.9884        0.9532           0.9523
cm_multi$table %>%
  as.data.frame() %>%
  pivot_wider(names_from = Reference, values_from = Freq) %>%
  rename(`Prediksi` = Prediction) %>%
  kable(caption = "Confusion Matrix — Regresi Logistik Multinomial") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  add_header_above(c(" " = 1, "Aktual" = 3))
Confusion Matrix — Regresi Logistik Multinomial
Aktual
Prediksi Gentoo Adelie Chinstrap
Gentoo 24 1 0
Adelie 0 28 1
Chinstrap 0 1 12
# Akurasi multinomial
n_salah_multi <- sum(multinom_pred != test$species)
aper_multi    <- n_salah_multi / nrow(test)

cat(sprintf("APER Multinomial (Testing) : %.4f (%.2f%%)\n",
            aper_multi, aper_multi * 100))
## APER Multinomial (Testing) : 0.0448 (4.48%)
cat(sprintf("Akurasi Multinomial (Testing): %.4f (%.2f%%)\n",
            1 - aper_multi, (1 - aper_multi) * 100))
## Akurasi Multinomial (Testing): 0.9552 (95.52%)

7 Perbandingan LDA dan Regresi Logistik Multinomial

data.frame(
  Aspek = c(
    "Metode",
    "Asumsi Utama",
    "Variabel Dependen",
    "APER (Testing)",
    "Akurasi (Testing)",
    "Kelebihan"
  ),
  LDA = c(
    "Linear Discriminant Analysis",
    "Normalitas multivariat & homogenitas kovarians",
    "Nominal ≥ 2 kategori",
    paste0(round(aper_test * 100, 2), "%"),
    paste0(round((1 - aper_test) * 100, 2), "%"),
    "Interpretasi geometris jelas, visualisasi mudah"
  ),
  Multinomial = c(
    "Regresi Logistik Multinomial",
    "Tidak memerlukan normalitas prediktor",
    "Nominal ≥ 3 kategori",
    paste0(round(aper_multi * 100, 2), "%"),
    paste0(round((1 - aper_multi) * 100, 2), "%"),
    "Lebih fleksibel, menghasilkan odds ratio"
  )
) %>%
  kable(caption = "Perbandingan LDA dan Regresi Logistik Multinomial",
        col.names = c("Aspek", "LDA", "Multinomial Logistik")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE)
Perbandingan LDA dan Regresi Logistik Multinomial
Aspek LDA Multinomial Logistik
Metode Linear Discriminant Analysis Regresi Logistik Multinomial
Asumsi Utama Normalitas multivariat & homogenitas kovarians Tidak memerlukan normalitas prediktor
Variabel Dependen Nominal ≥ 2 kategori Nominal ≥ 3 kategori
APER (Testing) 1.49% 4.48%
Akurasi (Testing) 98.51% 95.52%
Kelebihan Interpretasi geometris jelas, visualisasi mudah Lebih fleksibel, menghasilkan odds ratio

8 Kesimpulan

Berdasarkan analisis klasifikasi spesies penguin menggunakan dataset Palmer Penguins dengan 4 variabel prediktor kontinu (culmen length, culmen depth, flipper length, dan body mass), diperoleh kesimpulan sebagai berikut:

  1. Linear Discriminant Analysis (LDA) menghasilkan dua fungsi diskriminan (LD1 dan LD2). LD1 menjelaskan sebagian besar variansi antar kelompok. Berdasarkan visualisasi plot diskriminan, ketiga spesies penguin terpisah dengan cukup baik, terutama spesies Gentoo yang memiliki karakteristik paling berbeda. Nilai APER testing sebesar 1.49% menunjukkan kemampuan klasifikasi yang sangat baik.

  2. Regresi Logistik Multinomial menghasilkan dua persamaan logit dengan kategori referensi Gentoo. Berdasarkan uji serentak, minimal satu variabel prediktor berpengaruh signifikan terhadap model. Odds ratio menunjukkan kontribusi masing-masing variabel dalam membedakan spesies. Nilai APER testing sebesar 4.48% menunjukkan akurasi klasifikasi yang sangat baik.

  3. Secara perbandingan, kedua metode menghasilkan performa klasifikasi yang relatif setara. LDA menghasilkan APER yang lebih kecil sehingga sedikit lebih unggul dalam klasifikasi spesies penguin pada dataset ini.


9 Referensi

  • Gorman KB, Williams TD, Fraser WR (2014). Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins. PLoS ONE 9(3): e90081.
  • Hosmer, D.W. & Lemeshow, S. (2000). Applied Logistic Regression (2nd ed.). Wiley.
  • Johnson, R.A. & Wichern, D.W. (2007). Applied Multivariate Statistical Analysis (6th ed.). Pearson.
  • Agresti, A. (2002). Categorical Data Analysis (2nd ed.). Wiley.
  • Sharma, S. (1996). Applied Multivariate Techniques. Wiley.